Playwright
Scraping Browser 提供一个高性能的无服务器平台,旨在简化从动态网站提取数据的过程。通过与 Playwright 的无缝集成,开发者可以运行、管理和监控无头浏览器,而无需专用服务器资源,从而实现高效的 Web 自动化和数据收集。
安装必要的库
首先,安装 playwright-core,这是一个轻量级的 Playwright 版本,用于连接到现有的浏览器实例:
npm install playwright-core
编写代码连接到 Scraping Browser
在你的 Playwright 代码中,使用以下方法连接到 Scraping Browser:
const { Playwright } = require('@scrapeless-ai/sdk');
(async () => {
const browser = await Playwright.connect({
session_name: 'sdk_test',
session_ttl: 180,
proxy_country: 'US',
session_recording: true,
defaultViewport: null
});
const context = browser.contexts()[0];
const page = await context.newPage();
await page.goto('https://www.scrapeless.com');
console.log(await page.title());
await browser.close();
})();
这允许你利用 Scraping Browser 的基础设施,包括可扩展性、IP 轮换和全球访问。
实际示例
以下是一些集成 Scraping Browser 后常见的 Playwright 操作:
- 导航和页面内容提取
const page = await browser.newPage();
await page.goto('https://www.example.com');
console.log(await page.title());
const html = await page.content();
console.log(html);
await browser.close();
- 截取屏幕截图
const context = browser.contexts()[0];
const page = await context.newPage();
await page.goto('https://www.example.com');
await page.screenshot({ path: 'example.png' });
console.log('Screenshot saved as example.png');
await browser.close();
- 运行自定义代码
const context = browser.contexts()[0];
const page = await context.newPage();
await page.goto('https://www.example.com');
const result = await page.evaluate(() => document.title);
console.log('Page title:', result);
await browser.close();
- 模拟鼠标点击
const { createPlaywrightCDPSession } = require('@scrapeless-ai/sdk');
// ... connect to Scraping Browser as shown above
const cdpSession = await createPlaywrightCDPSession(page);
await cdpSession.realClick('button[type="submit"]');
- 模拟键盘输入
const { createPlaywrightCDPSession } = require('@scrapeless-ai/sdk');
// ... connect to Scraping Browser as shown above
const cdpSession = await createPlaywrightCDPSession(page);
await cdpSession.realFill('#login-email', 'scrapeless@gmail.com');
- 使用 Scrapeless Agent 获取当前页面 URL
const { createPlaywrightCDPSession } = require('@scrapeless-ai/sdk');
// ... connect to Scraping Browser as shown above
const cdpSession = await createPlaywrightCDPSession(page);
const { error, liveURL } = await cdpSession.liveURL();
- 解决图片验证码
const { createPlaywrightCDPSession } = require('@scrapeless-ai/sdk');
// ... connect to Scraping Browser as shown above
const cdpSession = await createPlaywrightCDPSession(page);
await cdpSession.imageToText({
imageSelector: '.captcha__image',
inputSelector: 'input[name="captcha"]',
timeout: 30000,
});
- 禁用自动验证码解决
const { createPlaywrightCDPSession } = require('@scrapeless-ai/sdk');
// ... connect to Scraping Browser as shown above
const cdpSession = await createPlaywrightCDPSession(page);
await cdpSession.disableCaptchaAutoSolve();
- 使用指定选项手动解决验证码
const { createPlaywrightCDPSession } = require('@scrapeless-ai/sdk');
// ... connect to Scraping Browser as shown above
const cdpSession = await createPlaywrightCDPSession(page);
await cdpSession.solveCaptcha();
- 等待页面上检测到验证码
const { createPlaywrightCDPSession } = require('@scrapeless-ai/sdk');
// ... connect to Scraping Browser as shown above
const cdpSession = await createPlaywrightCDPSession(page);
await cdpSession.waitCaptchaDetected({ timeout: 30000 });
- 等待验证码解决(成功或失败)
const { createPlaywrightCDPSession } = require('@scrapeless-ai/sdk');
// ... connect to Scraping Browser as shown above
const cdpSession = await createPlaywrightCDPSession(page);
await cdpSession.waitCaptchaSolved({ timeout: 30000 });