Scraping BrowserGuidesOptimizing Cost

Scraping browser proxy traffic optimization solutions

Introduction

When using Puppeteer for data scraping, traffic consumption is an important consideration. Especially when using proxy services, traffic costs can increase significantly. To optimize traffic usage, we can adopt the following strategies:

  1. Resource interception: Reduce traffic consumption by intercepting unnecessary resource requests.
  2. Request URL interception: Further reduce traffic by intercepting specific requests based on URL characteristics.
  3. Simulate mobile devices: Use mobile device configurations to obtain lighter page versions.
  4. Comprehensive optimization: Combine the above methods to achieve the best results.

Optimization Scheme 1: Resource Interception

Resource Interception Introduction

In Puppeteer, page.setRequestInterception(true) can capture every network request initiated by the browser and decide to continue (request.continue()), terminate (request.abort()), or customize the response (request.respond()).

This method can significantly reduce bandwidth consumption, especially suitable for crawling, screenshotting, and performance optimization scenarios.

Interceptable Resource Types and Suggestions

Resource TypeDescriptionExampleImpact After InterceptionRecommendation
imageImage resourcesJPG/PNG/GIF/WebP imagesImages will not be displayed⭐ Safe
fontFont filesTTF/WOFF/WOFF2 fontsSystem default fonts will be used instead⭐ Safe
mediaMedia filesVideo/audio filesMedia content cannot be played⭐ Safe
manifestWeb App ManifestPWA configuration filePWA functionality may be affected⭐ Safe
prefetchPrefetch resources<link rel="prefetch">Minimal impact on the page⭐ Safe
stylesheetCSS StylesheetExternal CSS filesPage styles are lost, may affect layout⚠️ Caution
websocketWebSocketReal-time communication connectionReal-time functionality disabled⚠️ Caution
eventsourceServer-Sent EventsServer push dataPush functionality disabled⚠️ Caution
preflightCORS preflight requestOPTIONS requestCross-origin requests fail⚠️ Caution
scriptJavaScript scriptsExternal JS filesDynamic functionality disabled, SPA may not render❌ Avoid
xhrXHR requestsAJAX data requestsUnable to obtain dynamic data❌ Avoid
fetchFetch requestsModern AJAX requestsUnable to obtain dynamic data❌ Avoid
documentMain documentHTML page itselfPage cannot load❌ Avoid

Recommendation Level Explanation:

  • Safe: Interception has almost no impact on data scraping or first-screen rendering; it is recommended to block by default.
  • ⚠️ Caution: May break styles, real-time functions, or cross-origin requests; requires business judgment.
  • Avoid: High probability of causing SPA/dynamic sites to fail to render or obtain data normally, unless you are absolutely sure you don’t need these resources.

Resource Interception Example Code

import puppeteer from 'puppeteer-core';
 
const scrapelessUrl = 'wss://browser.scrapeless.com/browser?token=your_api_key&session_ttl=180&proxy_country=ANY';
 
async function scrapeWithResourceBlocking(url) {
    const browser = await puppeteer.connect({
        browserWSEndpoint: scrapelessUrl,
        defaultViewport: null
    });
    const page = await browser.newPage();
 
    // Enable request interception
    await page.setRequestInterception(true);
 
    // Define resource types to block
    const BLOCKED_TYPES = new Set([
        'image',
        'font',
        'media',
        'stylesheet',
    ]);
 
    // Intercept requests
    page.on('request', (request) => {
        if (BLOCKED_TYPES.has(request.resourceType())) {
            request.abort();
            console.log(`Blocked: ${request.resourceType()} - ${request.url().substring(0, 50)}...`);
        } else {
            request.continue();
        }
    });
 
    await page.goto(url, {waitUntil: 'domcontentloaded'});
 
    // Extract data
    const data = await page.evaluate(() => {
        return {
            title: document.title,
            content: document.body.innerText.substring(0, 1000)
        };
    });
 
    await browser.close();
    return data;
}
 
// Usage
scrapeWithResourceBlocking('https://www.scrapeless.com')
    .then(data => console.log('Scraping result:', data))
    .catch(error => console.error('Scraping failed:', error));

Optimization Scheme 2: Request URL Interception

In addition to intercepting by resource type, more granular interception control can be performed based on URL characteristics. This is particularly effective for blocking ads, analytics scripts, and other unnecessary third-party requests.

URL Interception Strategies

  1. Intercept by domain: Block all requests from a specific domain
  2. Intercept by path: Block requests from a specific path
  3. Intercept by file type: Block files with specific extensions
  4. Intercept by keyword: Block requests whose URLs contain specific keywords

Common Interceptable URL Patterns

URL PatternDescriptionExampleRecommendation
Advertising servicesAdvertising network domainsad.doubleclick.net, googleadservices.com⭐ Safe
Analytics servicesStatistics and analytics scriptsgoogle-analytics.com, hotjar.com⭐ Safe
Social media pluginsSocial sharing buttons, etc.platform.twitter.com, connect.facebook.net⭐ Safe
Tracking pixelsPixels that track user behaviorURLs containing pixel, beacon, tracker⭐ Safe
Large media filesLarge video, audio filesExtensions like .mp4, .webm, .mp3⭐ Safe
Font servicesOnline font servicesfonts.googleapis.com, use.typekit.net⭐ Safe
CDN resourcesStatic resource CDNcdn.jsdelivr.net, unpkg.com⚠️ Caution

URL Interception Example Code

import puppeteer from 'puppeteer-core';
 
const scrapelessUrl = 'wss://browser.scrapeless.com/browser?token=your_api_key&session_ttl=180&proxy_country=ANY';
 
async function scrapeWithUrlBlocking(url) {
    const browser = await puppeteer.connect({
        browserWSEndpoint: scrapelessUrl,
        defaultViewport: null
    });
    const page = await browser.newPage();
 
    // Enable request interception
    await page.setRequestInterception(true);
 
    // Define domains and URL patterns to block
    const BLOCKED_DOMAINS = [
        'google-analytics.com',
        'googletagmanager.com',
        'doubleclick.net',
        'facebook.net',
        'twitter.com',
        'linkedin.com',
        'adservice.google.com',
    ];
 
    const BLOCKED_PATHS = [
        '/ads/',
        '/analytics/',
        '/pixel/',
        '/tracking/',
        '/stats/',
    ];
 
    // Intercept requests
    page.on('request', (request) => {
        const url = request.url();
 
        // Check domain
        if (BLOCKED_DOMAINS.some(domain => url.includes(domain))) {
            request.abort();
            console.log(`Blocked domain: ${url.substring(0, 50)}...`);
            return;
        }
 
        // Check path
        if (BLOCKED_PATHS.some(path => url.includes(path))) {
            request.abort();
            console.log(`Blocked path: ${url.substring(0, 50)}...`);
            return;
        }
 
        // Allow other requests
        request.continue();
    });
 
    await page.goto(url, {waitUntil: 'domcontentloaded'});
 
    // Extract data
    const data = await page.evaluate(() => {
        return {
            title: document.title,
            content: document.body.innerText.substring(0, 1000)
        };
    });
 
    await browser.close();
    return data;
}
 
// Usage
scrapeWithUrlBlocking('https://www.scrapeless.com')
    .then(data => console.log('Scraping result:', data))
    .catch(error => console.error('Scraping failed:', error));

Optimization Scheme 3: Simulate Mobile Devices

Simulating mobile devices is another effective traffic optimization strategy because mobile websites usually provide lighter page content.

Advantages of Mobile Device Simulation

  1. Lighter page versions: Many websites provide more concise content for mobile devices
  2. Smaller image resources: Mobile versions usually load smaller images
  3. Simplified CSS and JavaScript: Mobile versions usually use simplified styles and scripts
  4. Reduced ads and non-core content: Mobile versions often remove some non-core functionality
  5. Adaptive response: Obtain content layouts optimized for small screens

Mobile Device Simulation Configuration

Here are the configuration parameters for several commonly used mobile devices:

const iPhoneX = {
    viewport: {
        width: 375,
        height: 812,
        deviceScaleFactor: 3,
        isMobile: true,
        hasTouch: true,
        isLandscape: false
    }
};

Or directly use the built-in methods of puppeteer to simulate mobile devices

import { KnownDevices } from 'puppeteer-core';
const iPhone = KnownDevices['iPhone 15 Pro'];
 
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.emulate(iPhone);

Mobile Device Simulation Example Code

import puppeteer, {KnownDevices} from 'puppeteer-core';
 
const scrapelessUrl = 'wss://browser.scrapeless.com/browser?token=your_api_key&session_ttl=180&proxy_country=ANY';
 
async function scrapeWithMobileEmulation(url) {
    const browser = await puppeteer.connect({
        browserWSEndpoint: scrapelessUrl,
        defaultViewport: null
    });
 
    const page = await browser.newPage();
 
    // Set mobile device simulation
    const iPhone = KnownDevices['iPhone 15 Pro'];
    await page.emulate(iPhone);
 
    await page.goto(url, {waitUntil: 'domcontentloaded'});
    // Extract data
    const data = await page.evaluate(() => {
        return {
            title: document.title,
            content: document.body.innerText.substring(0, 1000)
        };
    });
 
    await browser.close();
    return data;
}
 
// Usage
scrapeWithMobileEmulation('https://www.scrapeless.com')
    .then(data => console.log('Scraping result:', data))
    .catch(error => console.error('Scraping failed:', error));

Comprehensive Optimization Example

Here is a comprehensive example combining all optimization schemes:

import puppeteer, {KnownDevices} from 'puppeteer-core';
 
const scrapelessUrl = 'wss://browser.scrapeless.com/browser?token=your_api_key&session_ttl=180&proxy_country=ANY';
 
async function optimizedScraping(url) {
    console.log(`Starting optimized scraping: ${url}`);
 
    // Record traffic usage
    let totalBytesUsed = 0;
 
    const browser = await puppeteer.connect({
        browserWSEndpoint: scrapelessUrl,
        defaultViewport: null
    });
 
    const page = await browser.newPage();
 
    // Set mobile device simulation
    const iPhone = KnownDevices['iPhone 15 Pro'];
    await page.emulate(iPhone);
 
    // Set request interception
    await page.setRequestInterception(true);
 
    // Define resource types to block
    const BLOCKED_TYPES = [
        'image',
        'media',
        'font'
    ];
 
    // Define domains to block
    const BLOCKED_DOMAINS = [
        'google-analytics.com',
        'googletagmanager.com',
        'facebook.net',
        'doubleclick.net',
        'adservice.google.com'
    ];
 
    // Define URL paths to block
    const BLOCKED_PATHS = [
        '/ads/',
        '/analytics/',
        '/tracking/'
    ];
 
    // Intercept requests
    page.on('request', (request) => {
        const url = request.url();
        const resourceType = request.resourceType();
 
        // Check resource type
        if (BLOCKED_TYPES.includes(resourceType)) {
            console.log(`Blocked resource type: ${resourceType} - ${url.substring(0, 50)}...`);
            request.abort();
            return;
        }
 
        // Check domain
        if (BLOCKED_DOMAINS.some(domain => url.includes(domain))) {
            console.log(`Blocked domain: ${url.substring(0, 50)}...`);
            request.abort();
            return;
        }
 
        // Check path
        if (BLOCKED_PATHS.some(path => url.includes(path))) {
            console.log(`Blocked path: ${url.substring(0, 50)}...`);
            request.abort();
            return;
        }
 
        // Allow other requests
        request.continue();
    });
 
    // Monitor network traffic
    page.on('response', async (response) => {
        const headers = response.headers();
        const contentLength = headers['content-length'] ? parseInt(headers['content-length'], 10) : 0;
        totalBytesUsed += contentLength;
    });
 
    await page.goto(url, {waitUntil: 'domcontentloaded'});
 
    // Simulate scrolling to trigger lazy-loading content
    await page.evaluate(() => {
        window.scrollBy(0, window.innerHeight);
    });
 
    await new Promise(resolve => setTimeout(resolve, 1000))
 
    // Extract data
    const data = await page.evaluate(() => {
        return {
            title: document.title,
            content: document.body.innerText.substring(0, 1000),
            links: Array.from(document.querySelectorAll('a')).slice(0, 10).map(a => ({
                text: a.innerText,
                href: a.href
            }))
        };
    });
 
    // Output traffic usage statistics
    console.log(`\nTraffic Usage Statistics:`);
    console.log(`Used: ${(totalBytesUsed / 1024 / 1024).toFixed(2)} MB`);
 
    await browser.close();
    return data;
}
 
// Usage
optimizedScraping('https://www.scrapeless.com')
    .then(data => console.log('Scraping complete:', data))
    .catch(error => console.error('Scraping failed:', error));

Optimization Comparison

We try removing the optimized code from the comprehensive example to compare the traffic before and after optimization. Here is the unoptimized example code:

import puppeteer from 'puppeteer-core';
 
const scrapelessUrl = 'wss://browser.scrapeless.com/browser?token=your_api_key&session_ttl=180&proxy_country=ANY';
 
async function optimizedScraping(url) {
  console.log(`Starting optimized scraping: ${url}`);
 
  // Record traffic usage
  let totalBytesUsed = 0;
 
  const browser = await puppeteer.connect({
    browserWSEndpoint: scrapelessUrl,
    defaultViewport: null
  });
 
  const page = await browser.newPage();
 
  // Set request interception
  await page.setRequestInterception(true);
 
  // Intercept requests
  page.on('request', (request) => {
    request.continue();
  });
 
  // Monitor network traffic
  page.on('response', async (response) => {
    const headers = response.headers();
    const contentLength = headers['content-length'] ? parseInt(headers['content-length'], 10) : 0;
    totalBytesUsed += contentLength;
  });
 
  await page.goto(url, {waitUntil: 'domcontentloaded'});
 
  // Simulate scrolling to trigger lazy-loading content
  await page.evaluate(() => {
    window.scrollBy(0, window.innerHeight);
  });
 
  await new Promise(resolve => setTimeout(resolve, 1000))
 
  // Extract data
  const data = await page.evaluate(() => {
    return {
      title: document.title,
      content: document.body.innerText.substring(0, 1000),
      links: Array.from(document.querySelectorAll('a')).slice(0, 10).map(a => ({
        text: a.innerText,
        href: a.href
      }))
    };
  });
 
  // Output traffic usage statistics
  console.log(`\nTraffic Usage Statistics:`);
  console.log(`Used: ${(totalBytesUsed / 1024 / 1024).toFixed(2)} MB`);
 
  await browser.close();
  return data;
}
 
// Usage
optimizedScraping('https://www.scrapeless.com')
  .then(data => console.log('Scraping complete:', data))
  .catch(error => console.error('Scraping failed:', error));

After running the unoptimized code, we can see the traffic difference very intuitively from the printed information:

ScenarioTraffic Used (MB)Saving Ratio
Unoptimized6.03
Optimized0.81≈ 86.6 %

By combining the above optimization schemes, proxy traffic consumption can be significantly reduced, scraping efficiency can be improved, and ensuring that the required core content is obtained.