Tối ưu hóa chi phí

Giới thiệu

Khi sử dụng Puppeteer để thu thập dữ liệu, việc tiêu thụ lưu lượng là một yếu tố cần xem xét quan trọng. Đặc biệt khi sử dụng dịch vụ proxy, chi phí lưu lượng có thể tăng lên đáng kể. Để tối ưu hóa việc sử dụng lưu lượng, chúng ta có thể áp dụng các chiến lược sau:

Chặn tài nguyên: Giảm tiêu thụ lưu lượng bằng cách chặn các yêu cầu tài nguyên không cần thiết.
Chặn URL yêu cầu: Giảm thêm lưu lượng bằng cách chặn các yêu cầu cụ thể dựa trên đặc điểm URL.
Mô phỏng thiết bị di động: Sử dụng cấu hình thiết bị di động để lấy các phiên bản trang nhẹ hơn.
Tối ưu hóa toàn diện: Kết hợp các phương pháp trên để đạt được kết quả tốt nhất.

Phương án tối ưu 1: Chặn tài nguyên

Giới thiệu về chặn tài nguyên

Trong Puppeteer, page.setRequestInterception(true) có thể bắt mọi yêu cầu mạng do trình duyệt khởi tạo và quyết định tiếp tục (request.continue()), hủy bỏ (request.abort()), hoặc tùy chỉnh phản hồi (request.respond()).

Phương pháp này có thể giảm đáng kể mức tiêu thụ băng thông, đặc biệt phù hợp với các trường hợp thu thập dữ liệu, chụp ảnh màn hình và tối ưu hóa hiệu năng.

Các loại tài nguyên có thể chặn và đề xuất

Loại tài nguyên	Mô tả	Ví dụ	Tác động sau khi chặn	Đề xuất
`image`	Tài nguyên hình ảnh	Hình ảnh JPG/PNG/GIF/WebP	Hình ảnh sẽ không được hiển thị	⭐ An toàn
`font`	Tệp phông chữ	Phông chữ TTF/WOFF/WOFF2	Sẽ sử dụng phông chữ mặc định của hệ thống	⭐ An toàn
`media`	Tệp phương tiện	Tệp video/âm thanh	Nội dung phương tiện không thể phát	⭐ An toàn
`manifest`	Web App Manifest	Tệp cấu hình PWA	Chức năng PWA có thể bị ảnh hưởng	⭐ An toàn
`prefetch`	Tài nguyên prefetch	`<link rel="prefetch">`	Tác động tối thiểu đến trang	⭐ An toàn
`stylesheet`	Tệp kiểu CSS	Tệp CSS bên ngoài	Kiểu trang bị mất, có thể ảnh hưởng đến bố cục	⚠️ Cần thận trọng
`websocket`	WebSocket	Kết nối giao tiếp thời gian thực	Chức năng thời gian thực bị vô hiệu hóa	⚠️ Cần thận trọng
`eventsource`	Server-Sent Events	Dữ liệu push từ server	Chức năng push bị vô hiệu hóa	⚠️ Cần thận trọng
`preflight`	Yêu cầu preflight CORS	Yêu cầu OPTIONS	Yêu cầu cross-origin thất bại	⚠️ Cần thận trọng
`script`	Script JavaScript	Tệp JS bên ngoài	Chức năng động bị vô hiệu hóa, SPA có thể không hiển thị	❌ Tránh
`xhr`	Yêu cầu XHR	Yêu cầu dữ liệu AJAX	Không thể lấy dữ liệu động	❌ Tránh
`fetch`	Yêu cầu Fetch	Yêu cầu AJAX hiện đại	Không thể lấy dữ liệu động	❌ Tránh
`document`	Tài liệu chính	Trang HTML chính	Trang không thể tải	❌ Tránh

Giải thích mức độ đề xuất:

⭐ An toàn: Chặn hầu như không ảnh hưởng đến việc thu thập dữ liệu hoặc hiển thị màn hình đầu tiên; nên chặn theo mặc định.
⚠️ Cần thận trọng: Có thể làm hỏng kiểu dáng, chức năng thời gian thực hoặc yêu cầu cross-origin; cần có sự đánh giá của người dùng.
❌ Tránh: Xác suất cao gây ra việc SPA/các trang động không thể hiển thị hoặc lấy dữ liệu bình thường, trừ khi bạn hoàn toàn chắc chắn rằng không cần các tài nguyên này.

Ví dụ mã chặn tài nguyên

import { Scrapeless } from '@scrapeless-ai/sdk';
import puppeteer from 'puppeteer-core';
 
const client = new Scrapeless({ apiKey: 'API Key' });
 
const { browserWSEndpoint } = client.browser.create({
    sessionName: 'sdk_test',
    sessionTTL: 180,
    proxyCountry: 'ANY',
    sessionRecording: true,
    fingerprint,
});
 
async function scrapeWithResourceBlocking(url) {
    const browser = await puppeteer.connect({
        browserWSEndpoint,
        defaultViewport: null
    });
    const page = await browser.newPage();
 
    // Enable request interception
    await page.setRequestInterception(true);
 
    // Define resource types to block
    const BLOCKED_TYPES = new Set([
        'image',
        'font',
        'media',
        'stylesheet',
    ]);
 
    // Intercept requests
    page.on('request', (request) => {
        if (BLOCKED_TYPES.has(request.resourceType())) {
            request.abort();
            console.log(`Blocked: ${request.resourceType()} - ${request.url().substring(0, 50)}...`);
        } else {
            request.continue();
        }
    });
 
    await page.goto(url, {waitUntil: 'domcontentloaded'});
 
    // Extract data
    const data = await page.evaluate(() => {
        return {
            title: document.title,
            content: document.body.innerText.substring(0, 1000)
        };
    });
 
    await browser.close();
    return data;
}
 
// Usage
scrapeWithResourceBlocking('https://www.scrapeless.com')
    .then(data => console.log('Scraping result:', data))
    .catch(error => console.error('Scraping failed:', error));

Phương án tối ưu 2: Chặn URL yêu cầu

Ngoài việc chặn theo loại tài nguyên, có thể thực hiện điều khiển chặn chi tiết hơn dựa trên đặc điểm URL. Điều này đặc biệt hiệu quả để chặn quảng cáo, script phân tích và các yêu cầu từ bên thứ ba không cần thiết khác.

Chiến lược chặn URL

Chặn theo miền: Chặn tất cả các yêu cầu từ một miền cụ thể
Chặn theo đường dẫn: Chặn các yêu cầu từ một đường dẫn cụ thể
Chặn theo loại tệp: Chặn các tệp có phần mở rộng cụ thể
Chặn theo từ khóa: Chặn các yêu cầu có URL chứa các từ khóa cụ thể

Các mẫu URL thường được chặn

Mẫu URL	Mô tả	Ví dụ	Đề xuất
Dịch vụ quảng cáo	Miền mạng quảng cáo	`ad.doubleclick.net`, `googleadservices.com`	⭐ An toàn
Dịch vụ phân tích	Script thống kê và phân tích	`google-analytics.com`, `hotjar.com`	⭐ An toàn
Plugin mạng xã hội	Nút chia sẻ trên mạng xã hội, v.v.	`platform.twitter.com`, `connect.facebook.net`	⭐ An toàn
Pixel theo dõi	Pixel theo dõi hành vi người dùng	URL chứa `pixel`, `beacon`, `tracker`	⭐ An toàn
Tệp phương tiện lớn	Tệp video, âm thanh lớn	Phần mở rộng như `.mp4`, `.webm`, `.mp3`	⭐ An toàn
Dịch vụ phông chữ	Dịch vụ phông chữ trực tuyến	`fonts.googleapis.com`, `use.typekit.net`	⭐ An toàn
Tài nguyên CDN	CDN tài nguyên tĩnh	`cdn.jsdelivr.net`, `unpkg.com`	⚠️ Cần thận trọng

Ví dụ mã chặn URL

import puppeteer from 'puppeteer-core';
import { Scrapeless } from '@scrapeless-ai/sdk';
import puppeteer from 'puppeteer-core';
 
const client = new Scrapeless({ apiKey: 'API Key' });
 
const { browserWSEndpoint } = client.browser.create({
    sessionName: 'sdk_test',
    sessionTTL: 180,
    proxyCountry: 'ANY',
    sessionRecording: true,
    fingerprint,
});
 
async function scrapeWithUrlBlocking(url) {
    const browser = await puppeteer.connect({
        browserWSEndpoint,
        defaultViewport: null
    });
    const page = await browser.newPage();
 
    // Enable request interception
    await page.setRequestInterception(true);
 
    // Define domains and URL patterns to block
    const BLOCKED_DOMAINS = [
        'google-analytics.com',
        'googletagmanager.com',
        'doubleclick.net',
        'facebook.net',
        'twitter.com',
        'linkedin.com',
        'adservice.google.com',
    ];
 
    const BLOCKED_PATHS = [
        '/ads/',
        '/analytics/',
        '/pixel/',
        '/tracking/',
        '/stats/',
    ];
 
    // Intercept requests
    page.on('request', (request) => {
        const url = request.url();
 
        // Check domain
        if (BLOCKED_DOMAINS.some(domain => url.includes(domain))) {
            request.abort();
            console.log(`Blocked domain: ${url.substring(0, 50)}...`);
            return;
        }
 
        // Check path
        if (BLOCKED_PATHS.some(path => url.includes(path))) {
            request.abort();
            console.log(`Blocked path: ${url.substring(0, 50)}...`);
            return;
        }
 
        // Allow other requests
        request.continue();
    });
 
    await page.goto(url, {waitUntil: 'domcontentloaded'});
 
    // Extract data
    const data = await page.evaluate(() => {
        return {
            title: document.title,
            content: document.body.innerText.substring(0, 1000)
        };
    });
 
    await browser.close();
    return data;
}
 
// Usage
scrapeWithUrlBlocking('https://www.scrapeless.com')
    .then(data => console.log('Scraping result:', data))
    .catch(error => console.error('Scraping failed:', error));

Phương án tối ưu 3: Mô phỏng thiết bị di động

Mô phỏng thiết bị di động là một chiến lược tối ưu hóa lưu lượng hiệu quả khác vì các trang web dành cho thiết bị di động thường cung cấp nội dung trang nhẹ hơn.

Ưu điểm của việc mô phỏng thiết bị di động

Phiên bản trang nhẹ hơn: Nhiều trang web cung cấp nội dung cô đọng hơn cho thiết bị di động
Tài nguyên hình ảnh nhỏ hơn: Phiên bản dành cho thiết bị di động thường tải hình ảnh nhỏ hơn
CSS và JavaScript đơn giản hơn: Phiên bản dành cho thiết bị di động thường sử dụng kiểu dáng và script đơn giản hơn
Giảm quảng cáo và nội dung không cốt lõi: Phiên bản dành cho thiết bị di động thường loại bỏ một số chức năng không cốt lõi
Phản hồi thích ứng: Lấy bố cục nội dung được tối ưu hóa cho màn hình nhỏ

Cấu hình mô phỏng thiết bị di động

Dưới đây là các tham số cấu hình cho một số thiết bị di động thường được sử dụng:

const iPhoneX = {
    viewport: {
        width: 375,
        height: 812,
        deviceScaleFactor: 3,
        isMobile: true,
        hasTouch: true,
        isLandscape: false
    }
};

Hoặc sử dụng trực tiếp các phương thức tích hợp sẵn của puppeteer để mô phỏng thiết bị di động

import { KnownDevices } from 'puppeteer-core';
const iPhone = KnownDevices['iPhone 15 Pro'];
 
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.emulate(iPhone);

Ví dụ mã mô phỏng thiết bị di động

import puppeteer from 'puppeteer-core';
import { Scrapeless } from '@scrapeless-ai/sdk';
 
const client = new Scrapeless({ apiKey: 'API Key' });
 
const { browserWSEndpoint } = client.browser.create({
    sessionName: 'sdk_test',
    sessionTTL: 180,
    proxyCountry: 'ANY',
    sessionRecording: true,
    fingerprint,
});
 
async function scrapeWithMobileEmulation(url) {
    const browser = await puppeteer.connect({
        browserWSEndpoint,
        defaultViewport: null
    });
 
    const page = await browser.newPage();
 
    // Set mobile device simulation
    const iPhone = KnownDevices['iPhone 15 Pro'];
    await page.emulate(iPhone);
 
    await page.goto(url, {waitUntil: 'domcontentloaded'});
    // Extract data
    const data = await page.evaluate(() => {
        return {
            title: document.title,
            content: document.body.innerText.substring(0, 1000)
        };
    });
 
    await browser.close();
    return data;
}
 
// Usage
scrapeWithMobileEmulation('https://www.scrapeless.com')
    .then(data => console.log('Scraping result:', data))
    .catch(error => console.error('Scraping failed:', error));

Ví dụ tối ưu hóa toàn diện

Đây là một ví dụ toàn diện kết hợp tất cả các phương án tối ưu:

import puppeteer, {KnownDevices} from 'puppeteer-core';
import { Scrapeless } from '@scrapeless-ai/sdk';
 
const client = new Scrapeless({ apiKey: 'API Key' });
 
const { browserWSEndpoint } = client.browser.create({
    sessionName: 'sdk_test',
    sessionTTL: 180,
    proxyCountry: 'ANY',
    sessionRecording: true,
    fingerprint,
});
 
async function optimizedScraping(url) {
    console.log(`Starting optimized scraping: ${url}`);
 
    // Record traffic usage
    let totalBytesUsed = 0;
 
    const browser = await puppeteer.connect({
        browserWSEndpoint,
        defaultViewport: null
    });
 
    const page = await browser.newPage();
 
    // Set mobile device simulation
    const iPhone = KnownDevices['iPhone 15 Pro'];
    await page.emulate(iPhone);
 
    // Set request interception
    await page.setRequestInterception(true);
 
    // Define resource types to block
    const BLOCKED_TYPES = [
        'image',
        'media',
        'font'
    ];
 
    // Define domains to block
    const BLOCKED_DOMAINS = [
        'google-analytics.com',
        'googletagmanager.com',
        'facebook.net',
        'doubleclick.net',
        'adservice.google.com'
    ];
 
    // Define URL paths to block
    const BLOCKED_PATHS = [
        '/ads/',
        '/analytics/',
        '/tracking/'
    ];
 
    // Intercept requests
    page.on('request', (request) => {
        const url = request.url();
        const resourceType = request.resourceType();
 
        // Check resource type
        if (BLOCKED_TYPES.includes(resourceType)) {
            console.log(`Blocked resource type: ${resourceType} - ${url.substring(0, 50)}...`);
            request.abort();
            return;
        }
 
        // Check domain
        if (BLOCKED_DOMAINS.some(domain => url.includes(domain))) {
            console.log(`Blocked domain: ${url.substring(0, 50)}...`);
            request.abort();
            return;
        }
 
        // Check path
        if (BLOCKED_PATHS.some(path => url.includes(path))) {
            console.log(`Blocked path: ${url.substring(0, 50)}...`);
            request.abort();
            return;
        }
 
        // Allow other requests
        request.continue();
    });
 
    // Monitor network traffic
    page.on('response', async (response) => {
        const headers = response.headers();
        const contentLength = headers['content-length'] ? parseInt(headers['content-length'], 10) : 0;
        totalBytesUsed += contentLength;
    });
 
    await page.goto(url, {waitUntil: 'domcontentloaded'});
 
    // Simulate scrolling to trigger lazy-loading content
    await page.evaluate(() => {
        window.scrollBy(0, window.innerHeight);
    });
 
    await new Promise(resolve => setTimeout(resolve, 1000))
 
    // Extract data
    const data = await page.evaluate(() => {
        return {
            title: document.title,
            content: document.body.innerText.substring(0, 1000),
            links: Array.from(document.querySelectorAll('a')).slice(0, 10).map(a => ({
                text: a.innerText,
                href: a.href
            }))
        };
    });
 
    // Output traffic usage statistics
    console.log(`\nTraffic Usage Statistics:`);
    console.log(`Used: ${(totalBytesUsed / 1024 / 1024).toFixed(2)} MB`);
 
    await browser.close();
    return data;
}
 
// Usage
optimizedScraping('https://www.scrapeless.com')
    .then(data => console.log('Scraping complete:', data))
    .catch(error => console.error('Scraping failed:', error));

import puppeteer, {KnownDevices} from 'puppeteer-core';
 
const scrapelessUrl = 'wss://browser.scrapeless.com/api/v2/browser?token=your_api_key&sessionTTL=180&proxyCountry=ANY';
 
async function optimizedScraping(url) {
    console.log(`Starting optimized scraping: ${url}`);
 
    // Record traffic usage
    let totalBytesUsed = 0;
 
    const browser = await puppeteer.connect({
        browserWSEndpoint: scrapelessUrl,
        defaultViewport: null
    });
 
    const page = await browser.newPage();
 
    // Set mobile device simulation
    const iPhone = KnownDevices['iPhone 15 Pro'];
    await page.emulate(iPhone);
 
    // Set request interception
    await page.setRequestInterception(true);
 
    // Define resource types to block
    const BLOCKED_TYPES = [
        'image',
        'media',
        'font'
    ];
 
    // Define domains to block
    const BLOCKED_DOMAINS = [
        'google-analytics.com',
        'googletagmanager.com',
        'facebook.net',
        'doubleclick.net',
        'adservice.google.com'
    ];
 
    // Define URL paths to block
    const BLOCKED_PATHS = [
        '/ads/',
        '/analytics/',
        '/tracking/'
    ];
 
    // Intercept requests
    page.on('request', (request) => {
        const url = request.url();
        const resourceType = request.resourceType();
 
        // Check resource type
        if (BLOCKED_TYPES.includes(resourceType)) {
            console.log(`Blocked resource type: ${resourceType} - ${url.substring(0, 50)}...`);
            request.abort();
            return;
        }
 
        // Check domain
        if (BLOCKED_DOMAINS.some(domain => url.includes(domain))) {
            console.log(`Blocked domain: ${url.substring(0, 50)}...`);
            request.abort();
            return;
        }
 
        // Check path
        if (BLOCKED_PATHS.some(path => url.includes(path))) {
            console.log(`Blocked path: ${url.substring(0, 50)}...`);
            request.abort();
            return;
        }
 
        // Allow other requests
        request.continue();
    });
 
    // Monitor network traffic
    page.on('response', async (response) => {
        const headers = response.headers();
        const contentLength = headers['content-length'] ? parseInt(headers['content-length'], 10) : 0;
        totalBytesUsed += contentLength;
    });
 
    await page.goto(url, {waitUntil: 'domcontentloaded'});
 
    // Simulate scrolling to trigger lazy-loading content
    await page.evaluate(() => {
        window.scrollBy(0, window.innerHeight);
    });
 
    await new Promise(resolve => setTimeout(resolve, 1000))
 
    // Extract data
    const data = await page.evaluate(() => {
        return {
            title: document.title,
            content: document.body.innerText.substring(0, 1000),
            links: Array.from(document.querySelectorAll('a')).slice(0, 10).map(a => ({
                text: a.innerText,
                href: a.href
            }))
        };
    });
 
    // Output traffic usage statistics
    console.log(`\nTraffic Usage Statistics:`);
    console.log(`Used: ${(totalBytesUsed / 1024 / 1024).toFixed(2)} MB`);
 
    await browser.close();
    return data;
}
 
// Usage
optimizedScraping('https://www.scrapeless.com')
    .then(data => console.log('Scraping complete:', data))
    .catch(error => console.error('Scraping failed:', error));

So sánh tối ưu

Chúng ta thử loại bỏ mã tối ưu khỏi ví dụ toàn diện để so sánh lưu lượng trước và sau khi tối ưu. Dưới đây là ví dụ mã chưa được tối ưu:

import puppeteer from 'puppeteer-core';
import { Scrapeless } from '@scrapeless-ai/sdk';
 
const client = new Scrapeless({ apiKey: 'API Key' });
 
const { browserWSEndpoint } = client.browser.create({
    sessionName: 'sdk_test',
    sessionTTL: 180,
    proxyCountry: 'ANY',
    sessionRecording: true,
    fingerprint,
});
 
async function unoptimizedScraping(url) {
  console.log(`Starting unoptimized scraping: ${url}`);
 
  // Record traffic usage
  let totalBytesUsed = 0;
 
  const browser = await puppeteer.connect({
    browserWSEndpoint,
    defaultViewport: null
  });
 
  const page = await browser.newPage();
 
  // Set request interception
  await page.setRequestInterception(true);
 
  // Intercept requests
  page.on('request', (request) => {
    request.continue();
  });
 
  // Monitor network traffic
  page.on('response', async (response) => {
    const headers = response.headers();
    const contentLength = headers['content-length'] ? parseInt(headers['content-length'], 10) : 0;
    totalBytesUsed += contentLength;
  });
 
  await page.goto(url, {waitUntil: 'domcontentloaded'});
 
  // Simulate scrolling to trigger lazy-loading content
  await page.evaluate(() => {
    window.scrollBy(0, window.innerHeight);
  });
 
  await new Promise(resolve => setTimeout(resolve, 1000))
 
  // Extract data
  const data = await page.evaluate(() => {
    return {
      title: document.title,
      content: document.body.innerText.substring(0, 1000),
      links: Array.from(document.querySelectorAll('a')).slice(0, 10).map(a => ({
        text: a.innerText,
        href: a.href
      }))
    };
  });
 
  // Output traffic usage statistics
  console.log(`\nTraffic Usage Statistics:`);
  console.log(`Used: ${(totalBytesUsed / 1024 / 1024).toFixed(2)} MB`);
 
  await browser.close();
  return data;
}
 
// Usage
unoptimizedScraping('https://www.scrapeless.com')
  .then(data => console.log('Scraping complete:', data))
  .catch(error => console.error('Scraping failed:', error));

Sau khi chạy mã chưa tối ưu, ta có thể thấy sự khác biệt về lưu lượng truy cập một cách trực quan từ thông tin được in ra:

Kịch bản	Lưu lượng sử dụng (MB)	Tỷ lệ tiết kiệm
Chưa tối ưu	6.03	—
Đã tối ưu	0.81	≈ 86.6 %

Bằng cách kết hợp các phương án tối ưu hóa trên, việc tiêu thụ lưu lượng proxy có thể được giảm đáng kể, hiệu quả thu thập dữ liệu được cải thiện và đảm bảo thu được nội dung cốt lõi cần thiết.

Browser MCP Tự động hóa MFA GitHub