CrawlQuickstartGetting Started

Getting Started

Crawl Single Page

The Crawl API allows you to get the data you want from web pages with a single call. You can scrape page content and capture its data in various formats.

Scrapeless exposes endpoints for starting a scrape request and for getting it’s status and results. By default, scraping is handled in an asynchronous manner of first starting the job and then checking it’s status until it is completed. However, with our SDKs, we provide a simple function that handles the whole flow and returns the data once the job is completed.

Installation

npm install @scrapeless-ai/sdk
 
pnpm add @scrapeless-ai/sdk
 

Usage

import { ScrapingCrawl } from "@scrapeless-ai/sdk";
 
// Initialize the client
const client = new ScrapingCrawl({
  apiKey: "your-api-key", // Get your API key from https://scrapeless.com
});
 
(async () => {
  const result = await client.scrapeUrl(
    "https://example.com"
  );
 
  console.log(result);
})();
 

Browser Configurations

You can also provide configurations for the session used to execute the scrape job when creating a new session itself; these could include using a proxy.

Scrapeless automatically handles common CAPTCHA types, including reCAPTCHA v2, Cloudflare Turnstile/Challenge.

No additional setup is required—Scrapeless takes care of it during scraping. 👉 For more details, checkout the Captcha Solving.

To see all the different available browser parameters, checkout the API Reference or Browser Parameters.

import { ScrapingCrawl } from "@scrapeless-ai/sdk";
 
// Initialize the client
const client = new ScrapingCrawl({
  apiKey: "your-api-key", // Get your API key from https://scrapeless.com
});
 
(async () => {
  const result = await client.scrapeUrl(
    "https://example.com",
    {
      browserOptions: {
        proxy_country: "ANY",
        session_name: "Crawl",
        session_recording: true,
        session_ttl: 900,
      },
    }
  );
 
  console.log(result);
})();
 

Scrape Configurations

You can also specify optional parameters for the scrape job, such as response formats, enabling main-content-only extraction, setting a maximum page navigation timeout, and more.

import { ScrapingCrawl } from "@scrapeless-ai/sdk";
 
// Initialize the client
const client = new ScrapingCrawl({
  apiKey: "your-api-key", // Get your API key from https://scrapeless.com
});
 
(async () => {
  const result = await client.scrapeUrl(
    "https://example.com",
    {
      formats: ["markdown", "html", "links"],
      onlyMainContent: false,
      timeout: 15000,
    }
  );
 
  console.log(result);
})();
 

For a full reference on the scrape endpoint, checkout the API Reference.

Batch Scrape

Batch Scrape works the same as regular scrape, except instead of a single URL, you can provide a list URLs to scrape at once.

import { ScrapingCrawl } from "@scrapeless-ai/sdk";
 
// Initialize the client
const client = new ScrapingCrawl({
  apiKey: "your-api-key", // Get your API key from https://scrapeless.com
});
 
(async () => {
  const result = await client.batchScrapeUrls(
    ["https://example.com", "https://scrapeless.com"],
    {
      formats: ["markdown", "html", "links"],
      onlyMainContent: false,
      timeout: 15000,
      browserOptions: {
        proxy_country: "ANY",
        session_name: "Crawl",
        session_recording: true,
        session_ttl: 900,
      },
    }
  );
 
  console.log(result);
})();
 

Crawl Subpages

Crawl a website and its linked pages to extract comprehensive data. For detailed usage, check out the Crawl API Reference

By default, the crawl skips sublinks that aren’t part of the URL hierarchy you specify. For example, crawling https://example.com/products/ wouldn’t capture pages under https://example.com/promotions/deal-567. To include such links, enable the allowBackwardLinks parameter.

Scrapeless exposes endpoints for starting a crawl request and getting its status and results. By default, crawling is handled asynchronously: start the job first, then check its status until completed. However, with our SDKs, we provide a simple function that handles the entire flow and returns the data once the job is finished.

Installation

npm install @scrapeless-ai/sdk
 
pnpm add @scrapeless-ai/sdk
 

Usage

import { ScrapingCrawl } from "@scrapeless-ai/sdk";
 
// Initialize the client
const client = new ScrapingCrawl({
  apiKey: "your-api-key", // Get your API key from https://scrapeless.com
});
 
(async () => {
  const result = await client.crawlUrl(
    "https://example.com",
    {
      limit: 2,
      scrapeOptions: {
        formats: ["markdown", "html", "links"],
        onlyMainContent: false,
        timeout: 15000,
      },
      browserOptions: {
        proxy_country: "ANY",
        session_name: "Crawl",
        session_recording: true,
        session_ttl: 900,
      },
    }
  );
 
  console.log(result);
})();
 

Response

{
  "success": true,
  "status": "completed",
  "completed": 2,
  "total": 2,
  "data": [
    {
      "url": "https://example.com",
      "metadata": {
        "title": "Example Page",
        "description": "A sample webpage"
      },
      "markdown": "# Example Page\nThis is content...",
      ...
    },
    ...
  ]
}
 

Each crawled page has its own status of completed or failed and can have its own error field, so be cautious of that.

To see the full schema, checkout the API Reference.

Browser Configurations

You can also provide configurations for the session used to execute the scrape job when creating a new session itself; these could include using a proxy.

For a complete list of available parameters, refer to the API Reference or Browser Parameters.

import { ScrapingCrawl } from "@scrapeless-ai/sdk";
 
// Initialize the client
const client = new ScrapingCrawl({
  apiKey: "your-api-key", // Get your API key from https://scrapeless.com
});
 
(async () => {
  const result = await client.crawlUrl(
    "https://example.com",
    {
      limit: 2,
      browserOptions: {
        proxy_country: "ANY",
        session_name: "Crawl",
        session_recording: true,
        session_ttl: 900,
      },
    }
  );
 
  console.log(result);
})();
 

Scrape Configurations

You can also specify optional parameters for the scrape job, such as response formats, enabling main-content-only extraction, setting a maximum page navigation timeout, and more.

import { ScrapingCrawl } from "@scrapeless-ai/sdk";
 
// Initialize the client
const client = new ScrapingCrawl({
  apiKey: "your-api-key", // Get your API key from https://scrapeless.com
});
 
(async () => {
  const result = await client.crawlUrl(
    "https://example.com",
    {
      limit: 2,
      scrapeOptions: {
        formats: ["markdown", "html", "links"],
        onlyMainContent: false,
        timeout: 15000,
      }
    }
  );
 
  console.log(result);
})();
 

For a full reference on the crawl endpoint, checkout the API Reference.