ActorFeaturesFundamentals

Fundamentals

Learn how to create and use Actors in the Scrapeless Dashboard to perform web scraping and automation.

Actor

Create an Actor

Actors are built from Docker-based source code and run in the Scrapeless cloud. Actor building is currently not open to the public and Scrapeless will offer custom-built Actors based on user requirements.

Step 1. Go to the Actor list and click “Create Actor” to start.

Step 2. Choose to get the source code from GitHub or GitLab repository to build Actor. After authorizing your Git repository, the system will automatically get the project and identify the version number in it. When building, you can choose one from all available versions to build, which is convenient for precise control and version management.


Input Parameters

When creating or configuring an Actor, you can define environment variables such as target site or data rules via input parameters. They are automatically generated based on the input_schema.json file in the Git repository. Our system will build a visual parameter configuration interface based on the field type, name, and description defined in the schema, so that users can fill in the corresponding input content when running the Actor.

[Learn more about input_schema.json format → (Insert hyperlink)]


Run Record

The running record data is generated after the Actor runs. Through the Run Record, you can view all the running records of this Actor.


Information

The Information description comes from the README file in your Git repository. When creating an Actor, Scrapeless will automatically read the README file for this repo.


Builds

An Actor could have multiple versions of source code and related settings. So, before running it, please build versions and select the target one at runtime.

  1. Click “Build” to start the above process. In “Build Details”, you can monitor the status and logs of the Version.

  1. Once completed, return to “Actor Details” to view the new build version.


Name & Description

Basic info of Actor like name and icon will also be fetched from the Git repo, but the description is accessible to be edited.

Run Actor

Learn how to start, run, and manage Actors.

Run Options

Before starting the Actor, you can configure the runtime environment:

  • Actor Version: Select the version of the Actor to run.
  • Timeout: Set the timeout duration for the Actor in the pending state.
  • Memory: Allocate memory for the Actor’s execution.
  • Server Mode:
    • Server: The Actor runs continuously in the background and listens for incoming requests—ideal for long-running tasks.
    • Once: The Actor runs a single time and then stops—best for one-off or fixed tasks, and you can also set the maximum execution time for the Actor.

Starting Actor

You can start an Actor in two ways:

  • Manual Start: Click “Start” to launch the Actor manually.
  • Scheduler: Set up a scheduled task to execute the Actor automatically(有图片吗)

Execution

Each time the Actor runs, our system automatically generates a record that allows you to view its status and details.

Concurrent Execution

The same Actor can be started multiple times simultaneously to achieve concurrent execution of tasks and improve processing efficiency.

Run Record

Run Record is used to record states, input parameters, output data and related logs of Actors. You can view all historical run records in the Record list.

You can get the following information in every Run Record:

  1. Output: Output data of the Actor.
  2. Storage: Access to data saved during execution.
  3. Input: Environment variables and input parameters used.
  4. Log: Logs generated during execution.

⚠️ Records are retained for 30 days. Older ones will be automatically deleted. Please back up important data in time.

Output

Output is the data result generated after the Actor runs, which is stored in the Dataset by default.

Storage

After execution, results are saved in the default Dataset. You can view them in the run details and download them from the Storage page.

Input

Displays the Input parameters used by the Actor during runtime, making it easy to review the parameter configuration at startup.

Log

The Log page captures detailed logs from the Actor’s execution, helping with debugging and issue resolution.

Schedule

Learn how to automatically run an Actor by setting a schedule, which allows you to run an Actor at a specified time.

Creating a Schedule

Run Frequency Configuration

You can set the automatic run frequency of an Actor using a Cron expression. If you’re unfamiliar with Cron syntax, we recommend visiting crontab.guru for guidance and examples.

Time Zone

We will display the time according to your current browser’s system time zone to help you more intuitively understand the execution time corresponding to the Cron expression. Meanwhile, the Next Time preview shows the next 5 scheduled running times to verify whether the configuration meets expectations.

Add Actor to Schedule

Each schedule must include at least one Actor and can include up to 5. All added Actors will run simultaneously at the scheduled time.

You can configure unique input variables for each Actor to ensure proper task behavior.

Schedule Log

View execution records of scheduled runs. Quickly identify whether each scheduled task was executed successfully or encountered errors—helpful for monitoring and troubleshooting.

Storage

Actors support three types of storage: Dataset, Key-Value, and Queue. They can help store, access, and manage your scraped data efficiently.

Dataset

View and download scraped data via the Dataset tab. Supported features include:

  1. Downloading in CSV and JSON formats.
  2. Select Fields: Choose specific fields to download.
  3. Data retention: Stored data is available for 30 days before automatic deletion.

Key-Value

This flexible storage can store any type of data—JSON, HTML, ZIP, images, or plain text. Each entry includes its MIME type for proper handling.

Each time an Actor runs, the system allocates it to an independent key-value storage space to facilitate data isolation and management.

Stored for 30 days; automatically deleted after expiry.

Queue

Used for managing and scheduling a large number of requests. It supports adding and retrieving request information such as URLs using HTTP methods and additional parameters.

Queues are ideal for scalable workflows like dynamic web crawling or batch processing.

Data is also retained for 30 days by default.