Fundamentals
Learn how to create and use Actors in the Scrapeless Dashboard to perform web scraping and automation.
Actor
Create an Actor
Actors are built from Docker-based source code and run in the Scrapeless cloud. Actor building is currently not open to the public and Scrapeless will offer custom-built Actors based on user requirements.
Step 1. Go to the Actor list and click “Create Actor” to start.
Step 2. Choose to get the source code from GitHub or GitLab repository to build Actor. After authorizing your Git repository, the system will automatically get the project and identify the version number in it. When building, you can choose one from all available versions to build, which is convenient for precise control and version management.
Input Parameters
When creating or configuring an Actor, you can define environment variables such as target site or data rules via input parameters. They are automatically generated based on the input_schema.json
file in the Git repository. Our system will build a visual parameter configuration interface based on the field type, name, and description defined in the schema, so that users can fill in the corresponding input content when running the Actor.
[Learn more about input_schema.json format → (Insert hyperlink)]
Run Record
The running record data is generated after the Actor runs. Through the Run Record, you can view all the running records of this Actor.
Information
The Information description comes from the README file in your Git repository. When creating an Actor, Scrapeless will automatically read the README file for this repo.
Builds
An Actor could have multiple versions of source code and related settings. So, before running it, please build versions and select the target one at runtime.
- Click “Build” to start the above process. In “Build Details”, you can monitor the status and logs of the Version.
- Once completed, return to “Actor Details” to view the new build version.
Name & Description
Basic info of Actor like name and icon will also be fetched from the Git repo, but the description is accessible to be edited.
Run Actor
Learn how to start, run, and manage Actors.
Run Options
Before starting the Actor, you can configure the runtime environment:
- Actor Version: Select the version of the Actor to run.
- Timeout: Set the timeout duration for the Actor in the pending state.
- Memory: Allocate memory for the Actor’s execution.
- Server Mode:
- Server: The Actor runs continuously in the background and listens for incoming requests—ideal for long-running tasks.
- Once: The Actor runs a single time and then stops—best for one-off or fixed tasks, and you can also set the maximum execution time for the Actor.
Starting Actor
You can start an Actor in two ways:
- Manual Start: Click “Start” to launch the Actor manually.
- Scheduler: Set up a scheduled task to execute the Actor automatically(有图片吗)
Execution
Each time the Actor runs, our system automatically generates a record that allows you to view its status and details.
Concurrent Execution
The same Actor can be started multiple times simultaneously to achieve concurrent execution of tasks and improve processing efficiency.
Run Record
Run Record is used to record states, input parameters, output data and related logs of Actors. You can view all historical run records in the Record list.
You can get the following information in every Run Record:
- Output: Output data of the Actor.
- Storage: Access to data saved during execution.
- Input: Environment variables and input parameters used.
- Log: Logs generated during execution.
⚠️ Records are retained for 30 days. Older ones will be automatically deleted. Please back up important data in time.
Output
Output is the data result generated after the Actor runs, which is stored in the Dataset
by default.
Storage
After execution, results are saved in the default Dataset. You can view them in the run details and download them from the Storage page.
Input
Displays the Input parameters used by the Actor during runtime, making it easy to review the parameter configuration at startup.
Log
The Log page captures detailed logs from the Actor’s execution, helping with debugging and issue resolution.
Schedule
Learn how to automatically run an Actor by setting a schedule, which allows you to run an Actor at a specified time.
Creating a Schedule
Run Frequency Configuration
You can set the automatic run frequency of an Actor using a Cron expression. If you’re unfamiliar with Cron syntax, we recommend visiting crontab.guru for guidance and examples.
Time Zone
We will display the time according to your current browser’s system time zone to help you more intuitively understand the execution time corresponding to the Cron expression. Meanwhile, the Next Time preview shows the next 5 scheduled running times to verify whether the configuration meets expectations.
Add Actor to Schedule
Each schedule must include at least one Actor and can include up to 5. All added Actors will run simultaneously at the scheduled time.
You can configure unique input variables for each Actor to ensure proper task behavior.
Schedule Log
View execution records of scheduled runs. Quickly identify whether each scheduled task was executed successfully or encountered errors—helpful for monitoring and troubleshooting.
Storage
Actors support three types of storage: Dataset, Key-Value, and Queue. They can help store, access, and manage your scraped data efficiently.
Dataset
View and download scraped data via the Dataset tab. Supported features include:
- Downloading in CSV and JSON formats.
- Select Fields: Choose specific fields to download.
- Data retention: Stored data is available for 30 days before automatic deletion.
Key-Value
This flexible storage can store any type of data—JSON, HTML, ZIP, images, or plain text. Each entry includes its MIME type for proper handling.
Each time an Actor runs, the system allocates it to an independent key-value storage space to facilitate data isolation and management.
Stored for 30 days; automatically deleted after expiry.
Queue
Used for managing and scheduling a large number of requests. It supports adding and retrieving request information such as URLs using HTTP methods and additional parameters.
Queues are ideal for scalable workflows like dynamic web crawling or batch processing.
Data is also retained for 30 days by default.