Getting Started with Scrapfly Crawler API

View as markdown

The Scrapfly Crawler API enables recursive website crawling at scale. We leverage WARC, Parquet format for large scale scraping and you can easily visualize using HAR artifact. Crawl entire websites with configurable limits, extract content in multiple formats simultaneously, and retrieve results as industry-standard artifacts.

Early Access Feature

The Crawler API is currently in early access. Features and API may evolve based on user feedback.

Quick Start: Choose Your Workflow

The Crawler API supports two integration patterns. Choose the approach that best fits your use case:

Polling Workflow

Schedule a crawl, poll the status endpoint to monitor progress, and retrieve results when complete. Best for batch processing, testing, and simple integrations.

Schedule Crawl

Create a crawler with a single API call. The API returns immediately with a crawler UUID:

curl -X POST "https://api.scrapfly.io/crawl?key=" \ -H 'Content-Type: application/json' \ -d '{ "url": "https://example.com", "page_limit": 100 }'

Response includes crawler UUID and status:

{"uuid": "550e8400-e29b-41d4-a716-446655440000", "status": "PENDING"}

Monitor Progress

Poll the status endpoint to track crawl progress:

curl https://api.scrapfly.io/crawl/{uuid}/status?key=

Status response shows real-time progress:

{ "crawler_uuid": "550e8400-e29b-41d4-a716-446655440000", "status": "RUNNING", "is_finished": false, "is_success": null, "state": { "urls_visited": 847, "urls_extracted": 1523, "urls_failed": 12, "urls_skipped": 34, "urls_to_crawl": 676, "api_credit_used": 8470, "duration": 145, "stop_reason": null } }

Understanding the Status Response

Field	Values	Description
`status`	`PENDING` `RUNNING` `DONE` `CANCELLED`	Current crawler state - actively running or completed
`is_finished`	`true` / `false`	Whether crawler has stopped (regardless of success/failure)
`is_success`	`true` - Success `false` - Failed `null` - Running	Outcome of the crawl (only set when finished)
`stop_reason`	See table below	Why the crawler stopped (only set when finished)

Stop Reasons:

Stop Reason	Description
`no_more_urls`	All discovered URLs have been crawled - normal completion
`page_limit`	Reached the configured `page_limit`
`max_duration`	Exceeded the `max_duration` time limit
`max_api_credit`	Reached the `max_api_credit` limit
`seed_url_failed`	The starting URL failed to crawl - no URLs visited
`user_cancelled`	User manually cancelled the crawl via API
`crawler_error`	Internal crawler error occurred
`no_api_credit_left`	Account ran out of API credits during crawl

Retrieve Results

Once is_finished: true, download artifacts or query content:

# Download WARC artifact (recommended for large crawls) curl https://api.scrapfly.io/crawl/{uuid}/artifact?key=&type=warc -o crawl.warc.gz # Query specific URL content curl https://api.scrapfly.io/crawl/{uuid}/contents?key=&url=https://example.com/page&format=markdown # Or batch retrieve multiple URLs (max 100 per request) curl -X POST https://api.scrapfly.io/crawl/{uuid}/contents/batch?key=&formats=markdown \ -H 'Content-Type: text/plain' \ -d 'https://example.com/page1 https://example.com/page2 https://example.com/page3'

For comprehensive retrieval options, see Retrieving Crawler Results.

Real-Time Webhook Workflow

Schedule a crawl with webhook configuration, receive instant HTTP callbacks as events occur, and process results in real-time. Best for real-time data ingestion, streaming pipelines, and event-driven architectures.

Webhook Setup Required

Before using webhooks, you must configure a webhook in your dashboard with your endpoint URL and authentication. Then reference it by name in your API call.

Schedule Crawl with Webhook

Create a crawler and specify the webhook name configured in your dashboard:

curl -X POST "https://api.scrapfly.io/crawl?key=" \ -H 'Content-Type: application/json' \ -d '{ "url": "https://example.com", "page_limit": 100, "webhook_name": "my-crawler-webhook", "webhook_events": [ "crawler_started", "crawler_url_visited", "crawler_finished" ] }'

Response includes crawler UUID:

{"uuid": "550e8400-e29b-41d4-a716-446655440000", "status": "PENDING"}

Receive Real-Time Webhooks

Your endpoint receives HTTP POST callbacks as events occur during the crawl:

{ "event": "crawler_url_visited", "payload": { "crawler_uuid": "550e8400-e29b-41d4-a716-446655440000", "url": "https://example.com/page", "status_code": 200, "depth": 1, "state": { "urls_visited": 42, "urls_to_crawl": 158, "api_credit_used": 420 } } }

Webhook Headers:

Header	Purpose
`X-Scrapfly-Crawl-Event-Name`	Event type (e.g., `crawler_url_visited`) for fast routing
`X-Scrapfly-Webhook-Job-Id`	Crawler UUID for tracking
`X-Scrapfly-Webhook-Signature`	HMAC-SHA256 signature for verification

Process Events in Real-Time

Handle webhook callbacks to stream data to your database, trigger pipelines, or process results:

# Example: Python webhook handler @app.post('/webhooks/crawler') def handle_crawler_webhook(request): event = request.headers['X-Scrapfly-Crawl-Event-Name'] payload = request.json()['payload'] if event == 'crawler_url_visited': # Stream scraped content to database save_to_database(payload['url'], payload['content']) elif event == 'crawler_finished': # Trigger downstream processing trigger_data_pipeline(payload['crawler_uuid']) return {'status': 'ok'}

For detailed webhook documentation and all available events, see Crawler Webhook Documentation.

Error Handling

Crawler API uses standard HTTP response codes and provides detailed error information:

`200` - OK	Request successful
`201` - Created	Crawler job created successfully
`400` - Bad Request	Invalid parameters or configuration
`401` - Unauthorized	Invalid or missing API key
`404` - Not Found	Crawler job not found
`429` - Too Many Requests	Rate limit or concurrency limit exceeded
`500` - Server Error	Internal server error
See the full error list for more details.

API Specification

Create Crawler Job

POST https://api.scrapfly.io/crawl

Create a new crawler job with custom configuration. The API returns immediately with a crawler UUID that you can use to monitor progress and retrieve results.

Query Parameters (Authentication)

These parameters must be passed in the URL query string, not in the request body.

Parameter

Description

Example

key

required

Your Scrapfly API key for authentication. You can find your key on your dashboard.

Query Parameter Only Must be passed as a URL query parameter (e.g., ?key=YOUR_KEY), never in the POST request body. This applies to all Crawler API endpoints.

?key=16eae084cff64841be193a95fc8fa67d
Append to endpoint URL

Request Body (Crawler Configuration)

These parameters configure the crawler behavior and must be sent in the JSON request body.

Parameter

Description

Example

url

required

Starting URL for the crawl. Must be a valid HTTP/HTTPS URL. The crawler will begin discovering and crawling linked pages from this seed URL. Must be URL encoded

url=https://example.com url=https://example.com/blog

page_limit

popular

default: 0 (unlimited)

Maximum number of pages to crawl. Must be non-negative. Set to 0 for unlimited (subject to subscription limits). Use this to limit crawl scope and control costs.

page_limit=100
page_limit=1000
page_limit=0 (unlimited)

max_depth

popular

default: 0 (unlimited)

Maximum link depth from starting URL. Must be non-negative. Depth 0 is the starting URL, depth 1 is links from the starting page, etc. Set to 0 for unlimited depth. Use lower values for focused crawls, higher values for comprehensive site crawling.

max_depth=2
max_depth=5
max_depth=0 (unlimited)

exclude_paths

popular

default: []

Exclude URLs matching these path patterns. Supports wildcards (*). Maximum 100 paths. Mutually exclusive with include_only_paths. Useful for skipping admin pages, authentication flows, or irrelevant sections.

exclude_paths=["/admin/*"]
exclude_paths=["*/login", "*/signup"]
exclude_paths=["/api/*", "/assets/*"]

include_only_paths

popular

default: []

Only crawl URLs matching these path patterns. Supports wildcards (*). Maximum 100 paths. Mutually exclusive with exclude_paths. Useful for focusing on specific sections like blogs or product pages.

include_only_paths=["/blog/*"]
include_only_paths=["/blog/*", "/articles/*"]
include_only_paths=["/products/*/reviews"]

Show Advanced Crawl Configuration (domain restrictions, delays, headers, sitemaps...)

ignore_base_path_restriction

default: false

By default, the crawler only follows links within the same base path as the starting URL. For example, starting from https://example.com/blog restricts crawling to /blog/*. Enable this to allow crawling any path on the same domain.

ignore_base_path_restriction=true
ignore_base_path_restriction=false

follow_external_links

default: false

Allow the crawler to follow links to external domains. By default, crawling is restricted to the starting domain.

Important: External Link Behavior

When follow_external_links=true:

Default (no domains specified): The crawler will follow links to ANY external domain (except social media URLs)
With allowed_external_domains: Only domains matching the specified patterns will be followed

External page scraping behavior:

External pages ARE scraped (content is extracted, credits are consumed)
Links from external pages are NOT followed (crawling goes only "one hop" into external domains)

follow_external_links=true Follow ANY external domain (except social media)
follow_external_links=false Stay within starting domain only

allowed_external_domains

default: []

Whitelist of external domains to crawl when follow_external_links=true. Maximum 250 domains. Supports fnmatch-style wildcards (*) for flexible pattern matching.

Pattern Matching Examples:

*.example.com - Matches all subdomains of example.com
specific.org - Exact domain match only
blog.*.com - Matches blog.anything.com

Scraping vs. Crawling External Pages

When a page contains a link to an allowed external domain:
The crawler WILL: Scrape the external page (extract content, consume credits)
The crawler WILL NOT: Follow links found on that external page

Example: Crawling example.com with allowed_external_domains=["*.wikipedia.org"] will scrape Wikipedia pages linked from example.com, but will NOT crawl additional links discovered on Wikipedia.

allowed_external_domains=["cdn.example.com"] Only follow links to cdn.example.com
allowed_external_domains=["*.example.com"] Follow all subdomains of example.com
allowed_external_domains=["blog.example.com", "docs.example.com"] Follow multiple specific domains

rendering_delay

Wait time in milliseconds after page load before extraction. Set to 0 to disable browser rendering (HTTP-only mode). Range: 0 or 1-25000ms (max 25 seconds). Only applies when browser rendering is enabled. Use this for pages that load content dynamically.

rendering_delay=0 (no rendering)
rendering_delay=2000
rendering_delay=5000
rendering_delay=25000 (maximum)

max_concurrency

default: account limit

Maximum number of concurrent scrape requests. Controls crawl speed and resource usage. Limited by your account's concurrency limit. Set to 0 to use account/project default.

max_concurrency=5
max_concurrency=10
max_concurrency=0 (use account limit)

headers

default: {}

Custom HTTP headers to send with each request. Pass as JSON object. Must be URL encoded

headers={"Authorization": "Bearer token"}

headers={"Referer": "https://example.com"}

delay

default: "0"

Add a delay between requests in milliseconds. Range: 0-15000ms (max 15 seconds). Use this to be polite to target servers and avoid overwhelming them with requests. Value must be provided as a string.

delay="1000" (1 second)
delay="5000" (5 seconds)
delay="15000" (maximum)

user_agent

default: null

Custom User-Agent string to use for all requests. If not specified, Scrapfly will use appropriate User-Agent headers automatically. This is a shorthand for setting the User-Agent header.

Important: ASP Compatibility

When asp=true (Anti-Scraping Protection is enabled), this parameter is ignored. ASP manages User-Agent headers automatically for optimal bypass performance.

Choose one approach:

Use ASP (asp=true) - Automatic User-Agent management with advanced bypass
Use custom User-Agent (user_agent=...) - Manual control, ASP disabled

user_agent=MyBot/1.0 (+https://example.com/bot)

use_sitemaps

default: false

Use sitemap.xml for URL discovery if available. When enabled, the crawler will check for /sitemap.xml and use it to discover additional URLs to crawl.

use_sitemaps=true
use_sitemaps=false

respect_robots_txt

default: true

Respect robots.txt rules. When enabled, the crawler will honor Disallow directives from the target site's robots.txt file.

respect_robots_txt=true
respect_robots_txt=false

cache

popular

default: false

Enable the cache layer for crawled pages. If a page is already cached and not expired, the cached version will be used instead of re-crawling.

cache=true
cache=false

cache_ttl

default: default TTL

Cache time-to-live in seconds. Range: 0-604800 seconds (max 7 days). Only applies when cache=true. Set to 0 to use default TTL. After this duration, cached pages will be considered stale and re-crawled.

cache_ttl=3600
cache_ttl=86400
cache_ttl=604800

cache_clear

default: false

Force refresh of cached pages. When enabled, all pages will be re-crawled even if valid cache entries exist.

cache_clear=true
cache_clear=false

ignore_no_follow

default: false

Ignore rel="nofollow" attributes on links. By default, links with nofollow are not crawled. Enable this to crawl all links regardless of the nofollow attribute.

ignore_no_follow=true
ignore_no_follow=false

content_formats

popular

default: ["html"]

List of content formats to extract from each crawled page. You can specify multiple formats to extract different representations simultaneously. Extracted content is available via the /contents endpoint or in downloaded artifacts.

Available formats:

html - Raw HTML content
clean_html - HTML with boilerplate removed
markdown - Markdown format (ideal for LLM training)
text - Plain text only
json - Structured JSON representation
extracted_data - AI-extracted structured data
page_metadata - Page metadata (title, description, etc.)

content_formats=["html"]
content_formats=["markdown"] LLM Ready
content_formats=["markdown", "extracted_data"]
content_formats=["html", "text", "page_metadata"]

max_duration

default: 900 (15 minutes)

Maximum crawl duration in seconds. Range: 15-10800 seconds (15s to 3 hours). The crawler will stop after this time limit is reached, even if there are more pages to crawl. Use this to prevent long-running crawls.

max_duration=900
max_duration=3600
max_duration=10800

max_api_credit

default: 0 (no limit)

Maximum API credits to spend on this crawl. Must be non-negative. The crawler will stop when this credit limit is reached. Set to 0 for no credit limit. Useful for controlling costs on large crawls.

max_api_credit=1000
max_api_credit=5000
max_api_credit=0 (no limit)

extraction_rules

default: null

Extraction rules to extract structured data from each page. Maximum 100 rules. Each rule maps a URL pattern (max 1000 chars) to an extraction config with type and value.

Supported types:

prompt - AI extraction prompt (max 10000 chars)
model - Pre-defined extraction model
template - Extraction template (name or JSON)

Comprehensive Guide: See the Extraction Rules documentation for detailed examples, pattern matching rules, and best practices.

extraction_rules={"/products/*": {"type": "prompt", "value": "Extract product details"}} extraction_rules={"/blog/*": {"type": "model", "value": "article"}}

webhook_name

popular

default: null

Name reference to a webhook configured in your dashboard. This is NOT a URL - it is the name you assigned when creating the webhook.

Two-step process:

Create webhook in dashboard - Configure URL, authentication, and events
Reference by name - Use the webhook name in your API call

The webhook must exist in the same project and environment as your crawler. The webhook name is converted to lowercase.

webhook_name=my-crawler-webhook (references a webhook named "my-crawler-webhook")

webhook_events

basic events if webhook_name provided

List of webhook events to subscribe to. If webhook name is provided but events list is empty, defaults to basic events: crawler_started, crawler_stopped, crawler_cancelled, crawler_finished.

Available events:

crawler_started - Crawler job started
crawler_url_visited - Individual URL successfully crawled
crawler_url_skipped - URL skipped (already crawled, excluded, etc.)
crawler_url_discovered - New URL discovered
crawler_url_failed - URL crawl failed
crawler_stopped - Crawler job stopped
crawler_cancelled - Crawler job cancelled
crawler_finished - Crawler job finished

webhook_events=["crawler_finished"]
webhook_events=["crawler_started", "crawler_finished"]
webhook_events=["crawler_url_visited", "crawler_url_failed"]

proxy_pool

popular

public_datacenter_pool

Select the proxy pool. A proxy pool is a network of proxies grouped by quality range and network type. The price varies based on the pool used. See proxy dashboard for available pools.

proxy_pool=public_datacenter_pool
proxy_pool=public_residential_pool

country

popular

default: random

Proxy country location in ISO 3166-1 alpha-2 (2 letters) country codes. The available countries are listed on your proxy dashboard. Supports exclusions (minus prefix) and weighted distribution (colon suffix with weight 0-255).

country=us
country=us,ca,mx (random distribution)
country=us:10,gb:5 (weighted, 0-255)
country=-gb (exclude GB)

asp

popular

default: false

Anti Scraping Protection - Enable advanced anti-bot bypass features including browser rendering, fingerprinting, and automatic retry with upgraded configurations. When enabled, the crawler will automatically use headless browsers and adapt to bypass protections.

Note When ASP is enabled, any custom user_agent parameter is ignored. ASP manages User-Agent headers automatically for optimal bypass performance.

asp=true
asp=false

Get Crawler Status

Retrieve the current status and progress of a crawler job. Use this endpoint to poll for updates while the crawler is running.

GET https://api.scrapfly.io/crawl/{uuid}/status

curl "https://api.scrapfly.io/crawl/{uuid}/status?key="

Response includes:

status - Current status (PENDING, RUNNING, COMPLETED, FAILED, CANCELLED)
state.urls_discovered - Total URLs discovered
state.urls_crawled - URLs successfully crawled
state.urls_pending - URLs waiting to be crawled
state.urls_failed - URLs that failed to crawl
state.api_credits_used - Total API credits consumed

Get Crawled URLs

Retrieve a list of all URLs discovered and crawled during the job, with metadata about each URL.

GET https://api.scrapfly.io/crawl/{uuid}/urls

# Get all visited URLs curl "https://api.scrapfly.io/crawl/{uuid}/urls?key=&status=visited" # Get failed URLs with pagination curl "https://api.scrapfly.io/crawl/{uuid}/urls?key=&status=failed&page=1&per_page=100"

Query Parameters:

key - Your API key (required)
status - Filter by URL status: visited, pending, failed
page - Page number for pagination (default: 1)
per_page - Results per page (default: 100, max: 1000)

Get Content

Retrieve extracted content from crawled pages in the format(s) specified in your crawl configuration.

Single URL or All Pages (GET)

GET https://api.scrapfly.io/crawl/{uuid}/contents

# Get all content in markdown format curl "https://api.scrapfly.io/crawl/{uuid}/contents?key=&format=markdown" # Get content for a specific URL curl "https://api.scrapfly.io/crawl/{uuid}/contents?key=&format=html&url=https://example.com/page"

Query Parameters:

key - Your API key (required)
format - Content format to retrieve (must be one of the formats specified in crawl config)
url - Optional: Retrieve content for a specific URL only

Batch Content Retrieval (POST)

POST https://api.scrapfly.io/crawl/{uuid}/contents/batch

Retrieve content for multiple specific URLs in a single request. More efficient than making individual GET requests for each URL. Maximum 100 URLs per request.

# Batch retrieve content for multiple URLs curl -X POST "https://api.scrapfly.io/crawl/{uuid}/contents/batch?key=&formats=markdown,text" \ -H "Content-Type: text/plain" \ -d "https://example.com/page1 https://example.com/page2 https://example.com/page3"

Query Parameters:

key - Your API key (required)
formats - Comma-separated list of formats (e.g., markdown,text,html)

Request Body:

Content-Type: text/plain - Plain text with URLs separated by newlines
Maximum 100 URLs per request

Response Format:

Content-Type: multipart/related - Standard HTTP multipart format (RFC 2387)
X-Scrapfly-Requested-URLs header - Number of URLs in the request
X-Scrapfly-Found-URLs header - Number of URLs found in the crawl results
Each part contains Content-Type and Content-Location headers identifying the format and URL

Efficient Streaming Format

The multipart format eliminates JSON escaping overhead, providing ~50% bandwidth savings for text content and constant memory usage during streaming. See the Results documentation for parsing examples in Python, JavaScript, and Go.

Download Artifact

Download industry-standard archive files containing all crawled data, including HTTP requests, responses, headers, and extracted content. Perfect for storing bulk crawl results offline or in object storage (S3, Google Cloud Storage).

GET https://api.scrapfly.io/crawl/{uuid}/artifact

# Download WARC artifact (gzip compressed, recommended for large crawls) curl "https://api.scrapfly.io/crawl/{uuid}/artifact?key=&type=warc" -o crawl.warc.gz # Download HAR artifact (JSON format) curl "https://api.scrapfly.io/crawl/{uuid}/artifact?key=&type=har" -o crawl.har

Query Parameters:

key - Your API key (required)
type - Artifact type:
- warc - Web ARChive format (gzip compressed, industry standard)
- har - HTTP Archive format (JSON, browser-compatible)

Billing

Crawler API billing is simple: the cost equals the sum of all Web Scraping API calls made during the crawl. Each page crawled consumes credits based on enabled features (browser rendering, anti-scraping protection, proxy type, etc.).

For detailed billing information, see Crawler API Billing.

Getting Started with Scrapfly Crawler API

Quick Start: Choose Your Workflow

Polling Workflow

Understanding the Status Response

Real-Time Webhook Workflow

Error Handling

API Specification

Create Crawler Job

Query Parameters (Authentication)

Request Body (Crawler Configuration)

Get Crawler Status

Get Crawled URLs

Get Content

Single URL or All Pages (GET)

Batch Content Retrieval (POST)

Download Artifact

Billing

Summary