The Scrapfly Crawler API enables recursive website crawling at scale. We leverage WARC, Parquet format for large scale scraping and you can easily visualize using HAR artifact. Crawl entire websites with configurable limits, extract content in multiple formats simultaneously, and retrieve results as industry-standard artifacts.
Early Access Feature
The Crawler API is currently in early access. Features and API may evolve based on user feedback.
Quick Start: Choose Your Workflow
The Crawler API supports two integration patterns. Choose the approach that best fits your use case:
Polling Workflow
Schedule a crawl, poll the status endpoint to monitor progress, and retrieve results when complete. Best for batch processing, testing, and simple integrations.
Schedule Crawl
Create a crawler with a single API call. The API returns immediately with a crawler UUID:
Schedule a crawl with webhook configuration, receive instant HTTP callbacks as events occur, and process results in real-time. Best for real-time data ingestion, streaming pipelines, and event-driven architectures.
Webhook Setup Required
Before using webhooks, you must configure a webhook in your dashboard with your endpoint URL and authentication. Then reference it by name in your API call.
Schedule Crawl with Webhook
Create a crawler and specify the webhook name configured in your dashboard:
Create a new crawler job with custom configuration. The API returns immediately with a crawler UUID that you can use to monitor progress and retrieve results.
Query Parameters (Authentication)
These parameters must be passed in the URL query string, not in the request body.
Your Scrapfly API key for authentication. You can find your key on your dashboard.
Query Parameter OnlyMust be passed as a URL query parameter (e.g., ?key=YOUR_KEY), never in the POST request body. This applies to all Crawler API endpoints.
?key=16eae084cff64841be193a95fc8fa67d Append to endpoint URL
Request Body (Crawler Configuration)
These parameters configure the crawler behavior and must be sent in the JSON request body.
Starting URL for the crawl. Must be a valid HTTP/HTTPS URL. The crawler will begin discovering and crawling linked pages from this seed URL. Must be URL encoded
Maximum number of pages to crawl. Must be non-negative. Set to 0 for unlimited (subject to subscription limits). Use this to limit crawl scope and control costs.
Maximum link depth from starting URL. Must be non-negative. Depth 0 is the starting URL, depth 1 is links from the starting page, etc. Set to 0 for unlimited depth. Use lower values for focused crawls, higher values for comprehensive site crawling.
Only crawl URLs matching these path patterns. Supports wildcards (*). Maximum 100 paths. Mutually exclusive with exclude_paths. Useful for focusing on specific sections like blogs or product pages.
include_only_paths=["/blog/*"]
include_only_paths=["/blog/*", "/articles/*"]
include_only_paths=["/products/*/reviews"]
Show Advanced Crawl Configuration (domain restrictions, delays, headers, sitemaps...)
By default, the crawler only follows links within the same base path as the starting URL. For example, starting from https://example.com/blog restricts crawling to /blog/*. Enable this to allow crawling any path on the same domain.
Whitelist of external domains to crawl when follow_external_links=true. Maximum 250 domains. Supports fnmatch-style wildcards (*) for flexible pattern matching.
Pattern Matching Examples:
*.example.com - Matches all subdomains of example.com
specific.org - Exact domain match only
blog.*.com - Matches blog.anything.com
Scraping vs. Crawling External Pages
When a page contains a link to an allowed external domain: The crawler WILL: Scrape the external page (extract content, consume credits) The crawler WILL NOT: Follow links found on that external page
Example: Crawling example.com with allowed_external_domains=["*.wikipedia.org"] will scrape Wikipedia pages linked from example.com, but will NOT crawl additional links discovered on Wikipedia.
allowed_external_domains=["cdn.example.com"] Only follow links to cdn.example.com
allowed_external_domains=["*.example.com"] Follow all subdomains of example.com
allowed_external_domains=["blog.example.com", "docs.example.com"] Follow multiple specific domains
Wait time in milliseconds after page load before extraction. Set to 0 to disable browser rendering (HTTP-only mode). Range: 0 or 1-25000ms (max 25 seconds). Only applies when browser rendering is enabled. Use this for pages that load content dynamically.
Maximum number of concurrent scrape requests. Controls crawl speed and resource usage. Limited by your account's concurrency limit. Set to 0 to use account/project default.
Add a delay between requests in milliseconds. Range: 0-15000ms (max 15 seconds). Use this to be polite to target servers and avoid overwhelming them with requests. Value must be provided as a string.
Custom User-Agent string to use for all requests. If not specified, Scrapfly will use appropriate User-Agent headers automatically. This is a shorthand for setting the User-Agent header.
Important: ASP Compatibility
When asp=true (Anti-Scraping Protection is enabled), this parameter is ignored. ASP manages User-Agent headers automatically for optimal bypass performance.
Choose one approach:
Use ASP (asp=true) - Automatic User-Agent management with advanced bypass
Use custom User-Agent (user_agent=...) - Manual control, ASP disabled
Cache time-to-live in seconds. Range: 0-604800 seconds (max 7 days). Only applies when cache=true. Set to 0 to use default TTL. After this duration, cached pages will be considered stale and re-crawled.
Ignore rel="nofollow" attributes on links. By default, links with nofollow are not crawled. Enable this to crawl all links regardless of the nofollow attribute.
List of content formats to extract from each crawled page. You can specify multiple formats to extract different representations simultaneously. Extracted content is available via the /contents endpoint or in downloaded artifacts.
Available formats:
html - Raw HTML content
clean_html - HTML with boilerplate removed
markdown - Markdown format (ideal for LLM training)
Maximum crawl duration in seconds. Range: 15-10800 seconds (15s to 3 hours). The crawler will stop after this time limit is reached, even if there are more pages to crawl. Use this to prevent long-running crawls.
Maximum API credits to spend on this crawl. Must be non-negative. The crawler will stop when this credit limit is reached. Set to 0 for no credit limit. Useful for controlling costs on large crawls.
Extraction rules to extract structured data from each page. Maximum 100 rules. Each rule maps a URL pattern (max 1000 chars) to an extraction config with type and value.
Supported types:
prompt - AI extraction prompt (max 10000 chars)
model - Pre-defined extraction model
template - Extraction template (name or JSON)
Comprehensive Guide: See the Extraction Rules documentation for detailed examples, pattern matching rules, and best practices.
List of webhook events to subscribe to. If webhook name is provided but events list is empty, defaults to basic events: crawler_started, crawler_stopped, crawler_cancelled, crawler_finished.
Select the proxy pool. A proxy pool is a network of proxies grouped by quality range and network type. The price varies based on the pool used. See proxy dashboard for available pools.
Proxy country location in ISO 3166-1 alpha-2 (2 letters) country codes. The available countries are listed on your proxy dashboard. Supports exclusions (minus prefix) and weighted distribution (colon suffix with weight 0-255).
Anti Scraping Protection - Enable advanced anti-bot bypass features including browser rendering, fingerprinting, and automatic retry with upgraded configurations. When enabled, the crawler will automatically use headless browsers and adapt to bypass protections.
Note When ASP is enabled, any custom user_agent parameter is ignored. ASP manages User-Agent headers automatically for optimal bypass performance.
asp=true
asp=false
Get Crawler Status
Retrieve the current status and progress of a crawler job. Use this endpoint to poll for updates while the crawler is running.
status - Current status (PENDING, RUNNING, COMPLETED, FAILED, CANCELLED)
state.urls_discovered - Total URLs discovered
state.urls_crawled - URLs successfully crawled
state.urls_pending - URLs waiting to be crawled
state.urls_failed - URLs that failed to crawl
state.api_credits_used - Total API credits consumed
Get Crawled URLs
Retrieve a list of all URLs discovered and crawled during the job, with metadata about each URL.
GEThttps://api.scrapfly.io/crawl/{uuid}/urls
# Get all visited URLs curl "https://api.scrapfly.io/crawl/{uuid}/urls?key=&status=visited" # Get failed URLs with pagination curl "https://api.scrapfly.io/crawl/{uuid}/urls?key=&status=failed&page=1&per_page=100"
Query Parameters:
key - Your API key (required)
status - Filter by URL status: visited, pending, failed
page - Page number for pagination (default: 1)
per_page - Results per page (default: 100, max: 1000)
Get Content
Retrieve extracted content from crawled pages in the format(s) specified in your crawl configuration.
Single URL or All Pages (GET)
GEThttps://api.scrapfly.io/crawl/{uuid}/contents
# Get all content in markdown format curl "https://api.scrapfly.io/crawl/{uuid}/contents?key=&format=markdown" # Get content for a specific URL curl "https://api.scrapfly.io/crawl/{uuid}/contents?key=&format=html&url=https://example.com/page"
Query Parameters:
key - Your API key (required)
format - Content format to retrieve (must be one of the formats specified in crawl config)
url - Optional: Retrieve content for a specific URL only
Retrieve content for multiple specific URLs in a single request. More efficient than making individual GET requests for each URL. Maximum 100 URLs per request.
# Batch retrieve content for multiple URLs curl -X POST "https://api.scrapfly.io/crawl/{uuid}/contents/batch?key=&formats=markdown,text" \ -H "Content-Type: text/plain" \ -d "https://example.com/page1 https://example.com/page2 https://example.com/page3"
Query Parameters:
key - Your API key (required)
formats - Comma-separated list of formats (e.g., markdown,text,html)
Request Body:
Content-Type: text/plain - Plain text with URLs separated by newlines
Maximum 100 URLs per request
Response Format:
Content-Type: multipart/related - Standard HTTP multipart format (RFC 2387)
X-Scrapfly-Requested-URLs header - Number of URLs in the request
X-Scrapfly-Found-URLs header - Number of URLs found in the crawl results
Each part contains Content-Type and Content-Location headers identifying the format and URL
Efficient Streaming Format
The multipart format eliminates JSON escaping overhead, providing ~50% bandwidth savings for text content and constant memory usage during streaming. See the Results documentation for parsing examples in Python, JavaScript, and Go.
Download Artifact
Download industry-standard archive files containing all crawled data, including HTTP requests, responses, headers, and extracted content. Perfect for storing bulk crawl results offline or in object storage (S3, Google Cloud Storage).
GEThttps://api.scrapfly.io/crawl/{uuid}/artifact
# Download WARC artifact (gzip compressed, recommended for large crawls) curl "https://api.scrapfly.io/crawl/{uuid}/artifact?key=&type=warc" -o crawl.warc.gz # Download HAR artifact (JSON format) curl "https://api.scrapfly.io/crawl/{uuid}/artifact?key=&type=har" -o crawl.har
Query Parameters:
key - Your API key (required)
type - Artifact type:
warc - Web ARChive format (gzip compressed, industry standard)
har - HTTP Archive format (JSON, browser-compatible)
Billing
Crawler API billing is simple: the cost equals the sum of all Web Scraping API calls made during the crawl. Each page crawled consumes credits based on enabled features (browser rendering, anti-scraping protection, proxy type, etc.).