A powerful web scraping library built with Playwright that provides a declarative, step-by-step approach to web automation and data extraction.
- 🚀 Declarative Scraping: Define scraping workflows using JSON templates
- 🔄 Pagination Support: Built-in support for next button and scroll-based pagination
- 📊 Data Collection: Extract text, HTML, values, and files from web pages
- 🔗 Multi-tab Support: Handle multiple tabs and complex navigation flows
- 📄 PDF Generation: Save pages as PDFs or trigger print-to-PDF actions
- 📥 File Downloads: Download files with automatic directory creation
- 🔁 Looping & Iteration: ForEach loops for processing multiple elements
- 📡 Streaming Results: Real-time result processing with callbacks
- 🎯 Error Handling: Graceful error handling with configurable termination
- 🔧 Flexible Selectors: Support for ID, class, tag, and XPath selectors
# Using pnpm (recommended) pnpm add stepwright # Using npm npm install stepwright # Using yarn yarn add stepwrightimport { runScraper } from 'stepwright'; const templates = [ { tab: 'example', steps: [ { id: 'navigate', action: 'navigate', value: 'https://example.com' }, { id: 'get_title', action: 'data', object_type: 'tag', object: 'h1', key: 'title', data_type: 'text' } ] } ]; const results = await runScraper(templates); console.log(results);The repository includes basic examples demonstrating core functionality:
- Basic Usage (TypeScript):
examples/basic-usage.ts- Simple navigation and data extraction - Basic Usage (JavaScript):
examples/basic-usage.js- Same example in JavaScript
Run the examples:
# Run all examples ./examples/run-examples.sh # Or run individual examples node examples/basic-usage.js npx tsx examples/basic-usage.tsFor more complex scenarios, check out:
- Advanced Usage (TypeScript):
examples/advanced-usage.ts- Pagination, file downloads, and multi-tab handling - Advanced Usage (JavaScript):
examples/advanced-usage.js- Same advanced features in JavaScript
Main function to execute scraping templates.
Parameters:
templates: Array ofTabTemplateobjectsoptions: OptionalRunOptionsobject
Returns: Promise<Record<string, any>[]>
Execute scraping with streaming results via callback.
Parameters:
templates: Array ofTabTemplateobjectsonResult: Callback function for each resultoptions: OptionalRunOptionsobject
interface TabTemplate { tab: string; initSteps?: BaseStep[]; // Steps executed once before pagination perPageSteps?: BaseStep[]; // Steps executed for each page steps?: BaseStep[]; // Legacy single steps array pagination?: PaginationConfig; }interface BaseStep { id: string; description?: string; object_type?: SelectorType; // 'id' | 'class' | 'tag' | 'xpath' object?: string; action: 'navigate' | 'input' | 'click' | 'data' | 'scroll' | 'download' | 'foreach' | 'open' | 'savePDF' | 'printToPDF'; value?: string; key?: string; data_type?: DataType; // 'text' | 'html' | 'value' | 'default' wait?: number; terminateonerror?: boolean; subSteps?: BaseStep[]; }interface RunOptions { browser?: LaunchOptions; onResult?: (result: Record<string, any>, index: number) => void | Promise<void>; }Navigate to a URL.
{ id: 'go_to_page', action: 'navigate', value: 'https://example.com' }Fill form fields.
{ id: 'search', action: 'input', object_type: 'id', object: 'search-box', value: 'search term' }Click on elements.
{ id: 'submit', action: 'click', object_type: 'class', object: 'submit-button' }Extract data from elements.
{ id: 'get_title', action: 'data', object_type: 'tag', object: 'h1', key: 'title', data_type: 'text' }Process multiple elements.
{ id: 'process_items', action: 'foreach', object_type: 'class', object: 'item', subSteps: [ // Steps to execute for each item ] }{ id: 'download_file', action: 'download', object_type: 'class', object: 'download-link', value: './downloads/file.pdf', key: 'downloaded_file' }{ id: 'save_pdf', action: 'savePDF', value: './output/page.pdf', key: 'pdf_file' }{ id: 'print_pdf', action: 'printToPDF', object_type: 'id', object: 'print-button', value: './output/printed.pdf', key: 'printed_file' }pagination: { strategy: 'next', nextButton: { object_type: 'class', object: 'next-page', wait: 2000 }, maxPages: 10 }pagination: { strategy: 'scroll', scroll: { offset: 800, delay: 1500 }, maxPages: 5 }const results = await runScraper(templates, { browser: { proxy: { server: 'http://proxy-server:8080', username: 'user', password: 'pass' } } });const results = await runScraper(templates, { browser: { headless: false, slowMo: 1000, args: ['--no-sandbox', '--disable-setuid-sandbox'] } });await runScraperWithCallback(templates, async (result, index) => { console.log(`Result ${index}:`, result); // Process result immediately }, { browser: { headless: true } });Use collected data in subsequent steps:
{ id: 'save_with_title', action: 'savePDF', value: './output/{{meeting_title}}.pdf', key: 'meeting_pdf' }Steps can be configured to terminate on error:
{ id: 'critical_step', action: 'click', object_type: 'id', object: 'important-button', terminateonerror: true }# Install dependencies pnpm install # Build the project pnpm build # Run tests pnpm test # Run tests in watch mode pnpm test:watch # Lint code pnpm lint # Format code pnpm format# Run all tests pnpm test # Run tests with coverage pnpm test:coverage # Run specific test file pnpm test scraper.test.ts- Fork the repository
- Create a feature branch
- Make your changes
- Add tests for new functionality
- Run the test suite
- Submit a pull request
MIT License - see LICENSE file for details.
- 🐛 Issues: GitHub Issues