Skip to content

Framework-Island/stepwright

Repository files navigation

StepWright

A powerful web scraping library built with Playwright that provides a declarative, step-by-step approach to web automation and data extraction.

Features

  • 🚀 Declarative Scraping: Define scraping workflows using JSON templates
  • 🔄 Pagination Support: Built-in support for next button and scroll-based pagination
  • 📊 Data Collection: Extract text, HTML, values, and files from web pages
  • 🔗 Multi-tab Support: Handle multiple tabs and complex navigation flows
  • 📄 PDF Generation: Save pages as PDFs or trigger print-to-PDF actions
  • 📥 File Downloads: Download files with automatic directory creation
  • 🔁 Looping & Iteration: ForEach loops for processing multiple elements
  • 📡 Streaming Results: Real-time result processing with callbacks
  • 🎯 Error Handling: Graceful error handling with configurable termination
  • 🔧 Flexible Selectors: Support for ID, class, tag, and XPath selectors

Installation

# Using pnpm (recommended) pnpm add stepwright # Using npm npm install stepwright # Using yarn yarn add stepwright

Quick Start

Basic Usage

import { runScraper } from 'stepwright'; const templates = [ { tab: 'example', steps: [ { id: 'navigate', action: 'navigate', value: 'https://example.com' }, { id: 'get_title', action: 'data', object_type: 'tag', object: 'h1', key: 'title', data_type: 'text' } ] } ]; const results = await runScraper(templates); console.log(results);

Examples

Basic Examples

The repository includes basic examples demonstrating core functionality:

  • Basic Usage (TypeScript): examples/basic-usage.ts - Simple navigation and data extraction
  • Basic Usage (JavaScript): examples/basic-usage.js - Same example in JavaScript

Run the examples:

# Run all examples ./examples/run-examples.sh # Or run individual examples node examples/basic-usage.js npx tsx examples/basic-usage.ts

Advanced Examples

For more complex scenarios, check out:

  • Advanced Usage (TypeScript): examples/advanced-usage.ts - Pagination, file downloads, and multi-tab handling
  • Advanced Usage (JavaScript): examples/advanced-usage.js - Same advanced features in JavaScript

API Reference

Core Functions

runScraper(templates, options?)

Main function to execute scraping templates.

Parameters:

  • templates: Array of TabTemplate objects
  • options: Optional RunOptions object

Returns: Promise<Record<string, any>[]>

runScraperWithCallback(templates, onResult, options?)

Execute scraping with streaming results via callback.

Parameters:

  • templates: Array of TabTemplate objects
  • onResult: Callback function for each result
  • options: Optional RunOptions object

Types

TabTemplate

interface TabTemplate { tab: string; initSteps?: BaseStep[]; // Steps executed once before pagination perPageSteps?: BaseStep[]; // Steps executed for each page steps?: BaseStep[]; // Legacy single steps array pagination?: PaginationConfig; }

BaseStep

interface BaseStep { id: string; description?: string; object_type?: SelectorType; // 'id' | 'class' | 'tag' | 'xpath' object?: string; action: 'navigate' | 'input' | 'click' | 'data' | 'scroll' | 'download' | 'foreach' | 'open' | 'savePDF' | 'printToPDF'; value?: string; key?: string; data_type?: DataType; // 'text' | 'html' | 'value' | 'default' wait?: number; terminateonerror?: boolean; subSteps?: BaseStep[]; }

RunOptions

interface RunOptions { browser?: LaunchOptions; onResult?: (result: Record<string, any>, index: number) => void | Promise<void>; }

Step Actions

Navigate

Navigate to a URL.

{ id: 'go_to_page', action: 'navigate', value: 'https://example.com' }

Input

Fill form fields.

{ id: 'search', action: 'input', object_type: 'id', object: 'search-box', value: 'search term' }

Click

Click on elements.

{ id: 'submit', action: 'click', object_type: 'class', object: 'submit-button' }

Data Extraction

Extract data from elements.

{ id: 'get_title', action: 'data', object_type: 'tag', object: 'h1', key: 'title', data_type: 'text' }

ForEach Loop

Process multiple elements.

{ id: 'process_items', action: 'foreach', object_type: 'class', object: 'item', subSteps: [ // Steps to execute for each item ] }

File Operations

Download

{ id: 'download_file', action: 'download', object_type: 'class', object: 'download-link', value: './downloads/file.pdf', key: 'downloaded_file' }

Save PDF

{ id: 'save_pdf', action: 'savePDF', value: './output/page.pdf', key: 'pdf_file' }

Print to PDF

{ id: 'print_pdf', action: 'printToPDF', object_type: 'id', object: 'print-button', value: './output/printed.pdf', key: 'printed_file' }

Pagination

Next Button Pagination

pagination: { strategy: 'next', nextButton: { object_type: 'class', object: 'next-page', wait: 2000 }, maxPages: 10 }

Scroll Pagination

pagination: { strategy: 'scroll', scroll: { offset: 800, delay: 1500 }, maxPages: 5 }

Advanced Features

Proxy Support

const results = await runScraper(templates, { browser: { proxy: { server: 'http://proxy-server:8080', username: 'user', password: 'pass' } } });

Custom Browser Options

const results = await runScraper(templates, { browser: { headless: false, slowMo: 1000, args: ['--no-sandbox', '--disable-setuid-sandbox'] } });

Streaming Results

await runScraperWithCallback(templates, async (result, index) => { console.log(`Result ${index}:`, result); // Process result immediately }, { browser: { headless: true } });

Data Placeholders

Use collected data in subsequent steps:

{ id: 'save_with_title', action: 'savePDF', value: './output/{{meeting_title}}.pdf', key: 'meeting_pdf' }

Error Handling

Steps can be configured to terminate on error:

{ id: 'critical_step', action: 'click', object_type: 'id', object: 'important-button', terminateonerror: true }

Development

Setup

# Install dependencies pnpm install # Build the project pnpm build # Run tests pnpm test # Run tests in watch mode pnpm test:watch # Lint code pnpm lint # Format code pnpm format

Testing

# Run all tests pnpm test # Run tests with coverage pnpm test:coverage # Run specific test file pnpm test scraper.test.ts

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests for new functionality
  5. Run the test suite
  6. Submit a pull request

License

MIT License - see LICENSE file for details.

Support