StepWright

A powerful web scraping library built with Playwright that provides a declarative, step-by-step approach to web automation and data extraction.

Features

🚀 Declarative Scraping: Define scraping workflows using JSON templates
🔄 Pagination Support: Built-in support for next button and scroll-based pagination
📊 Data Collection: Extract text, HTML, values, and files from web pages
🔗 Multi-tab Support: Handle multiple tabs and complex navigation flows
📄 PDF Generation: Save pages as PDFs or trigger print-to-PDF actions
📥 File Downloads: Download files with automatic directory creation
🔁 Looping & Iteration: ForEach loops for processing multiple elements
📡 Streaming Results: Real-time result processing with callbacks
🎯 Error Handling: Graceful error handling with configurable termination
🔧 Flexible Selectors: Support for ID, class, tag, and XPath selectors

Installation

# Using pnpm (recommended) pnpm add stepwright # Using npm npm install stepwright # Using yarn yarn add stepwright

Quick Start

Basic Usage

import { runScraper } from 'stepwright'; const templates = [ { tab: 'example', steps: [ { id: 'navigate', action: 'navigate', value: 'https://example.com' }, { id: 'get_title', action: 'data', object_type: 'tag', object: 'h1', key: 'title', data_type: 'text' } ] } ]; const results = await runScraper(templates); console.log(results);

Examples

Basic Examples

The repository includes basic examples demonstrating core functionality:

Basic Usage (TypeScript): examples/basic-usage.ts - Simple navigation and data extraction
Basic Usage (JavaScript): examples/basic-usage.js - Same example in JavaScript

Run the examples:

# Run all examples ./examples/run-examples.sh # Or run individual examples node examples/basic-usage.js npx tsx examples/basic-usage.ts

Advanced Examples

For more complex scenarios, check out:

Advanced Usage (TypeScript): examples/advanced-usage.ts - Pagination, file downloads, and multi-tab handling
Advanced Usage (JavaScript): examples/advanced-usage.js - Same advanced features in JavaScript

API Reference

Core Functions

`runScraper(templates, options?)`

Main function to execute scraping templates.

Parameters:

templates: Array of TabTemplate objects
options: Optional RunOptions object

Returns: Promise<Record<string, any>[]>

`runScraperWithCallback(templates, onResult, options?)`

Execute scraping with streaming results via callback.

Parameters:

templates: Array of TabTemplate objects
onResult: Callback function for each result
options: Optional RunOptions object

Types

`TabTemplate`

interface TabTemplate { tab: string; initSteps?: BaseStep[]; // Steps executed once before pagination perPageSteps?: BaseStep[]; // Steps executed for each page steps?: BaseStep[]; // Legacy single steps array pagination?: PaginationConfig; }

`BaseStep`

interface BaseStep { id: string; description?: string; object_type?: SelectorType; // 'id' | 'class' | 'tag' | 'xpath' object?: string; action: 'navigate' | 'input' | 'click' | 'data' | 'scroll' | 'download' | 'foreach' | 'open' | 'savePDF' | 'printToPDF'; value?: string; key?: string; data_type?: DataType; // 'text' | 'html' | 'value' | 'default' wait?: number; terminateonerror?: boolean; subSteps?: BaseStep[]; }

`RunOptions`

interface RunOptions { browser?: LaunchOptions; onResult?: (result: Record<string, any>, index: number) => void | Promise<void>; }

Step Actions

Navigate

Navigate to a URL.

{ id: 'go_to_page', action: 'navigate', value: 'https://example.com' }

Input

Fill form fields.

{ id: 'search', action: 'input', object_type: 'id', object: 'search-box', value: 'search term' }

Click

Click on elements.

{ id: 'submit', action: 'click', object_type: 'class', object: 'submit-button' }

Data Extraction

Extract data from elements.

{ id: 'get_title', action: 'data', object_type: 'tag', object: 'h1', key: 'title', data_type: 'text' }

ForEach Loop

Process multiple elements.

{ id: 'process_items', action: 'foreach', object_type: 'class', object: 'item', subSteps: [ // Steps to execute for each item ] }

File Operations

Download

{ id: 'download_file', action: 'download', object_type: 'class', object: 'download-link', value: './downloads/file.pdf', key: 'downloaded_file' }

Save PDF

{ id: 'save_pdf', action: 'savePDF', value: './output/page.pdf', key: 'pdf_file' }

Print to PDF

{ id: 'print_pdf', action: 'printToPDF', object_type: 'id', object: 'print-button', value: './output/printed.pdf', key: 'printed_file' }

Pagination

Next Button Pagination

pagination: { strategy: 'next', nextButton: { object_type: 'class', object: 'next-page', wait: 2000 }, maxPages: 10 }

Scroll Pagination

pagination: { strategy: 'scroll', scroll: { offset: 800, delay: 1500 }, maxPages: 5 }

Advanced Features

Proxy Support

const results = await runScraper(templates, { browser: { proxy: { server: 'http://proxy-server:8080', username: 'user', password: 'pass' } } });

Custom Browser Options

const results = await runScraper(templates, { browser: { headless: false, slowMo: 1000, args: ['--no-sandbox', '--disable-setuid-sandbox'] } });

Streaming Results

await runScraperWithCallback(templates, async (result, index) => { console.log(`Result ${index}:`, result); // Process result immediately }, { browser: { headless: true } });

Data Placeholders

Use collected data in subsequent steps:

{ id: 'save_with_title', action: 'savePDF', value: './output/{{meeting_title}}.pdf', key: 'meeting_pdf' }

Error Handling

Steps can be configured to terminate on error:

{ id: 'critical_step', action: 'click', object_type: 'id', object: 'important-button', terminateonerror: true }

Development

Setup

# Install dependencies pnpm install # Build the project pnpm build # Run tests pnpm test # Run tests in watch mode pnpm test:watch # Lint code pnpm lint # Format code pnpm format

Testing

# Run all tests pnpm test # Run tests with coverage pnpm test:coverage # Run specific test file pnpm test scraper.test.ts

Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests for new functionality
Run the test suite
Submit a pull request

License

MIT License - see LICENSE file for details.

Support

🐛 Issues: GitHub Issues

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.github/workflows		.github/workflows
examples		examples
src		src
test		test
.eslintrc.js		.eslintrc.js
.gitignore		.gitignore
.npmrc		.npmrc
LICENSE		LICENSE
README.md		README.md
eslint.config.mts		eslint.config.mts
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
pnpm-workspace.yaml		pnpm-workspace.yaml
tsconfig.json		tsconfig.json
vitest.config.ts		vitest.config.ts

Folders and files

Latest commit

History

Repository files navigation

StepWright

Features

Installation

Quick Start

Basic Usage

Examples

Basic Examples

Advanced Examples

API Reference

Core Functions

runScraper(templates, options?)

runScraperWithCallback(templates, onResult, options?)

Types

TabTemplate

BaseStep

RunOptions

Step Actions

Navigate

Input

Click

Data Extraction

ForEach Loop

File Operations

Download

Save PDF

Print to PDF

Pagination

Next Button Pagination

Scroll Pagination

Advanced Features

Proxy Support

Custom Browser Options

Streaming Results

Data Placeholders

Error Handling

Development

Setup

Testing

Contributing

License

Support

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Contributors

Uh oh!

Languages

`runScraper(templates, options?)`

`runScraperWithCallback(templates, onResult, options?)`

`TabTemplate`

`BaseStep`

`RunOptions`