Skip to content

SylphxAI/pdf-reader-mcp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

πŸ“„ @sylphx/pdf-reader-mcp

Production-ready PDF processing server for AI agents

npm version License CI/CD codecov coverage TypeScript Downloads

5-10x faster parallel processing β€’ Y-coordinate content ordering β€’ 94%+ test coverage β€’ 103 tests passing

Security Validated

πŸš€ Overview

PDF Reader MCP is a production-ready Model Context Protocol server that empowers AI agents with enterprise-grade PDF processing capabilities. Extract text, images, and metadata with unmatched performance and reliability.

The Problem:

// Traditional PDF processing - Sequential page processing (slow) - No natural content ordering - Complex path handling - Poor error isolation

The Solution:

// PDF Reader MCP - 5-10x faster parallel processing ⚑ - Y-coordinate based ordering πŸ“ - Flexible path support (absolute/relative) 🎯 - Per-page error resilience πŸ›‘οΈ - 94%+ test coverage βœ…

Result: Production-ready PDF processing that scales.


⚑ Key Features

Performance

  • πŸš€ 5-10x faster than sequential with automatic parallelization
  • ⚑ 12,933 ops/sec error handling, 5,575 ops/sec text extraction
  • πŸ’¨ Process 50-page PDFs in seconds with multi-core utilization
  • πŸ“¦ Lightweight with minimal dependencies

Developer Experience

  • 🎯 Path Flexibility - Absolute & relative paths, Windows/Unix support (v1.3.0)
  • πŸ–ΌοΈ Smart Ordering - Y-coordinate based content preserves document layout
  • πŸ›‘οΈ Type Safe - Full TypeScript with strict mode enabled
  • πŸ“š Battle-tested - 103 tests, 94%+ coverage, 98%+ function coverage
  • 🎨 Simple API - Single tool handles all operations elegantly

πŸ“Š Performance Benchmarks

Real-world performance from production testing:

Operation Ops/sec Performance Use Case
Error handling 12,933 ⚑⚑⚑⚑⚑ Validation & safety
Extract full text 5,575 ⚑⚑⚑⚑ Document analysis
Extract page 5,329 ⚑⚑⚑⚑ Single page ops
Multiple pages 5,242 ⚑⚑⚑⚑ Batch processing
Metadata only 4,912 ⚑⚑⚑ Quick inspection

Parallel Processing Speedup

Document Sequential Parallel Speedup
10-page PDF ~2s ~0.3s 5-8x faster
50-page PDF ~10s ~1s 10x faster
100+ pages ~20s ~2s Linear scaling with CPU cores

Benchmarks vary based on PDF complexity and system resources.


πŸ“¦ Installation

Claude Code

claude mcp add pdf-reader -- npx @sylphx/pdf-reader-mcp

Claude Desktop

Add to claude_desktop_config.json:

{ "mcpServers": { "pdf-reader": { "command": "npx", "args": ["@sylphx/pdf-reader-mcp"] } } }
πŸ“ Config file locations
  • macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
  • Windows: %APPDATA%\Claude\claude_desktop_config.json
  • Linux: ~/.config/Claude/claude_desktop_config.json

VS Code

code --add-mcp '{"name":"pdf-reader","command":"npx","args":["@sylphx/pdf-reader-mcp"]}'

Cursor

  1. Open Settings β†’ MCP β†’ Add new MCP Server
  2. Select Command type
  3. Enter: npx @sylphx/pdf-reader-mcp

Windsurf

Add to your Windsurf MCP config:

{ "mcpServers": { "pdf-reader": { "command": "npx", "args": ["@sylphx/pdf-reader-mcp"] } } }

Cline

Add to Cline's MCP settings:

{ "mcpServers": { "pdf-reader": { "command": "npx", "args": ["@sylphx/pdf-reader-mcp"] } } }

Warp

  1. Go to Settings β†’ AI β†’ Manage MCP Servers β†’ Add
  2. Command: npx, Args: @sylphx/pdf-reader-mcp

Smithery (One-click)

npx -y @smithery/cli install @sylphx/pdf-reader-mcp --client claude

Manual Installation

# Quick start - zero installation npx @sylphx/pdf-reader-mcp # Or install globally npm install -g @sylphx/pdf-reader-mcp

🎯 Quick Start

Basic Usage

{ "sources": [{ "path": "documents/report.pdf" }], "include_full_text": true, "include_metadata": true, "include_page_count": true }

Result:

  • βœ… Full text content extracted
  • βœ… PDF metadata (author, title, dates)
  • βœ… Total page count
  • βœ… Structural sharing - unchanged parts preserved

Extract Specific Pages

{ "sources": [{ "path": "documents/manual.pdf", "pages": "1-5,10,15-20" }], "include_full_text": true }

Absolute Paths (v1.3.0+)

// Windows - Both formats work! { "sources": [{ "path": "C:\\Users\\John\\Documents\\report.pdf" }], "include_full_text": true } // Unix/Mac { "sources": [{ "path": "/home/user/documents/contract.pdf" }], "include_full_text": true }

No more "Absolute paths are not allowed" errors!

Extract Images with Natural Ordering

{ "sources": [{ "path": "presentation.pdf", "pages": [1, 2, 3] }], "include_images": true, "include_full_text": true }

Response includes:

  • Text and images in exact document order (Y-coordinate sorted)
  • Base64-encoded images with metadata (width, height, format)
  • Natural reading flow preserved for AI comprehension

Batch Processing

{ "sources": [ { "path": "C:\\Reports\\Q1.pdf", "pages": "1-10" }, { "path": "/home/user/Q2.pdf", "pages": "1-10" }, { "url": "https://example.com/Q3.pdf" } ], "include_full_text": true }

⚑ All PDFs processed in parallel automatically!


✨ Features

Core Capabilities

  • βœ… Text Extraction - Full document or specific pages with intelligent parsing
  • βœ… Image Extraction - Base64-encoded with complete metadata (width, height, format)
  • βœ… Content Ordering - Y-coordinate based layout preservation for natural reading flow
  • βœ… Metadata Extraction - Author, title, creation date, and custom properties
  • βœ… Page Counting - Fast enumeration without loading full content
  • βœ… Dual Sources - Local files (absolute or relative paths) and HTTP/HTTPS URLs
  • βœ… Batch Processing - Multiple PDFs processed concurrently

Advanced Features

  • ⚑ 5-10x Performance - Parallel page processing with Promise.all
  • 🎯 Smart Pagination - Extract ranges like "1-5,10-15,20"
  • πŸ–ΌοΈ Multi-Format Images - RGB, RGBA, Grayscale with automatic detection
  • πŸ›‘οΈ Path Flexibility - Windows, Unix, and relative paths all supported (v1.3.0)
  • πŸ” Error Resilience - Per-page error isolation with detailed messages
  • πŸ“ Large File Support - Efficient streaming and memory management
  • πŸ“ Type Safe - Full TypeScript with strict mode enabled

πŸ†• What's New in v1.3.0

πŸŽ‰ Absolute Paths Now Supported!

// βœ… Windows { "path": "C:\\Users\\John\\Documents\\report.pdf" } { "path": "C:/Users/John/Documents/report.pdf" } // βœ… Unix/Mac { "path": "/home/john/documents/report.pdf" } { "path": "/Users/john/Documents/report.pdf" } // βœ… Relative (still works) { "path": "documents/report.pdf" }

Other Improvements:

  • πŸ› Fixed Zod validation error handling
  • πŸ“¦ Updated all dependencies to latest versions
  • βœ… 103 tests passing, 94%+ coverage maintained
πŸ“‹ View Full Changelog

v1.2.0 - Content Ordering

  • Y-coordinate based text and image ordering
  • Natural reading flow for AI models
  • Intelligent line grouping

v1.1.0 - Image Extraction & Performance

  • Base64-encoded image extraction
  • 10x speedup with parallel processing
  • Comprehensive test coverage (94%+)

View Full Changelog β†’


πŸ“– API Reference

read_pdf Tool

The single tool that handles all PDF operations.

Parameters

Parameter Type Description Default
sources Array List of PDF sources to process Required
include_full_text boolean Extract full text content false
include_metadata boolean Extract PDF metadata true
include_page_count boolean Include total page count true
include_images boolean Extract embedded images false

Source Object

{ path?: string; // Local file path (absolute or relative) url?: string; // HTTP/HTTPS URL to PDF pages?: string | number[]; // Pages to extract: "1-5,10" or [1,2,3] }

Examples

Metadata only (fast):

{ "sources": [{ "path": "large.pdf" }], "include_metadata": true, "include_page_count": true, "include_full_text": false }

From URL:

{ "sources": [{ "url": "https://arxiv.org/pdf/2301.00001.pdf" }], "include_full_text": true }

Page ranges:

{ "sources": [{ "path": "manual.pdf", "pages": "1-5,10-15,20" // Pages 1,2,3,4,5,10,11,12,13,14,15,20 }] }

πŸ”§ Advanced Usage

πŸ“ Y-Coordinate Content Ordering

Content is returned in natural reading order based on Y-coordinates:

Document Layout: β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ [Title] Y:100 β”‚ β”‚ [Image] Y:150 β”‚ β”‚ [Text] Y:400 β”‚ β”‚ [Photo A] Y:500 β”‚ β”‚ [Photo B] Y:550 β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ Response Order: [ { type: "text", text: "Title..." }, { type: "image", data: "..." }, { type: "text", text: "..." }, { type: "image", data: "..." }, { type: "image", data: "..." } ] 

Benefits:

  • AI understands spatial relationships
  • Natural document comprehension
  • Perfect for vision-enabled models
  • Automatic multi-line text grouping
πŸ–ΌοΈ Image Extraction

Enable extraction:

{ "sources": [{ "path": "manual.pdf" }], "include_images": true }

Response format:

{ "images": [{ "page": 1, "index": 0, "width": 1920, "height": 1080, "format": "rgb", "data": "base64-encoded-png..." }] }

Supported formats: RGB, RGBA, Grayscale Auto-detected: JPEG, PNG, and other embedded formats

πŸ“‚ Path Configuration

Absolute paths (v1.3.0+) - Direct file access:

{ "path": "C:\\Users\\John\\file.pdf" } { "path": "/home/user/file.pdf" }

Relative paths - Workspace files:

{ "path": "docs/report.pdf" } { "path": "./2024/Q1.pdf" }

Configure working directory:

{ "mcpServers": { "pdf-reader-mcp": { "command": "npx", "args": ["@sylphx/pdf-reader-mcp"], "cwd": "/path/to/documents" } } }
πŸ“Š Large PDF Strategies

Strategy 1: Page ranges

{ "sources": [{ "path": "big.pdf", "pages": "1-20" }] }

Strategy 2: Progressive loading

// Step 1: Get page count { "sources": [{ "path": "big.pdf" }], "include_full_text": false } // Step 2: Extract sections { "sources": [{ "path": "big.pdf", "pages": "50-75" }] }

Strategy 3: Parallel batching

{ "sources": [ { "path": "big.pdf", "pages": "1-50" }, { "path": "big.pdf", "pages": "51-100" } ] }

πŸ”§ Troubleshooting

"Absolute paths are not allowed"

Solution: Upgrade to v1.3.0+

npm update @sylphx/pdf-reader-mcp

Restart your MCP client completely.


"File not found"

Causes:

  • File doesn't exist at path
  • Wrong working directory
  • Permission issues

Solutions:

Use absolute path:

{ "path": "C:\\Full\\Path\\file.pdf" }

Or configure cwd:

{ "pdf-reader-mcp": { "command": "npx", "args": ["@sylphx/pdf-reader-mcp"], "cwd": "/path/to/docs" } }

"No tools showing up"

Solution:

npm cache clean --force rm -rf node_modules package-lock.json npm install @sylphx/pdf-reader-mcp@latest

Restart MCP client completely.


πŸ—οΈ Architecture

Tech Stack

Component Technology
Runtime Node.js 22+ ESM
PDF Engine PDF.js (Mozilla)
Validation Zod + JSON Schema
Protocol MCP SDK
Language TypeScript (strict)
Testing Vitest (103 tests)
Quality Biome (50x faster)
CI/CD GitHub Actions

Design Principles

  • πŸ”’ Security First - Flexible paths with secure defaults
  • 🎯 Simple Interface - One tool, all operations
  • ⚑ Performance - Parallel processing, efficient memory
  • πŸ›‘οΈ Reliability - Per-page isolation, detailed errors
  • πŸ§ͺ Quality - 94%+ coverage, strict TypeScript
  • πŸ“ Type Safety - No any types, strict mode
  • πŸ”„ Backward Compatible - Smooth upgrades always

πŸ§ͺ Development

Setup & Scripts

Prerequisites:

  • Node.js >= 22.0.0
  • pnpm (recommended) or npm

Setup:

git clone https://github.com/SylphxAI/pdf-reader-mcp.git cd pdf-reader-mcp pnpm install && pnpm build

Scripts:

pnpm run build # Build TypeScript pnpm run test # Run 103 tests pnpm run test:cov # Coverage (94%+) pnpm run check # Lint + format pnpm run check:fix # Auto-fix pnpm run benchmark # Performance tests

Quality:

  • βœ… 103 tests
  • βœ… 94%+ coverage
  • βœ… 98%+ function coverage
  • βœ… Zero lint errors
  • βœ… Strict TypeScript
Contributing

Quick Start:

  1. Fork repository
  2. Create branch: git checkout -b feature/awesome
  3. Make changes: pnpm test
  4. Format: pnpm run check:fix
  5. Commit: Use Conventional Commits
  6. Open PR

Commit Format:

feat(images): add WebP support fix(paths): handle UNC paths docs(readme): update examples 

See CONTRIBUTING.md


πŸ“š Documentation


πŸ—ΊοΈ Roadmap

βœ… Completed

  • Image extraction (v1.1.0)
  • 5-10x parallel speedup (v1.1.0)
  • Y-coordinate ordering (v1.2.0)
  • Absolute paths (v1.3.0)
  • 94%+ test coverage (v1.3.0)

πŸš€ Next

  • OCR for scanned PDFs
  • Annotation extraction
  • Form field extraction
  • Table detection
  • 100+ MB streaming
  • Advanced caching
  • PDF generation

Vote at Discussions


πŸ† Recognition

Featured on:

Trusted worldwide β€’ Enterprise adoption β€’ Battle-tested


🀝 Support

GitHub Issues Discord

Show Your Support: ⭐ Star β€’ πŸ‘€ Watch β€’ πŸ› Report bugs β€’ πŸ’‘ Suggest features β€’ πŸ”€ Contribute


πŸ“Š Stats

Stars Forks Downloads Contributors

103 Tests β€’ 94%+ Coverage β€’ Production Ready


πŸ“„ License

MIT Β© Sylphx


πŸ™ Credits

Built with:

  • PDF.js - Mozilla PDF engine
  • Bun - Fast JavaScript runtime

Special thanks to the open source community ❀️

Powered by Sylphx

This project uses the following @sylphx packages:


Star History

Star History Chart


Built with ❀️ by Sylphx