pdf-extraction

Here are 147 public repositories matching this topic...

opendataloader-project / opendataloader-pdf

PDF Parser for AI-ready data. Automate PDF accessibility. Open-source.

html markdown pdf json ocr ai accessibility a11y pdf-converter tables ocr-recognition pdf-parser rag bounding-box eaa pdf-extraction tagged-pdf document-parsing pdf-accessibility pdf-ua

Updated Mar 25, 2026
Java

kreuzberg-dev / kreuzberg

Star

A polyglot document intelligence framework with a Rust core. Extract text, metadata, and structured information from PDFs, Office documents, images, and 88+ formats. Available for Rust, Python, Ruby, Java, Go, PHP, Elixir, C#, R, C, TypeScript (Node/Bun/Wasm/Deno)- or use via CLI, REST API, or MCP server.

ruby python java rust golang php node elixir csharp ffi wasm tesseract text-extraction metadata-extraction table-extraction bun pdfium rag pdf-extraction document-intelligence

Updated Mar 25, 2026
Rust

24eme / signaturepdf

Star

Free open-source web software for signing PDF (alone or with others) and also organize pages, edit medata and compress pdf

php pdf js signature pdf-manipulation pdf-merge pdf-format pdf-rotate pdf-merger pdf-meta-editor pdf-tools pdf-signature pdf-compression pdf-editor pdf-sign pdf-extraction pdf-signer pdf-metadata pdf-compressor

Updated Mar 13, 2026
JavaScript

pytr-org / pytr

Star

Use TradeRepublic in terminal and mass download all documents

portfolio finance terminal-app portfolio-performance pdf-extraction traderepublic-statements traderepublic

Updated Feb 18, 2026
Python

ArtifexSoftware / mupdf.js

Star

JavaScript bindings for MuPDF

javascript pdf typescript wasm mupdf pdf-viewer pdf-extraction

Updated Mar 23, 2026

mateogon / pdf-narrator

Star

Convert your PDFs and EPUBs into audiobooks effortlessly. Features intelligent text extraction, customizable text-to-speech settings, and efficient processing for low-resource systems.

pdf text-to-speech audiobook tts epub low-resource pdf-extraction pdf-to-audiobook immersive-reading kokoro-tts audiobook-generator pdf-audiobook

Updated Feb 26, 2026
Python

iamarunbrahma / pdf-to-markdown

Star

Conversion of PDF documents to structured Markdown, optimized for Retrieval Augmented Generation (RAG) and other NLP tasks. Extract text, tables, and images with preserved formatting for enhanced information retrieval and processing.

python information-retrieval document-conversion pdf-converter text-extraction pdf-parsing document-processing rag pdf-extraction retrieval-augmented-generation pdf-to-markdown

Updated Nov 22, 2024
Python

ExtractPDF4J / ExtractPDF4J

Sponsor

Star

Java PDF table extraction & OCR library. Extract structured tables from text-based and scanned PDFs using stream, lattice (OpenCV-style grid detection), and hybrid parsing.

java cli ocr maven pdf-document pdf-extractor ocr-recognition document-processing pdf-processor pdf-document-processor pdf-extraction java17

Updated Mar 15, 2026
Java

pcschreiber1 / PDF_Extraction-Translation

Star

Translate many large PDF Reports for free using Python.

python pdf-extraction pdf-translation

Updated Dec 31, 2022
Jupyter Notebook

heleninsights-dot / phd-deepread-workflow

Star

A professinal CLI workflow for PhD students to extract, analyze, and visualize academic papers into structured Markdown and Obsidian Canvas.

python pdf workflow research academic obsidian literature-review pdf-extraction

Updated Mar 6, 2026
Python

wszqkzqk / qt-web-extractor

Star

Web content extraction engine backed by Qt WebEngine.

chromium web-scraping qtwebengine content-extraction headless-browser pdf-extraction pyside6 open-webui

Updated Mar 24, 2026
Python

aidalinfo / extract-kit

Star

Powerful PDF data extraction library powered by AI vision models. Transform PDFs into structured, validated data using TypeScript, Zod, and AI providers like Scaleway and Ollama.

pdf document-processing ai-sdk pdf-extraction vision-llm

Updated Sep 14, 2025
TypeScript

jessevanwyk1 / claude-scholar

Star

🚀 Simplify your research workflow with Claude Scholar, the complete configuration for Claude Code in data science, AI, and academic writing.

search mcp academic pubmed summarization research-tool reading-list arxiv ai-safety literature-review scientific-literature semantic-scholar pdf-extraction streamlit academic-papers academic-research research-tools mcp-server claude-code

Updated Mar 25, 2026
TeX

MarkShawn2020 / video2ppt

Star

Extract presentation slides from videos with accurate timestamps

python opencv video-processing cli-tool frame-extraction pdf-extraction video-to-slides presentation-extraction

Updated Aug 25, 2025
Shell

adobe / pdftools-extract-java-sdk-samples

Star

This sample project provides a preview of the PDF Extract API. Using the sample project and this documentation, you will easily be able to integrate the PDF Extract API in your own server-side code.

java pdf extract pdf-extraction

Updated Apr 8, 2024
Java

aakashsharan / research-vault

Star

AI research assistant that extracts structured patterns from papers using RAG, LangGraph, and Claude. Query across your research library with natural language.

Updated Mar 21, 2026
Python

TrueLipstick / TrueLipstick-pdf-image-ocr-extractor

Star

Open WebUI tool for extracting text from PDFs and images using Tesseract OCR. Supports text-based and scanned PDFs, multi-language OCR (English + Swedish), fully offline.

multilingual python docker pdf ocr tool tesseract text-extraction fitz pymupdf pypdf pdf-extraction image-ocr open-webui

Updated Mar 8, 2026
Python

martymcenroe / RCA-PDF-extraction-pipeline

Sponsor

Star

Automated document extraction pipeline using AI vision models for invoice and form data capture

python computer-vision pdf-extraction document-ai

Updated Mar 15, 2026
Python

inhyeoklee / paper2slides-skill

Star

Turn a scientific paper PDF into a presentation slide deck. An Antigravity / Claude Code agent skill.

reveal-js pdf-extraction claude-code-skill agent-skill antigravity-skill paper-to-slides scientific-presentation

Updated Feb 12, 2026
HTML

danribes / pdf_xtractor

Star

Cross-platform desktop app for extracting text, tables, and structured data from PDFs using IBM Docling AI. Export to JSON, Markdown, CSV, Excel, HTML.

desktop-app python pdf ocr document-processing pdf-extraction pyside6 ibm-docling

Updated Feb 23, 2026
Python

Improve this page

Add a description, image, and links to the pdf-extraction topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the pdf-extraction topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pdf-extraction

Here are 147 public repositories matching this topic...

opendataloader-project / opendataloader-pdf

kreuzberg-dev / kreuzberg

24eme / signaturepdf

pytr-org / pytr

ArtifexSoftware / mupdf.js

mateogon / pdf-narrator

iamarunbrahma / pdf-to-markdown

ExtractPDF4J / ExtractPDF4J

pcschreiber1 / PDF_Extraction-Translation

heleninsights-dot / phd-deepread-workflow

wszqkzqk / qt-web-extractor

aidalinfo / extract-kit

jessevanwyk1 / claude-scholar

MarkShawn2020 / video2ppt

adobe / pdftools-extract-java-sdk-samples

aakashsharan / research-vault

TrueLipstick / TrueLipstick-pdf-image-ocr-extractor

martymcenroe / RCA-PDF-extraction-pipeline

inhyeoklee / paper2slides-skill

danribes / pdf_xtractor

Improve this page

Add this topic to your repo