PDF Sword 🗡️

智能 PDF 章节分解工具 - 将 PDF 按章节拆分为结构化数据，以便 LLM 阅读和分析。

特性

🔍 预设模式匹配：内置多种中文/英文章节模式（第一章、第1章、1.1 等）
🤖 LLM 智能分析：自动分析文档结构，生成正则或代码进行分割
⚡ 自动策略选择：优先使用预设模式，失败时自动切换 LLM
💰 成本可控：只在必要时调用 LLM，支持预览字符数限制
📦 多种输出格式：支持 JSON、TXT、Markdown 格式输出

适用场景

📊 金融文档：年报、募集说明书、招股说明书
📄 法律文件：合同、法规、条款
📚 学术论文：按章节提取分析
📑 技术文档：结构化处理

安装

# 克隆仓库 git clone https://github.com/yourusername/pdf-sword.git cd pdf-sword # 安装 pip install -e . # 或安装开发依赖 pip install -e ".[dev]"

快速开始

1. 命令行使用

# 基本使用（自动选择策略） pdf-sword split input.pdf # 保存为 JSON pdf-sword split input.pdf -o output.json # 保存为 Markdown pdf-sword split input.pdf -o chapters.md --format md # 每个章节保存为单独文件 pdf-sword split input.pdf -o chapters_dir/ --format txt # 强制使用 LLM 分析 pdf-sword split input.pdf --strategy llm -o output.json # 测试预设模式匹配情况 pdf-sword test input.pdf

2. Python API 使用

from pdf_sword import PDFChunker from pdf_sword.models import ChunkerConfig, SplitStrategy # 基本使用 chunker = PDFChunker() result = chunker.chunk("path/to/document.pdf") # 遍历章节 for chapter in result.chapters: print(f"[{chapter.index}] {chapter.title}") print(f"内容长度: {len(chapter.content)} 字符") print("-" * 40) # 转换为字典 chapters_dict = result.to_dict() # {"第一章": "内容...", "第二章": "内容...", ...} # 获取特定章节 chapter = result.get_chapter("财务数据") if chapter: print(chapter.content)

3. 配置 LLM（用于智能分析）

from pdf_sword import PDFChunker from pdf_sword.models import ChunkerConfig, LLMConfig, SplitStrategy config = ChunkerConfig( strategy=SplitStrategy.AUTO, # 自动选择：先尝试模式匹配，失败则用 LLM llm=LLMConfig( model="gpt-4o-mini", # 或其他兼容 OpenAI API 的模型 api_key="your-api-key", # 或设置环境变量 OPENAI_API_KEY base_url=None, # 自定义 API 地址（可选） max_preview_chars=10000, # LLM 预览的字符数 ), ) chunker = PDFChunker(config) result = chunker.chunk("path/to/document.pdf")

4. 添加自定义模式

from pdf_sword import PDFChunker from pdf_sword.models import PatternConfig chunker = PDFChunker() # 添加自定义章节模式 chunker.add_pattern(PatternConfig( name="my_pattern", regex=r"Chapter\s+([A-Z]+):", # 匹配 Chapter I:, Chapter II: 等 level=1, description="大写字母章节", )) result = chunker.chunk("path/to/document.pdf")

预设模式

模式名称	描述	示例
`chapter_cn_num`	中文数字章节	第一章、第二章...
`chapter_arabic`	阿拉伯数字章节	第1章、第2章...
`section_cn_num`	中文数字节	第一节、第二节...
`section_arabic`	阿拉伯数字节	第1节、第2节...
`chapter_dot`	数字点号	1. 、2. ...
`subsection_dot`	多级数字	1.1 、1.2 ...
`chapter_cn_dun`	中文顿号	一、二、三...
`section_cn_paren`	中文括号	（一）、（二）...
`part_roman`	英文 Part	Part I、Part II...
`chapter_en`	英文 Chapter	Chapter 1、Chapter 2...

输出格式

JSON 格式

{ "strategy": "pattern", "pattern_used": "chapter_arabic", "total_chars": 152345, "chapter_count": 8, "chapters": [ { "index": 1, "title": "概述", "level": 1, "content": "第一章内容...", "metadata": {"pattern": "chapter_arabic"} } ] }

Markdown 格式

# PDF 章节分解结果 - 策略: pattern - 总字符数: 152345 - 章节数: 8 --- ## 概述 第一章内容... --- ## 财务数据 第二章内容...

工作原理

文本提取：使用 pdfplumber 提取 PDF 文本，保留布局信息
模式匹配：尝试预设正则模式，自动选择最佳匹配
LLM 分析（必要时）：
- 将前 N 个字符发送给 LLM
- LLM 分析章节规律，返回正则表达式
- 或使用 LLM 生成的 Python 代码进行分割
章节提取：根据匹配点分割文本，构建结构化数据

环境变量

export OPENAI_API_KEY="your-api-key" export OPENAI_BASE_URL="https://api.openai.com/v1" # 可选

开发

# 运行测试 pytest # 代码格式化 black src/ ruff check --fix src/ # 类型检查 mypy src/

许可证

MIT License

贡献

欢迎 Issue 和 PR！

Fork 本仓库
创建特性分支 (git checkout -b feature/AmazingFeature)
提交更改 (git commit -m 'Add some AmazingFeature')
推送到分支 (git push origin feature/AmazingFeature)
创建 Pull Request

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
examples		examples
src/pdf_sword		src/pdf_sword
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Sword 🗡️

特性

适用场景

安装

快速开始

1. 命令行使用

2. Python API 使用

3. 配置 LLM（用于智能分析）

4. 添加自定义模式

预设模式

输出格式

JSON 格式

Markdown 格式

工作原理

环境变量

开发

许可证

贡献

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PDF Sword 🗡️

特性

适用场景

安装

快速开始

1. 命令行使用

2. Python API 使用

3. 配置 LLM（用于智能分析）

4. 添加自定义模式

预设模式

输出格式

JSON 格式

Markdown 格式

工作原理

环境变量

开发

许可证

贡献

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages