批量

English | 中文

批量

import os from pdf2docx import Converter from glob import glob # 设置包含 PDF 文件的文件夹路径 folder_path = r'C:\Users\username\Desktop\folder' # 使用 glob 模块找到所有的 PDF 文件 pdf_files = glob(os.path.join(folder_path, '*.pdf')) # 遍历所有找到的 PDF 文件 for pdf_file in pdf_files: # 从 PDF 文件路径创建 DOCX 文件路径（替换扩展名） docx_file = pdf_file.replace('.pdf', '.docx') # 创建一个 Converter 对象并进行转换 cv = Converter(pdf_file) cv.convert(docx_file) # 转换所有页面 cv.close() print(f'Converted: {pdf_file} to {docx_file}')

pdf2docx

Extract data from PDF with PyMuPDF, e.g. text, images and drawings
Parse layout with rule, e.g. sections, paragraphs, images and tables
Generate docx with python-docx

Features

Parse and re-create page layout
- page margin
- section and column (1 or 2 columns only)
- page header and footer [TODO]
Parse and re-create paragraph
- OCR text [TODO]
- text in horizontal/vertical direction: from left to right, from bottom to top
- font style, e.g. font name, size, weight, italic and color
- text format, e.g. highlight, underline, strike-through
- list style [TODO]
- external hyper link
- paragraph horizontal alignment (left/right/center/justify) and vertical spacing
Parse and re-create image
- in-line image
- image in Gray/RGB/CMYK mode
- transparent image
- floating image, i.e. picture behind text
Parse and re-create table
- border style, e.g. width, color
- shading style, i.e. background color
- merged cells
- vertical direction cell
- table with partly hidden borders
- nested tables
Parsing pages with multi-processing

It can also be used as a tool to extract table contents since both table content and format/style is parsed.

Limitations

Text-based PDF file
Left to right language
Normal reading direction, no word transformation / rotation
Rule-based method can't 100% convert the PDF layout

Name		Name	Last commit message	Last commit date
Latest commit History 871 Commits
.github		.github
docs		docs
pdf2docx		pdf2docx
test		test
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
AFFERO GPL		AFFERO GPL
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
README_CN.md		README_CN.md
requirements.txt		requirements.txt
setup.py		setup.py
version.txt		version.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

批量

pdf2docx

Features

Limitations

Documentation

Sample

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

批量

pdf2docx

Features

Limitations

Documentation

Sample

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages