I have a PDF file. I need the bookmarks in that file extracted to a text file or an Excel spreadsheet. I also need to validate the bookmarks from the large PDF file. How would I do that?
3 Answers
You can use pdftk to extract data (in particular, bookmarks) from PDF files.
Example: with pdftk 2.02,
pdftk file.pdf dump_data_utf8 | grep '^Bookmark' outputs the list of bookmarks, 4 lines for each bookmark, under the form:
BookmarkBegin BookmarkTitle: <title in UTF8> BookmarkLevel: <number> BookmarkPageNumber: <number> where for instance, level 1 corresponds to sections, level 2 to subsections, and so on. Instead of dump_data_utf8, you can use dump_data, which will give you HTML/XML numeric entities for non-ASCII characters (e.g. è for "è").
Note: Without the grep, you can get other interesting data, such as the metadata (creation date, author, keywords, title, etc.), the number of pages and the dimensions of each page. This pdftk utility can do other things on the PDF file(s); see its man page for a full description.
- Just a note to self- pdftk can export as well as import bookmark data to and from a text file. pdflabs.com/blog/export-and-import-pdf-bookmarksPrem– Prem2021-09-06 20:12:25 +00:00Commented Sep 6, 2021 at 20:12
with qpdf
This should get you started:
qpdf --json your.pdf | jq '.objects' | grep -Po 'Title": \K.*' That command will also yield the title of the PDF, though.
Have a look at the qpdf manual regarding its JSON output.
I'm pretty sure the command can be simplified, getting rid of grep, by using jq's wildcards.
- 5Kudos for qpdf! A JSON output of bookmarks using ` --json --json-key=outlines` could not be simpler. Easy to parse for further processing, this is what I searched for, far too long.CodeBrauer– CodeBrauer2020-08-25 16:17:42 +00:00Commented Aug 25, 2020 at 16:17
- 2Thanks!
qpdf --json --json-key=outlines test.pdf > test.json, works like a charm :)Alex G– Alex G2023-12-22 09:05:19 +00:00Commented Dec 22, 2023 at 9:05
You can use the CLI of jpdftweak to extract bookmarks in CSV format:
java -jar -Xmx512M jpdftweak.jar "file.pdf" -savebookmarks "bmarks.csv" /dev/null After validating and possibly modifying the bookmark data you could load it back into the PDF file with the following command:
java -jar -Xmx512M jpdftweak.jar "file.pdf" -loadbookmarks "bmarks.csv" "file_updated.pdf" The -Xmx512M Java parameter is optional but can help with processing larger PDF files that require more memory.
You might want to read this related Q&A as well.