OSX, Homebrew's pdftotext works, but it does not honor the paragraph breaks. I have experimented with -pagebrk, -eol mac or -eol unix, but the problem of dividing paragraphs seems to be always there. Is it a typical problem?
2 Answers
PDFs are weird things, and the text in them isn't necessarily in any sane order.
Try pdftotext's -layout option.
Depending on the PDF, this may give you a multi-column text file, which is perfectly readable (esp. on a wide-screen display with more than 80 columns) but single-column text can be more useful.
--
I find the easiest way to convert multi-column text to single-column is to edit the text with vim, insert a TAB between the columns, and write a perl script to merge the columns into one column on each page (pages are separated by form-feed characters, ^L). This can very be time-consuming and tedious.
My first attempts at writing a perl script to do this tried to identify columns by the number of space characters between the columns but, unfortunately, this varies from as few as 1 or 2 space character to 5 or more (and there's also the fact that some columns are justified with additional spaces), so there's no automated way to distinguish between the normal spacing between words and the spacing between columns. And it completely fails to deal with tables in the pdftotext output.
It's far easier to manually edit and insert TAB characters and split the columns on that, and vi/vim makes repetitive editing tasks like this fairly easy: find a handy cursor location to insert a TAB, press Ctrl-V and move the cursor down to the bottom of the page or section you're editing, then press rTAB to replace the selected vim-column with tab characters.
Finally, in your comments you mentioned seeing Unicode character 'RIGHT SINGLE QUOTATION MARK' (U+2019) in the ouput text. This is perfectly normal, many (most?) PDFs have unicode characters (e.g. for smart-quotes and em-dashes and ellipses etc) embedded in them, as they're not limited to just ASCII characters
I created this simple one-liner which does some helpful things, but keep in mind that PDFs are weird and don't always work.
sed 's/\.$/.\n/; s/• /\n/; /^[0-9]/ s/$/\n/' | perl -00 -pe 's/\n(?!\Z)/ /g' - The sed command puts an additional new line if a sentence with a full stop, as it is likely the end of the paragraph. (This assumption will already fail lots of times.)
- If an itemized list is encountered, add an additional new line.
- If a line starts with a number, it is likely a title, add a new line.
Now whenever there is a group of lines separated with newlines, it is likely a paragraph. The perl command will put those groups on a single line. This perl one-liner is explained here:
https://unix.stackexchange.com/a/479229/245582
NB. I used pdftotext from Debian's popper-utils.
-layoutoption. Depending on the PDF, this may give you a multi-column text file - I find the easiest way to deal with them is to edit the text withvim, insert a tab between the columns, and write a perl script to merge the columns into one column on each page (pages are separated by form-feed characters,^L). This can be time-consuming and tedious.-layoutversion works. But then another question arises, is there a way to find a malicious process of a virus type that inserts a sign u+2019 after the text while pdftotext is working? I sense it might be a hacker penetration, so I wonder if there is a way to check on the BSD terminal safety.