1

I sometimes need to export a few pages from a big PDF file.

The pages must then be "copyable", ie the pages must not be exported as images.

On my older Windows 7 computer, a 200 page PDF crashes LibreOffice which I normally use for this task.

I thought I finally found the solution with Foxit Reader… only to see that the text pastes as garbage:

enter image description here

Is there a Windows/Linux application that can export a "text PDF" as "text PDF" (for lack of a better word) ?

FWIW, I tried the following apps before asking:

  • CutePDF Writer (3.2.0.1) : image
  • PDFSam Basic: Can't use eg. "1,2,5,102-105" ?
  • ImageMagic: Only exports as images?
  • LibreOffice: Crashes when handling 200 page doc
  • Acrobat Reader: Can't print/export with own driver (relies on installed CutePDF)

Thank you.


Edit: I can search the original file CTRL+F, so a text layer must be present. Nevertheless, pdftotext failed:

apt-get install poppler-utils pdftotext -layout -f 102 -l 105 big.pdf subset.pdf Syntax Warning: Invalid Font Weight Syntax Warning: Invalid Font Weight 

Next, copied subset.pdf to Windows, opened in SumatraPDF:

"Error loading subset.pdf". 
6
  • Linux: pdftotext (command line). Don't remember what package it came with, though, but it's in the standard repos. Oh, it even has its own Wikipedia page… so it's in the poppler-utils package. Of course only works for text; cannot extract text from graphics. For that, you'd first need to pdfsandwich the PDF ;) Commented Oct 4, 2019 at 18:51
  • Thanks, but it doesn't seem to do what I need: I do not want the raw text, I want the select pages to be "selectable/copyable text", like the original. As shown in the screenshot, even Foxit Reader turns text into garbage when pasting text elsewhere. Apparently, PDF contains multiple layers, where text is located in one of them, and graphics in another. Commented Oct 5, 2019 at 8:49
  • I was afraid of that (you wanting a GUI with selectable copy-paste), which is why I made it a comment. And yes, you're correct about the layers. As for the "graphics" stuff: can you search for the text in that PDF? Because if not, there is no text layer – which would explain the garbage. That can be helped then by pdfsandwich, wich OCRs the PDF and adds the missing text layer. With the resulting PDF, your copy/paste then should work. Commented Oct 5, 2019 at 10:40
  • I don't need a GUI to extract, a CLI is fine. The screenshot was just to show the problem. Why does it seem so hard to extract a few pages from a big PDF file without turning it into either garbage or an image (CutePDF, IMageMagick, etc.) ? Yes, I can copy/paste text from the original file. Commented Oct 5, 2019 at 11:48
  • 1
    Yes, I can search the original file CTRL+F Commented Oct 5, 2019 at 12:04

2 Answers 2

0

I would suggest PDFChef. They have a version on Windows, Online, and App Store.

For a 200-page pdf, the cloud version would be impracticable. So I would suggest the desktop version. I have been using it for a few years and it hasn't let me down yet. There is a free trial, so give it a go.

Desktop PDF app

Viewing and creating PDFs
Text editing
Inserting signatures into PDF documents
Organizing pages
Multiple formats for conversion
Access to the cloud storage
7-day trial with full functionality
Windows and macOS versions

At USD40 for a lifetime license (unless upgrading), it's affordable. I am not affiliated with them in any way.

It's a full editor, but my primary use has been similar to yours - extracting a few pages. Twice I have used it to delete a few pages. And once I have edited a pdf with it.

-1

You can use LEADTOOLS Recognition SDK technology in your application. https://www.leadtools.com/sdk/engine/recognition You can leverage the IOcrEngine interface, which will allow you to convert an image to a searchable PDF.

DISCLOSURE: I am an employee of the company offering this toolkit.

Here is some sample code:

string input = @"C:\LEADTOOLS21\Resources\Images\ocr1.tif"; string output = @"C:\LEADTOOLS21\Resources\Images\ocr1.PDF"; using (IOcrEngine _ocrEngine = OcrEngineManager.CreateEngine(OcrEngineType.LEAD)) { // Startup the LEADTOOLS OCR Engine _ocrEngine.Startup(null, null, null, null); //Run the AutoRecognizeManager and specify PDF format _ocrEngine.AutoRecognizeManager.Run(inputFile, outputFile, DocumentFormat.Pdf, null, null); } 

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.