3

I have just deleted a scanned pdf file.

I'm trying to recover it with scalpel.

The problem is that scalpel will recover many deleted files and names them numerically (e.g. 0001.pdf, 0002.pdf, ... 9999.pdf).

According to this tutorial I could use a grep command and search some text in the deleted file.

The problem is, that the file is scanned (I know the original file name), so I don't know what to search for.

3
  • 1
    It might not be so easy to grep text even if you did know what to look for, since there is some compression and other binary junk mixed in with text, if any. Did you enable OCR when scanning? Also do you know if scalpel names them numerically in the order of last modified or is it some arbitrary order? Commented Apr 7, 2016 at 22:05
  • @user454038 It is scanned without OCR. I don't know if scalpel do some ordering. Commented Apr 7, 2016 at 23:34
  • You need to make backups. In addition, put your scans under version control. That's what I do. With distributed version control, you can easily push your repository to a remote location, so that's a form of backup, though not a substitute for proper backups. Commented Apr 8, 2016 at 7:53

3 Answers 3

5
+100

If you can scan the document again, you might be able to automatically compare that against the recovered documents. But if that is the case you probably don't need to recover.

That leaves finding the right PDF, and since opening them one by one in programs like evince is cumbersome I recommend you run the following in the directory where the .pdf files are recovered:

for i in *.pdf ; do pdfimages -j -l 1 "$i" "${i%}" done 

This will leave with JPEG files (-j option, unless the scanned file was not JPEG, which is unlikely) with the first page ( -l 1 ) with same basename as your PDFs.

Now you can use eog to quickly browse through the extracted images until you (visually) recognise the document you are looking for. Once found the image file will have the same basename as the PDF file you are looking for.

4

Try running pdfinfo on your files.

The output may have Creator: Simple Scan or similar in it, so you can search for that.

You can also try using the CreationDate field if you know the approximate date of creation.

Of course pdfinfo will return an error if the file isn't a PDF file, so you'll need to send errors to /dev/null.

Try scanning a document using Simple Scan, and see what output pdfinfo returns for it.

1
  • Actually, pdfinfo found about 50 files, so I had to open it manually. But the one file I was looking for wasn't between them. I'm not sure if I can trust this tool (scalpel). Maybe this answer is worth accepting, but I would like to recover the file in the end. Commented Apr 8, 2016 at 2:33
1

The scan image data in the PDF file will most likely be preceded by something like

<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 2480/K -1>>/Filter/CCITTFaxDecode/Height 3507/Length 96349/Name/Im0/Subtype/Image/Type/XObject/Width 2480>>stream 

I'd therefore start to narrow things down with grep -Fil 'subtype/image' filenames. This will at least rule out PDF files which do not contain an image.

3
  • This finds hundreds of scanned files. My pdf file is output of simple-scan application, so I guess according to the size of similar scanned documents, that the resulting file should be (6 pages) 354K +/- in size. If it would be possible to add some switch to the grep command that will narrow the results yet. Commented Apr 7, 2016 at 23:30
  • btw. your command does not narrow the results to scanned documents. Commented Apr 7, 2016 at 23:31
  • I know the creation date of the original file yet, if it helps. Commented Apr 7, 2016 at 23:38

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.