A document format for grey scanned pages with better lossless compression (i.e., a smaller file) than PDF+Zip?

Question

I have page_1.pnm, …, page_6.pnm, which represent 6 pages of a scanned document, all in gray PNM produced by scanimage and manually postprocessed with GIMP. The command

convert $(for i in 1 2 3 4 5 6; do echo page_$i.pnm; done | xargs echo) -compress Zip -quality 100 document.hi-res.pdf

produced a PDF file of size 15620554 bytes, whereas

tar cvf document.hi-res.tar $(for i in $(seq 1 6); do echo page_$i.pnm; done| xargs echo) xz -9 -e -vv document.hi-res.tar

produced a .tar.xz file of size 12385312 bytes, which is about 72 % of the PDF size. This means that there is enough superfluous information in the document that the PDF+Zip combination doesn't or can't remove.

This raises the question: Is there a document format (for scanned stuff) for which Windows has a built-in viewer and Debian Linux has at least a standard, freely available viewer such that the documents in this format are generally smaller than PDFs without losing information? (Yes, I tried TIFF, and it was larger than PDF. I also produced a Postscript document with convert and then squeezed it via gzip --best, but the resulting .ps.gz file was even larger. I don't know how to produce usable DJVU documents from gray images in a lossless way. I don't know how to produce XPS files on Debian - GhostXPS/GhostPDL seems to have no package.)

By the way, is there any shorter and more elegant way to produce

page_1.pnm page_2.pnm page_3.pnm page_4.pnm page_5.pnm file_6.pnm

than

$(for i in $(seq 1 6); do echo page_$i.pnm; done | xargs echo)

PS. I don't need lossy compression; if I allow myself to lose information, I'm pretty happy with

convert $(for i in $(seq 1 6); do echo page_$i.pnm; done | xargs echo) -compress JPEG2000 -quality 40 document.JPEG2000.40.pdf

(replace 40 with your choice until your file is small enough for your application).

PPS. As opposed to In ImageMagick, how to create a PDF file from an image with the best Flate compression ratio?, in this question we are NOT (or at least not necessarily) interested in the best compression ratio for a single-page PDF+Zip but allow for many pages and a wider variety of compressors and formats.

@meuh Maybe I'm missing something, but your link seems to care mostly about lossy compression; it mentions XPS only once without details. — AlMa1r
– AlMa1r, Commented Jan 30 at 21:37
This question is similar to: In ImageMagick, how to create a PDF file from an image with the best Flate compression ratio?. If you believe it’s different, please edit the question, make it clear how it’s different and/or how the answers on that question are not helpful for your problem. — Marcus Müller
– Marcus Müller, Commented Jan 30 at 22:50
also, as you already read, JPEG200 allows for lossless compression as well. I explicitly explained that before to you, as well: unix.stackexchange.com/questions/788365/… — Marcus Müller
– Marcus Müller, Commented Jan 30 at 22:52
@MarcusMüller I have not tested lossless JPEG2000 properly yet (noconclusive results so far; perhaps my options on the command line were inappropriate/wrong for the purpose). — AlMa1r
– AlMa1r, Commented Feb 1 at 2:42

Hermann · Accepted Answer · 2025-01-30 23:57:45Z

With the restrictions you give, I think it is unlikely you will find something apprpopriate.

There are a couple of lossless image compression methods, for example:

Run-Length Encoding (RLE)
LZW
DEFLATE

If you look into the PDF specification, you will notice that it supports exactly those methods. You mentioned ZIP compression, but for convert, that is just an alias for Flate.

While there are better general purpose data compression methods such as LZMA (xz), none of them have found their way into common document formats (at least not that I am aware).

The reason for this is probably simple: People who absolutely need lossless compression do not care much about size. People who care about size are willing to make concessions.

There are however, many lossy image compression methods which have "perceptually lossless" modes of operation. This means after decompression, the bit-stream won't perfectly match the input, but you will have a hard time seeing a difference with your eyes. WebP is rather popular these days. Apple prefers HEIF. You will end up with a directory of image files rather than a single file, though. With a recent version of libtiff, you can put WebP in TIFF, but that is an experimental feature and not part of the official file format standard.

As far as I know, DjVu is lossy, too. If you still want to try it, you can convert to lossless PDF first, then use pdf2djvu.

Stack Exchange Network

A document format for grey scanned pages with better lossless compression (i.e., a smaller file) than PDF+Zip?

1 Answer 1

You must log in to answer this question.

Linked

Hot Network Questions

A document format for grey scanned pages with better lossless compression (i.e., a smaller file) than PDF+Zip?

1 Answer 1

You must log in to answer this question.

Linked

Related

Hot Network Questions