I keep getting comments and emails stating a very interesting thing. May it be, that included in the compression algorithm in question, JBIG2, there exists a flag “lossless” one just has to turn on within the implementation, and everything's in order? This press announcement (thanks, Flavio) suggests so. Money Quote:
supports traditional “lossless” compression, but also a new “lossy” type of image compression, whereby the compression factor is increased on average by a factor of about 3 to 10, without noticeable visual differences compared with the lossless mode.
The “lossy” one is the thing all is about here. The important thing to note from that quote seems that the 3-10x factor applies to lossy JBIG2, not lossless. Lossless JBIG2 will generally create files that are a bit smaller than other forms of compression, lossy will create files that are massively smaller.
Reportedly, The JBIG2 lossless mode still can provide much better compression results in terms of image quality and document size compared to possible counterparts hidden behind the xerox terms “higher” and “high”, while still preserving exact bit-to-bit image representation. It is stated, that one just needs to pass a lossless flag to JBIG2 (translation for non-coders: It's just like flipping a switch) in order to resolve that problem once and forever. (Thanks, Nik!)
Now, we don't know the particular implementation Xerox uses. Maybe, Nik was referring to some particular implementation. Also, JBIG2 is more of a decompression standard than a compression standard.Maybe they do not have that switch. But if this is true, the obvious question would be: Why wasn't lossless compression activated in the first place, as data integrity is a big deal in scan copiers? The only reason I can think of, is to quench every single possible byte out of the mean file sizes in order to make things easier for the marketing guys. May seem ridiculous, in times where memory shortages are not common any more. On the other hand, a factor 3 - 10 size down can open up applications that were otherwise impractical. Unsure of this, we'll see. Please also have a look on the very interesting comment of William Rucklidge below!
Edit: It may be that the number substitutions also occur on compression settings apart from “normal”! Can't be true! I think this is too weird to confirm or take this as face value myself, so do not take it as face value until xerox confirms. Of course I told Xerox them before posting this, I do not want to do harm to them as they have been very friendly to me, too. Also, I was not when the PDFs were created.
Have a look at this PDF a reader sent to me. It contains a TIFF scan of my test numbers, and several other scans at 300 dpi across all the settings “normal”, “higher” and “high” with OCR switched on and off, made by a WorkCentre 7655 reportedly. Open it with some PDF reader that can read comments, for example Acrobat Reader, so you can see what pages belong to what settings. Guess what? Look for false 8s again. Several of the scans seem to contain some 8s instead of 6s even on other compression settings than “normal” again, which, if there is no other unknown issue, contradicts the statement in Xerox's Scanning Q&A document: “You will not see a character substitution issue when scanning with the factory default settings.” (from Question 6 in their document)“. But in this case, the default settings seem to be affected as well?
Edit2: Here is a similar file by a WorkCentre 5665, created by the same guy and commented afterwards. Again, these are scans of the same page with different settings in one piece. As well there are 6s replaced by 8s. Wow. No statement so far from Xerox. Be aware that the user changed the compression names from normal/higher/high to normal/high/highest, sorry for the confusion.
Oh dear… and they were so adamant that it couldn't happen on factory settings. Hope to get a statement from Xerox soon. I can't quite believe it and hope that resolves itself … as is was still one machine, it was easy to say for me, there might be something wrong with the particular machine. But if it's two…? Somebody able to replicate?
Edit3: Another reader sends me a PDF created by a WorkCentre 7530 (he stated, it was a 7535, however the PDF says it stems from a 7530). He claims to have scanned this on “high” mode, with 300 dpi. As you can see, these are three colums of test numbers with different text sizes. The fourth number in the left column contains an 8 that should be a 6. You can get the correct values out of the 8pt and 9pt columns. The reader placed a X as PDF comment next to the wrong number. Still no statement from Xerox.
As there is still no statement from Xerox, I do set out now and try to replicate the error by myself – I will keep you posted what's coming out. Stand by.
I know all of you are waiting. At the moment, I am in close touch with Xerox, figuring things out. This is to let you know something's going on. Please stand by a bit more … Thanks!
Because of caching, a comment can take up to two minutes until it appears.