Could it have been that easy, Xerox?

I keep getting comments and emails stating a very interesting thing. May it be, that included in the compression algorithm in question, JBIG2, there exists a flag “lossless” one just has to turn on within the implementation, and everything's in order? This press announcement (thanks, Flavio) suggests so. Money Quote:

supports traditional “lossless” compression, but also a new “lossy” type of image compression, whereby the compression factor is increased on average by a factor of about 3 to 10, without noticeable visual differences compared with the lossless mode.

The “lossy” one is the thing all is about here. The important thing to note from that quote seems that the 3-10x factor applies to lossy JBIG2, not lossless. Lossless JBIG2 will generally create files that are a bit smaller than other forms of compression, lossy will create files that are massively smaller.

Reportedly, The JBIG2 lossless mode still can provide much better compression results in terms of image quality and document size compared to possible counterparts hidden behind the xerox terms “higher” and “high”, while still preserving exact bit-to-bit image representation. It is stated, that one just needs to pass a lossless flag to JBIG2 (translation for non-coders: It's just like flipping a switch) in order to resolve that problem once and forever. (Thanks, Nik!)

Now, we don't know the particular implementation Xerox uses. Maybe, Nik was referring to some particular implementation. Also, JBIG2 is more of a decompression standard than a compression standard.Maybe they do not have that switch. But if this is true, the obvious question would be: Why wasn't lossless compression activated in the first place, as data integrity is a big deal in scan copiers? The only reason I can think of, is to quench every single possible byte out of the mean file sizes in order to make things easier for the marketing guys. May seem ridiculous, in times where memory shortages are not common any more. On the other hand, a factor 3 - 10 size down can open up applications that were otherwise impractical. Unsure of this, we'll see. Please also have a look on the very interesting comment of William Rucklidge below!

Edit: Reportedly, compression modes apart from "normal" affected

Edit: It may be that the number substitutions also occur on compression settings apart from “normal”! Can't be true! 8-O I think this is too weird to confirm or take this as face value myself, so do not take it as face value until xerox confirms. Of course I told Xerox them before posting this, I do not want to do harm to them as they have been very friendly to me, too. Also, I was not when the PDFs were created.

Have a look at this PDF a reader sent to me. It contains a TIFF scan of my test numbers, and several other scans at 300 dpi across all the settings “normal”, “higher” and “high” with OCR switched on and off, made by a WorkCentre 7655 reportedly. Open it with some PDF reader that can read comments, for example Acrobat Reader, so you can see what pages belong to what settings. Guess what? Look for false 8s again. Several of the scans seem to contain some 8s instead of 6s even on other compression settings than “normal” again, which, if there is no other unknown issue, contradicts the statement in Xerox's Scanning Q&A document: “You will not see a character substitution issue when scanning with the factory default settings.” (from Question 6 in their document)“. But in this case, the default settings seem to be affected as well?

Edit2: Here is a similar file by a WorkCentre 5665, created by the same guy and commented afterwards. Again, these are scans of the same page with different settings in one piece. As well there are 6s replaced by 8s. Wow. No statement so far from Xerox. Be aware that the user changed the compression names from normal/higher/high to normal/high/highest, sorry for the confusion.

Oh dear… and they were so adamant that it couldn't happen on factory settings. Hope to get a statement from Xerox soon. I can't quite believe it and hope that resolves itself … as is was still one machine, it was easy to say for me, there might be something wrong with the particular machine. But if it's two…? Somebody able to replicate?

Edit3: Another reader sends me a PDF created by a WorkCentre 7530 (he stated, it was a 7535, however the PDF says it stems from a 7530). He claims to have scanned this on “high” mode, with 300 dpi. As you can see, these are three colums of test numbers with different text sizes. The fourth number in the left column contains an 8 that should be a 6. You can get the correct values out of the 8pt and 9pt columns. The reader placed a X as PDF comment next to the wrong number. Still no statement from Xerox.

As there is still no statement from Xerox, I do set out now and try to replicate the error by myself – I will keep you posted what's coming out. Stand by.

:!: I know all of you are waiting. At the moment, I am in close touch with Xerox, figuring things out. This is to let you know something's going on. Please stand by a bit more … Thanks! :-) :!:

Comments

Because of caching, a comment can take up to two minutes until it appears.

Due to heavy spam, I need to block the comment feature for some time.

Just a note:

Some readers may need to open that PDF with a reader that shows comments in order to see which settings were used to print each page. (My browser plug-in wouldn't show the comments, but Acrobat Pro shows them.)

1 |
J.Colbert
| 2013/08/08 22:09 | reply

Acrobat Reader shows them, too!

2 |
David Kriesel
| 2013/08/08 22:28 | reply

The important thing to note from that quote is that the 3-10x factor applies to lossy JBIG2, not lossless. Lossless JBIG2 will generally create files that are a bit smaller than other forms of compression… but lossy JBIG2 generates files that are massively smaller. Lossless JBIG2 might give, say, a 1.2-1.5x improvement in compression ratio - while lossy JBIG2 can give a 3-10x improvement. That's the kind of difference that can open up applications that were otherwise impractical. So the question “why not always use lossless?” is answered by “it's not useful in some places where lossy compression is - bearing in mind the risks of lossy compression”. In some ways (though not entirely), it's like offering the choice to scan at 200dpi rather than the higher resolution the machine is capable of - you lose image quality but get smaller files in return.

A few notes on how JBIG2 works… The JBIG2 standard specifies how to decompress data encoded in JBIG2 format. There are three kinds of encoded image data (generic, halftone, and symbol) and character substitution is only an issue for symbol data. Symbol data comes in two kinds: symbol dictionaries and symbol regions. Symbol dictionaries are just collections of small images - they're not of a fixed “patch” size, but can be of any mix of sizes. Ideally, each symbol image will be of a single character (of a particular typeface, at a particular size). A symbol region references one or more symbol dictionaries, and it contains basically just a list of (x, y, ID) coordinates - draw the symbol with this ID at this location. Lossy JBIG2 ends there. Lossless JBIG2 adds another kind of information (“refinement”) to each such use of a symbol - you can think of this as “for every pixel in the symbol, do I draw it or do I flip it?”. That's a lot more information than just (x, y, ID) and it explains why lossy JBIG2 can be so much smaller than lossless.

What the standard doesn't say is how to do the compression. Different encoders can choose their symbols in different ways; in particular, they can be more or less choosy about how similar two bits of the original image should be before being considered to be identical. Encoders have a lot of flexibility; in fact there's no actual “lossless” flag as such - it's just that encoders can choose to include, not include, or even partially include refinement information (and there's no requirement that the refinement data brings the image all the way to lossless… the encoder is in charge of the loss level, even if refinement is present). Encoders can also choose which kind of compression (the three I mentioned above) to apply to which parts of the page; can perform refinement at a per-symbol level or over the whole page, or regions of a page.

Full disclosure: I was the editor of JBIG2 and worked at Xerox, though I left long enough ago that I have no idea if any work I did there is embodied in these copiers.

3 |
William Rucklidge
| 2013/08/09 01:46 | reply

Thank you for this comment providing so much information. I added some of that to the article above (and the German version) and will get in touch by email!

4 |
David Kriesel
| 2013/08/09 08:45 | reply

Right now I'm on the road, there for some fast quick'n'dirty writeup about 'wc5665test.pdf':

1. The file's metadata claim that it comes directly from a 'Xerox WorkCenter 5665', created on Thu, Aug 8 at 14:52:43 h.

2. All pages are in Letter format (792×612 pt).

3. Each page is made up by 1 raster image with color depth 1.

4. All raster images use JBIG2 compression.

5. All raster images on the pages use a resolution of 300 dpi.

5 | | 2013/08/09 09:15 | reply

Here are the metadata. (Resolutions were calculated by me.)

pdfinfo -f 1 -l 6 -meta wc5665test.pdf 
 
Creator:        Xerox WorkCentre 5665
Producer:       Xerox WorkCentre 5665
CreationDate:   Thu Aug  8 14:52:43 2013
ModDate:        Thu Aug  8 14:52:43 2013
Tagged:         yes
Form:           AcroForm
Pages:          6
Encrypted:      no
Page    1 size: 792 x 612 pts (letter)
Page    1 rot:  270
Page    2 size: 792 x 612 pts (letter)
Page    2 rot:  270
Page    3 size: 792 x 612 pts (letter)
Page    3 rot:  270
Page    4 size: 792 x 612 pts (letter)
Page    4 rot:  270
Page    5 size: 792 x 612 pts (letter)
Page    5 rot:  270
Page    6 size: 792 x 612 pts (letter)
Page    6 rot:  270
File size:      589999 bytes
Optimized:      yes
PDF version:    1.6
Metadata:
<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 4.2.1-c043 52.372728, 2009/01/18-15:08:04        ">
   <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
      <rdf:Description rdf:about=""
            xmlns:xmp="http://ns.adobe.com/xap/1.0/">
         <xmp:ModifyDate>2013-08-08T14:52:43-04:00</xmp:ModifyDate>
         <xmp:CreateDate>2013-08-08T14:52:43-04:00</xmp:CreateDate>
         <xmp:MetadataDate>2013-08-08T14:52:43-04:00</xmp:MetadataDate>
         <xmp:CreatorTool>Xerox WorkCentre 5665</xmp:CreatorTool>
      </rdf:Description>
      <rdf:Description rdf:about=""
            xmlns:dc="http://purl.org/dc/elements/1.1/">
         <dc:format>application/pdf</dc:format>
      </rdf:Description>
      <rdf:Description rdf:about=""
            xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/">
         <xmpMM:DocumentID>uuid:7b2372c8-28f6-4ed4-ab47-ed2b56c5931d</xmpMM:DocumentID>
         <xmpMM:InstanceID>uuid:dc071ad2-a635-4466-b088-12142e9c17d0</xmpMM:InstanceID>
      </rdf:Description>
      <rdf:Description rdf:about=""
            xmlns:pdf="http://ns.adobe.com/pdf/1.3/">
         <pdf:Producer>Xerox WorkCentre 5665</pdf:Producer>
      </rdf:Description>
   </rdf:RDF>
</x:xmpmeta>
6 | | 2013/08/09 09:16 | reply
pdfimages -list wc5665test.pdf 

page   num  type   width height color comp bpc  enc interp  object ID
---------------------------------------------------------------------
   1     0 mask     3296  2560  -       1   1  jbig2  no       117  0
   2     1 mask     3296  2560  -       1   1  jbig2  no         4  0
   3     2 mask     3296  2560  -       1   1  jbig2  no         9  0
   4     3 mask     3296  2560  -       1   1  jbig2  no        14  0
   5     4 mask     3296  2560  -       1   1  jbig2  no        19  0
   6     5 mask     3296  2560  -       1   1  jbig2  no        24  0
7 | | 2013/08/09 09:17 | reply

The file regarding the WorkCenter 5665 did make the impression to have been created 'all of a piece' by the WorkCenter. The red comments of course have been put onto the PDF pages after the fact, but these didn't modify the metadata of the PDF (and they are not supposed to do this according to the PDF spec).

The file regarding WorkCenter 7655 has been assembled after scanning for sure (obviously so – because the statement was clear, saying the original TIFF had been inserted as the first page).

In any case the 7655 file doesn't look like 'all of a piece', and to analyse it is a bit more complex. Because on each page from 2 to 7 there are two raster images: one with the numbers compressed with JBIG2, another 'white' one using JPEG compression (possibly a sort of 'masking' covering the page area with the numbers used for OCR).

What can be said with some certainty is this:

1. The image with the numbers on page 2 has a size 2882×1783 pixels. It is rendered at 300 ppi on the PDF page.

2. The same is true for all other pages (3-7).

3. There is no trace of OCR on pages (5-7) where the annotations claim to have had OCR enabled while scanning.

8 | | 2013/08/09 09:51 | reply

I'm re-running tests with both the 7655 and 5665 to see if my results are the same. I'll upload the individual results from each test separately and post that link for review.

The version of pdfimages I have (win32) doesn't have the -list switch built into it for some reason. I'll have to leave it up to you all to review it for compression info.

9 |
j.colbert
| 2013/08/09 17:20 | reply

I have forwarded a zip file to David, and asked that he post it to the blog. (I don't have a good server to post the file to on my own right now.) The file contains the individual and untouched scans from my latest tests on the 7655, all of which still show substitution. The README file will offer more specific info on how I tested. I hope that someone can either spot something I'm doing wrong in my testing, or verify that my test results appear to be good.

10 |
j.colbert
| 2013/08/09 20:02 | reply

@j.colbert:

The 'pdfimages' that has the '-list' parameter is only available from the Poppler (http://poppler.freedesktop.org/releases.html) package.

Poppler is a fork from the original XPDF software (http://www.foolabs.com/xpdf/download.html).

My guess is that you're using the Windows version of XPDF's 'pdfimages'.

In principle, Poppler can be compiled on all platforms. But up until about 2 years ago, I found no Windows version, and I was unable to build it myself. But meanwhile there is this:

* http://blog.alivate.com.au/poppler-windows/

* http://laconsigna.wordpress.com/2011/07/14/compiling-poppler-on-windows/

11 | | 2013/08/09 23:08 | reply

j.colbert: Thank you so much for helping. I'm sorry not to have used your zip file directly, but I thought this would be so big a deal I had to reproduce is with my own hands to be sure before publishing.

12 |
David Kriesel
| 2013/08/10 11:38 | reply