Please see the “condensed time line” section (the next one) for a time line of how the Xerox saga unfolded. It for example depicts that I did not push the thing to the public right away, but gave Xerox a lot of time before I did so. Personally, I think this is important. I will post updates and links to the relevant blog posts in there in order not to make total crap out of this article's outline. In this way, I keep this article up-to-date for future visitors and also write new blog posts on the topic for RSS users.
In this article I present in which way scanners / copiers of the Xerox WorkCentre Line randomly alter written numbers in pages that are scanned. This is not an OCR problem (as we switched off OCR on purpose), it is a lot worse – patches of the pixel data are randomly replaced in a very subtle and dangerous way: The scanned images look correct at first glance, even though numbers may actually be incorrect. Without a fuss, this may cause scenarios like:
To make things even more worse: The copiers in question are the common Xerox WorkCentres, and Xerox seemed to be unaware of the issue until we found out about it last Wednesday. Whats more, not only one different WorkCentre model seems to be affected, as we tested at least two with this issue (Xerox WorkCentre 7535 and 7556). Additionally, the current software release, as installed by xerox support, did not solve the issue, thus, the issue existed on the very old release we had installed, as well as on a very new one. The error has been confirmed by a xerox rental firm in the meantime, and Xerox is investigating as well, so it does not seem to be some dumb handling error or something similar (if I was thinking this, I of course would not publish it here).
As a result, anyone using those WorkCentres has to ask himself:
Even though Xerox seems eager to solve the issue, because of the possible dangers an immediate publication of the issue is advisable. This is what I want to do with this article.
The rest of the article is organized as follows.
Due to the mass of the blog articles, I keep getting asked for a comprehensive time line. Here it is. The relevant blog posts are linked (barely visible, though, hm). This list will be updated as used, the current status is always marked with a .
|July 24th||Number mangling discovered|
|July 25th||Told Xerox about number mangling in pixel data. During the next days, neither the Xerox support levels we had phoned, nor the Xerox helpdesks visiting, did know that character substitutions could occur at all and were amazed to see this. The Xerox helpdesk guys even set out replicating the “bug” at their devices and managed to do so. I think this is important as it depicts that I first went to Xerox and gave them about a week of time, before I posted it on my blog. Later I was to learn that the number mangling was known to Xerox.|
|August 1||No solution from Xerox yet, nearly one week, across all support levels. They either do not believe us or look amazed. I think this is scary and write this blog post.|
|August 2||The blog post goes around the planet. From now on, I get emails of people being able to replicate the “bug” around the world, as well as informations that JBIG2 compression can cause this. I update this blog post a lot.|
|August 6, morning||A reader tells me there is a small notice in his copier's admin panel about character substitution. On his device, the “bug” can be avoided by setting compression from “normal” to higher. As a consequence, the issue must have been known by Xerox – so why was nobody telling us? As the notice only occurs when adjusting the setting in the admin panel, the implications, however, stay the same. People may create false figures without knowing. Anyway, I write a blog post presenting this workaround.|
|August 6, afternoon||Conference call with Rick Dastin and Francis Tse. Rick Dastin, Vice President at Xerox, is the first one actually working at Xerox being able to tell me that character substitutions actually can occur and Xerox knows (in contrast to their support). They also tell me that this is wanted. I criticize that this compression mode is called “normal”. I also learn that there are two notices in the (300 page) manuals also, telling about character substitution, also with respect to “normal”.|
|during the conference call||Xerox publishes their first press statement. “For data integrity purposes, we recommend the use of the factory defaults with a quality level set to “higher.”” Aside from this, they confirm the work around and tell the customers about the small notice.|
|August 7||Second Statement, Software patch announced, reads like the “normal” setting will be entirely eliminated by the patch. Xerox states most clearly: “You will not see a character substitution issue when scanning with the factory default settings.” In a further document, these factory settings are defined: Compression “higher”, at least 200 DPI.|
|August 9||With Compression “higher” (even advised in the Xerox press statements) and an even more generous resolution 300 DPI I reproduced number mangling on a Xerox device. As I think this is a big deal as anyone could be affected, I told Xerox and wait until they confirmed to see number mangling with the same settings, too. They confirm, and after that I wrote a detailed blog post about this.|
|August 11||On a Xerox WorkCentre 7545, all three compression modes seem to be affected. A user reports the same using a WC 7655. In these cases, numbers seem to be mangled independently from the compression mode, which makes the issue hard to avoid. See blog post.|
|August 12||Xerox scanning issue fully confirmed. They indeed implemented a software bug, eight years ago, and indeed, numbers could be mangled across all compression modes. They have to roll out a patch for hundreds of thousands of devices world-wide. On one hand, the implications might be of vital significance for Xerox now. On the other hand, I am glad not to go down in history as the guy too dumb to read the manual. Here is their press statement.|
|August 19||Pretested the Xerox number mangling patch today. Looks good. Pattern matching completely eliminated, hundreds of thousands of devices affected.|
|August 22||First patches for Xerox scanning bug releasted. Because of the large number of affected device types, the patches are going to be released in several waves.|
|Sept 11||I keep getting asked why I did not demand money from Xerox: There are reasons for this way of proceeding. Click here for a blog post on this.|
We got aware of the problem by scanning some construction plan last Wednesday and printing it again. Construction problems contain a boxed square meter number per room. Now in some rooms, we found beautifully lay-outed, but utterly wrong square meter numbers. You really have to read the numbers to find out; this is why it is so hard to find out. In the present case, we found out because one room in the construction plan was – as the copy told us – about 22 square meters large, whereas the next room, a lot larger, was assigned a label with 14 square meters.
Firstly, I present to you a complete original version of the affected construction plan part. After this, the wrong numbers will be presented. Click to enlarge. I added the yellow marks myself to show you where the errors will occur. Let us name the upper one “place 1”, the lower left “place 2” and the lower right “place 3”.
Now, let us scan the construction plan and get a PDF file from it. No OCR, just plain image. Then, we get wrong square meter numbers at the three places (Yeah, couldn't believe it, too). The screen shots of the erroneous places are organized in the below table. There is one additional line in the table for the original patches. The Xerox WorkCentre 7535 always produced the same errors; this is why we only need one line for it in the table. In contrast, the WorkCentre 7556 randomly produced different numbers, this is why I present three lines for three runs with different errors.
|Run / Machine||Place 1||Place 2||Place 3|
|Original, aus einem Tif-Scan entnommen, Korrektheit verifiziert|
|Xerox WorkCentre 7535|
|Xerox WorkCentre 7556, Run 1|
|Xerox WorkCentre 7556, Run 2|
|Xerox WorkCentre 7556, Run 3|
I know that the resolution is not too fine, but the numbers are clearly readable. Additionally, obviously, these are no simple wrong pixels, but whole image patches are mixed up or copied. I repeat: This is not an OCR problem, but of course, I can't have a look into the software itself, maybe OCR is still fiddling with the data even though we switched it off.
Next example: Some cost table, scanned on the WorkCentre 7535. As we are used to, a correct-looking scan at the first glance, but take a closer look. This error was found because usually, in such cost tables, the numbers are sorted ascending.
The 65 became an 85 (second column, third line). Edit: I'm getting emails telling me that also a 60 in the upper right region of the image became a 80. Thanks! This is not a simple pixel error either, one can clearly see the characteristic dent the 8 has on the left side in contrast to a 6. This scan is several weeks old – no one can say how many wrong documents have been produced by the Xerox machines in the mean time.
Here some tech detail in order to enable you reproduce the errors:
Note that I did not reproduce the Error on all of these myself, so except for the 7535 and 7556, take the information as hearsay.
After the cost table, I printed some numbers, scanned them, OCRed them and compared them to the original ones. As the OCR produces errors, by itself, one obviously has to check by hand for false positives when performing this. I took Arial, 7pt as a test font, and the WorkCentre 7535 with the newer of the aboved named Software version as a test machine. The scan settings were like above. And again, a lot of sixes were replaced by eights: (only a few of the errors are marked yellow for the sake for laziness):
Observe how the sixes around the false eights look correct. Also the false eights contain the characteristic dent again, so whole image patches have been replaced again.
In case you want to have a look for yourself:
The error does not occur if PDFs are scanned with OCR, or TIFs are scanned (the latter seems plausible, as the pure image data should be saved into the TIF). Additionally, there seems to be a correlation between font size, scan dpi used. I was able to reliably reproduce the error for 200 DPI PDF scans w/o OCR, of sheets with Arial 7pt and 8pt numbers. Overall it looks like some sort of compression algorithm using patches more than once (I think I could even identify some equally-pixeled eights).
Edit: It seems that the above thought was not that wrong at all. Several mails I got suggest that the xerox machines use JBIG2 for compression. Even though the specification only cover the JBIG2 decompression, in reality, there is often created a dictionary of image patches found to be “similar”. Those patches then get reused instead of the original image data, as long as the error generated by them is not “too high”. In this case, the pattern matching procedure for finding “similar” image patches seems to do its work too sloppy, which then seems to cause the classification of characters at similar, that actually are not the same. Makes sense.
While JGIB2 is a powerful (de) compression standard, Xerox seemed to neglect has to be most aware of the following important aspect: If using pattern matching upon compression there is no guarantee that parts of the scanned image actually come from the corresponding place on the paper. What comes out if pattern matching based compression is used at the wrong place, as it obviously is the case in business scan copiers, are documents that look perfectly right, but are incorrect in a subtle and dangerous way. For this reason, no JBIG2 document encoded by such pattern matching techniques can ever provide legal certainty. Is someone was to sue me given such documents, I would state “the document might be incorrect and you can't prove me wrong”. The issue gets even more dangerous if life-important documents are scanned, like medical prescriptions.
Of course, if Xerox would have chosen the patch size in a way enabling whole, readable letters to fit into the patches, this would be grossly negligent. Also, it would shed light on how these machines are tested, as when using some patch-based compression algorithm, it kind of suggests itself to test it with low-resolution, albeit still readable letters.
This is another thing I keep getting asked all the time. I'm not quite sure. JBIG2 is a powerful compression standard, but it can be pretty dangerous when misused, as we see. The misusing, in particular using pattern matching techniques for compression at all in the first place, and the bug disabling the ability to turn it off, seems to be a Xerox individual “feature”.
The big deal is that users are unable to see that's something wrong, for the outcoming documents look perfectly fine. Using a non-pattern matching based compression, one might recognize badly scanned numbers by for example black spots or other artifacts on them, but here, they are replaced by incorrect, despite nice-looking numbers. This makes the error almost impossible to detect. The issue has existed for eight years in Xerox machines, and *now* that people are aware, they all of a sudden are replicating it across the planet.