This evening, I had like half an hour conference call with
- Rick Dastin, Corporate Vice President Office and Solutions, and
- Francis Tse, Imaging System Architect at Xerox Corporation.
First, I'd like to point out that the atmosphere of the call was very relaxed and easy-going. Above all, both sides were listening to each other, at least this is what I feel (Mr. Dastin, Mr. Tse, feel free to object ). I highly appreciate the way, Xerox deals with the issue, as not all enterprises would do it in this friendly way. We all know the stories of enterprises shooting at the messenger for such a blog post.
- The suggestions of this blog concerning JBIG2 are right
- The suggested workaround is indeed a workaround, as it switches off JBIG2
- The main problem was, respectively is, a support problem, which would not have ocurred, if Xerox support would have known their machines.
Now for the finer granularity facts. The Xerox design in scanning modes contains three levels. Two standard levels (high and higher) and one that gives us small file sizes, but deliberately neglects data integrity (named normal). Now, the “normal” setting uses JBIG2 (as suggested) and therefore may indeed mangle characters. The “higher” and “high” levels use another compression, which also explains, why the image quality may actually decrease when switching from “higher” to “high” – another counter-intuitive thing, as we also discussed.
If one needs a data integrity neglecting compression level in scanners, can be argued about. You need to make your own opinion with respect to this. The double key phrase concecning JBIG2 from the conference call was the following (from my memory minutes, but I think these were the words):
David Kriesel: “If you give me a document encoded with JBIG2, and I claim it's incorrect, you can't prove me wrong.”
Francis Tse: “Yes, you're right, it's a probablistic thing.”
To be most fair: The “normal” setting is not the default setting (in particular, in the company where the error occured to me first, somebody must have set scanning to “normal”) and there exists an (albeit small) warning message in the web interface, see also the screen shot in the workaround blog post.
Only, I don't think Xerox is off the hook with this. Personally, I would never ever implement patch based image compression algorithms for text data that might possibly need legal certainty, which I also stated during the call. We however agreed that the Names of the compression levels may be misleading. Somebody might always think “hey, the normal setting is enough for me”, neglect the small warning in the web interface telling about character substitutions, and go on with business (in fact, exactly this is what lots and lots of people seemingly did, but more on this later, when I also propose the only two solutions I can think of).
What Xerox seemingly didn't know at all, was the the apparently vast number of customers using the “normal” setting in the way I sketched above, not really knowing about the implications, rendering the data integrity of their scanned documents possibly bad. As a consecuence, even though the character mangling is no bug, but the negative implications I stated in my first article are nevertheless possible. Additionally, in the web interface, one has to know what he's looking for to change the parameter.
A further small point was, asked what I would expected them to do, I answered, I would have expected them to drop a note to the public once they got aware of how much concern about the issue arises. They told me that, while this is right in general, for legal reasons as well as long decision paths, this was not so easy.
For this reason, they now try to make their customers aware and tell them. Unfortunately, due to their decentralized structure, they have no complete customer register and try to do it in mass media (they even published a statement while we have been talking: see here. I happily want to contribute, as long I still have the page impressions with respect to this issue, this is why I am writing this article.
Last but not least, the main problem. The main problem in our particular case was that, even though the JBIG2 issues are pretty known within Xerox, across all support levels nobody knew about them. This is the one aspect nobody had an explanation for at all, but it sounded like they will be investigating the issue. If somebody had told the company right away “hey folks, change the setting to higher”, then I wouldn't have written my blog post in the first place. We agreed that this lack of knowledge was a bit suprising.
Additionally, we talked a bit about their JBIG2 implementation, and I suggested that they might decrease their maximum image patch size. If this suggestion will be taken to heart, I of course cannot say
Edit and conclusion: After some thinking, I want to point out, that I can only think of two solutions to be sure with respect to this issue.
- Remove patch based lossy compression from the machines entirely (memory usage is no issue any more), or
- force the user to click away some notification every single time an PDF is created with this scanning mode, so he nows, everything might happen. Additionally, for later users, add an watermark to the PDF to make people aware the PDF might be given to.
A small, single, technical notification within the web interface doesn't seem to work out at all – this is the lesson one can learn from the past days, and from the hits I keep getting on the issue (more than 200.000 within less than days). I'm still getting emails of people whose Xerox machine show the behavior I described. Now I at least can point them to my Workaround post.
Because of caching, a comment can take up to two minutes until it appears.