Wednesday, September 7, 2011

Publishing on the Cheap

Casting about for a copy of Unbought Spirit to give a friend, I found that a press I hadn't heard of had brought John Jay Chapman's Memories and Milestones back into print. I imagined something like a Dover Publications edition: a bit boxy, covers with a coating that cracks, typography that looks dated. I gave the press too much credit.

 The publishers had used optical character recognition software. There are a number of maddening traces this has left:
  • footnotes embedded in the running text
  • captions for illustrations not included
  • at the foot of each chapter the heading for the next one, like the catchword in older books, but immediately beneath the last paragraph
  • the occasional odd character error
  • occasional bad line breaking
The character errors can be entertaining, as when in chapter that mentions the unfortunate influence of business on education, one reads of the "Board o$ Trustees", or when a personal name almost appears in "Cory bantic". The failures in the line breaks set me to wonder about the algorithms used: in a couple of cases one or the other enclosing quotation mark is on a different line, leading me to suspect that somebody didn't think through a regular expression. In other places, a series of short lines, as if verses, appears inexplicably--maybe an illustration went there?

Had the publishers simply found an old copy, photographed the plates and created image files to print from, they'd have produced a much better volume. There would have been the occasional smudge, but "Pa-gliacci" and his friends would not have intruded. But a high-quality image file can be pretty fat compared to plain text--were they economizing on storage?

1 comment:

  1. While looking for print editions of older books, I've come across booksellers who, according to their own claims, are selling bound paperback photographic reprints of the originals, just as you describe. They're not terribly expensive, but I haven't ordered any of them, because I haven't the foggiest idea what I'd actually get; I fear receiving exactly the sort of OCR mess you describe.

    ReplyDelete