Things I have learned about OCR scanning

What’s OCR scanning? It’s a method of getting printed text into digital form. This can be useful for authors who are thinking of self-publishing rights-reverted texts. If the rights have reverted it’s likely to have been published pre-digital publishing. In which case the easiest way to digitise your text is to OCR scan it.

You’ll need a scanner and OCR software. I use ABBYY FineReader Express. You scan the pages to PDF first, then run the PDF through the OCR software and text is converted to rich-text format .RTF.

The accuracy is very good in general, but the software is only software, and it sometimes gets confused.

Display font recognition is pretty poor – so this isn’t a method to use if your entire book is set like this…

This came out as CUfUiOi+e

But if you use a font like this I’m using here it comes out just fine with a few points to note:

  • I’s often become 1’s or J’s or even T’s – this is particularly likely in speech like this: ‘I becomes 1 or J or T. I think this is because the software merges the speech mark and the letter together.
  • Foreign accents are generally ignored. (They might work in different languages – I haven’t tested yet.)
  • Random full stops creep in sometimes. This might be because of particularly large or blobby serifs in a serif font or because of a printing error or mark.
  • Italic! can be translated to /.
  • ? can be translated to /.
  • If text is tracked wide the software will add multiple spaces between words.
  • If text is tracked tight the software might close up spaces.
  • Occasionally software will introduce rogue paragraph endings.
  • Ellipses can come out as dot dot dot or dot space dot space dot depending on the original setting. I always change all versions to … (alt and ; on a Mac). This is important for e-books because if you leave as dots and spaces, you can get odd breaks such as two dots on one line and one on the next. And that looks really unprofessional. For the same reason I suggest that where ellipses come at the end of a sentence you close up space before.
  • And remember, it’s only software – if there was a typo in the book, there will be a typo in the resulting text!

This means you will have to run through a load of search and replace queries and a spell check, so more on that another time.

