Tag Archives: OCR scanning text

Things I have learned about OCR scanning

What’s OCR scanning? It’s a method of getting printed text into digital form. This can be useful for authors who are thinking of self-publishing rights-reverted texts. If the rights have reverted it’s likely to have been published pre-digital publishing. In which case the easiest way to digitise your text is to OCR scan it.

You’ll need a scanner and OCR software. I use ABBYY FineReader Express. You scan the pages to PDF first, then run the PDF through the OCR software and text is converted to rich-text format .RTF.

The accuracy is very good in general, but the software is only software, and it sometimes gets confused.

Display font recognition is pretty poor – so this isn’t a method to use if your entire book is set like this…

This came out as CUfUiOi+e

But if you use a font like this I’m using here it comes out just fine with a few points to note:

  • I’s often become 1’s or J’s or even T’s – this is particularly likely in speech like this: ‘I becomes 1 or J or T. I think this is because the software merges the speech mark and the letter together.
  • Foreign accents are generally ignored. (They might work in different languages – I haven’t tested yet.)
  • Random full stops creep in sometimes. This might be because of particularly large or blobby serifs in a serif font or because of a printing error or mark.
  • Italic! can be translated to /.
  • ? can be translated to /.
  • If text is tracked wide the software will add multiple spaces between words.
  • If text is tracked tight the software might close up spaces.
  • Occasionally software will introduce rogue paragraph endings.
  • Ellipses can come out as dot dot dot or dot space dot space dot depending on the original setting. I always change all versions to … (alt and ; on a Mac). This is important for e-books because if you leave as dots and spaces, you can get odd breaks such as two dots on one line and one on the next. And that looks really unprofessional. For the same reason I suggest that where ellipses come at the end of a sentence you close up space before.
  • And remember, it’s only software – if there was a typo in the book, there will be a typo in the resulting text!

This means you will have to run through a load of search and replace queries and a spell check, so more on that another time.

Advertisement
Tagged , , , , , ,

Uses for a scanner Pt 3: Digitising printed texts – a guide for authors and publishers

De-compiled!

Recently I’ve had enquiries from authors who are thinking about self-publishing their rights-reverted books as e-books, but don’t know where to start and what is involved.

Let’s assume that you are sure the rights have reverted to you, you feel there’s a market for your books, and you’ve got someone to market them – or you’re prepared to put the time in – and that you realise that you will have to spend some time and probably some money (if you don’t want to spend an inordinate lot of time) on converting your books.

So let’s start with first things first – how do you convert your printed book into an e-book?
Step one: The text must exist in an editable digital form. This means that if all you have is a printed book, you must somehow get it into a text document (Word – or similar will do).

You could type it up again … or get someone else to type it up again…
You could check with your agent and/or publisher to see if they still hold files
You could scan the book and use OCR scanning software. Which brings me to the point of this post.

You can use an ordinary flat-bed scanner that often comes with home printers these days. But it is a drawn-out process. Open book, place on scanner, scan spread (holding book down as flat as you can), take book out of scanner, turn page over and repeat … and repeat … and repeat. You can see how this will quickly become tedious, and you really do have to hold the pages as flat as possible or the OCR scanning software will struggle.

Another option is to use a sheet-feeder scanner. You will have to cut your book up – so book-lovers of the ‘weep to see a broken spine’ disposition look away now…

Step one: Thoroughly break the spine. Open and close in several places, bend and generally loosen up.

Step two: Carefully pull the cover away from the book block.

Gently pull the cover away from the book block

It’s not essential that the cover stays in one piece, but I think it makes it easier to handle.

Step three: Once the cover is off, carefully pull the book block apart into sections. Or if you have a huge guillotine about your person, use it to trim off the glued section.

Pull apart into manageable sections

Step four: Trim along the glued edges. You don’t have to worry about perfection here, just so long as you don’t cut into the text areas. I use scissors, but you can use a knife if it’s easier.

Trim off the glued edges

Step five: Fan through the pages several times to get rid of paper dust and to make sure all the pages are separate. Any still glued together will snarl up in the scanner. Books make a surprising amount of dust too.

Fan pages to separate and get rid of dust

Step six: Place about forty pages in the scanner (with my scanner it’s face down and pointing down). No need to count the pages – just experiment with how much it can cope with. Set the scanner going. Keep an eye out for snarl-ups or misfeeds. Make sure that you put the sections through in the correct order.

Pages going through the scanner

Step seven: Save the resulting scan as a PDF. This has created a series of page images. The text still isn’t editable at this point.

Step eight: Run your OCR software. I use ABBYY Finereader Express. Save the result as a RTF file.

You now have an editable file.

Step nine: You’ll need to check it through for OCR errors. The software is very good for reasonably normal text, but if your original has any fancy fonts, handwriting, etc, expect a lot of errors. I recently scanned a couple of books with chapter heads in a gothic blackletter font and they came out as complete gobbledygook. You’ll need to find and delete page numbers, running heads, etc. I’ve noticed errors on italics with ?or! directly after them and I converting to 1 or the other way around. Foreign accents tend to be ignored too (if scanning in English that is).

Step ten: You now have digital text ready for formatting and converting – but that’s another story!

If this all sounds like a huge faff – I can do any or all parts of this process for you, and I can convert to e-book formats too. Just contact me for details.

Please note:  you must own the rights to the work (or have permission from the owner).

Tagged , , , , , , , ,

A new scanner

Image

 

Yay! I am now set up to OCR scan using this – the Fujitsu Scansnap. Having played with this for a few days I’m really pleased with it.

It’s a neat little sheet-feed scanner – shown next to a Mac here. It’s permanently plugged in and you switch it on by opening it. Then you just feed in what you want to scan. It’ll take up to A4 size papers (and A3 if you wrap it round a carrier sheet). It scans both sides, but is clever enough to leave out blank pages.

But the magic is really in the software. OCR (Optical Character Recognition) software looks at the page image and extracts editable text from it. For printed text – such as a novel – the conversion is very nearly perfect. It’s not quite clever enough yet to work out page breaks and it has a bit of trouble with foreign accents (but it can be set to various languages so this may be a fix) and has the occasional inexplicable wobble.

OCRed text will always need checking, but it is the perfect solution for authors wanting to rerelease their backlist as e-books or print-on-demand titles.

I’ll post about the workflow later…

 

 

Tagged , , , ,