(+) The Problems with OCR

The following is a Plus Edition article, written by and copyright by Dick Eastman. 

OCRMuch of the genealogy information available on the World Wide Web is obtained from old books, published many years ago. With today’s technology, vendors are finding it easy to scan the books and to convert the pages into computer text. The results are placed online and the text becomes searchable in Google and other search engines, as well as each site’s own “search box.” The conversion from printed pages to computer text can be performed at modest expense and the information derived can be valuable for many genealogists. There is but one problem: it doesn’t always work very well.

Scanning a page from a book creates a picture of the page. However, a picture is not easily searchable. The image is similar to taking a picture with a digital camera: while it is easily readable by a human eye, the computer cannot “see” the words in the picture. A conversion process, called Optical Character Recognition, is required.

Optical Character Recognition, usually abbreviated to OCR, is the mechanical or electronic conversion of scanned images of handwritten, typewritten or printed text into machine-encoded text. For this article, I will ignore handwritten text as that is a much different process with its own challenges. Most genealogists are concerned with converting typeset books to computer text that can be searched.

The OCR process is simple in theory. When a printed page of text is scanned, the scanner delivers an image of the text to OCR software stored in the attached computer. The software then attempts to identify each letter of each word in the image in order to convert it to an editable text document or to convert the information into whatever format is needed.

Converting a picture of a word into the computer text equivalent of the same word is a much more complex process than one might think. If you are aware of the strengths and weaknesses of the conversion process, you can better understand the search process when looking for information. That understanding can result in better results when you understand what works and what does not.

The remainder of this article is for Plus Edition subscribers only and will remain in the Plus Edition subscribers’ web site for several weeks. SUBSCRIBE NOW to read this article.

There are three different methods of viewing the full Plus Edition article:

1. If you have a Plus Edition user ID and password, you can read the full article right now at no additional charge in this web site’s Plus Edition at http://eogn.com/wp/?p=40803. This article will remain online for several weeks.

If you do not remember your Plus Edition user ID or password, you can retrieve them at http://www.eogn.com/wp/ and click on “Forgot password?”

2. If you do not have a Plus Edition subscription but would like to subscribe, you will be able to immediately read this article online. What sort of articles can you read in the Plus Edition? Click here to find out. For more information or to subscribe, goto https://blog.eogn.com/subscribe-to-the-plus-edition.

3. Non-subscribers may purchase this one article without subscribing for $2.00 US. You may purchase the article by clicking herePayment can be made with VISA, MasterCard, American Express, Discover Card, or with PayPal’s safe and secure payment system.  You can then either read the article on-screen or else download it to your computer and save it.

%d bloggers like this: