This week I added a new software tool to my genealogy toolbox: ABBYY TextGrabber. Actually, this program uses a process that I have been using for several years: take a picture of documents and then later convert the image to text by use of OCR (optical character recognition). The one difference is that ABBYY TextGrabber provides the software to "package" everything together in one easy-to-use program. If you already own an Apple iPhone, you might want to add this low-cost program to your toolbox as well.
ABBYY TextGrabber allows you to take a picture of any text, such as genealogy information found in a book, and it then converts that text to computer-readable format by using OCR (optical character recognition). The decoded text can then be sent by email or copied by iTunes to a Windows or Macintosh or Chromebook computer where you vcan easily copy-and-paste the information into any genealogy program, word processor, or most any other program.
Using ABBYY TextGrabber can save you a LOT of manual data entry on a keyboard!
NOTE: Newsletter readers report that ABBYY TextGrabber only works with the iPhone 4 or 3GS, not the iPod Touch or the iPad. I didn't have an iPod Touch available to me so I tested ABBYY TextGrabber with an iPhone 4. I like it so well that I am keeping the program.
The Process
Actually, through a series of iPhones I have owned, I have been using the camera in my iPhone to take pictures of documents for years. The built-in camera works rather well at snapping pictures of all sorts of documents although it doesn't have the high resolution of a dedicated desktop scanner. Still, the resolution of the latest iPhone 4 is sufficient for most of my purposes.
For example, click on the image to the right to see a picture I took with my iPhone 4 from the History and Genealogy of the Eastman Family of America published in 1901 by Guy S. Rix. If you click on the small image to the right, you will soon see a full-size copy of the image that I snapped with the iPhone camera. Note that it is rather good resolution at 1,936 by 2,592 pixels, resulting in a 1.8 megabyte file.
The picture looks good to the human eye. It does have a few characteristics, however, that are typical of genealogy books. First, the camera was not perfectly vertical in relationship to the text when I snapped the picture, so the text is slightly slanted. This is typical of using a handheld camera to take pictures of pages in a book. It is almost impossible to align a handheld camera exactly vertically in relation to the book. That won't bother the human eye, but might make a big difference when converting the text by OCR technology. (I'll write more about OCR a bit later in this article.)
Next, there are "extra marks" on the page. Someone wrote a checkmark above and to the left of the word "John" in the first paragraph of text, just below the page title. Also look at the ninth line (counting the title as a line). Note the smudge in front of the letter "e" in "each of his children."
Next, in the middle of the page, someone crossed out the month of "Aug." and wrote in "Oct."
Another problem is the font sizes. At least this page has all the text in a serif font, but the superscript numbers at the end of each child's name will be difficult for a computer to interpret easily.
All in all, this picture looks rather "clean" although I do see one extra speck above the word "MASS." in the page title. Many pictures of genealogy books often have a lot more speckles, smudges, and dots.
The OCR Conversion
You do not need ABBYY TextGrabber to perform OCR text conversion although using this low-cost program certainly will simplify the process. If you don't have ABBYY TextGrabber, you can take the picture with any iPhone camera, using the built in software. Then you will need to find OCR software and also determine the best method of transferring your picture to that software. ABBYY TextGrabber easily does all that for you for a one-time charge of $1.95.
OCR stands for "optical character recognition." Optical character recognition is the mechanical or electronic translation of scanned images of handwritten, typewritten, or printed text into machine-encoded text. It is widely used to convert books and documents into electronic files. For genealogists, this usually means converting an image of a printed or typewritten book or document into text that can be treated like any other text in a computer. Text that has been converted by OCR can be used in a word processor, a text editor, a genealogy program, or most any other program that uses text.
For a detailed explanation of the workings of OCR, look at Wikipedia at http://en.wikipedia.org/wiki/Optical_character_recognition.
The Results
The big drawback of OCR is that it is not 100% accurate, even when clear images are available. Some errors almost always occur when attempting to convert the pictures of text to actual text. If you add in smudges, extra marks, hand-written corrections, and misaligned text, the number of errors quickly increases. Perhaps the biggest problem of all is the mix of different sized fonts. Information presented in columns, rather than as flowing sentences, also creates accuracy problems.
Using the image shown above as input, ABBYY TextGrabber used OCR technology to create the following text:
JOHN EASTMAK2, OF SALISBURY, MASS. 9
'2. John Eastman2 (Roger1), born .in Salisbury, Mass.,
Jan. 9, 1640; died there March 25, 1720; married ist,
Oct. 27, 1665, Hannah Heilie ; 2d, Nov. 5, 1670, Mary.
Boynton, born in Rowley, Mass., May 23, 1648, daughter of
William Boynton, of Rowley, Mass. Mr. Boynton was a school
teacher for many years, also a tailor and planter; owned
lands in various parts of Essex county. He gave a farm to
•each of his children, seven in number. Mr. Eastman took
the oath of allegiance in 1677, and was made freeman in 1690.
He represented Salisbury in the general court at Boston
in 1691.
CHILDREN.
i. Hannah', b. Nov. 23, 1673; d. Dec. 18, 1673.
ii. John8, b. Aug. 24, 1675.
iii. Zachariah8, b. Aftgx-24, 1679. ^) ^
iv. Roger*, b. Feb. 26, 1682.
v. Elizabeth', b. Sept. 26, 1685; m. 1st, April, 1705, George
Brown ; 2d, Dec. 10, 1713, Thomas Fellows.
vi. Thomas8, b. Feb. 14, 1688 ; d. Aug. 27, 1691.
vii. Thomas8, b. 1691.
16. viii. Joseph', b. June 23,1692.
3. Nathaniel Eastman2 (Roger*), born in Salisbury, Mass.,
March 18, 1643 ; died Nov. 30, 1709; married April 30,
1672, Elizabeth Hudson, daughter of Jared Hudson. She
died June 10, 1716. He lived in Salisbury, where he took
the oath of allegiance in 1678, and was made freeman in
1690. He was admitted to the church in Salisbury in
1698, and his wife in 1687. He was a cooper by trade.
His will was admitted to probate in 1710.
12.
13.
14.
15.
Created by ABBYY TextGrabber
I would suggest this output is good, but far from perfect. Notice the extra "dot" in front of "each of his children," and then look at the same words in the original picture. See the smudges in front of the word "each?" That created an OCR error.
The column of text did not get decoded accurately. Entire numbers are missing from the list of children of John and Hannah Eastman although those numbers appear to have been added to the end of the page of text.
In the original book, Roger Eastman (person number 14) has a superscript 3 after his name (Roger3 Eastman) although the number appears to be fuzzy. In the OCR version, the fuzzy superscript 3 was decoded as an asterisk. An almost identical situation occurs on the next line with Elizabeth although the superscript 3 on her name was decoded as an apostrophe.
The crossed-out month of "Aug" in Zachariah's birth date totally confused the OCR software, and, of course, the hand-written "Oct" was not decoded. In fact, the state of the art in OCR today is not sufficient to work on hand-written text.
Such errors are common in any OCR product. Using a very powerful OCR program in a high-speed computer with lots of available memory can reduce the number of OCR errors, but it will never eliminate them. Unfortunately, a handheld iPhone or iPod Touch does not qualify as a "high-speed computer with lots of available memory."
The thing that amazes me is that ABBYY TextGrabber even works at all. In fact, I'd say it works rather well when compared to some other, much more expensive OCR programs I have seen and used. These results aren't bad for a program that costs $1.95! When you add in the simplicity of use, I'd say that ABBYY TextGrabber is a winner.
Anyone who is a touch typist might be able to perform data entry as fast by hand as by using OCR followed by manual "clean up" efforts. However, I am not a touch typist. Considering my typing speed, I find it faster to use OCR and then make a second pass to correct the errors manually. Your results may vary, depending on your own keyboard skills.
Another thing I discovered about ABBYY TextGrabber is that it likes to have a lot of light available when taking the picture. The better the light, the better the picture and the lower the number of OCR errors. While a light bulb is good, it can also induce uneven light and glare, especially on glossy pages. Mother Nature produces the best lighting source of all: sunshine. I'd suggest always placing the book near a window before taking the picture. The above picture from the History and Genealogy of the Eastman Family of America was taken near a window on a cloudy day.
I am delighted with ABBYY TextGrabber. It is one of the most useful low-cost programs I have ever used. To be sure, better programs are available, but not with a purchase price of $1.95. For $1.95, I am impressed! Since I always have my cell phone with me, I always have a scanner with me as well. I'd call that another huge advantage: convenience.
The iPhone 4 will never create a picture as high-quality as a dedicated flatbed desktop scanner, but it can produce results that are sufficient for many purposes. Remember that you will almost always need to perform some manual "clean-up" effort on the text after the OCR process. For me and for many other genealogists, that is much faster and easier than manually re-entering all the information on the keyboard.
ABBYY TextGrabber costs $1.95 and is available in the iPhone App Store. Simply press the "App Store" icon on your iPhone, and then enter a search term of: ABBYY. This will find the application quickly. You can read more about ABBYY TextGrabber at http://goo.gl/G4BQB.
If you enjoyed this article, Tweet it, share it on Facebook or on your preferred social network. Republishing of this article in newsletters, blogs, and elsewhere is allowed and encouraged. Details may be found at http://goo.gl/hoHH1.
Of course, if you haven’t done so already, you should join my email newsletter mailing list to stay current on my latest articles and announcements. You can also cancel at any time within seconds. I promise to never, ever send you any unrequested e-mail, other than newsletter updates.
