Many genealogy records are indexed by a high-tech algorithm called the Soundex Code. Well, it was "high tech" in 1918 when it was invented by Robert Russell. In a nutshell, Soundex Codes provide a means of identifying words - especially names -- by the way they sound. They were used extensively by the WPA crews working in the 1930s to organize Federal Census data from 1880 to 1920. Soundex has also been used for many state and local census records and is very popular in genealogy software and databases.
Motor vehicle bureaus in the District of Columbia, Maryland, Michigan, Minnesota, and Missouri employ Soundex for generating the initial characters of the identification numbers on driver's licenses. The Canadian Centre for Justice Statistics uses Soundex to encode names in its crime surveys and maintain the anonymity of individuals about whom data is collected.
In the days when nearly all of the data for the Census of Population was collected by actual enumerators and individuals who walked from door to door, it was discovered that many of these people spelled surnames phonetically. Thus, one might spell Smith as "Smith" while another might spell it as "Smyth" and still another "Smythe." The census records were to be indexed by the sound of each name rather than by its spelling, and Soundex was the code system used to organize this index.
If you search many records of interest to genealogists, sooner or later you will need to use Soundex Codes. Why? Well, you can often find a person's entry by his or her Soundex Code even when the names have been misspelled. This becomes important when you realize that many census takers did not speak the language of the people being enumerated. In fact, in the first 150 years of U.S. census records, the majority of Americans were illiterate and did not know how to write their own last names. The spelling of many family names also has changed over the years, but often the Soundex Code remains the same.
Spelling of names varies widely in early records, especially when language difficulties have intervened. For instance, I could not find my French-speaking great-grandparents listed in the U.S. Census. I searched and searched, but never found any entries for Joseph and Sophie Theriault. I then decided to do a Soundex search. The Soundex Code for Theriault is T643. When searching for Soundex Codes, I found several entries for T643 in Ashland, Maine, including one for the family of Joseph and Sophia Tahrihult -- improperly spelled, but with the same Soundex Code.
The census taker had a Scottish name, and he was listed on another census page in the same town as a being born in Scotland. I am guessing that he did not speak French. I bet he had some difficulty when speaking with my great-grandparents, neither of whom spoke English and neither of whom could read or write. No wonder Theriault became Tahrihult!
The Soundex Code is not difficult to learn although I still use a small reference card when I go to the archives to look at records. Every Soundex Code consists of a letter and three numbers, such as W-252. The letter is always the first letter of the surname, and the hyphen is optional. The numbers are assigned to the remaining letters of the surname according to the Soundex guide shown below. If necessary, zeroes are added at the end to produce a four-character code. Additional letters are disregarded.
Here is the Soundex Coding Guide:
Each number represents letters:
1 = B, F, P and V
2 = C, G, J, K, Q, S, X and Z
3 = D and T
4 = L
5 = M and N
6 = RDisregard the letters A, E, I, O, U, H, W, and Y.
Here are some of the simpler examples:
Washington is coded W252 (W, 2 for the S, 5 for the N, 2 for the G, remaining letters disregarded).
Lee is coded L000 (L, there is no Soundex Code for E so the numbers 000 are added).
Now let's move on to some of the more complex rules:
Any double letters in a name are treated as one letter. For example:
Gutierrez is coded G-362 (G, 3 for the T, 6 for the first R, second R ignored, 2 for the Z).
If the surname has different letters side-by-side that have the same number in the Soundex coding guide, they are treated as one letter. Examples:
Pfister is coded as P-236 (P, F ignored, 2 for the S, 3 for the T, 6 for the R).
Jackson is coded as J-250 (J, 2 for the C, K ignored, S ignored, 5 for the N, 0 added).
Tymczak is coded as T-522 (T, 5 for the M, 2 for the C, Z ignored, 2 for the K). Since the vowel "A" separates the Z and K, the K is coded.
Names with Prefixes
If a surname has a prefix, such as Van, Con, De, Di, La, or Le, the code should ignore these prefixes. However, coders sometimes miss this rule, so they might assign the Soundex code either with or without the prefix. Because the surname might be listed under either code, a thorough search of the Soundex index should include both forms. Note, however, that Mc and Mac are not considered prefixes, according to the National Archives and Records Administration. Once again, however, not everyone knows this particular rule, so you might want to search both with and without the Mc or Mac coded.
VanDeusen might be coded two ways:
With the prefix included, V-532 (V, 5 for N, 3 for D, 2 for S)
or
With the prefix excluded, D-250 (D, 2 for the S, 5 for the N, 0 added).Consonant Separators
If a vowel (A, E, I, O, U) separates two consonants that have the same Soundex Code, the consonant to the right of the vowel is coded. Example:
Tymczak is coded as T-522 (T, 5 for the M, 2 for the C, Z ignored (see "Side-by-Side" rule above), 2 for the K). Since the vowel "A" separates the Z and K, the K is coded.
If "H" or "W" separate two consonants that have the same Soundex Code, the consonant to the right of the vowel is not coded. Example:
Ashcraft is coded A-261 (A, 2 for the S, C ignored, 6 for the R, 1 for the F). It is not coded A-226.
The Soundex Indexing System web page on the National Archives site has been updated to include this previously "lost" rule. Not all documents use this extra rule, however. Use the National Archive's Soundex page as your definitive source. The genealogical community owes a special thanks to Tony Burroughs who researched and rediscovered the original Soundex instructions used by the Census Bureau.
American Indian and Asian Names
A phonetically spelled American Indian or Asian name was sometimes coded as if it were one continuous name. If a distinguishable surname was given, the name may have been coded in the normal manner. For example, Dances with Wolves might have been coded as Dances (D-522) or as Wolves (W-412), or the name Shinka-Wa-Sa may have been coded as Shinka (S-520) or Sa (S-000).
While the rules sound a bit complex, they do become easier with a bit of practice. For those of us who are too lazy to go through the coding exercise, the computer age has brought many new tools. Most modern genealogy programs will tell you the Soundex Code of any name that you enter. In addition, a number of online Soundex Machines are available, including those at: http://www.eogn.com/soundex and http://resources.rootsweb.com/cgi-bin/soundexconverter. On any of these sites, you type in a last name, and then the site will display the correct Soundex Code. Yet Another Soundex Converter (YASC) at http://www.bradandkathy.com/genealogy/yasc.html will even convert a long list of names to their Soundex equivalents; you do not have to enter them one at a time.
NOTE: You can find many more Soundex converters online but many of them do not follow the "H & W" Rule. To test them, enter a name of Ashcraft. It should produce a Soundex code of A-261. If the software produces some other code, don't trust it.
The National Archives and Records Administration publishes a free brochure, entitled Using the Census Soundex. To obtain a copy, send an e-mail to inquire@nara.gov and ask for General Information Leaflet 55, usually referred to as GIL 55, Make sure that you include your name, postal address, and "GIL 55 please".
While Soundex is a great tool and in widespread use, it certainly is not perfect. For example, it fails when the first letters are different. For instance, Knowles is coded as K542 while both Noles and Nolles are N420. Likewise, Cantor is C536 while the similar sound of Kantor is K536.
Soundex also has a number of shortcomings when dealing with Eastern European Jewish names. Two Jewish genealogists, Randy Daitch and Gary Mokotoff, developed a more sophisticated system, more suitable for Jewish genealogy. The Daitch-Mokotoff Soundex is becoming the de facto standard for on-line lookups on Jewish-related web sites. You can read more about the The Daitch-Mokotoff Soundex in an article written by Gary Mokotoff at http://www.avotaynu.com/soundex.html.
Numerous other improved Soundex methods have been developed in recent years and are in widespread use on numerous computer databases. The accuracy on the newer methods is much improved. These new and improved Soundex systems typically use more than one letter and three numbers. However, they have never seen much use in genealogy databases.
Now, have fun with census records!
Dick, your excellent soundex article seems similar to your previous soundex article which I can't date at the moment -- to which I posted a comment that got no response. Anyway, I'm going to again post this observation because it involves what is arguably the most popular employer of soundex coding that we genealogists use daily, and that is the Ancestry.Com census databases.
My testing indicates that Ancestry's soundex searches employ an index that does *not* use the "H & W" rule. My test case uses the Brooklyn, N. Y. census for 1870, and the surname of our friend, Tony Burroughs (who last Saturday gave us an excellent Saturday night banquet speech at the NERGC conference) -- and who bears the problematic surname case where his correct soundex code is B-620, but without the H & G rule, is B-622. An Ancestry.com "exact" search for "Burroughs" yields 25 hits, whereas a "soundex" search yields 80 hits -- but *zero* of them are "Burroughs" or it variants. Checking a few of these 80 hits shows they all seem to match B-622. If we then fake the search out by dropping the final "s" and do a soundex search on "Burrough" -- which codes as B-620 using the incorrect coding rule -- we get 2,182 hits, including the missing 25 "Burroughs" plus numerous useful variants like "Burrows", etc., which is exactly what we want from a soundex search.
Tony, in his original article, I'm sure explained all this, but did he point out that this hughly popular census database suffers this coding problem? And even if he did, we genealogists need to know what we're dealing with and how to cope with it, so it probably deserves being spotlighted again. Are there other popular databases out there with the problem?
Posted by: Mac Young | April 06, 2005 at 11:38 AM
The only way to be sure is to search for both the "correct" Soundex code and the "incorrect" soundex code.
I discovered this H/W anomaly quite a few years ago (1998/1999 I think it was) while trying to determine all the rules so that I could write the calculations to have FileMaker Pro do soundex coding on databases I was developing for our genealogy society.
At that time the only online reference I could find to the H/W rule was at the Clayton County, Texas Library's web site. Nothing at our local library, or at the State of Michigan Library mentioned the H/W rule, but yet it was obvious this rule had been used when the US Censuses were indexed in the 1930s since Ashcroft was coded in these indexes as A261, rather than A226 which the common rules would determine.
A discussion on soc.genealogy.computing followed and the problem became more publicised.
The databases I create now calculate Soundex both ways - using the H/W rule and ignoring it, so for example if you go to
http://data.wmgs.org:591/KentCountyObits/FMPro?-db=KentCountyObituaries&-lay=Listing&-format=search.htm&-view
and search for Last Name Begins with Burroughs you'll see in the results that the Soundex could be either B620 or B622, so conversely you'd find Burroughs by searching for either B620 or B622.
Cheers
Roger
Posted by: Roger Moffat | April 06, 2005 at 04:35 PM
Actually it's "older" than I thought. Google turned up the archive of my first post to GENCMP-L about this in January 1997
http://archiver.rootsweb.com/th/read/GENCMP/1997-01/0853821393
but I can't find the archive of the discussion that involved Tony Burroughs and I think Bob Velke (author of TMG) was involved also.
Roger
Posted by: Roger Moffat | April 06, 2005 at 04:43 PM
Hi,
Does anybody know the names of these "numerous other improved Soundex methods".
Posted by: Bioubiou | January 15, 2008 at 04:49 AM
Hi,
I have understood converting from name to soundex code. But After getting the code, how I will get the similar names from the code.
That is name to code, then code to some similar names. I need help on code to names which are phonitically similar.
Please help me.
Bye.
Posted by: Md. Rakibul haque | October 17, 2008 at 12:01 AM
To Md. Rakibul Haque - converting from a soundex code back to names. To be practical, you would first need a list of all possible/reasonable names. Then the program would pick out from the list, and display, all those names that fit the soundex code provided.
If you are a doctor (Md ?) you may be wanting to convert from a soundex code to a medicine. Your data base would contain all possible medicines. You could type in the soundex code and the program would display the few medicines that fit that code.
Posted by: Bob Richardson | January 14, 2009 at 12:42 PM
I am a data analyst profiler and this article, and its rich use of links to sources has been a delight to find. Thank you for your research and sharing of information.
Posted by: Scott Alan Johnson | April 02, 2009 at 01:18 PM