Amazon launched a new service this week, called the Public Data Sets service. The project encourages developers, researchers, universities, and businesses to upload large (non-confidential) data sets to Amazon and then let others integrate that data into their own applications hosted on Amazon. Indeed, Amazon is a leader in "cloud computing" on Amazon Web Services (AWS), providing programs that run on Amazon's servers instead of your own computer but accessed in a web browser as if you were executing the program(s) locally.
Amazon has already loaded the several huge databases, including U.S. census databases, and is seeking more. However, genealogists shouldn't get too excited just yet: the census databases are not the ones that we all want to use.
The U.S. Census Bureau compiles all sorts of information, not only names. The databases provided by the Census Bureau include such things as how many bathrooms are installed in homes, how many television sets, average income of American residents, and more. In other words, these are databases of demographics data, not of individuals.
Specifically, Amazon already has the following United States demographic databases from various US Censuses, summary information about Business and Industry, and Household Profiles:
- 2000 US Census (Linux/UNIX): snap-92d333fb (Demographics data only)
- 2000 US Census (Windows): snap-36ce2e5f (Demographics data only)
- 1990 US Census (Linux/UNIX): snap-33f8185a (Demographics data only)
- 1990 US Census (Windows): snap-8cf818e5 (Demographics data only)
- 1980 US Census (Linux/UNIX): snap-9df717f4 (Demographics data only)
- 1980 US Census (Windows): snap-b6f818df (Demographics data only)
- 2003-2006 Economic Data (Linux/UNIX): snap-0bdf3f62
- 2003-2006 Economic Data (Windows): snap-4edd3d27
- Business and Industry Summary Data (Linux/UNIX): snap-5cf81835
- Business and Industry Summary Data (Windows): snap-8af818e3
Some other non-census databases now hosted on Amazon include:
- Annotated Human Genome Data provided by ENSEMBL - An annotated form of the Human Genome, perfect for biological research
- PubChem Library provided by by the National Center for Biotechnology Information - A data set of information on the biological activities of small molecules
- Various Labor Statistics Databases provided by The Bureau of Labor Statistics
- Statistics on Inflation & Prices, Employment, Unemployment, Pay & Benefits, Spending & Time Use, Productivity, Workplace Injuries, International Comparisons, Employment Projections, and Regional Resources
- Various Transportation Databases provided by The Bureau of Transportation Services
- Data and statistics from the US Department of Transportation on Aviation, Maritime, Highway, Transit, Rail, Pipeline, Bike/Pedestrian and other modes of transportation
Amazon Web Services (AWS) is hosting the public data sets at no charge for the community, and like all of AWS services, users pay only for the computing power and storage they consume with their own applications.
The data is available now but is not yet easily accessible to the general public. Someone has to first create the programs that read the data and then present it in a manner that people wish to see. Amazon faced a "chicken and egg" problem: nobody would create applications until the databases were available and yet nobody would spend money to place huge databases online if there were no applications that would read them. Amazon elected to spend its own money first by placing huge public domain databases online, hoping that application developers would soon follow with suitable applications. As a result, many large databases have been put online and made available to developers this week, but there are almost no applications available just yet.
If you are a programmer, you can write an application today and start accessing the data immediately. The rest of us will have to wait until some energetic programmer(s) write the applications for us.
This week's announcement is monumental from a technology viewpoint but I don't see too much of immediate interest to genealogists. First of all, the data available so far is of little interest to genealogists. Most of us do not care about compiled economic data. Next, there are no applications available today that will access this data although I am sure that will change in coming months.
I suppose it is possible that census records containing information about individuals from the 1930 and earlier census records could be contributed in the future, but who is going to do that? The records contributed so far are all public domain databases contributed by the U.S. Census Bureau and other providers of free information. However, the Census Bureau does not have any computerized databases containing names of residents in the 1930 and earlier census records. To be sure, the U.S. Census Bureau does have that information on paper and on microfilm, but not in computer format.
Ancestry.com and HeritageQuest Online (a division of ProQuest) have spent millions of dollars converting those records to computerized databases. Footnote.com has done the same for the 1860 census only and some other commercial companies may have computerized small segments of census records. In each case, the computer databases are seen as assets of the particular corporation and they are not about to give the data away free of charge to Amazon!
One interesting exception might be FamilySearch which is financed by the Church of Jesus Christ of Latter-day Saints. This non-profit organization has already created several census databases and has more such projects underway at this time. In the past two or three years, FamilySearch management also has become very interested in cooperative projects with commercial and non-profit organizations alike. I could envision a future cooperative effort between Amazon and FamilySearch. I am not predicting such an alliance, I am simply saying that it is a possibility.
Even with the commercial organizations, business factors change frequently and the expenses involved in providing databases online continue to plummet. One of the existing holders of genealogy information could become interested in a future alliance with Amazon although I think it is more likely that a new, previously-unknown company could be formed to take advantage of the low-cost technology being offered by Amazon and others.
Competition is usually a good thing. Whatever happens, genealogists will benefit from increased competition and the lower prices that will follow.
As the expenses of "cloud computing" continue to drop, I can envision huge benefits to genealogists and to others who frequently access large databases.
While I see little of interest to genealogists in today's announcement, I can envision that could change within a very few years.
You can read more about the new public datasets available on Amazon Web Services at http://aws.amazon.com/publicdatasets/.
You can also read my earlier Plus Edition article entitled "Genealogy Software as a Service" that discusses many of the same concepts at http://plus.eogn.com/Default.aspx?pageId=113015&mode=PostView&bmi=61445 (a Plus Edition user name and password is required to access that article). If you do not yet have a Plus Edition user name and password, you may prefer to purchase the article for $2.00 at http://www.lulu.com/content/5227359.
