The British Library has started archiving the entire UK web, including one billion pages from 4.8 million websites, blogs, e-books, online newsletters, forums, and social media sites. The process will take five months, in a bid to preserve the nation's “digital memory.”
The library says the work is urgent. Ever since people began switching from paper and ink to computers and mobile phones, material that would fascinate future historians has been disappearing into a digital black hole.
The online "harvesting" of web sites started yesterday (6 April 2013). It runs an automatic "web harvest" of 4.8 million UK websites amounting to around one billion pages. The process is expected to take three months, followed by two months of processing the data to make it easily searchable in the British Library's database, which already holds 750 million pages of newsprint. The web harvesting will not stop, however. The five month effort will only capture the present web sites. The British Library expects to continue capturing all changes and new web sites that appear in the future.
Along with the British Library, the National Libraries of Scotland and Wales, the Bodleian Libraries in Oxford, Cambridge University Library and Trinity College Library Dublin will all help to gather the data and make it available for visitors. These six are being referred to as "legal deposit libraries." Access to the material will be offered in reading rooms at each of the six libraries.
The archive will include everything from mainstream news websites - including access to content previously locked behind paywalls - to blogs, forums and eventually content from social networks, such as tweets and Facebook posts made by users with their privacy settings set to public.
NOTE: I have no idea how the British Library will be able to access content "content previously locked behind paywalls."The effort is an obvious duplication of the web archives presently saved by Archive.org's service called "The Wayback Machine" which has collected 240 billion pages since 1996 at http://archive.org/web/web.php. However, the British Library's system will be a bit different and will, as a result, capture many items not stored on The Wayback Machine. For instance, The Wayback Machine only takes periodic snapshots, typically once per month. For instance, to look at old pages published by www.EOGN.com, look at http://web.archive.org/web/*/http://eogn.com. The blue circles on the calendar represent the dates on which the The Wayback Machine took a snapshot of the web site. For some reason, The Wayback Machine took many snapshots of www.EOGN.com in the month of January, 2013, but took none in June, July, or August of 2012. Articles and comments posted during those months were never stored by The Wayback Machine, unless those articles and comments were still available online when a snapshot was finally made in September.
The Library of Congress in Washington, DC, preserves American digital content such as e-books and e-journals, and archives online content in collections built around themes and events, but does not routinely save all websites.
In contrast, the British Library says that it will initially capture most UK web sites only once a year, but hundreds of thousands of fast-changing sites, such as those of newspapers and magazines, will be archived as often as once a day. However, captures eventually will be made more often. Copies of every public tweet and Facebook entry in the UK could eventually be included.
Of course, the other big difference is that the British Library will only capture UK Websites. The project initially will capture only .uk domains but will later seek to identify UK sites in the .org and .com domains and capture those as well.
To make sure the collection is not destroyed by computer failure or by natural disasters, there will be multiple self-replicating copies on servers around the country. To make sure the collection is available for centuries, the British Library staff will transfer files into updated formats as technology evolves.
British Library spokesman Ben Sanderson acknowledged that this is new territory for an institution more used to documents written on parchment, paper and the fine calfskin known as vellum. "Vellum - you don't need an operating system to read that," he said.
You can learn more in a video on the Guardian's web site at http://www.guardian.co.uk/books/video/2013/apr/06/british-library-digital-video or click on the video player below: