There are a number of reasons for copying an entire web site to your own hard drive. Perhaps you want to save a copy of all the information available on the site. You can then disconnect from the Internet and peruse the site at your leisure while disconnected. In fact, you can read the pages of a web site while on a train or airplane.
Another reason that I use is for searching data: even though I have a broadband Internet connection, I find it much faster to search a copy of a web site on my own hard drive than to search the original pages online with Google or other search tools. A final reason is that you own the site and want to make a backup copy in case of disasters.
A number of programs are available that will simplify the copying of web sites to your local hard drive. I have experimented with three or four of them and have now settled on HTTrack.
HTTrack is a powerful and easy-to-use tool that can backup your entire web site for off-line browsing and archival purposes. The program will make a mirror image of an entire site or (optionally) of part of a site. The program will recursively build all directories, getting HTML, images, and other files from the server to your computer. As part of the process, HTTrack arranges the original site's relative link-structure.
You can later open a web browser to view the web pages stored on your hard drive. It will look just like the original web site except that performance is much better. As you move from web page to web page, the new pages appear almost instantly.
HTTrack's default settings will copy an entire web site. I suggest you not try that with Ancestry.com, Google, Switchboard.com, or other sites with huge databases. You might need a few terabytes of disk storage to accomplish that! However, HTTrack's menus make it easy to narrow the parameters down to copy only a subset of the web site, such as only the message board or only the pages in the "Canadian genealogy" subfolders.
None of the web site copying (mirroring) programs will work on interactive web pages. That is, when copying sections of Ancestry.com or Switchboard.com (a popular telephone directory web site), these programs will have difficulties with any pages that ask the user to input a name to be searched. For instance, a web page on Ancestry.com that says, "Enter first and last names" will be copied. That is, the entire page WITH BLANK SPACES will be copied. However, none of these programs will let you enter names to be searched, nor will they record the results of such searches. As a result, you cannot use any of these programs to copy an entire database from Ancestry.com or Switchboard.com or other database-driven interactive sites.
Another limitation is that web pages which require a program to be executed on the web server will not function properly when copied to your hard drive. Many web pages ending in the letters .cgi, .php, or .pl specify that a program must be run on the web server in order to supply information to the viewer. HTTrack and other web mirroring programs will typically copy the original page but have no method of running the required programs that are not installed on your local PC. Therefore, any expected results from these programs cannot be displayed on your local screen.
For instance, you might be copying data from www.weather.com and encounter a page ending in .cgi that queries the current weather report for your town. The original page will be copied to your hard drive. However, if you read that page two or three days later while offline, there is no way to execute that CGI script and display the current weather on your screen.
The "mirror" also is not an exact duplicate of the original. One case in point is that all the internal links are changed to make them work properly from the hard drive. For instance, let's say I copy my web site (www.eogn.com) to my hard drive. Now the link on my web page that points to http://www.eogn.com/index.html (a second page on my site) gets converted so that it points to C:\My Web Sites\EOGN\blog.eogn.com\index.html. This way, when I later view the page on my own hard drive, this link takes me to the correct document as stored on my own hard drive rather than trying to connect to the equivalent page on the web.
HTTrack adds some information to the beginning and end of each page stored on the hard drive to show that it is a copy of the original, created by HTTrack. This is a good idea because it shows that the page is not the original. However, all this prevented me from performing one of my original plans: I had thought that I could quickly edit the pages stored on my hard drive and then upload the results back to the web server. Because of all the changes HTTrack makes inside the copied web pages, my original plan is not practical.
Even with these minor limitations, HTTrack will properly copy most genealogy web pages since perhaps 99% of such pages online do not depend on .cgi, .php, or .pl scripts. Just bear in mind that HTTrack will not be effective for sites created with The Next Generation or any other sophisticated interactive genealogy web site. It also will not work well with online interactive databases, such as www.weather.com, stock market reports, or other sites that display constantly changing data.
All in all, I am pleased with HTTrack, even with these minor limitations. I found the program to be easy to download, install, and operate. For most of the simpler web sites, all the user needs to do is run the program, supply the URL of the site to be copied, and then sit back while the program copies everything. You later use a standard web browser to open the main index file of the copied site (usually index.html) as it exists on your hard drive and use it as if you are on the web. The one major difference will be speed: web pages appear in your browser's screen much faster when retrieving those pages from your own hard disk rather than from the web.
The best part of HTTrack is its price: FREE.
This one is a keeper. I use it periodically to make backup copies of web sites that I own. I also occasionally copy other sites to read while I am traveling and disconnected from the Internet. That is a great way to pass the time on a long flight.
For more information about HTTrack for Windows, Linux, and UNIX, or to download the program, go to http://www.httrack.com
And how does copying an entire website without the owner's (who is usually the copyright holder) permission comply with current copyright law?
It violates it big time, it even violates the fair use sections. It goes far beyond what copyright law would consider 'fair' when it refers to "amount and substantiality of the portion used in relation to the copyrighted work as a whole."
Dick, if you own the site you should have the original source that you can republish if you need to. I can see a use for the tool if you want to make an offline copy of your own web site to give to someone.
As for making copies of an entire web site to use later when you see fit, is absolutely no different than going to the library and xeroxing an entire book.
Posted by: Dino (All Dino, All the Time) | December 02, 2005 at 11:23 AM
Disclaimer: I am not a lawyer.
However, as I understand it, there are no copyright issues with making copies of publicly-available information for your own personal use. However, there may be an issue with what you do with that copy.
In other words, if you make a copy and keep it on your own hard drive and simply read it yourself, I believe you are perfectly within legal boundaries. Anybody can make a copy of eogn.com for their own use and read it while riding the commuter train without worrying about legalities.
Should you later republish or reprint or otherwise redistribute the information to others, you might then be in danger of violating copyright laws.
- Dick Eastman
Posted by: Dick Eastman | December 02, 2005 at 11:47 AM
I'm also puzzled since you said you can't use it to edit and then upload because of all the changes it makes to the pages, but later say you use it for backing up your websites. I'd rather do a straight, quick, FTP to my HD of my folders and files from my website directory and get them clean, than use this program that messes them up and have to clean them up later...
Posted by: Trishymouse | December 03, 2005 at 05:08 PM
I'm also puzzled since you said you can't use it to edit and then upload because of all the changes it makes to the pages, but later say you use it for backing up your websites. I'd rather do a straight, quick, FTP to my HD of my folders and files from my website directory and get them clean, than use this program that messes them up and have to clean them up later...
Posted by: Trishymouse | December 03, 2005 at 05:08 PM
Dick,
I'm happy that you allow so much freedom with copies of your web site. Unfortunately (or fortunately, depending on your views) web sites have exactly the same copyright protection granted to books and other printed matter.
Just as it is copyright infringement for you to go to a library and xerox (or scan) an entire book without the consent of the copyright holder, it is also copyright infringement to make a copy of an entire web site.
Didn't you review a book on copyright not too long ago? The following is from Cyndi's List:
Web pages are protected by a copyright. Information contained on those web pages and all original information that is not in the public domain is protected by copyright. A compilation of works, including a set of links arranged into a compilation, IS protected by copyright.
http://www.cyndislist.com/copyrite.htm
Dick, thank you for allowing everyone to freely use your column. Not all authors are so generous.
Posted by: Dino (All Dino, All the Time) | December 03, 2005 at 11:07 PM
Dino, I think you need to re-read my earlier posts on this topic. In short, anyone is free to make copies FOR THEIR OWN PERSONAL USE. That is true of web sites, books, records and most anything else. Copyright becomes an issue only when someone tries to redistribute the information or to re-use it in something else.
As I wrote earlier, "if you make a copy and keep it on your own hard drive and simply read it yourself, I believe you are perfectly within legal boundaries. Anybody can make a copy of eogn.com for their own use ..."
and
"Should you later republish or reprint or otherwise redistribute the information to others, you might then be in danger of violating copyright laws."
In short, you can legally copy it all you wish but don't give it to anyone else or re-use the data in any manner without permission.
- Dick Eastman
Posted by: Dick Eastman | December 03, 2005 at 11:16 PM
The backups made by HTTrack and most other web site copiers are not suitable for re-uploading. They are not identical images and cannot be identical if they are to be viewed offline. However, all the text is backed up.
Posted by: Dick Eastman | December 03, 2005 at 11:18 PM
Reality check: No one is likely to be sued for copying a public website for personal offline browsing. That being said, the question whether one could be successfully sued will never be answered until someone actually is. That seems to be the way copyright law works.
Offline browsing is simply an extreme case of caching, which web browsers do all the time by default. I currently have over 50MB of website info stored on my hard drive. Have I infringed upon anyone's copyright? How about if I download a free utility that allows me to easily view these cached files?
Some software even allows you (or your ISP) to "precache" webpages you haven't yet viewed. All of this is intended to speed up the delivery of web content--not to deprive anyone of his livelihood.
In offline browsing, content is used as it was intended to be used, just at a more convenient time (called "time-shifting").
Copying an entire website is not really analogous to photocopying a book at the library. It's more analogous to recording an episode of Desperate Housewives for later viewing. The U.S. Supreme Court in the 1984 "Betamax case" ruled that recording a program to view later fell under "fair use":
"When one considers the nature of a televised copyrighted audiovisual work ... and that time-shifting merely enables a viewer to see such a work which he had been invited to witness in its entirety free of charge, the fact ... that the entire work is reproduced ... does not have its ordinary effect of militating against a finding of fair use."
Even this is not strictly analogous, since most television programs are not available "on demand" the way websites are. Again, Internet copyright law is still in its infancy, and future cases will undoubtedly clarify the fair-use provisions.
Also, notice the phrase "free of charge" in the court decision. I wouldn't try downloading large portions of the Ancestry.com website for later viewing--especially since this is expressly forbidden in the TOS.
Posted by: Chris Dunham | December 04, 2005 at 02:58 AM
The major reason that I would want to copy an entire web page (and I copy partial web pages all the time) is that in my 16 years of experience with a computer, I have seen an awful lot of web pages disappear on me.
And since I have saved all of my bookmarks from three different computers, I have a huge bookmark file.
There are probably plenty more pages that are gone that I have not had cause to revisit, but might some day.
Web pages disappear for various reasons, one of which would be the death of an owner with a survivor who isn't interested in the subject. This will happen more and more as we baby boomers (and older) get into our 60s and 70s. Another reason would be moving to another server and changing the page title.
Thanks for the tip, Gene.
Ray Marshall
Minneapolis
Posted by: Ray Marshall | December 05, 2005 at 01:18 AM
Keep in mind that many sites will be brought to their knees by the use of HTTrack. Any dynamically generated genealogy site has 10s of thousands of "virtual pages". Most small genealogy websites will exhaust all of their allotted bandwidth for the month from 1 single site copy using HTTrack.
Posted by: KosherJava | December 05, 2005 at 12:15 PM
>>Keep in mind that many sites will be brought to their knees by the use of HTTrack.
Good point. This is especially true of sites using free webhosting.
Posted by: Chris Dunham | December 05, 2005 at 02:11 PM
I've got an online 14000 persons genealogy database with dynamic pages, and I hate it when I see these vultures in action. To let people view your material online, and to let them download your entire site in what is looking like a DoS attack are two totally different things. Unless it is explicitly allowed, I feel that you should be very careful about doing it. There are ways to prevent it from happening, but that will require active steps from the site owner.
See the news thread at http://groups.google.com/group/lucky.freebsd.questions/browse_frm/thread/eb55e1d51cfebc97/da5d3b70664b6381?tvc=1&q=httrack++abuse&hl=en#da5d3b70664b6381 for an alternative to Dick's rosy view as well as some technical advice for dealing with site rippers.
At least, HTTrack is honoring the "robots.txt" file by default. But that won't stop anyone who doesn't take 'no' for an answer, and finds out how to change this setting.
Posted by: leifbk | December 06, 2005 at 02:34 AM