Almost two years ago, I wrote an article entitled, "Copy an Entire Web Site with HTTRACK." That article is still available at http://blog.eogn.com/eastmans_online_genealogy/2005/12/copy_an_entire_.html. It describes the operation of a Windows program that can download an entire web site for offline browsing or for backup purposes. It is a good method of backing up your web site.
Now a similar utility is available for Macintosh users. Best of all, it is a free program. The author does accept donations, however.
SiteSucker automatically downloads Web sites from the Internet. It does this by copying the site's Web pages, images, backgrounds, movies, and other files to your local hard drive. Just enter a URL (Uniform Resource Locator), press the Enter key, and SiteSucker can download an entire Web site.
NOTE: Programs such as SiteSucker or HTTRACK are excellent for downloading static web pages. That is, web pages that never change. However, they will not work in interactive sites, such as those that query online databases. Don't try to download www.FamilySearch.org or eBay, even if you do have the disk space available.
You can use SiteSucker to make local copies of your Web sites for easy maintenance. It can either download files unmodified or "localize" the files it downloads, allowing you to browse a site offline.
If SiteSucker is in the middle of a download when you choose the Save command, SiteSucker will pause the download and save its status with the document. When you open the document later, you can restart the download from where it left off by pressing the Resume button.
SiteSucker is a Universal application, which means that it's made to run on both Intel- and PowerPC-based Mac computers. SiteSucker requires Mac OS X 10.4.x Tiger or greater. Of course, to download files, your computer will also need an Internet connection. If it is a large site, you will also need plenty of available disk space.
There are several limitations. As mentioned earlier, the program cannot query databases and will not work on any web site that asks for user input and then builds pages "on the fly," based on the input. That leaves out many genealogy databases, as well as eBay and others.
SiteSucker totally ignores JavaScript. It will not see any link specified within JavaScript. (If the Log Warnings option is on in the download settings, SiteSucker will include a warning in the log file for any page that uses JavaScript.)
SiteSucker does scan Flash (SWF) files for embedded plain text links, but it can only detect links to files that have one of the following extensions: html, swf, mp3, sit, zip, mov, gif, jpg, png, doc, or txt. SiteSucker cannot localize Flash files, and it does not examine other media files for embedded links.
By default, SiteSucker honors robots.txt exclusions and the Robots META tag. Therefore, it will not download any directories or pages disallowed by robot exclusions. However, you can override this behavior with the Ignore Robot Exclusions setting that's under the Advanced tab in the download settings.
The free SiteSucker program is available at http://www.sitesucker.us/.
Safari does something similar, just Save As... Web Archive.
Posted by: Infinite Ancestors | September 13, 2007 at 10:16 PM
But Safari only gets the page being viewed. Software like SiteSucker can download an entire site - all the pages of a site as long as they're linked up to a page you start from, or linked to a page linked to the page you start from.
Roger
Posted by: theKiwi | September 13, 2007 at 11:46 PM
So we CAN legally copy entire websites? There's no copyright infringement? We don't have to ask permission? Has Steve Danko read this, yet?
I would like to do this legally, of course, because I could learn a lot from the structure of the files, XHTML, CSS, PHP, JavaScript and whatever else. I don't wish to steal -- just to learn.
Someone chime in here with the rules and protocol, please. Perhaps Mr. Manson?
Happy Dae.
http://www.ShoeStringGenealogy.com/ssg1.htm
Posted by: Happy Dae | September 14, 2007 at 01:52 AM
---> So we CAN legally copy entire websites? There's no copyright infringement? We don't have to ask permission?
Yes. Absolutely. 100% legal. The same is true for music, books, videos and more. Well, there is ONE hitch: All of the copies must be for your own personal use only.
That has been true for decades.
As soon as you copy or republish any information/music/videos and provide part or all of the copy to someone else, you may have the copyright police knocking at your door.
Disclaimer: I am not a lawyer. If you have questions, you are encouraged to seek professional legal help.
- Dick Eastman
Posted by: Dick Eastman | September 14, 2007 at 07:28 AM
Most of the times you can be safe downloading an entire site for personal use. However, there might be policies written by sites that expressly forbid the systematic download of their pages, sections or of the site as a whole. E.g. the digital library I worked for had the policy to prohibit the download of entire books.
Posted by: rdx | September 14, 2007 at 09:16 AM
BOY... I hope this works. My own wedsite is on Google, but their backgrounds don't copy under Safari.
I'd like to get my website _off_ Googlepages and ON to my own ISP host .
==Marjorie
Posted by: Marjorie | September 14, 2007 at 11:10 AM
It worked when I tested it on a small web site I own, only about 6 or 8 web pages. I decided to not try it on eogn.com as there thousands of pages there. I back that up using other methods.
- Dick Eastman
Posted by: Dick Eastman | September 14, 2007 at 11:34 AM
We all have our favorites for offloading genealogy info from the Web. I have used sitesucker (Mac) once, with great success for static .html pages. On a PC I use EasyWebSave ($10) for savingsingle whole pages - just rightclick & choose it, the page is saved, filed in its special directory, and I move on. For Interactive [a selection on a dynamic page] info-saving, I use Techsmith's Snagit 'text capture' to the clipboard, coupled with Fookes' NoteTab Pro, which has a 'Pasteboard Feature' that append-saves whatever is sent to the clipboard, semi-automatically building a .txt-file of just those snippets on 'Uncle Roger.' (This method is also great for interactive lookup comparison pages as in MyHeritage.com or GenCircles.com that often won't 'copy/paste'.) When I am done, I search the contents of my findings/files with Gaviri's PocketSearch, which not only covers all my hard-disks, but also zip drives, thumb-drives, etc. and highlights the search-word showing full-context. I have tried all the 'big' desktop search engines, but they are always grinding away in the background, slowing everything down to the point of aggravation . Pocketsearch is the only one I know of that is small, nimble (quick returns) and searches beyond my regular hard disks. None of these are Free (except SS), but are low cost, and IMHO are worth the money in terms of time and effort saved.
Posted by: Ed | September 14, 2007 at 12:38 PM
When you say entire website, does that mean files that are being stored, but not shown, on my website? I have a website that I often use to store files I plan to access from remote locations. I upload with FTP a Word document or a video, for example, and then download it later by entering the URL address of the document on my website. For example: "www.mywebsite.com/mydocument.doc". I know there are other ways to do this, but it is so easy this way, and some of my files are too large for sending via email.
These are private files and I don't want to share them with other people. However, my html pages are in the same folder on my website and are visible on my site. Am I at risk for these being copies with this procedure you are talking about? Would buring them one more level deeper into a folder on my website help? for example: "www.mywebsite.com/mydrivewayfolder/mydocument.doc"???
Posted by: Marilyn | September 14, 2007 at 12:53 PM
On your recommendation this morning I tried out SiteSucker. WOW!
I am taking over as a County Coordinator for a GenWeb site and have been trying for 2 weeks to get all the files [18,500] downloaded to my Mac. Last Tuesday I got to the half way point. Yesterday, in 7 hours, I finally reached the 2/3's complete goal.
Today, with SiteSucker, the whole job [STARTING OVER!] is complete in 6 hours. AND, it gave me an error log to show which files were missing apparently basing that determination on an analysis of links within the files.
The file count is almost exactly 1000 pages smaller than two different FTP programs found. I'm thinking that SiteSucker downloaded only those files linked to the webpage and did not download any that were formerly linked but aren't any more. That's my guess but I'm planning to contact Rick Cranisky at SiteSucker and double check.
I'm not throwing anything out until the revision is complete, just in case.
SiteSucker is GREAT! Thanks for the "heads up", Dick.
Posted by: Suzie | September 14, 2007 at 04:27 PM
You are correct: SiteSucker (and most other programs that download entire sites) can only find pages that have other pages linking to them. SiteSucker follows links. It has no other method of finding web pages.
Any pages that are "marooned" (unlinked) will not be found by the various web site download programs.
Actually, that is a benefit. If you have an FTP listing of ALL pages and then compare that list against the files that SiteSucker retrieves, you can easily identify marooned pages: they are the ones listed by FTP but not found by SiteSucker.
- Dick Eastman
Posted by: Dick Eastman | September 14, 2007 at 04:33 PM
I have been keeping blogs various places for years, and today used SiteSucker to quickly archive them on my mac, allowing me search and edit everything... very basic example of the utility of this program with ones own web content.
Posted by: Lee | November 24, 2007 at 06:07 PM