Summary:
This project is a website archiver and spider. This archiver will store full websites it has crawled in our database. It will then crawl each website periodically, check updates to those pages (based upon our archived version), and store any pages with updated information (pages with changes).
Details:
In the database, there will be a list of URLs for the spider. The URLs would be similar to: www.scriptlance.com, www.domain.com.
The spider’s purpose is to crawl each website and retrieve it’s HTML code for replication (similar to archive.org). Every page of the website should be archived (i.e., crawl each page for internal URLs and then log the HTML to the database).
Once a website is fully crawled and archived into the database, subsequent crawls will be for the purpose of evaluating changes. If there is a change to any one of the pages on the site, then the spider will archive the latest version by storing the current HTML code.
The archiver will store the following from each page:
Page Title, Page URL, Date Archived, HTML Code (for the purpose of replication on our website).
Replication:
For our purpose, replication means that we will be able to use the information stored in the database (HTML Code) to show a version of this website to our users. The stored page should look near-exact to what the spider stored. If images are no longer around, that is fine. Take a look at the wayback machine at archiver.org to see what I mean by this.