Contact Scrape / Parse
We need you to build a crawler / scraper / parser to collect contact info off of this web site cached files and inserted into a MySQL database.
The system must loop through folders (used to prevent system lag) and parse each file for the following information:
id: *the filename without .html so the id for cvffj2.html is cvffj2
name:
alias: *also does business as / Alt Business name
address:
city:
state:
zip:
phone:
url:
sic: *text not code
naics:
line_of_business: *line of business
started: *yead started
state_of_inc: *state of incorporation
type: *location type
employees:
employees_at_location:
contact_name:
contact_title:
stock_symbol:
stock_exchange:
parent_companies: *The can be more than one look for the id and seperate using comma “,”
The script must be written in PHP. There are three different page layout styles (that I have found so far) which are broken down in the styles.txt attachment. The script should be able to detect if the page doesn’t match one of the given layouts and log the file so we can look at the page to see what is different.
The attached Sample.zip has the folder structure we are using with fewer files.
Please send a small PHP code sample in the PMB, so I can get an idea of your coding style, and feel free to ask me any questions you need for clarification.
Thanks


