Hi,
First: I need a webcrawler, but it’s not necessary that it is written in java, I only want to use it in windows and I want that the crawler is really fast, really really fast!
I have a list with needed features (I even have a sketch of the wished surface):
- The crawler should extract all links from a website
-> Needed specification: If I want, the crawler only should extract links which include a parameter, given by a textfield (if I write “web” in the field, it only should extract links including the word “web”)
-> Needed specification: If I want, the crawler only should extract intern links (on this site, no outter links)
- If I want the crawler should extract the text of the html tags (activated by a checkbox)
-> Checkbox: title (extracts the text of the titel tag)
-> Checkbox: description (extracts the text of the meta tag description)
-> Checkbox: keywords (extracts the text of the meta tag description)
-> Checkbox: Body (extracts all text between the body tags)
-> Checkbox: Body only text (extracts ONLY text, all html tags will be stripped)
-> Checkbox: H1 Tag
-> Checkbox: H2 Tag
-> Checkbox: H3 Tag
-> Checkbox: b (for bold) tag
- The crawler should get the links by:
-> A text field in the crawler surface
-> or by a txt file
- The crawler should write the extracted links and maybe the tags and so on (like title,description,keywords,h1….) in:
-> A txt file
-> or a html file
- I want to set the linkdepth by a textfield
- I want to add own html tags for extraction (besides h1,h2,h3,title,description) like: p,tr,div and so on so it should extract the text of this tags – I want to add this tags manually by a textfield
- I want a statistic in the crawler with:
-> Runtime (how long does the crawler runs since I’ve presse dthe start button?)
-> URLs indexed
-> how many sites does the crawler index per second?
- The crawler even should:
-> read the status codes of a website (like 400,404,200….)
I want two more features but I don’t want to descripte them now – more if you will get the job.
I hope someone can help me!