Archive

Posts Tagged ‘crawling’

Crawling Engine (java)

January 26th, 2012 Comments off

Looking for someone to IMPROVE an existing web crawling engine, coded in Java.

Data Crawling / Data Mining / Screen Scraping

January 6th, 2012 Comments off

I need you to download all the data in a local directory website. (Company names, addresses, telephone numbers, urls, emails etc.).

Data Crawling / Data Mining / Screen Scraping

December 22nd, 2011 Comments off

Similar to last time, I need to download all the data in a local directory website into an excel spreadsheet. (Company names, addresses, telephone numbers, urls, emails etc.).

Anbu, I am only inviting you for this because I want to work with you again. Let’s please do this for $40. (Your price was $50 last time.)

Web Data Scarping/ Data Crawling/ Screen Scraping

August 10th, 2011 Comments off

I need web/ data scrape work from various websites:

- apartment/ housing rental sites
- vacation rental sites
- I can also provide site examples via PMB

I would need the address, price, bedrooms, baths, features, property name, term, description, etc and would need the data to be inserted to proper category/ fields on my website. This could be a one time thing or we could talk about options about options where data is automatically sent, etc. You can talk to me via PMB about options.

Depending on how things go and the bid amount accepted there could be a lot more of this work in the future.

Data Crawling / Data Mining / Screen Scraping

August 9th, 2011 Comments off

I need you to download all the data in a local directory website. (Company names, addresses, telephone numbers, urls, emails etc.).

I just need the data in an excel spreadsheet, nothing else.

PM me if you want to see the website. The website is in a foreign language, but should not be a problem.

You can keep your script, I may come back to you later to re-run it. Also, if everything works well, I will need to repeat this project with two other websites.

Data Crawling / Data Mining / Get Data In Local Directory

August 6th, 2011 Comments off

You will get all the data from eyp.ph, import it and export it to e-syndicat directory without errors. The script will also be sent to me for future use.

Web Crawling/ Spireding Scripts

July 21st, 2011 Comments off

My client is looking for web crawling/ spireding scripts in C# & ASP.NET. The script will crawl Dice and Monster website, will search resume, on the basis of keyword (one text box). Then after script will crawl the result and display it in our page (not on dice and monster).

Technology used: C#, ASP.NET
Time Frame: 15-20 Days

Modify Existing Crawling Script In Php Code

June 22nd, 2011 Comments off

I have a scraping script to crawling products and i need to modify it.

1) script create category-subcategory files and images files in my db
2)script scraping title properly in my SQL. Now the title not appears to my web site. Also convert a unique title
3)Not crawling products out of stock
4)scraping 60 products each category-subcategory every time. when complete the category-subcategory circle go to scraping anothers 60 products ans so on..

Web Crawling And Data Retrival In Excel

April 8th, 2011 Comments off

Hi,
I need to crawl data from few websites and have them in excel format… any one willing to do it for me will be welcomed.. please contact me in my email for the list of website that i want to crawl..

Regards,
timmy guerra

Crawling Job

November 7th, 2009 Comments off

Need to create an Australian list of URLs and crawl certain pages from them and extract this into a mysql database.

# Upload all of the RDF dump from dmoz.org as a starting point.
# Extract the URLs and descriptions and categories from this data dump
# This needs to be inserted into an MySQL DB called rawurls
# We need to copy all urls ending in .com.au, .net.au, .org.au, .gov.au, .edu.au into another table called australianurls
# We need to ping all URLs in rawurls and insert the resulting IP address into the rawurls table alongside each url
# We then need to import all IP addresses that are currently assigned to Australia in order to find websites that are being hosted in Australia but that dont have .au extensions (ie there are many .com addresses owned by Australian companies and hosted in Australia) you can find Geo IP address tables here http://www.maxmind.com/app/geolitecountry
# Select all from rawurls that have IP addresses = Australian IP ranges and copy into Australian URLs
# Crawl each site in AustralianUrls and extract, Description, About us & contact us pages and insert into Australian URLs database
# While crawling, insert all found external links into RawURLs.
# Repeat steps 4-7 daily via cron
# We will need a basic admin web page which shows what numbers of URLs have been found and crawled daily and inserted into AustralianURLs

Crawler must be multi headed/threaded ie must be able to run 1-100 crawlers at once.

Could be any suitable language PHP/Python/Java/Ruby on rails etc as long as it scales.

Web Crawling Application

November 5th, 2009 Comments off

Task: Web Crawling application to build

Guidelines:

Important: Please read the whole document and then reply to the questions at the end of those guidelines along with your bid.

1. Introduction and Application Functions.

Our client is working on a research study on the etymology of the domain names used by businesses.
A business = any organization selling a product or service to other businesses or/and to customers. (B2B + B2C)

They hired us (now, it is your mission) to develop an automated application which will find (based on a list of 10 million domain names) which of those domains are being used by businesses and which are not. In other words, to differentiate the active “business websites” from the other domain names (by other domain names, we mean: non-active websites or non-used domains, news and general info sites, personal websites, school websites, non-profit organization websites, forums, blogs, directories, and so on…)

You need to develop an automated online (server based) application which will “crawl” the active websites from that list of domain names and analyze their “navigation elements” (you may know those
as “menu” or “website categories”) to check if they (they = the “navigation elements”) contain the words commonly used by businesses.

In more details, the application will:

a) “validate” each URL to check if it corresponds to an active (online) website.

b) locate the “navigation elements” (menu or website categories) of each active website.

c) check if the “menu” of each website contains at least one of the words commonly used by businesses for their website (we will call them “key words”), such as:

Company.
Business.
Products.
Services.
Consulting.
Clients.
Pricing.
Customers.
Portfolio.
Reference.
Quote.
Career
Team.
Management.
Jobs
Partners

If one of those common business “key words” is used in their “menu”, this is enough to imply that there is a high chance that the website may belong to a business and is not a personal site, blog or directory,…

So, if at least one of those words is located in the menu of the website, the application will return a positive answer and add the URL to the “positive” list of “domains used by businesses”. And then go to step a) with next URL.
The application database needs to include for each positive answer the “key word” which matched.

This needs to be a “broad match”. For example, if the app. is looking for “Service” and finds “Our Services” on the websites “menu”, it needs to be validated positively as well.

2. “Crawling” Formula.

In order for the application to complete the task described previously in 1.c , we propose the following “formula”:

a) the application will collect all “inside links” on an homepage. By “inside links” we mean links going to a page from the same domain name – we mean also: the javascript and dhtml links,… (everything using the </a> tag in .html and staying in the same domain)

b) the application will collect the target file names on those links. For example in
“http www. url . com / products.php”, “products” is the target file name on the link.
It will also collect the anchor text on those links, the folder name on those links if there is any and so on… (We will call them “link names”)

c) the application will check if 80% of the letters from at least one of the “key words” are included in the “link names”. If it matches, it will assign a “positive” response to the url.
For example, if we take the “key word”, “service”, all the following “link names” will return a positive match:
services.php
ourservices.html
/servicesweoffer/
The reason why we use 80% and not 100% (from the letters in the “key words”) is to include plurals or singulars, spelling mistakes and word declinations (mainly for other languages where accents are used)

This is a proposition. If you find a better “formula”, you are welcome to propose it.

You need to take a lot of time and to complete a lot of tests in order to design the best “formula” for the “crawling” as it is the most important part of the application process.

You need to develop a multi-threaded application which will be able to “crawl” millions of urls effectively.
It needs to integrate a powerful multi-threaded “crawling” process, a “queue” and “scheduler” management system and a strong database storage system.

3. “Introduction” Pages.

Lot of websites have intro pages (static or in flash) where the “menu” doesn’t appear. You usually need to click on a “Skip intro” or “Enter website” link to go to the main site.

In order to use the right elements for the “crawling”, the application needs to differentiate an “intro page” from a “main homepage”.

The “formula” we propose for that task is the following:
a) When it arrives on a website, the app. will count the number of “inside links” on the “landing” page. (first level)
b) If the number of “inside links” on the landing page is 3 or below, it will go to step c). If the number of those links is over 3, it will process with the normal “crawling formula” process.
c) It will go to one of the “inside pages” (if any) by following one of the “inside links” from the landing page. It will process with the “crawling” on both the “landing page” and one of the “inside page(s)”.

Again, this is a proposal. If you find a better “formula”, you are welcome to propose it.

4. Web based Interface.

Along with the application process, you need to build a web-based interface which will help us to monitor the “crawling” and allow us to extract the lists.

The main functions of the web-based interface will be:

a) URL list Import. (lists will be in .txt or in .csv)
The application needs to provide 2 types of imports:
a)import by uploading the URL lists directly to the online application from the computer.
b)import by uploading the URL lists to a directory on the server by ftp. (for large lists, it will be more convenient).

There needs to be a page where we can check the status of the import and see the number of URLs that were found in the imported lists.

b) “Key Words”

The “Key Words” are the words commonly used by businesses in the “menu” (the menu = the navigation elements) of their website.

We need to be able to add/modify/delete words in the list of “Key Words” which will be used by the application for the “crawling” process.

(The “Key Words” can be from different languages as sites are in different languages)

c) The “Crawling” Process.

We need to be able to start/pause/end and check the status of the “crawling” process from the online interface.

d) list extraction and statistics

Once the “crawling” has been completed, we need to be able to export the list of negative (non-match) and positive URLs.

We need also to be able to export the “positives” lists for each keyword match and have access to a detailed statistics page about the “crawling”.

5. Important Remarks:

- A dedicated Linux server will be provided.

- Those are the “general” guidelines for the application you have to build. We did not go to many details, because we want to give you lot of freedom to build the application the way you think will be most suitable for the task. Please develop a powerful, logical, efficient and easy-to-use application.

- If we assign the project to you, we will place the funds into escrow and then you will be required to submit to us a complete framework proposal (detailed specifications) about the way you will
build the application, the “crawling” formula, the testing, the programming language and technology along with the list of features you will include in the application. This will allow us to check before you start “programming”, that your “solution” suits us.

- PLEASE REPLY TO THE FOLLOWING 4 QUESTIONS ALONG WITH YOUR BID:

1) Did you already create or worked with “crawling” applications?

2) What do you think about the “crawling formula” we suggested?

3) What programming language(s) would you use to develop the application?

4) In what timeframe can you
a) provide your framework proposal (detailed specifications about the “structure” of the application you intend to build) (in .pdf or .doc)
and b) develop the application? (please provide an estimate)

Thank you for your bid.

Crawling Script

July 21st, 2009 Comments off

We have a keyword-based do-follow-blog-inator

a) we give a lits of blog sites ie wordpress, disqus, etc
b) we crawl looking for anywhere that has “do follow” blog comments
c) we can narrow search for certain keywords
d) we also list page rank
e) results is a list of yes blogposts with dofollow and pagerank

results saved so we can do different searches

WE WANT TO ADD NEW FEATURES:
- a) other sites/blogs/google crawl
- b) do “all” and no keyword?

Its php/curl written. Should be max 1 hour of work for EXPERT!

Happy bidding!

Data Crawling From Web

July 14th, 2009 Comments off

Hi,

We have small requirement of capturing web content by giving url from a .net Application & storing the data into excel sheet for example we have a website called http://www.nbed.nb.ca/schooldirectory where we can get a list of school names and their addresses so we need to capture that data into excel sheet

Crawling Script

June 22nd, 2009 Comments off

I want a keyword-based dofollowblog-inator

a) we give a lits of blog sites ie wordpress, disqus, etc
b) we crawl looking for anywhere that has “do follow” blog comments
c) we can narrow search for certain keywords
d) we also list page rank
e) results is a list of yes blogposts with dofollow and pagerank

results saved so we can do different searches

Web Crawling

May 29th, 2009 Comments off

Am looking for a developer to create a custom crawler/spider capable of continuously crawling 1000-2000 sites per week.

1. Search 2000 sites.
2. To set frequency of crawl for each site
3. Option to search whole site or selected folders of a site
4. Option to add in a username and password for a site

Site Crawling And Reporting

May 22nd, 2009 Comments off

so I have site with lots of pages. Need a script that can crawl the site, build a sitemap – should be navigable (some , report broken links (404s, 301s, and 302s) comprehensively.

Should be runnable on any OS (windows / debian linux) – must be simple to setup

Bear