I would like a program/script written in PHP and/or Java (or ANY other language) that will help someone locate instances of copyright infringement on the Internet of documents that the author has written.
This project would be divided into THREE PHASES.
In PHASE ONE of this project, you would develop the core program code. The program would do the following:
1. Search through a specified folder (on a web server’s hard drive) for any Microsoft Word, RTF, PDF, or plain text documents that may exist in that folder (and, optionally, any sub-folders). The software would then take a user-defined number of random samples of contiguous text (i.e., a user-defined number consecutive words) for EACH document in that folder/sub-folder).
a. The user should have the ability to specify the amount of “distance” (measured in words) between the samples that are taken, i.e. the user can specify that the samples are to be taken every 250 words, 500 words, 750 words, etc.
b The user should be able to select the document types that are searched for (Microsoft Word, RTF, PDF, or plain ASCII text documents).
2. The software would then take those samples (strings of text) and query Google.com and/or Docstoc.com (pre-selected by the user) to identify and find any matches (instances of copyright infringement).
3. For any matches are found, the software would then log:
a. the URL of the page on which the match is found
b. the TITLE of the html page on which the match is found
c. the file name of the source document from the user’s local hard drive
d. the date and time of the query in which the match was found on the offending web site
e. whether the match was found on Google.com or Docstoc.com
f. the exact string of text that was discovered on Google.com or Docstoc.com.
4. The log generated in #3 above should then be exportable as a comma delimited text file (CSV file). It should also be displayed on screen. The user will select if he wants to view the report on screen or export to CSV.
*** REQUESTED PMB COMMENTS: I do not really care what programming language you use, although I tend to prefer PHP and Java. If you think another programming language would be better than PHP, please specify in the PMB what language you would use and WHY that would be better than PHP. NOTE: Please state in the PMB how many DAYS it would take you to complete PHASE ONE of this project. ***
In the SECOND PHASE of this project, I would like you to develop a Windows compatible stand-alone application that the user could install/execute on his computer that will do the same thing as 1-4 above, except that the initial queries and sampling of the documents would take place ONLY on the user’s LOCAL hard drive. No text would be relayed back to the web server. The program would still search for the matches on the Internet in the same way and create a CSV report and screen report.
In the THIRD PHASE of the project, I would like for you to add the capability to the web site version so that the program/script can query the user’s local hard drive and retrieve the samples of the documents on the user’s local computer, and then relay them back to the web server for processing and querying on the Internet. The CSV file and screen report would then be accessible from the web server to the user.
NOTE: I am very price sensitive for this project, so please bid accordingly. If someone bids the right amount and can commit to finishing the job quickly, I will likely end the bidding process early and select you. Please write the word “Excel” in your bid comments so that I know read these specs carefully and that you understand English. Thank you for your interest, and I look forward to working with you!