Hi, I’d like to hire someone to perform a “fuzzy” text match on two datasets. that is, I have a set of firm names in Data A (as well as “state” and “month”) that I’d like to match to a set of firm names in Data B (as well as “state” “month” and “day”). So I’d like to grab the “day” in Data B for each (Firm Name, State, Month) entry.
All told, I have about 200-400k records in Data A and about 6,000 in Data B. Please note that the firm names in Data A are “messy”. That is, there may be odd spacing, punctuation, some abbreviated, and misspelled words. In a few cases, the text field may inadvertantly contain additional fields (such as a city name). Data B is fairly clean.
I’m flexible, but I’d like to some kind of “probability score” that indicates the strength of the match, as well as the best matched day. I’d like some manual checks to be done to ensure that the matches with high scores look correct.
Specifically, the data looks like this:
DATA A:
Firm Name (and clean State Month too)
123. ACEP
124. church ville inn
125. Metro Metropolitan State Hospital
126. WNA Wealth Advisors; Inc
127. Valley Plating Works;Inc; Commerce CA
128. Ferrara Fire Apparatus; Inc
129. Guest House Inn
130. BUOT STUDIO; L.L.C
131. AIAA DESIGN/BUILD/FLY TEAM AT AUBURN UNIVERSITY
132. G & B Anderson; Inc
133. Loss Mit Rep
134. Kerr Drug Inc
135. PHH Mortgage (Randstad)
I’d probably limit the firm fields to text fields of 50 characters or so.
Finally, I’d like to have the underlying software code so that I can run the algorithm in the future on future data.
Thanks!