Method and apparatus for assessing similarity between online job listings
US8099415B2 · kind B2 · utility
Assignee
Inventors
Key dates
| Filing date | Sep 8, 2006 |
| Grant date | Jan 17, 2012 |
| Priority date | — |
| Expiry date | Oct 8, 2028 |
Classification
- Technology area (CPC G)Physics
- CPC primaryG06F16/258
- WIPO fieldComputer technology
- WIPO sectorElectrical engineering
Abstract
Job listings retrieved from external sources are pre-processed prior to being stored in the search engine production database and duplicate records identified prior to storage in a production database for the search engine. Inter-source and intra-source hash values are calculated for each job listing and the values compared. Job listings having the same intra-source hash are judged to be duplicates of each other. Descriptions whose intra-source hash values do not match, but whose inter-source hash values match are judged to be duplicate candidates and subject to further processing. Suffixes for each such record are stored to a data structure such as a suffix array and the records searched and compared based on the suffix arrays. Records having a pre-determined number of contiguous words in common are judged to be duplicates. Duplicate records are identified before the data set is stored to the production data base.
Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.