Generating similarity scores for matching non-identical data strings
US7814107B1 · kind B1 · utility
Assignee
Inventors
Key dates
| Filing date | May 25, 2007 |
| Grant date | Oct 12, 2010 |
| Priority date | — |
| Expiry date | Jun 13, 2028 |
Classification
- Technology area (CPC G)Physics
- CPC primaryG06F16/93
- WIPO fieldComputer technology
- WIPO sectorElectrical engineering
Abstract
A system and method for determining the likelihood of two documents describing substantially similar subject matter is presented. A set of tokens for each of two documents is obtained, each set representing strings of characters found in the corresponding document. A matrix of token pairs is determined, each token pair comprising a token from each set of tokens. For each token pair in the matrix, a similarity score is determined. Those token pairs in the matrix with a similarity score above a threshold score are selected and added to a set of matched tokens. A similarity score for the two documents is determined according to the scores of the token pairs added to the set of matched tokens. The determined similarity score is provided as the likelihood that the first and second documents describing substantially similar subject matter.
Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.