Patent · US Active

Generating similarity scores for matching non-identical data strings

US7814107B1 · kind B1 · utility

72Cited by
23References
29Claims
0Family size

Assignee

Inventors

Key dates

Filing dateMay 25, 2007
Grant dateOct 12, 2010
Priority date
Expiry dateJun 13, 2028

Classification

  • Technology area (CPC G)Physics
  • CPC primaryG06F16/93
  • WIPO fieldComputer technology
  • WIPO sectorElectrical engineering

Abstract

A system and method for determining the likelihood of two documents describing substantially similar subject matter is presented. A set of tokens for each of two documents is obtained, each set representing strings of characters found in the corresponding document. A matrix of token pairs is determined, each token pair comprising a token from each set of tokens. For each token pair in the matrix, a similarity score is determined. Those token pairs in the matrix with a similarity score above a threshold score are selected and added to a set of matched tokens. A similarity score for the two documents is determined according to the scores of the token pairs added to the set of matched tokens. The determined similarity score is provided as the likelihood that the first and second documents describing substantially similar subject matter.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.