Patent · US Active

Generating similarity scores for matching non-identical data strings

US7814107B1 · kind B1 · utility

72Cited by

23References

29Claims

0Family size

Assignee

AMAZON TECHNOLOGIES, INC. · US

Inventors

Srikanth Thirumalai · Needham, US
Egidio Loch Terra · San Mateo, US
Vijai Mohan · Seattle, US
Mark J. Tomko · Seattle, US
Grant M. Emery · Seattle, US
Aswath Manoharan · Sunnyvale, US

Key dates

Filing date	May 25, 2007
Grant date	Oct 12, 2010
Priority date	—
Expiry date	Jun 13, 2028

Classification

Technology area (CPC G)Physics
CPC primaryG06F16/93
WIPO fieldComputer technology
WIPO sectorElectrical engineering

Abstract

A system and method for determining the likelihood of two documents describing substantially similar subject matter is presented. A set of tokens for each of two documents is obtained, each set representing strings of characters found in the corresponding document. A matrix of token pairs is determined, each token pair comprising a token from each set of tokens. For each token pair in the matrix, a similarity score is determined. Those token pairs in the matrix with a similarity score above a threshold score are selected and added to a set of matched tokens. A similarity score for the two documents is determined according to the scores of the token pairs added to the set of matched tokens. The determined similarity score is provided as the likelihood that the first and second documents describing substantially similar subject matter.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.