Patent · US Active

Identifying potential duplicates of a document in a document corpus

US7895225B1 · kind B1 · utility

13Cited by

23References

39Claims

0Family size

Assignee

AMAZON TECHNOLOGIES, INC. · US

Inventors

Srikanth Thirumalai · Needham, US
Aswath Manoharan · Sunnyvale, US
Mark J. Tomko · Seattle, US
Grant M. Emery · Seattle, US
Vijai Mohan · Seattle, US

Key dates

Filing date	Dec 6, 2007
Grant date	Feb 22, 2011
Priority date	—
Expiry date	Feb 23, 2029

Classification

Technology area (CPC G)Physics
CPC primaryG06F16/355
WIPO fieldComputer technology
WIPO sectorElectrical engineering

Abstract

According to aspects of the disclosed subject matter, a method for identifying a set of documents from a document corpus that are potential duplicates of a source document is provided. A source document is obtained. A list of queries corresponding to a source document is identified. Each query in the identified list of queries is executed on the document corpus, wherein the execution of each query yields a corresponding results set identifying an ordered set of documents in the document corpus. For each document identified in each results set, a document score is generated for the identified document based on the identified document's ordinal position in its results set. A subset of the identified documents of the results set is selected according to the generated document scores that satisfy predetermined selection criteria. The selected subset of identified documents are stored or displayed.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.