Patent · US Active

Identifying potential duplicates of a document in a document corpus

US9195714B1 · kind B1 · utility

2Cited by

26References

20Claims

0Family size

Assignee

AMAZON TECHNOLOGIES, INC. · US

Inventors

Srikanth Thirumalai · Needham, US
Aswath Manoharan · Sunnyvale, US
Mark J. Tomko · Seattle, US
Grant M. Emery · Seattle, US
Vijai Mohan · Seattle, US

Key dates

Filing date	Feb 17, 2011
Grant date	Nov 24, 2015
Priority date	—
Expiry date	Jul 10, 2031

Classification

Technology area (CPC G)Physics
CPC primaryG06F16/355
WIPO fieldComputer technology
WIPO sectorElectrical engineering

Abstract

According to aspects of the disclosed subject matter, a method for identifying a set of documents from a document corpus that are potential duplicates of a source document, is provided. A source document is obtained. A list of queries corresponding to the source document is identified. Each query in the identified list of queries is executed on the document corpus, wherein the execution of each query yields a corresponding results set identifying an ordered set of documents in the document corpus. For each document identified in each results set, a document score is generated for the identified document based on the identified document's ordinal position in its results set. A subset of the identified documents of the results set is selected according to the generated document scores that satisfy predetermined selection criteria. The selected subset of identified documents are stored or displayed.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.