Patent · US Active

Identifying potential duplicates of a document in a document corpus

US9195714B1 · kind B1 · utility

2Cited by
26References
20Claims
0Family size

Assignee

Inventors

Key dates

Filing dateFeb 17, 2011
Grant dateNov 24, 2015
Priority date
Expiry dateJul 10, 2031

Classification

  • Technology area (CPC G)Physics
  • CPC primaryG06F16/355
  • WIPO fieldComputer technology
  • WIPO sectorElectrical engineering

Abstract

According to aspects of the disclosed subject matter, a method for identifying a set of documents from a document corpus that are potential duplicates of a source document, is provided. A source document is obtained. A list of queries corresponding to the source document is identified. Each query in the identified list of queries is executed on the document corpus, wherein the execution of each query yields a corresponding results set identifying an ordered set of documents in the document corpus. For each document identified in each results set, a document score is generated for the identified document based on the identified document's ordinal position in its results set. A subset of the identified documents of the results set is selected according to the generated document scores that satisfy predetermined selection criteria. The selected subset of identified documents are stored or displayed.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.