Patent · US Active

Identifying potential duplicates of a document in a document corpus

US7895225B1 · kind B1 · utility

13Cited by
23References
39Claims
0Family size

Assignee

Inventors

Key dates

Filing dateDec 6, 2007
Grant dateFeb 22, 2011
Priority date
Expiry dateFeb 23, 2029

Classification

  • Technology area (CPC G)Physics
  • CPC primaryG06F16/355
  • WIPO fieldComputer technology
  • WIPO sectorElectrical engineering

Abstract

According to aspects of the disclosed subject matter, a method for identifying a set of documents from a document corpus that are potential duplicates of a source document is provided. A source document is obtained. A list of queries corresponding to a source document is identified. Each query in the identified list of queries is executed on the document corpus, wherein the execution of each query yields a corresponding results set identifying an ordered set of documents in the document corpus. For each document identified in each results set, a document score is generated for the identified document based on the identified document's ordinal position in its results set. A subset of the identified documents of the results set is selected according to the generated document scores that satisfy predetermined selection criteria. The selected subset of identified documents are stored or displayed.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.