Identifying potential duplicates of a document in a document corpus
US7895225B1 · kind B1 · utility
Assignee
Inventors
Key dates
| Filing date | Dec 6, 2007 |
| Grant date | Feb 22, 2011 |
| Priority date | — |
| Expiry date | Feb 23, 2029 |
Classification
- Technology area (CPC G)Physics
- CPC primaryG06F16/355
- WIPO fieldComputer technology
- WIPO sectorElectrical engineering
Abstract
According to aspects of the disclosed subject matter, a method for identifying a set of documents from a document corpus that are potential duplicates of a source document is provided. A source document is obtained. A list of queries corresponding to a source document is identified. Each query in the identified list of queries is executed on the document corpus, wherein the execution of each query yields a corresponding results set identifying an ordered set of documents in the document corpus. For each document identified in each results set, a document score is generated for the identified document based on the identified document's ordinal position in its results set. A subset of the identified documents of the results set is selected according to the generated document scores that satisfy predetermined selection criteria. The selected subset of identified documents are stored or displayed.
Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.