System and method for detecting duplicate and similar documents
US7139756B2 · kind B2 · utility
Assignee
Inventors
Key dates
| Filing date | Jan 22, 2002 |
| Grant date | Nov 21, 2006 |
| Priority date | — |
| Expiry date | Aug 4, 2023 |
Classification
- Technology area (CPC Y)Emerging Cross-Sectional Technologies
- CPC primaryY10S707/99937
- WIPO fieldComputer technology
- WIPO sectorElectrical engineering
Abstract
A system and a method are described for rapidly determining document similarity among a set of documents, such as a set of documents obtained from an information retrieval (IR) system. A ranked list of the most important terms in each document is obtained using a phrase recognizer system. The list is stored in a database and is used to compute document similarity with a simple database query. If the number of terms found to not be contained in both documents is less than some predetermined threshold compared to the total number of terms in the document, these documents are determined to be very similar. It is shown that these techniques may be employed to accurately recognize that documents, that have been revised to contain parts of other documents, are still closely related to the original document. These teachings further provide for the computation of a document signature that can then be used to make a rapid comparison between documents that are likely to be identical.
Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.