Duplicate document detection in a web crawler system
US7627613B1 · kind B1 · utility
Assignee
Inventors
Key dates
| Filing date | Jul 3, 2003 |
| Grant date | Dec 1, 2009 |
| Priority date | — |
| Expiry date | Sep 21, 2024 |
Classification
- Technology area (CPC Y)Emerging Cross-Sectional Technologies
- CPC primaryY10S707/99954
- WIPO fieldComputer technology
- WIPO sectorElectrical engineering
Abstract
Duplicate documents are detected in a web crawler system. Upon receiving a newly crawled document, a set of documents, if any, sharing the same content as the newly crawled document is identified. Information identifying the newly crawled document and the selected set of documents is merged into information identifying a new set of documents. Duplicate documents are included and excluded from the new set of documents based on a query independent metric for each such document. A single representative document for the new set of documents is identified in accordance with a set of predefined conditions.
Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.