Representative document selection for sets of duplicate documents in a web crawler system
US8260781B2 · kind B2 · utility
Assignee
Inventors
Key dates
| Filing date | Jul 19, 2011 |
| Grant date | Sep 4, 2012 |
| Priority date | — |
| Expiry date | Jul 19, 2031 |
Classification
- Technology area (CPC Y)Emerging Cross-Sectional Technologies
- CPC primaryY10S707/99954
- WIPO fieldComputer technology
- WIPO sectorElectrical engineering
Abstract
Duplicate documents are detected in a web crawler system. Upon receiving a newly crawled document, a set of documents, if any, sharing the same content as the newly crawled document is identified. Information identifying the newly crawled document and the selected set of documents is merged into information identifying a new set of documents. Duplicate documents are included and excluded from the new set of documents based on a query independent metric for each such document. A single representative document for the new set of documents is identified in accordance with a set of predefined conditions.
Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.