Patent · US Active

Representative document selection for sets of duplicate documents in a web crawler system

US7984054B2 · kind B2 · utility

3Cited by
13References
15Claims
0Family size

Assignee

Inventors

Key dates

Filing dateDec 1, 2009
Grant dateJul 19, 2011
Priority date
Expiry dateDec 1, 2029

Classification

  • Technology area (CPC Y)Emerging Cross-Sectional Technologies
  • CPC primaryY10S707/99954
  • WIPO fieldComputer technology
  • WIPO sectorElectrical engineering

Abstract

Duplicate documents are detected in a web crawler system. Upon receiving a newly crawled document, a set of documents, if any, sharing the same content as the newly crawled document is identified. Information identifying the newly crawled document and the selected set of documents is merged into information identifying a new set of documents. Duplicate documents are included and excluded from the new set of documents based on a query independent metric for each such document. A single representative document for the new set of documents is identified in accordance with a set of predefined conditions.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.