Patent · US Active

Representative document selection for sets of duplicate documents in a web crawler system

US8260781B2 · kind B2 · utility

5Cited by
15References
15Claims
0Family size

Assignee

Inventors

Key dates

Filing dateJul 19, 2011
Grant dateSep 4, 2012
Priority date
Expiry dateJul 19, 2031

Classification

  • Technology area (CPC Y)Emerging Cross-Sectional Technologies
  • CPC primaryY10S707/99954
  • WIPO fieldComputer technology
  • WIPO sectorElectrical engineering

Abstract

Duplicate documents are detected in a web crawler system. Upon receiving a newly crawled document, a set of documents, if any, sharing the same content as the newly crawled document is identified. Information identifying the newly crawled document and the selected set of documents is merged into information identifying a new set of documents. Duplicate documents are included and excluded from the new set of documents based on a query independent metric for each such document. A single representative document for the new set of documents is identified in accordance with a set of predefined conditions.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.