Patent · US Expired

Duplicate document detection in a web crawler system

US7627613B1 · kind B1 · utility

77Cited by
13References
30Claims
0Family size

Assignee

Inventors

Key dates

Filing dateJul 3, 2003
Grant dateDec 1, 2009
Priority date
Expiry dateSep 21, 2024

Classification

  • Technology area (CPC Y)Emerging Cross-Sectional Technologies
  • CPC primaryY10S707/99954
  • WIPO fieldComputer technology
  • WIPO sectorElectrical engineering

Abstract

Duplicate documents are detected in a web crawler system. Upon receiving a newly crawled document, a set of documents, if any, sharing the same content as the newly crawled document is identified. Information identifying the newly crawled document and the selected set of documents is merged into information identifying a new set of documents. Duplicate documents are included and excluded from the new set of documents based on a query independent metric for each such document. A single representative document for the new set of documents is identified in accordance with a set of predefined conditions.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.