Patent · US Active

Near-duplicate document detection for web crawling

US8140505B1 · kind B1 · utility

23Cited by
7References
27Claims
0Family size

Assignee

Inventors

Key dates

Filing dateMar 31, 2005
Grant dateMar 20, 2012
Priority date
Expiry dateMay 12, 2030

Classification

  • Technology area (CPC G)Physics
  • CPC primaryG06F16/9014
  • WIPO fieldComputer technology
  • WIPO sectorElectrical engineering

Abstract

A system generates a hash value for a fetched document and compares the hash value with a set of stored hash values to identify ones of the stored hash values with a sequence of bit positions, less than all of the bit positions, that match a corresponding sequence of bit positions of the hash value. The system also determines whether any of the identified hash values are substantially similar to the hash value and identify the fetched document as a near-duplicate of another document when one of the identified hash values is substantially similar to the hash value.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.