Near-duplicate document detection for web crawling
US8140505B1 · kind B1 · utility
Assignee
Inventors
Key dates
| Filing date | Mar 31, 2005 |
| Grant date | Mar 20, 2012 |
| Priority date | — |
| Expiry date | May 12, 2030 |
Classification
- Technology area (CPC G)Physics
- CPC primaryG06F16/9014
- WIPO fieldComputer technology
- WIPO sectorElectrical engineering
Abstract
A system generates a hash value for a fetched document and compares the hash value with a set of stored hash values to identify ones of the stored hash values with a sequence of bit positions, less than all of the bit positions, that match a corresponding sequence of bit positions of the hash value. The system also determines whether any of the identified hash values are substantially similar to the hash value and identify the fetched document as a near-duplicate of another document when one of the identified hash values is substantially similar to the hash value.
Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.