Method for clustering closely resembling data objects
US6349296B1 · kind B1 · utility
Assignee
Inventors
Key dates
| Filing date | Aug 21, 2000 |
| Grant date | Feb 19, 2002 |
| Priority date | — |
| Expiry date | Aug 21, 2020 |
Classification
- Technology area (CPC Y)Emerging Cross-Sectional Technologies
- CPC primaryY10S707/99944
- WIPO fieldComputer technology
- WIPO sectorElectrical engineering
Abstract
A computer-implemented method determines the resemblance of data objects such as Web pages. Each data object is partitioned into a sequence of tokens. The tokens are grouped into overlapping sets of the tokens to form shingles. Each shingle is represented by a unique identification element encoded as a fingerprint. A minimum element from each of the images of the set of fingerprints associated with a document under each of a plurality of pseudo random permutations of the set of all fingerprints are selected to generate a sketch of each data object. The sketches characterize the resemblance of the data objects. The sketches can be further partitioned into a plurality of groups. Each group is fingerprinted to form a feature. Data objects that share more than a certain numbers of features are estimated to be nearly identical.
Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.