Patent · US Expired

Method for clustering closely resembling data objects

US6349296B1 · kind B1 · utility

194Cited by
8References
20Claims
0Family size

Assignee

Inventors

Key dates

Filing dateAug 21, 2000
Grant dateFeb 19, 2002
Priority date
Expiry dateAug 21, 2020

Classification

  • Technology area (CPC Y)Emerging Cross-Sectional Technologies
  • CPC primaryY10S707/99944
  • WIPO fieldComputer technology
  • WIPO sectorElectrical engineering

Abstract

A computer-implemented method determines the resemblance of data objects such as Web pages. Each data object is partitioned into a sequence of tokens. The tokens are grouped into overlapping sets of the tokens to form shingles. Each shingle is represented by a unique identification element encoded as a fingerprint. A minimum element from each of the images of the set of fingerprints associated with a document under each of a plurality of pseudo random permutations of the set of all fingerprints are selected to generate a sketch of each data object. The sketches characterize the resemblance of the data objects. The sketches can be further partitioned into a plurality of groups. Each group is fingerprinted to form a feature. Data objects that share more than a certain numbers of features are estimated to be nearly identical.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.