Patent · US Expired

Method for clustering closely resembling data objects

US6119124A · kind A · utility

254Cited by
7References
24Claims
0Family size

Assignee

Inventors

Key dates

Filing dateMar 26, 1998
Grant dateSep 12, 2000
Priority date
Expiry dateMar 26, 2018

Classification

  • Technology area (CPC Y)Emerging Cross-Sectional Technologies
  • CPC primaryY10S707/99944
  • WIPO fieldComputer technology
  • WIPO sectorElectrical engineering

Abstract

A computer-implemented method determines the resemblance of data objects such as Web pages. Each data object is partitioned into a sequence of tokens. The tokens are grouped into overlapping sets of the tokens to form shingles. Each shingle is represented by a unique identification element encoded as a fingerprint. A minimum element from each of the images of the set of fingerprints associated with a document under each of a plurality of pseudo random permutations of the set of all fingerprints are selected to generate a sketch of each data object. The sketches characterize the resemblance of the data objects. The sketches can be further partitioned into a plurality of groups. Each group is fingerprinted to form a feature. Data objects that share more than a certain numbers of features are estimated to be nearly identical.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.