Patent · US Active

Clustering of near-duplicate documents

US9355171B2 · kind B2 · utility

1Cited by
9References
20Claims
0Family size

Assignee

Inventors

Key dates

Filing dateAug 27, 2010
Grant dateMay 31, 2016
Priority date
Expiry dateMay 24, 2031

Classification

  • Technology area (CPC G)Physics
  • CPC primaryG06F16/355
  • WIPO fieldComputer technology
  • WIPO sectorElectrical engineering

Abstract

Documents likely to be near-duplicates are clustered based on document vectors that represent word-occurrence patterns in a relatively low-dimensional space. Edit distance between documents is defined based on comparing their document vectors. In one process, initial clusters are formed by applying a first edit-distance constraint relative to a root document of each cluster. The initial clusters can be merged subject to a second edit-distance constraint that limits the maximum edit distance between any two documents in the cluster. The second edit-distance constraint can be defined such that whether it is satisfied can be determined by comparing cluster structures rather than individual documents.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.