Clustering of near-duplicate documents
US9355171B2 · kind B2 · utility
Assignee
Inventors
Key dates
| Filing date | Aug 27, 2010 |
| Grant date | May 31, 2016 |
| Priority date | — |
| Expiry date | May 24, 2031 |
Classification
- Technology area (CPC G)Physics
- CPC primaryG06F16/355
- WIPO fieldComputer technology
- WIPO sectorElectrical engineering
Abstract
Documents likely to be near-duplicates are clustered based on document vectors that represent word-occurrence patterns in a relatively low-dimensional space. Edit distance between documents is defined based on comparing their document vectors. In one process, initial clusters are formed by applying a first edit-distance constraint relative to a root document of each cluster. The initial clusters can be merged subject to a second edit-distance constraint that limits the maximum edit distance between any two documents in the cluster. The second edit-distance constraint can be defined such that whether it is satisfied can be determined by comparing cluster structures rather than individual documents.
Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.