Estimating document similarity using bit-strings
US8594239B2 · kind B2 · utility
Assignee
Inventors
Key dates
| Filing date | Feb 21, 2011 |
| Grant date | Nov 26, 2013 |
| Priority date | — |
| Expiry date | Feb 16, 2032 |
Classification
- Technology area (CPC G)Physics
- CPC primaryG06F16/316
- WIPO fieldComputer technology
- WIPO sectorElectrical engineering
Abstract
Each of a plurality of documents is divided into samples. Small bit-strings are generated for selected samples from each of the documents and used to create a sketch for each document. Because the bit-strings are small (e.g., only one, two, or three bits in length), the generated sketches are smaller than the sketches generated using previous methods for generating sketches, and therefore use less storage space. The generated sketches are compared to determine documents that are near-duplicates of one another.
Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.