Identification of high deduplication data
US9984092B1 · kind B1 · utility
Assignee
Inventors
Key dates
| Filing date | Aug 16, 2017 |
| Grant date | May 29, 2018 |
| Priority date | — |
| Expiry date | Aug 16, 2037 |
Classification
- Technology area (CPC G)Physics
- CPC primaryG06F16/1748
- WIPO fieldComputer technology
- WIPO sectorElectrical engineering
Abstract
A computer-implemented method includes dividing a data set into a plurality of regions and dividing the plurality of regions into a plurality of chunks of fixed size. The computer-implemented method further includes determining a sample size of the plurality of chunks to be sampled for each region, wherein the sample size is determined based, at least in part, on an acceptance of a likelihood of identifying at least one collision between two regions corresponding to logical entities of a first cluster of logical entities. The computer-implemented method further includes sampling the plurality of chunks for each region based on the determined sample size. The computer-implemented method further includes generating a hash value for each chunk sampled and storing each hash value in an index. The computer-implemented method further includes identifying one or more collisions between the plurality of regions. A corresponding computer system and computer program product are also disclosed.
Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.