Sampling-based deduplication estimation
US10198455B2 · kind B2 · utility
Assignee
Inventors
Key dates
| Filing date | Jan 13, 2016 |
| Grant date | Feb 5, 2019 |
| Priority date | — |
| Expiry date | Dec 15, 2036 |
Classification
- Technology area (CPC G)Physics
- CPC primaryG06F16/137
- WIPO fieldComputer technology
- WIPO sectorElectrical engineering
Abstract
A method, including partitioning a dataset into a first number of data units, and selecting, based on a sampling ratio, a second number of the data units. A hash value is calculated for each of the selected data units, and a first histogram is computed indicating a first duplication count for each of the calculated hash values. Based on respective frequencies of the calculated hash values, a second histogram is computed indicating an observed frequency for each of the first duplication counts in the first histogram, and based on the sampling ratio and the second histogram, a target function is derived. A third histogram that minimizes the target function is derived, the third histogram including, for the first number of the storage units, second duplication counts and a respective predicted frequency for each of the second duplication counts. Finally, a deduplication ratio is determined based on the third histogram.
Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.