Patent · US Active

Sampling-based deduplication estimation

US10198455B2 · kind B2 · utility

2Cited by
5References
21Claims
0Family size

Assignee

Inventors

Key dates

Filing dateJan 13, 2016
Grant dateFeb 5, 2019
Priority date
Expiry dateDec 15, 2036

Classification

  • Technology area (CPC G)Physics
  • CPC primaryG06F16/137
  • WIPO fieldComputer technology
  • WIPO sectorElectrical engineering

Abstract

A method, including partitioning a dataset into a first number of data units, and selecting, based on a sampling ratio, a second number of the data units. A hash value is calculated for each of the selected data units, and a first histogram is computed indicating a first duplication count for each of the calculated hash values. Based on respective frequencies of the calculated hash values, a second histogram is computed indicating an observed frequency for each of the first duplication counts in the first histogram, and based on the sampling ratio and the second histogram, a target function is derived. A third histogram that minimizes the target function is derived, the third histogram including, for the first number of the storage units, second duplication counts and a respective predicted frequency for each of the second duplication counts. Finally, a deduplication ratio is determined based on the third histogram.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.