Patent · US Active

Identification of high deduplication data

US9984092B1 · kind B1 · utility

4Cited by

18References

1Claims

0Family size

Assignee

International Business Machines Corporation · US

Inventors

Danny Harnik · Tel Mond, IL
Ety Khaitzin · Petah Tikva, IL
Sergey Marenkov · Yehud Monosson, IL
Dmitry Sotnikov · Rishon LeZion, IL

Key dates

Filing date	Aug 16, 2017
Grant date	May 29, 2018
Priority date	—
Expiry date	Aug 16, 2037

Classification

Technology area (CPC G)Physics
CPC primaryG06F16/1748
WIPO fieldComputer technology
WIPO sectorElectrical engineering

Abstract

A computer-implemented method includes dividing a data set into a plurality of regions and dividing the plurality of regions into a plurality of chunks of fixed size. The computer-implemented method further includes determining a sample size of the plurality of chunks to be sampled for each region, wherein the sample size is determined based, at least in part, on an acceptance of a likelihood of identifying at least one collision between two regions corresponding to logical entities of a first cluster of logical entities. The computer-implemented method further includes sampling the plurality of chunks for each region based on the determined sample size. The computer-implemented method further includes generating a hash value for each chunk sampled and storing each hash value in an index. The computer-implemented method further includes identifying one or more collisions between the plurality of regions. A corresponding computer system and computer program product are also disclosed.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.