Optimized subset processing for de-duplication
US10901996B2 · kind B2 · utility
Assignee
Inventors
Key dates
| Filing date | Feb 24, 2016 |
| Grant date | Jan 26, 2021 |
| Priority date | — |
| Expiry date | Jan 16, 2038 |
Classification
- Technology area (CPC G)Physics
- CPC primaryG06F16/285
- WIPO fieldComputer technology
- WIPO sectorElectrical engineering
Abstract
Some embodiments of the present invention include a method for identifying duplicate records from a group of records in a database system. The method includes generating a cluster of records from a group of records based on one or more keys; splitting the cluster of records into multiple subsets of records with each subset of records having fewer number of records than the cluster of records, wherein the splitting the cluster of records into multiple subsets of records is based on a number of records in the cluster of records exceeding a threshold; causing duplicate sets of records in each of the subsets of records to be identified, wherein a duplicate set of records includes one or more records, and wherein when a duplicate set of records includes two or more records, the two or more records are duplicates of one another; merging all of the duplicate sets of records identified from the multiple subsets of records forming a first group of duplicate sets of records; and forming a representative set of records based on selecting a representative record from each of the duplicate sets in the first group of duplicate sets of records.
Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.