Multi-pass duplicate identification using sorted neighborhoods and aggregation techniques
US10409788B2 · kind B2 · utility
Assignee
Inventors
Key dates
| Filing date | Jan 23, 2017 |
| Grant date | Sep 10, 2019 |
| Priority date | — |
| Expiry date | Sep 16, 2037 |
Classification
- Technology area (CPC G)Physics
- CPC primaryG06F16/215
- WIPO fieldComputer technology
- WIPO sectorElectrical engineering
Abstract
Systems and methods are provided herein for multi-pass duplicate identification using sorted neighborhoods. Data comprising a plurality of data records is received. Neighborhood records are generated by merging the plurality of data records with reference records stored in a remote data store. A resource identification field is assigned to each reference record. A pair distance, for each pair of neighborhood records having different resource identification fields, is determined by calculating a standard deviation of distances between each attribute of the pair scaled by a filled pairs quote value. Possible duplicate records are identified by evaluating each pair distance against a threshold, each possible duplicate having grouped attributes. Final duplicate records are identified by matching each group to a key.
Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.