Patent · US Active

Multi-pass duplicate identification using sorted neighborhoods and aggregation techniques

US10409788B2 · kind B2 · utility

0Cited by
5References
20Claims
0Family size

Assignee

Inventors

Key dates

Filing dateJan 23, 2017
Grant dateSep 10, 2019
Priority date
Expiry dateSep 16, 2037

Classification

  • Technology area (CPC G)Physics
  • CPC primaryG06F16/215
  • WIPO fieldComputer technology
  • WIPO sectorElectrical engineering

Abstract

Systems and methods are provided herein for multi-pass duplicate identification using sorted neighborhoods. Data comprising a plurality of data records is received. Neighborhood records are generated by merging the plurality of data records with reference records stored in a remote data store. A resource identification field is assigned to each reference record. A pair distance, for each pair of neighborhood records having different resource identification fields, is determined by calculating a standard deviation of distances between each attribute of the pair scaled by a filled pairs quote value. Possible duplicate records are identified by evaluating each pair distance against a threshold, each possible duplicate having grouped attributes. Final duplicate records are identified by matching each group to a key.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.