Patent · US Active

Supercomputing environment for duplicate detection on web-scale data

US7389310B1 · kind B1 · utility

38Cited by
2References
5Claims
0Family size

Assignee

Inventors

Key dates

Filing dateMar 10, 2008
Grant dateJun 17, 2008
Priority date
Expiry dateMar 10, 2028

Classification

  • Technology area (CPC Y)Emerging Cross-Sectional Technologies
  • CPC primaryY10S707/99954
  • WIPO fieldComputer technology
  • WIPO sectorElectrical engineering

Abstract

A scale-out supercomputing environment includes a plurality of interconnected nodes arranged in a three-dimensional cubic grid and configured to perform a method of duplicate detection. The method includes at least computing a fingerprint of at least one document in the supercomputing environment to generate data packets from the at least one document and to generate a fixed size tuple of information from the at least one document, distributing the data packets to each node of the plurality of nodes to ensure all elements of the fixed size tuple fit into memory of the plurality of nodes, applying localized detection techniques to data packets on each node of the plurality of nodes to remove data packet duplicates, redistributing the data packets to each node of the plurality of nodes based on the document fingerprint, and performing a global merge of results of the localized detection techniques.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.