Supercomputing environment for duplicate detection on web-scale data
US7389310B1 · kind B1 · utility
Assignee
Inventors
Key dates
| Filing date | Mar 10, 2008 |
| Grant date | Jun 17, 2008 |
| Priority date | — |
| Expiry date | Mar 10, 2028 |
Classification
- Technology area (CPC Y)Emerging Cross-Sectional Technologies
- CPC primaryY10S707/99954
- WIPO fieldComputer technology
- WIPO sectorElectrical engineering
Abstract
A scale-out supercomputing environment includes a plurality of interconnected nodes arranged in a three-dimensional cubic grid and configured to perform a method of duplicate detection. The method includes at least computing a fingerprint of at least one document in the supercomputing environment to generate data packets from the at least one document and to generate a fixed size tuple of information from the at least one document, distributing the data packets to each node of the plurality of nodes to ensure all elements of the fixed size tuple fit into memory of the plurality of nodes, applying localized detection techniques to data packets on each node of the plurality of nodes to remove data packet duplicates, redistributing the data packets to each node of the plurality of nodes based on the document fingerprint, and performing a global merge of results of the localized detection techniques.
Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.