Patent · US Expired

Identifying similarities within large collections of unstructured data

US6947933B2 · kind B2 · utility

46Cited by

9References

16Claims

0Family size

Assignee

Verdasys, Inc. · US

Inventor

Michael Smolsky · Brookline, US

Key dates

Filing date	Dec 17, 2003
Grant date	Sep 20, 2005
Priority date	—
Expiry date	Dec 17, 2023

Classification

Technology area (CPC Y)Emerging Cross-Sectional Technologies
CPC primaryY10S707/99956
WIPO fieldComputer technology
WIPO sectorElectrical engineering

Abstract

A technique for determining when documents stored in digital format in a data processing system are similar. A method compares a sparse representation of two or more documents by breaking the documents into “chunks” of data of predefined sizes. Selected subsets of the chunks are determined as being representative of data in the documents and coefficients are developed to represent such chunks. Coefficients are then combined into coefficient clusters containing coefficients that are similar according to a predetermined similarity metric. The degree of similarity between documents is then evaluated by counting clusters into which chunks of similar documents fall.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.