Patent · US Active

Method and apparatus for document clustering and document sketching

US8255397B2 · kind B2 · utility

3Cited by
100References
4Claims
0Family size

Assignee

Inventor

Key dates

Filing dateAug 26, 2008
Grant dateAug 28, 2012
Priority date
Expiry dateAug 12, 2029

Classification

  • Technology area (CPC Y)Emerging Cross-Sectional Technologies
  • CPC primaryY10S707/99943
  • WIPO fieldComputer technology
  • WIPO sectorElectrical engineering

Abstract

A first embodiment of the invention provides a system that automatically classifies documents in a collection into clusters based on the similarities between documents, that automatically classifies new documents into the right clusters, and that may change the number or parameters of clusters under various circumstances. A second embodiment of the invention provides a technique for comparing two documents, in which a fingerprint or sketch of each document is computed. In particular, this embodiment of the invention uses a specific algorithm to compute the document's fingerprint. One embodiment uses a sentence in the document as a logical delimiter or window from which significant words are extracted and, thereafter, a hash is computed of all pair-wise permutations. Words are extracted based on their weight in the document, which can be computed using measures such as term frequency and the inverse document frequency.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.