Patent · US Expired

Method and apparatus for automatically generating hierarchical categories from large document collections

US5819258A · kind A · utility

226Cited by
7References
46Claims
0Family size

Assignee

Inventors

Key dates

Filing dateMar 7, 1997
Grant dateOct 6, 1998
Priority date
Expiry dateMar 7, 2017

Classification

  • Technology area (CPC Y)Emerging Cross-Sectional Technologies
  • CPC primaryY10S707/99938
  • WIPO fieldComputer technology
  • WIPO sectorElectrical engineering

Abstract

A top-down clustering method and apparatus recursively processes clusters of documents by first extracting features from the documents comprising the cluster, then using the extracted features to generate sub-clusters and finally using the generated sub-clusters to develop topics and identifiers for each sub-cluster. This process is repeated for each cluster and sub-cluster in a recursive manner so that clustering is performed using features extracted from each document in a cluster to perform sub-clustering. Feature extraction is performed by using frequency counts of terms taken from each document in a cluster and discarding terms falling outside of predetermined boundaries computed based on the total number of documents in the cluster. After bounding, the number of tokens is reduced prior to clustering by means of a correlation technique, such as a PCA model.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.