Patent · US Active

System and method for handling the confounding effect of document length on vector-based similarity scores

US9311390B2 · kind B2 · utility

4Cited by
9References
6Claims
0Family size

Assignee

Inventor

Key dates

Filing dateJan 29, 2009
Grant dateApr 12, 2016
Priority date
Expiry dateOct 17, 2033

Classification

  • Technology area (CPC G)Physics
  • CPC primaryG06F16/3347
  • WIPO fieldComputer technology
  • WIPO sectorElectrical engineering

Abstract

A computer-implemented method, system, and computer program product for generating vector-based similarity scores in text document comparisons considering confounding effects of document length. Vector-based methods for comparing the semantic similarity between texts (such as Content Vector Analysis and Random Indexing) have a characteristic which may reduce their usefulness for some applications: the similarity estimates they produce are strongly correlated with the lengths of the texts compared. The statistical basis for this confound is described, and suggests the application of a pivoted normalization method from information retrieval to correct for the effect of document length. In two text categorization experiments, Random Indexing similarity scores using pivoted normalization are shown to perform significantly better than standard vector-based similarity estimation methods.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.