Patent · US Active

System and method for handling the confounding effect of document length on vector-based similarity scores

US9311390B2 · kind B2 · utility

4Cited by

9References

6Claims

0Family size

Assignee

Educational Testing Service · US

Inventor

Derrick Higgins · Highland Park, US

Key dates

Filing date	Jan 29, 2009
Grant date	Apr 12, 2016
Priority date	—
Expiry date	Oct 17, 2033

Classification

Technology area (CPC G)Physics
CPC primaryG06F16/3347
WIPO fieldComputer technology
WIPO sectorElectrical engineering

Abstract

A computer-implemented method, system, and computer program product for generating vector-based similarity scores in text document comparisons considering confounding effects of document length. Vector-based methods for comparing the semantic similarity between texts (such as Content Vector Analysis and Random Indexing) have a characteristic which may reduce their usefulness for some applications: the similarity estimates they produce are strongly correlated with the lengths of the texts compared. The statistical basis for this confound is described, and suggests the application of a pivoted normalization method from information retrieval to correct for the effect of document length. In two text categorization experiments, Random Indexing similarity scores using pivoted normalization are shown to perform significantly better than standard vector-based similarity estimation methods.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.