System and method for handling the confounding effect of document length on vector-based similarity scores
US9311390B2 · kind B2 · utility
Assignee
Inventor
Key dates
| Filing date | Jan 29, 2009 |
| Grant date | Apr 12, 2016 |
| Priority date | — |
| Expiry date | Oct 17, 2033 |
Classification
- Technology area (CPC G)Physics
- CPC primaryG06F16/3347
- WIPO fieldComputer technology
- WIPO sectorElectrical engineering
Abstract
A computer-implemented method, system, and computer program product for generating vector-based similarity scores in text document comparisons considering confounding effects of document length. Vector-based methods for comparing the semantic similarity between texts (such as Content Vector Analysis and Random Indexing) have a characteristic which may reduce their usefulness for some applications: the similarity estimates they produce are strongly correlated with the lengths of the texts compared. The statistical basis for this confound is described, and suggests the application of a pivoted normalization method from information retrieval to correct for the effect of document length. In two text categorization experiments, Random Indexing similarity scores using pivoted normalization are shown to perform significantly better than standard vector-based similarity estimation methods.
Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.