Techniques for computing similarity measurements between segments representative of documents
US8166049B2 · kind B2 · utility
Assignee
Inventors
Key dates
| Filing date | May 28, 2009 |
| Grant date | Apr 24, 2012 |
| Priority date | — |
| Expiry date | Apr 6, 2030 |
Classification
- Technology area (CPC G)Physics
- CPC primaryG06F16/316
- WIPO fieldComputer technology
- WIPO sectorElectrical engineering
Abstract
Keyword frequency data for a plurality of document-derived segments is represented in a matrix form in which each segment is represented as a vector of dimensionality equal to the number of keywords. The matrix may be subdivided into a plurality of sub-matrices, each preferably corresponding to a non-overlapping portion of the plurality of keywords. When determining a similarity measurement between any pair of segments, at least a portion of the keyword frequency data for each sub-matrix's non-overlapping keywords are used to determine a sub-matrix dot product for the pair of segments. The resulting plurality of sub-matrix dot products are then summed together in order to provide the similarity measurement. Keywords that are synonyms of each other may be accommodated through the modification of keyword frequency data. Where the keyword frequency data in the matrix representation is relative sparse, compressed views of the matrix representation may be provided.
Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.