Phrase-based document clustering with automatic phrase extraction
US8392175B2 · kind B2 · utility
Assignee
Inventors
Key dates
| Filing date | May 21, 2010 |
| Grant date | Mar 5, 2013 |
| Priority date | — |
| Expiry date | Jun 30, 2031 |
Classification
- Technology area (CPC G)Physics
- CPC primaryG06F40/237
- WIPO fieldComputer technology
- WIPO sectorElectrical engineering
Abstract
Meaningful phrases are distinguished from chance word sequences statistically, by analyzing a large number of documents and using a statistical metric such as a mutual information metric to distinguish meaningful phrases from groups of words that co-occur by chance. In some embodiments, multiple lists of candidate phrases are maintained to optimize the storage requirement of the phrase-identification algorithm. After phrase identification, a combination of words and meaningful phrases can be used to construct clusters of documents.
Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.