Clustering strings using N-grams
US7644076B1 · kind B1 · utility
Assignee
Inventors
Key dates
| Filing date | Sep 12, 2003 |
| Grant date | Jan 5, 2010 |
| Priority date | — |
| Expiry date | Sep 12, 2023 |
Classification
- Technology area (CPC Y)Emerging Cross-Sectional Technologies
- CPC primaryY10S707/99936
- WIPO fieldComputer technology
- WIPO sectorElectrical engineering
Abstract
A method and computer program for clustering a string are described. The string includes a plurality of characters. R unique n-grams T1 . . . R are identified in the string. For every unique n-gram TS, if the frequency of TS in a set of n-gram statistics is not greater than a first threshold, the string is associated with a cluster associated with TS. Otherwise, for every other n-gram TV in the string T1 . . . R, except S, if the frequency of n-gram TV is greater than the first threshold, and if the frequency of n-gram pair TS-TV is not greater than a second threshold, the string is associated with a cluster associated with the n-gram pair TS-TV. Otherwise, for every other n-gram TX in the string T1 . . . R, except S and V, the string is associated with a cluster associated with the n-gram triple TS-TV-TX. Otherwise, nothing is done.
Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.