Efficient language identification
US8027832B2 · kind B2 · utility
Assignee
Inventors
Key dates
| Filing date | Feb 11, 2005 |
| Grant date | Sep 27, 2011 |
| Priority date | — |
| Expiry date | Feb 26, 2028 |
Classification
- Technology area (CPC Y)Emerging Cross-Sectional Technologies
- CPC primaryY10T70/7057
- WIPO fieldComputer technology
- WIPO sectorElectrical engineering
Abstract
A system and methods of language identification of natural language text are presented. The system includes stored expected character counts and variances for a list of characters found in a natural language. Expected character counts and variances are stored for multiple languages to be considered during language identification. At run-time, one or more languages are identified for a text sample based on comparing actual and expected character counts. The present methods can be combined with upstream analyzing of Unicode ranges for characters in the text sample to limit the number of languages considered. Further, n-gram methods can be used in downstream processing to select the most probable language from among the languages identified by the present system and methods.
Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.