System and method for identifying language using morphologically-based techniques
US6415250B1 · kind B1 · utility
Assignee
Inventor
Key dates
| Filing date | Jun 18, 1997 |
| Grant date | Jul 2, 2002 |
| Priority date | — |
| Expiry date | Jun 18, 2017 |
Classification
- Technology area (CPC G)Physics
- CPC primaryG06F40/268
- WIPO fieldComputer technology
- WIPO sectorElectrical engineering
Abstract
A language identification system for automatically identifying a language in which an input text is written based upon a probabilistic analysis of predetermined portions of words sampled from the input text. The predetermined portions of words reflect morphological characteristics of natural languages. The automatic language identification system determines which language of a plurality of represented languages a given text is written based upon a value representing the relative likelihood that the text is a particular one of the plurality of represented languages due to a presence of a morphologically-significant word portion in the text. Preferably the word portion is the last three characters in a word. The relative likelihood is derived from a relative frequency of occurrence of the fixed-length word ending in each of a plurality of language corpuses, within each language corpus corresponding to one of the plurality of represented languages. Specifically, the automatic language identification system includes a language corpus analyzer that generates, for each of a plurality of word endings extracted from at least one of the language corpuses, a plurality of probabilities associ…
Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.