Patent · US Active

Text language identification

US7689409B2 · kind B2 · utility

302Cited by

32References

9Claims

0Family size

Assignee

FRANCE TELECOM · FR

Inventor

Johannes Heinecke · Lannion, FR

Key dates

Filing date	Dec 11, 2003
Grant date	Mar 30, 2010
Priority date	—
Expiry date	Jan 13, 2028

Classification

Technology area (CPC G)Physics
CPC primaryG06F40/268
WIPO fieldComputer technology
WIPO sectorElectrical engineering

Abstract

After prestoring first character strings that occur frequently in words of languages and second character strings that are a typical therein, a device for automatically identifying the language of a text from a plurality of languages extracts words from the text and constructs all of the character strings contained in each extracted word. Each string in an extracted word is compared to the first and second strings of a particular language. If the word contains a first string, a score of the language is increased by a coefficient depending in particular on the position of the first string in the word. If the word contains a second string, the score is decreased by a coefficient associated with the second string. The highest of the scores corresponding to the predetermined languages identifies the language of the text.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.