Language identification for documents containing multiple languages
US8224641B2 · kind B2 · utility
Assignee
Inventor
Key dates
| Filing date | Nov 19, 2008 |
| Grant date | Jul 17, 2012 |
| Priority date | — |
| Expiry date | May 13, 2031 |
Classification
- Technology area (CPC G)Physics
- CPC primaryG06F40/263
- WIPO fieldComputer technology
- WIPO sectorElectrical engineering
Abstract
Multiple nonoverlapping languages within a single document can be identified. In one embodiment, for each of a set of candidate languages, a set of non-overlapping languages is defined. The document is analyzed under the hypothesis that the whole document is in one language and that part of the document is in one language while the rest is in a different, non-overlapping language. Language(s) of the document are identified based on comparing these competing hypotheses across a number of language pairs. In another embodiment, transitions between non-overlapping character sets are used to segment a document, and each segment is scored separately for a subset of candidate languages. Language(s) of the document are identified based on the segment scores.
Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.