Patent · US Expired

Automatic language identification system for multilingual optical character recognition

US6047251A · kind A · utility

74Cited by
7References
16Claims
0Family size

Assignee

Inventors

Key dates

Filing dateSep 15, 1997
Grant dateApr 4, 2000
Priority date
Expiry dateSep 15, 2017

Classification

  • Technology area (CPC G)Physics
  • CPC primaryG06V30/242
  • WIPO fieldComputer technology
  • WIPO sectorElectrical engineering

Abstract

The disclosed invention utilizes a dictionary-based approach to identify languages within different zones in a multi-lingual document. As a first step, a document image is segmented into various zones, regions and word tokens, using suitable geometric properties. Within each zone, the word tokens are compared to dictionaries associated with various candidate languages, and the language that exhibits the highest confidence factor is initially identified as the language of the zone. Subsequently, each zone is further split into regions. The language for each region is then identified, using the confidence factors for the words of that region. For any language determination having a low confidence value, the previously determined language of the zone is employed to assist the identification process.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.