Patent · US Active

Multi-language document search and retrieval system

US7369987B2 · kind B2 · utility

6Cited by
9References
7Claims
0Family size

Assignee

Inventors

Key dates

Filing dateDec 29, 2006
Grant dateMay 6, 2008
Priority date
Expiry dateDec 29, 2026

Classification

  • Technology area (CPC G)Physics
  • CPC primaryG06F40/289
  • WIPO fieldComputer technology
  • WIPO sectorElectrical engineering

Abstract

A multi-lingual indexing and search system is presented that performs tokenization and stemming in a manner which is independent of whether index entries and search terms appear as words in a dictionary. The system includes a tokenizer that separates a string of text into individual word tokens, and eliminates predetermined types of tokens from further processing. The system also includes a stemmer that reduces words to grammatical stems by removing known word-endings associated with the various languages to be supported. The stemmer removes known word endings from the word tokens without any effort to guarantee that the remaining stem is contained in a dictionary. In an embodiment, the stemmer only removes those word endings which are associated with nouns. The system further includes an indexer that stores the stems in an index.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.