Multi-language document search and retrieval system
US7369987B2 · kind B2 · utility
Assignee
Inventors
Key dates
| Filing date | Dec 29, 2006 |
| Grant date | May 6, 2008 |
| Priority date | — |
| Expiry date | Dec 29, 2026 |
Classification
- Technology area (CPC G)Physics
- CPC primaryG06F40/289
- WIPO fieldComputer technology
- WIPO sectorElectrical engineering
Abstract
A multi-lingual indexing and search system is presented that performs tokenization and stemming in a manner which is independent of whether index entries and search terms appear as words in a dictionary. The system includes a tokenizer that separates a string of text into individual word tokens, and eliminates predetermined types of tokens from further processing. The system also includes a stemmer that reduces words to grammatical stems by removing known word-endings associated with the various languages to be supported. The stemmer removes known word endings from the word tokens without any effort to guarantee that the remaining stem is contained in a dictionary. In an embodiment, the stemmer only removes those word endings which are associated with nouns. The system further includes an indexer that stores the stems in an index.
Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.