Tokenizer for a natural language processing system
US7269547B2 · kind B2 · utility
Assignee
Inventors
Key dates
| Filing date | Jul 15, 2005 |
| Grant date | Sep 11, 2007 |
| Priority date | — |
| Expiry date | Jul 15, 2025 |
Classification
- Technology area (CPC G)Physics
- CPC primaryG06F40/268
- WIPO fieldComputer technology
- WIPO sectorElectrical engineering
Abstract
The present invention is a segmenter used in a natural language processing system. The segmenter segments a textual input string into tokens for further natural language processing. In accordance with one feature of the invention, the segmenter includes a tokeinzer engine that proposes segmentations and submits them to a linguistic knowledge component for validation. In accordance with another feature of the invention, the segmentation system includes language specific data that contains a precedence hierarchy for punctuation. If proposed tokens in the input string contain punctuation, they can illustratively be broken into subtokens based on the precedence hierarchy.
Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.