Tokenizer for a natural language processing system
US7092871B2 · kind B2 · utility
Assignee
Inventors
Key dates
| Filing date | Mar 30, 2001 |
| Grant date | Aug 15, 2006 |
| Priority date | — |
| Expiry date | Apr 29, 2024 |
Classification
- Technology area (CPC G)Physics
- CPC primaryG06F40/268
- WIPO fieldComputer technology
- WIPO sectorElectrical engineering
Abstract
The present invention is a segmenter used in a natural language processing system. The segmenter segments a textual input string into tokens for further natural language processing. In accordance with one feature of the invention, the segmenter includes a tokenizer engine that proposes segmentations and submits them to a linguistic knowledge component for validation. In accordance with another feature of the invention, the segmentation system includes language-specific data that contains a precedence hierarchy for punctuation. If proposed tokens in the input string contain punctuation, they can illustratively be broken into subtokens based on the precedence hierarchy.
Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.