Patent · US Expired

Tokenizer for a natural language processing system

US7092871B2 · kind B2 · utility

37Cited by
11References
10Claims
0Family size

Assignee

Inventors

Key dates

Filing dateMar 30, 2001
Grant dateAug 15, 2006
Priority date
Expiry dateApr 29, 2024

Classification

  • Technology area (CPC G)Physics
  • CPC primaryG06F40/268
  • WIPO fieldComputer technology
  • WIPO sectorElectrical engineering

Abstract

The present invention is a segmenter used in a natural language processing system. The segmenter segments a textual input string into tokens for further natural language processing. In accordance with one feature of the invention, the segmenter includes a tokenizer engine that proposes segmentations and submits them to a linguistic knowledge component for validation. In accordance with another feature of the invention, the segmentation system includes language-specific data that contains a precedence hierarchy for punctuation. If proposed tokens in the input string contain punctuation, they can illustratively be broken into subtokens based on the precedence hierarchy.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.