Patent · US Active

System for tokenizing text in languages without inter-word separation

US10002128B2 · kind B2 · utility

1Cited by
0References
24Claims
0Family size

Assignee

Inventors

Key dates

Filing dateOct 29, 2015
Grant dateJun 19, 2018
Priority date
Expiry dateOct 29, 2035

Classification

  • Technology area (CPC G)Physics
  • CPC primaryG06F40/53
  • WIPO fieldComputer technology
  • WIPO sectorElectrical engineering

Abstract

A computerized system for transforming an input string includes a dictionary with tokens and associated scores. A chart parser generates a chart parse of the input string by, for each position within the input string, (i) identifying a string of at least one consecutive character in the input string that begins at that position and matches one of the tokens and (ii) unless the identified string is a single character matching the start character for another entry in the chart parse, creating an entry corresponding to the identified string. A partition selection module determines a selected partition of the input string. The selected partition includes an array of tokens selected from the chart parse such that their concatenation matches the input string. The selected partition is a minimum score partition, where the score is based on a sum of the tokens' associated scores from the dictionary.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.