System for tokenizing text in languages without inter-word separation
US10002128B2 · kind B2 · utility
Assignee
Inventors
Key dates
| Filing date | Oct 29, 2015 |
| Grant date | Jun 19, 2018 |
| Priority date | — |
| Expiry date | Oct 29, 2035 |
Classification
- Technology area (CPC G)Physics
- CPC primaryG06F40/53
- WIPO fieldComputer technology
- WIPO sectorElectrical engineering
Abstract
A computerized system for transforming an input string includes a dictionary with tokens and associated scores. A chart parser generates a chart parse of the input string by, for each position within the input string, (i) identifying a string of at least one consecutive character in the input string that begins at that position and matches one of the tokens and (ii) unless the identified string is a single character matching the start character for another entry in the chart parse, creating an entry corresponding to the identified string. A partition selection module determines a selected partition of the input string. The selected partition includes an array of tokens selected from the chart parse such that their concatenation matches the input string. The selected partition is a minimum score partition, where the score is based on a sum of the tokens' associated scores from the dictionary.
Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.