Identification of reading order text segments with a probabilistic language model
US10372821B2 · kind B2 · utility
Assignee
Inventors
Key dates
| Filing date | Mar 17, 2017 |
| Grant date | Aug 6, 2019 |
| Priority date | — |
| Expiry date | Apr 16, 2037 |
Classification
- Technology area (CPC G)Physics
- CPC primaryG06V30/416
- WIPO fieldComputer technology
- WIPO sectorElectrical engineering
Abstract
Certain embodiments identify a correct structured reading-order sequence of text segments extracted from a file. A probabilistic language model is generated from a large text corpus to comprise observed word sequence patterns for a given language. The language model measures whether splicing together a first text segment with another continuation text segment results in a phrase that is more likely than a phrase resulting from splicing together the first text segment with other continuation text segments. Sets of text segments, which include a first set with a first text segment and a first continuation text segment as well as a second set with the first text segment and a second continuation text segment, are provided to the probabilistic model. A score indicative of a likelihood of the set providing a correct structured reading-order sequence is obtained for each set of text segments.
Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.