Patent · US Active

Identification of reading order text segments with a probabilistic language model

US10372821B2 · kind B2 · utility

5Cited by
11References
16Claims
0Family size

Assignee

Inventors

Key dates

Filing dateMar 17, 2017
Grant dateAug 6, 2019
Priority date
Expiry dateApr 16, 2037

Classification

  • Technology area (CPC G)Physics
  • CPC primaryG06V30/416
  • WIPO fieldComputer technology
  • WIPO sectorElectrical engineering

Abstract

Certain embodiments identify a correct structured reading-order sequence of text segments extracted from a file. A probabilistic language model is generated from a large text corpus to comprise observed word sequence patterns for a given language. The language model measures whether splicing together a first text segment with another continuation text segment results in a phrase that is more likely than a phrase resulting from splicing together the first text segment with other continuation text segments. Sets of text segments, which include a first set with a first text segment and a first continuation text segment as well as a second set with the first text segment and a second continuation text segment, are provided to the probabilistic model. A score indicative of a likelihood of the set providing a correct structured reading-order sequence is obtained for each set of text segments.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.