Patent · US Active

Methods and systems for efficient and accurate text extraction from unstructured documents

US10360294B2 · kind B2 · utility

2Cited by
4References
25Claims
0Family size

Assignee

Inventors

Key dates

Filing dateApr 22, 2016
Grant dateJul 23, 2019
Priority date
Expiry dateApr 22, 2036

Classification

  • Technology area (CPC G)Physics
  • CPC primaryG06V30/414
  • WIPO fieldComputer technology
  • WIPO sectorElectrical engineering

Abstract

According to one aspect, the subject matter described herein includes a method for extracting text from unstructured documents. The method includes creating a spatial index for storing information about words on a page of a document to be analyzed; using the spatial index to detect white space that indicates boundaries of columns within the page, aggregate words into lines, identify lines that are part of a header or footer of the page, and identify lines that are part of a table or a figures within the page; and joining lines together to generate continuous text flows. In one embodiment, the continuous text is divided into sections. In one embodiment, references within the document are identified. In one embodiment, inline citations within the document body are replaced with the corresponding reference information, or portions thereof.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.