Methods and systems for efficient and accurate text extraction from unstructured documents
US10360294B2 · kind B2 · utility
Assignee
Inventors
Key dates
| Filing date | Apr 22, 2016 |
| Grant date | Jul 23, 2019 |
| Priority date | — |
| Expiry date | Apr 22, 2036 |
Classification
- Technology area (CPC G)Physics
- CPC primaryG06V30/414
- WIPO fieldComputer technology
- WIPO sectorElectrical engineering
Abstract
According to one aspect, the subject matter described herein includes a method for extracting text from unstructured documents. The method includes creating a spatial index for storing information about words on a page of a document to be analyzed; using the spatial index to detect white space that indicates boundaries of columns within the page, aggregate words into lines, identify lines that are part of a header or footer of the page, and identify lines that are part of a table or a figures within the page; and joining lines together to generate continuous text flows. In one embodiment, the continuous text is divided into sections. In one embodiment, references within the document are identified. In one embodiment, inline citations within the document body are replaced with the corresponding reference information, or portions thereof.
Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.