Heuristic domain targeted table detection and extraction technique
US10706228B2 · kind B2 · utility
Assignee
Inventors
Key dates
| Filing date | Dec 1, 2017 |
| Grant date | Jul 7, 2020 |
| Priority date | — |
| Expiry date | Aug 4, 2038 |
Classification
- Technology area (CPC G)Physics
- CPC primaryG06V30/10
- WIPO fieldComputer technology
- WIPO sectorElectrical engineering
Abstract
A method, system, and apparatus are provided for processing tables embedded within documents wherein a first table header is detected by using semantic groupings of table header terms to identify a minimum number of table header terms in a scanned line of an text document; a potential data zone is extracted by applying white space correlation analysis to a portion of the text document that is adjacent to the first table header; one or more data zone columns from the potential data zone are grouped and aligned with a corresponding header column in the first table header to form a candidate table; data cleansing is performed on the candidate table; and then one or more columns of the candidate table are evaluated using natural language processing to apply a specified table analysis.
Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.