Computerized recognition and extraction of tables in digitized documents
US11182604B1 · kind B1 · utility
Assignee
Inventors
Key dates
| Filing date | Nov 26, 2019 |
| Grant date | Nov 23, 2021 |
| Priority date | — |
| Expiry date | Jun 10, 2040 |
Classification
- Technology area (CPC G)Physics
- CPC primaryG06V30/416
- WIPO fieldComputer technology
- WIPO sectorElectrical engineering
Abstract
Information contained in tables in a digitized document is extracted by retrieving table layout data regarding bounding boxes, each being auto-generated by the system and/or (re)generated by a user to the digitized image of a sample document. A row template is used to identify a first table, by automatically scanning within the document. Upon detecting a possible row in the input image, a Row Possibility Confidence Value (RPCV) is generated that indicates a likelihood that the possible row corresponds to an actual row in the first table. The possible row is regarded as an actual row if the RPCV exceeds a predetermined threshold value. For repeated tables in a document only the first table needs to be identified via bounding boxes. Also, related tables can be linked to permit linked data to be extracted to a structured file. Also, only the primary column in a readable and existent table header is required to extract table values across columns.
Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.