Patent · US Active

Document structure extraction using machine learning

US11769072B2 · kind B2 · utility

2Cited by

5References

16Claims

0Family size

Assignee

Adobe Inc. · US

Inventor

Michael Kraley · Lexington, US

Key dates

Filing date	Aug 8, 2016
Grant date	Sep 26, 2023
Priority date	—
Expiry date	Aug 16, 2040

Classification

Technology area (CPC G)Physics
CPC primaryG06V30/414
WIPO fieldComputer technology
WIPO sectorElectrical engineering

Abstract

The structure of an untagged document can be derived using a predictive model that is trained in a supervised learning framework based on a corpus of tagged training documents. Analyzing the training documents results in a plurality of document part feature vectors, each of which correlates a category defining a document part (for example, “title” or “body paragraph”) with one or more feature-value pairs (for example, “font=Arial” or “alignment=centered”). Any suitable machine learning algorithm can be used to train the predictive model based on the document part feature vectors extracted from the training documents. Once the predictive model has been trained, it can receive feature-value pairs corresponding to a portion of an untagged document and make predictions with respect to the how that document part should be categorized. The predictive model can therefore generate tag metadata that defines a structure of the untagged document in an automated fashion.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.