Patent · US Active

Document structure extraction using machine learning

US11769072B2 · kind B2 · utility

2Cited by
5References
16Claims
0Family size

Assignee

Inventor

Key dates

Filing dateAug 8, 2016
Grant dateSep 26, 2023
Priority date
Expiry dateAug 16, 2040

Classification

  • Technology area (CPC G)Physics
  • CPC primaryG06V30/414
  • WIPO fieldComputer technology
  • WIPO sectorElectrical engineering

Abstract

The structure of an untagged document can be derived using a predictive model that is trained in a supervised learning framework based on a corpus of tagged training documents. Analyzing the training documents results in a plurality of document part feature vectors, each of which correlates a category defining a document part (for example, “title” or “body paragraph”) with one or more feature-value pairs (for example, “font=Arial” or “alignment=centered”). Any suitable machine learning algorithm can be used to train the predictive model based on the document part feature vectors extracted from the training documents. Once the predictive model has been trained, it can receive feature-value pairs corresponding to a portion of an untagged document and make predictions with respect to the how that document part should be categorized. The predictive model can therefore generate tag metadata that defines a structure of the untagged document in an automated fashion.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.