Patent · US Active

Automatic extraction of a training corpus for a data classifier based on machine learning algorithms

US11409779B2 · kind B2 · utility

0Cited by
1References
15Claims
0Family size

Assignee

Inventors

Key dates

Filing dateMay 11, 2018
Grant dateAug 9, 2022
Priority date
Expiry dateJan 27, 2039

Classification

  • Technology area (CPC G)Physics
  • CPC primaryG06N5/046
  • WIPO fieldComputer technology
  • WIPO sectorElectrical engineering

Abstract

An iterative classifier for unsegmented electronic documents is based on machine learning algorithms. The textual strings in the electronic document are segmented using a composite dictionary that combines a conventional dictionary and an adaptive dictionary developed based on the context and nature of the electronic document. The classifier is built using a corpus of training and testing samples automatically extracted from the electronic document by detecting signatures for a set of pre-established classes for the textual strings. The classifier is further iteratively improved by automatically expanding the corpus of training and testing samples in real-time when textual strings in new electronic documents are processed and classified.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.