Automatic extraction of a training corpus for a data classifier based on machine learning algorithms
US11409779B2 · kind B2 · utility
Assignee
Inventors
Key dates
| Filing date | May 11, 2018 |
| Grant date | Aug 9, 2022 |
| Priority date | — |
| Expiry date | Jan 27, 2039 |
Classification
- Technology area (CPC G)Physics
- CPC primaryG06N5/046
- WIPO fieldComputer technology
- WIPO sectorElectrical engineering
Abstract
An iterative classifier for unsegmented electronic documents is based on machine learning algorithms. The textual strings in the electronic document are segmented using a composite dictionary that combines a conventional dictionary and an adaptive dictionary developed based on the context and nature of the electronic document. The classifier is built using a corpus of training and testing samples automatically extracted from the electronic document by detecting signatures for a set of pre-established classes for the textual strings. The classifier is further iteratively improved by automatically expanding the corpus of training and testing samples in real-time when textual strings in new electronic documents are processed and classified.
Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.