System and method to generate a labeled dataset for training an entity detection system
US11681944B2 · kind B2 · utility
Assignee
Inventors
Key dates
| Filing date | Aug 9, 2018 |
| Grant date | Jun 20, 2023 |
| Priority date | — |
| Expiry date | Nov 12, 2041 |
Classification
- Technology area (CPC G)Physics
- CPC primaryG06F40/295
- WIPO fieldComputer technology
- WIPO sectorElectrical engineering
Abstract
“Semi-supervised” machine learning relies on less human input than a supervised algorithm to train a machine learning algorithm to perform entity recognition (NER). Starting with a known entity value or known pattern value for a specific entity type, phrases in a training data corpus are identified that include the known entity value. Context-value patterns are generated to match selected phrases that include the known entity value. One or more context-value patterns may be validated based on human input. The validated patterns identify additional entity values. A subset of the additional entity values may also be validated based on human input. Occurrences of validated entity values may be labeled in the training corpus. Sample phrases from the labeled training dataset may be extracted to form a reduced-size training set for a supervised machine learning model which may be further used in production to label data for any named entity recognition application.
Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.