Generating a consistently labeled training dataset by automatically generating and displaying a set of most similar previously-labeled texts and their previously assigned labels for each text that is being labeled for the training dataset
US10789533B2 · kind B2 · utility
Assignee
Inventors
Key dates
| Filing date | Jul 26, 2017 |
| Grant date | Sep 29, 2020 |
| Priority date | — |
| Expiry date | Jun 21, 2039 |
Classification
- Technology area (CPC G)Physics
- CPC primaryG06N20/00
- WIPO fieldComputer technology
- WIPO sectorElectrical engineering
Abstract
Technology for generating a consistently labeled training dataset. For each one of multiple previously labeled texts, a distance between the previously labeled text and a current text to be labeled is generated by comparing a list of tokens for the previously labeled text to a list of tokens for the current text to determine an overlap value equal to a number of tokens that match between the list of tokens for the previously labeled text and the list of tokens for the current text, and using the overlap value to calculate a distance between the previously labeled text and the current text that is inversely correlated to the overlap value. Previously labeled texts that are most similar to the current text are identified as those previously labeled texts having the shortest distances to the current text, and are displayed with their previously assigned labels in a label selection user interface.
Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.