Dictionary based deduplication of training set samples for machine learning based computer threat analysis
US11373065B2 · kind B2 · utility
Assignee
Inventor
Key dates
| Filing date | Jan 17, 2018 |
| Grant date | Jun 28, 2022 |
| Priority date | — |
| Expiry date | Sep 19, 2040 |
Classification
- Technology area (CPC G)Physics
- CPC primaryG06N20/20
- WIPO fieldComputer technology
- WIPO sectorElectrical engineering
Abstract
Presence of malicious code can be identified in one or more data samples. A feature set extracted from a sample is vectorized to generate a sparse vector. A reduced dimension vector representing the sparse vector can be generated. A binary representation vector of reduced dimension vector can be created by converting each value of a plurality of values in the reduced dimension vector to a binary representation. The binary representation vector can be added as a new element in a dictionary structure if the binary representation is not equal to an existing element in the dictionary structure. A training set for use in training a machine learning model can be created to include one vector whose binary representation corresponds to each of a plurality of elements in the dictionary structure.
Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.