Clustering analysis for deduplication of training set samples for machine learning based computer threat analysis
US11620471B2 · kind B2 · utility
Assignee
Inventor
Key dates
| Filing date | Nov 1, 2017 |
| Grant date | Apr 4, 2023 |
| Priority date | — |
| Expiry date | Dec 22, 2040 |
Classification
- Technology area (CPC G)Physics
- CPC primaryG06F21/564
- WIPO fieldComputer technology
- WIPO sectorElectrical engineering
Abstract
A method, a system, and a computer program product for performing analysis of data to detect presence of malicious code are disclosed. Reduced dimensionality vectors are generated from a plurality of original dimensionality vectors representing features in a plurality of samples. The reduced dimensionality vectors have a lower dimensionality than an original dimensionality of the plurality of original dimensionality vectors. A first plurality of clusters is determined by applying a first clustering algorithm to the reduced dimensionality vectors. A second plurality of clusters is determined by applying a second clustering algorithm to one or more clusters in the first plurality of clusters using the original dimensionality. An exemplar for a cluster in the second plurality of clusters is added to a training set, which is used to train a machine learning model for identifying a file containing malicious code.
Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.