Patent · US Active

Dictionary based deduplication of training set samples for machine learning based computer threat analysis

US11373065B2 · kind B2 · utility

1Cited by
0References
16Claims
0Family size

Assignee

Inventor

Key dates

Filing dateJan 17, 2018
Grant dateJun 28, 2022
Priority date
Expiry dateSep 19, 2040

Classification

  • Technology area (CPC G)Physics
  • CPC primaryG06N20/20
  • WIPO fieldComputer technology
  • WIPO sectorElectrical engineering

Abstract

Presence of malicious code can be identified in one or more data samples. A feature set extracted from a sample is vectorized to generate a sparse vector. A reduced dimension vector representing the sparse vector can be generated. A binary representation vector of reduced dimension vector can be created by converting each value of a plurality of values in the reduced dimension vector to a binary representation. The binary representation vector can be added as a new element in a dictionary structure if the binary representation is not equal to an existing element in the dictionary structure. A training set for use in training a machine learning model can be created to include one vector whose binary representation corresponds to each of a plurality of elements in the dictionary structure.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.