Patent · US Active

Techniques for unsupervised learning embeddings on source code tokens from non-local contexts

US10901708B1 · kind B1 · utility

2Cited by
2References
18Claims
0Family size

Assignee

Inventors

Key dates

Filing dateNov 23, 2018
Grant dateJan 26, 2021
Priority date
Expiry dateNov 23, 2038

Classification

  • Technology area (CPC G)Physics
  • CPC primaryG06N20/00
  • WIPO fieldComputer technology
  • WIPO sectorElectrical engineering

Abstract

Techniques for unsupervised learning of embeddings on source code from non-local contexts are described. Code can be processed to generate an abstract syntax tree (AST) which represents syntactic paths between tokens in the code. Once the AST(s) have been generated, the paths in the AST(s) can be crawled to identify terminals (e.g., leaf nodes in the AST) and paths between terminals can be identified. The pairs of tokens identified at the ends of each path can then be used to generate a cooccurrence matrix. For example, if X number of unique terminals are identified, a matrix of size X by X can be generated to indicate a frequency at which pairs of terminals cooccur. This cooccurrence matrix can then be used as input to existing techniques for learning vector-space embeddings, such as word2vec, GloVe, Swivel, etc.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.