Techniques for unsupervised learning embeddings on source code tokens from non-local contexts
US10901708B1 · kind B1 · utility
Assignee
Inventors
Key dates
| Filing date | Nov 23, 2018 |
| Grant date | Jan 26, 2021 |
| Priority date | — |
| Expiry date | Nov 23, 2038 |
Classification
- Technology area (CPC G)Physics
- CPC primaryG06N20/00
- WIPO fieldComputer technology
- WIPO sectorElectrical engineering
Abstract
Techniques for unsupervised learning of embeddings on source code from non-local contexts are described. Code can be processed to generate an abstract syntax tree (AST) which represents syntactic paths between tokens in the code. Once the AST(s) have been generated, the paths in the AST(s) can be crawled to identify terminals (e.g., leaf nodes in the AST) and paths between terminals can be identified. The pairs of tokens identified at the ends of each path can then be used to generate a cooccurrence matrix. For example, if X number of unique terminals are identified, a matrix of size X by X can be generated to indicate a frequency at which pairs of terminals cooccur. This cooccurrence matrix can then be used as input to existing techniques for learning vector-space embeddings, such as word2vec, GloVe, Swivel, etc.
Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.