Patent · US Active

Automated error detection and recovery for GPU computations in a service environment

US9836354B1 · kind B1 · utility

16Cited by
0References
20Claims
0Family size

Assignee

Inventors

Key dates

Filing dateApr 28, 2014
Grant dateDec 5, 2017
Priority date
Expiry dateOct 24, 2034

Classification

  • Technology area (CPC G)Physics
  • CPC primaryG06F11/1629
  • WIPO fieldComputer technology
  • WIPO sectorElectrical engineering

Abstract

A service provider system may implement ECC-like features when executing computations on GPUs that do not include sufficient error detection and recovery for computations that are sensitive to bit errors. During execution of critical computations on behalf of customers, the system may automatically instrument program instructions received from the customers to cause each computation to be executed using multiple sets of hardware resources (e.g., different host machines, processor cores, or internal hardware resources). The service may provide APIs with which customers may instrument their code for execution using redundant resource instances, or specify parameters for applying the ECC-like features. The service or customer may instrument code to perform (or cause the system to perform) checkpointing operations at particular points in the code, and to compare intermediate results produced by different hardware resources. If the intermediate results do not match, the computation may be restarted from a checkpointed state.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.