Automated error detection and recovery for GPU computations in a service environment
US9836354B1 · kind B1 · utility
Assignee
Inventors
Key dates
| Filing date | Apr 28, 2014 |
| Grant date | Dec 5, 2017 |
| Priority date | — |
| Expiry date | Oct 24, 2034 |
Classification
- Technology area (CPC G)Physics
- CPC primaryG06F11/1629
- WIPO fieldComputer technology
- WIPO sectorElectrical engineering
Abstract
A service provider system may implement ECC-like features when executing computations on GPUs that do not include sufficient error detection and recovery for computations that are sensitive to bit errors. During execution of critical computations on behalf of customers, the system may automatically instrument program instructions received from the customers to cause each computation to be executed using multiple sets of hardware resources (e.g., different host machines, processor cores, or internal hardware resources). The service may provide APIs with which customers may instrument their code for execution using redundant resource instances, or specify parameters for applying the ECC-like features. The service or customer may instrument code to perform (or cause the system to perform) checkpointing operations at particular points in the code, and to compare intermediate results produced by different hardware resources. If the intermediate results do not match, the computation may be restarted from a checkpointed state.
Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.