Smart online link repair and job scheduling in machine learning supercomputers
US12289196B2 · kind B2 · utility
Assignee
Inventors
Key dates
| Filing date | Dec 8, 2022 |
| Grant date | Apr 29, 2025 |
| Priority date | — |
| Expiry date | Dec 8, 2042 |
Classification
- Technology area (CPC H)Electricity
- CPC primaryH04L41/122
- WIPO fieldDigital communication
- WIPO sectorElectrical engineering
Abstract
Generally disclosed herein is an approach for smart topology-aware link disabling and user job rescheduling strategies for online network repair of broken links in high performance networks used in supercomputers that are common in Machine Learning (ML) and High-Performance Computing (HPC) applications. While a disabled link is repaired online, user jobs may continue to run. The broken links may be detected as part of pre-flight checks before the user jobs run and/or during the job run time via a distributed failure detection and mitigation software stack which includes a centralized network controller and multiple agents running on each node. The network controller may ensure that the user jobs are rerouted to healthy links within the same network until the broken links are fixed and tested by the repair workflows, in which case the broken links are enabled again by the network controller for future user jobs.
Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.