Patent · US Active

Smart online link repair and job scheduling in machine learning supercomputers

US12289196B2 · kind B2 · utility

0Cited by
7References
20Claims
0Family size

Assignee

Inventors

Key dates

Filing dateDec 8, 2022
Grant dateApr 29, 2025
Priority date
Expiry dateDec 8, 2042

Classification

  • Technology area (CPC H)Electricity
  • CPC primaryH04L41/122
  • WIPO fieldDigital communication
  • WIPO sectorElectrical engineering

Abstract

Generally disclosed herein is an approach for smart topology-aware link disabling and user job rescheduling strategies for online network repair of broken links in high performance networks used in supercomputers that are common in Machine Learning (ML) and High-Performance Computing (HPC) applications. While a disabled link is repaired online, user jobs may continue to run. The broken links may be detected as part of pre-flight checks before the user jobs run and/or during the job run time via a distributed failure detection and mitigation software stack which includes a centralized network controller and multiple agents running on each node. The network controller may ensure that the user jobs are rerouted to healthy links within the same network until the broken links are fixed and tested by the repair workflows, in which case the broken links are enabled again by the network controller for future user jobs.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.