Patent · US Active

Reducing latency by processing parts of a language model query in parallel

US12287816B1 · kind B1 · utility

0Cited by
2References
20Claims
0Family size

Assignee

Inventors

Key dates

Filing dateOct 31, 2023
Grant dateApr 29, 2025
Priority date
Expiry dateOct 31, 2043

Classification

  • Technology area (CPC G)Physics
  • CPC primaryG06N20/00
  • WIPO fieldComputer technology
  • WIPO sectorElectrical engineering

Abstract

A technique partitions a user's original query into plural smaller component queries, each of which has a common part and an instance-specific part. The technique distributes the component queries to plural processor instances of a processor. The plural processor instances transform the respective component queries into query-component responses by acting in parallel, independent of each other. The technique generates a final response based on the query-component responses, e.g., by assembling the component-query responses into the final response. The technique reduces latency because the processor instances work on parts of the user's original query at the same time, rather than as a single stream of consecutive tokens. The plural processor instances have access to a shared cache memory, and utilize relevant data that has been computed in response to previous queries.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.