Reducing latency by processing parts of a language model query in parallel
US12287816B1 · kind B1 · utility
Assignee
Inventors
Key dates
| Filing date | Oct 31, 2023 |
| Grant date | Apr 29, 2025 |
| Priority date | — |
| Expiry date | Oct 31, 2043 |
Classification
- Technology area (CPC G)Physics
- CPC primaryG06N20/00
- WIPO fieldComputer technology
- WIPO sectorElectrical engineering
Abstract
A technique partitions a user's original query into plural smaller component queries, each of which has a common part and an instance-specific part. The technique distributes the component queries to plural processor instances of a processor. The plural processor instances transform the respective component queries into query-component responses by acting in parallel, independent of each other. The technique generates a final response based on the query-component responses, e.g., by assembling the component-query responses into the final response. The technique reduces latency because the processor instances work on parts of the user's original query at the same time, rather than as a single stream of consecutive tokens. The plural processor instances have access to a shared cache memory, and utilize relevant data that has been computed in response to previous queries.
Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.