Patent · US Active

Reducing latency by processing parts of a language model query in parallel

US12287816B1 · kind B1 · utility

0Cited by

2References

20Claims

0Family size

Assignee

MICROSOFT TECHNOLOGY LICENSING, LLC · US

Inventors

Sayan Dev PATHAK · Kirkland, US
Osama ABUELSOROUR · Menlo Park, US
Christopher Hakan BASOGLU · Everett, US
Harini Kesavamoorthy · Bengaluru, IN
Girish Milind MAHAJAN · Redmond, US
Salman Quazi · Mountain View, US
Valeriy Viktorovich Kirshin · Kirkland, US

Key dates

Filing date	Oct 31, 2023
Grant date	Apr 29, 2025
Priority date	—
Expiry date	Oct 31, 2043

Classification

Technology area (CPC G)Physics
CPC primaryG06N20/00
WIPO fieldComputer technology
WIPO sectorElectrical engineering

Abstract

A technique partitions a user's original query into plural smaller component queries, each of which has a common part and an instance-specific part. The technique distributes the component queries to plural processor instances of a processor. The plural processor instances transform the respective component queries into query-component responses by acting in parallel, independent of each other. The technique generates a final response based on the query-component responses, e.g., by assembling the component-query responses into the final response. The technique reduces latency because the processor instances work on parts of the user's original query at the same time, rather than as a single stream of consecutive tokens. The plural processor instances have access to a shared cache memory, and utilize relevant data that has been computed in response to previous queries.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.