System and method for policy optimization using quasi-Newton trust region method
US11650551B2 · kind B2 · utility
Assignee
Inventors
Key dates
| Filing date | Oct 4, 2019 |
| Grant date | May 16, 2023 |
| Priority date | — |
| Expiry date | Nov 15, 2041 |
Classification
- Technology area (CPC G)Physics
- CPC primaryG06N3/006
- WIPO fieldComputer technology
- WIPO sectorElectrical engineering
Abstract
A computer-implemented learning method for optimizing a control policy controlling a system is provided. The method includes receiving states of the system being operated for a specific task, initializing the control policy as a function approximator including neural networks, collecting state transition and reward data using a current control policy, estimating an advantage function and a state visitation frequency based on the current control policy, updating the current control policy using the second-order approximation of the objective function, a second-order approximation of the KL-divergence constraint on the permissible change in the policy using a quasi-newton trust region policy optimization, and determining an optimal control policy, for controlling the system, based on the average reward accumulated using the updated current control policy.
Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.