Average reward actor-critic with deterministic policy search (Record no. 429608)
[ view plain ]
000 -LEADER | |
---|---|
fixed length control field | 01935nam a22002417a 4500 |
008 - FIXED-LENGTH DATA ELEMENTS--GENERAL INFORMATION | |
fixed length control field | 230803b |||||||| |||| 00| 0 eng d |
041 ## - LANGUAGE CODE | |
Language code of text/sound track or separate title | en |
082 ## - DEWEY DECIMAL CLASSIFICATION NUMBER | |
Classification number | 600 |
Item number | NAM |
100 ## - MAIN ENTRY--PERSONAL NAME | |
Personal name | Saxena, Naman |
245 ## - TITLE STATEMENT | |
Title | Average reward actor-critic with deterministic policy search |
260 ## - PUBLICATION, DISTRIBUTION, ETC. (IMPRINT) | |
Place of publication, distribution, etc | Bangalore : |
Name of publisher, distributor, etc | IISc , |
Date of publication, distribution, etc | 2023 . |
300 ## - PHYSICAL DESCRIPTION | |
Extent | viii, 143p. |
Other physical details | col. ill. ; |
Dimensions | 29.1 cm * 20.5 cm |
Accompanying material | e-Thesis |
Size of unit | 3.477Mb |
500 ## - GENERAL NOTE | |
General note | include bibliographic reference and index |
502 ## - DISSERTATION NOTE | |
Dissertation note | MTech (Res); 2023; Computer science and automation |
520 ## - SUMMARY, ETC. | |
Summary, etc | The average reward criterion is relatively less studied as most existing works in the Reinforcement Learning literature consider the discounted reward criterion. There are few recent works that present on-policy average reward actor-critic algorithms, but average reward off-policy actor-critic is relatively less explored. In this work, we present both on-policy and off-policy deterministic policy gradient theorems for the average reward performance criterion. Using these theorems, we also present an Average Reward Off-Policy Deep Deterministic Policy Gradient (ARO-DDPG) Algorithm. We first show asymptotic convergence analysis using the ODE-based method. Subsequently, we provide a finite time analysis of the resulting stochastic approximation scheme with linear function approximator and obtain an $\epsilon$-optimal stationary policy with a sample complexity of $\Omega(\epsilon^{-2.5})$. We compare the average reward performance of our proposed ARO-DDPG algorithm and observe better empirical performance compared to state-of-the-art on-policy average reward actor-critic algorithms over MuJoCo-based environments. |
650 ## - SUBJECT ADDED ENTRY--TOPICAL TERM | |
Topical term or geographic name as entry element | Reinforcement Learning |
650 ## - SUBJECT ADDED ENTRY--TOPICAL TERM | |
Topical term or geographic name as entry element | Actor-Critic Algorithm |
650 ## - SUBJECT ADDED ENTRY--TOPICAL TERM | |
Topical term or geographic name as entry element | Stochastic Approximation |
700 ## - ADDED ENTRY--PERSONAL NAME | |
Personal name | Kolathaya, Shishir N Y advised |
700 ## - ADDED ENTRY--PERSONAL NAME | |
Personal name | Bhatnagar, Shalabh advised |
856 ## - ELECTRONIC LOCATION AND ACCESS | |
Uniform Resource Identifier | https://etd.iisc.ac.in/handle/2005/6175 |
942 ## - ADDED ENTRY ELEMENTS (KOHA) | |
Koha item type | Thesis |
No items available.