000 | 01935nam a22002417a 4500 | ||
---|---|---|---|
008 | 230803b |||||||| |||| 00| 0 eng d | ||
041 | _aen | ||
082 |
_a600 _bNAM |
||
100 | _aSaxena, Naman | ||
245 | _aAverage reward actor-critic with deterministic policy search | ||
260 |
_aBangalore : _bIISc , _c2023 . |
||
300 |
_aviii, 143p. _bcol. ill. ; _c29.1 cm * 20.5 cm _ee-Thesis _g3.477Mb |
||
500 | _ainclude bibliographic reference and index | ||
502 | _aMTech (Res); 2023; Computer science and automation | ||
520 | _aThe average reward criterion is relatively less studied as most existing works in the Reinforcement Learning literature consider the discounted reward criterion. There are few recent works that present on-policy average reward actor-critic algorithms, but average reward off-policy actor-critic is relatively less explored. In this work, we present both on-policy and off-policy deterministic policy gradient theorems for the average reward performance criterion. Using these theorems, we also present an Average Reward Off-Policy Deep Deterministic Policy Gradient (ARO-DDPG) Algorithm. We first show asymptotic convergence analysis using the ODE-based method. Subsequently, we provide a finite time analysis of the resulting stochastic approximation scheme with linear function approximator and obtain an $\epsilon$-optimal stationary policy with a sample complexity of $\Omega(\epsilon^{-2.5})$. We compare the average reward performance of our proposed ARO-DDPG algorithm and observe better empirical performance compared to state-of-the-art on-policy average reward actor-critic algorithms over MuJoCo-based environments. | ||
650 | _aReinforcement Learning | ||
650 | _aActor-Critic Algorithm | ||
650 | _aStochastic Approximation | ||
700 | _aKolathaya, Shishir N Y advised | ||
700 | _aBhatnagar, Shalabh advised | ||
856 | _uhttps://etd.iisc.ac.in/handle/2005/6175 | ||
942 | _cT | ||
999 |
_c429608 _d429608 |