Abstract
Gradient-based temporal difference (GTD) algorithms are widely used in off-policy learning scenarios. Among them, the two time-scale TD with gradient correction (TDC) algorithm has been shown to have superior performance. In contrast to previous studies that characterized the non-asymptotic convergence rate of TDC only under identical and independently distributed (i.i.d.) data samples, we provide the first non-asymptotic convergence analysis for two time-scale TDC under a non-i.i.d. Markovian sample path and linear function approximation. We show that the two time-scale TDC can converge as fast as O(logt2/3t ) under diminishing stepsize, and can converge exponentially fast under constant stepsize, but at the cost of a non-vanishing error. We further propose a TDC algorithm with blockwisely diminishing stepsize, and show that it asymptotically converges with an arbitrarily small error at a blockwisely linear convergence rate. Our experiments demonstrate that such an algorithm converges as fast as TDC under constant stepsize, and still enjoys comparable accuracy as TDC under diminishing stepsize.
| Original language | English |
|---|---|
| Journal | Advances in Neural Information Processing Systems |
| Volume | 32 |
| State | Published - 2019 |
| Event | 33rd Annual Conference on Neural Information Processing Systems, NeurIPS 2019 - Vancouver, Canada Duration: Dec 8 2019 → Dec 14 2019 |
Fingerprint
Dive into the research topics of 'Two time-scale off-policy TD learning: Non-asymptotic analysis over markovian samples'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver