Skip to main navigation Skip to search Skip to main content

Two time-scale off-policy TD learning: Non-asymptotic analysis over markovian samples

  • Ohio State University

Research output: Contribution to journalConference articlepeer-review

63 Scopus citations

Abstract

Gradient-based temporal difference (GTD) algorithms are widely used in off-policy learning scenarios. Among them, the two time-scale TD with gradient correction (TDC) algorithm has been shown to have superior performance. In contrast to previous studies that characterized the non-asymptotic convergence rate of TDC only under identical and independently distributed (i.i.d.) data samples, we provide the first non-asymptotic convergence analysis for two time-scale TDC under a non-i.i.d. Markovian sample path and linear function approximation. We show that the two time-scale TDC can converge as fast as O(logt2/3t ) under diminishing stepsize, and can converge exponentially fast under constant stepsize, but at the cost of a non-vanishing error. We further propose a TDC algorithm with blockwisely diminishing stepsize, and show that it asymptotically converges with an arbitrarily small error at a blockwisely linear convergence rate. Our experiments demonstrate that such an algorithm converges as fast as TDC under constant stepsize, and still enjoys comparable accuracy as TDC under diminishing stepsize.

Original languageEnglish
JournalAdvances in Neural Information Processing Systems
Volume32
StatePublished - 2019
Event33rd Annual Conference on Neural Information Processing Systems, NeurIPS 2019 - Vancouver, Canada
Duration: Dec 8 2019Dec 14 2019

Fingerprint

Dive into the research topics of 'Two time-scale off-policy TD learning: Non-asymptotic analysis over markovian samples'. Together they form a unique fingerprint.

Cite this