TY - GEN
T1 - Designing Near-Optimal Partially Observable Reinforcement Learning
AU - Shi, Ming
AU - Liang, Yingbin
AU - Shroff, Ness
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Partially observable Markov decision processes (POMDPs) have been widely applied in various real-world applications. However, existing results have shown that learning in POMDPs is intractable in the worst case. The main challenge lies in the lack of latent state information. For example, in wireless channel scheduling, due to energy and security constraints, it is usually difficult or impossible for the user to know the conditions/states of all channels. Thus, a key fundamental question here is: how much online state information (OSI) is sufficient to achieve tractability? In this paper, we make the first effort to establish fundamental conditions and methods for bridging the gap between partially observable reinforcement learning and networking with incomplete state information. Specifically, we establish a lower bound that reveals a surprising hardness result: unless we have full OSI, we need an exponentially scaling sample complexity to obtain an ϵ-optimal policy solution for POMDPs. Nonetheless, motivated by the structures of practical systems, we identify important subclasses of POMDPs that are tractable, even with only partial OSI. For two subclasses of POMDPs with partial OSI, we provide new algorithms that are proved to be near-optimal by establishing new regret upper and lower bounds.
AB - Partially observable Markov decision processes (POMDPs) have been widely applied in various real-world applications. However, existing results have shown that learning in POMDPs is intractable in the worst case. The main challenge lies in the lack of latent state information. For example, in wireless channel scheduling, due to energy and security constraints, it is usually difficult or impossible for the user to know the conditions/states of all channels. Thus, a key fundamental question here is: how much online state information (OSI) is sufficient to achieve tractability? In this paper, we make the first effort to establish fundamental conditions and methods for bridging the gap between partially observable reinforcement learning and networking with incomplete state information. Specifically, we establish a lower bound that reveals a surprising hardness result: unless we have full OSI, we need an exponentially scaling sample complexity to obtain an ϵ-optimal policy solution for POMDPs. Nonetheless, motivated by the structures of practical systems, we identify important subclasses of POMDPs that are tractable, even with only partial OSI. For two subclasses of POMDPs with partial OSI, we provide new algorithms that are proved to be near-optimal by establishing new regret upper and lower bounds.
KW - partial observability
KW - regret analysis
KW - reinforcement learning
KW - sample complexity
KW - wireless channel scheduling
UR - https://www.scopus.com/pages/publications/85214583376
U2 - 10.1109/MILCOM61039.2024.10773784
DO - 10.1109/MILCOM61039.2024.10773784
M3 - Conference contribution
AN - SCOPUS:85214583376
T3 - Proceedings - IEEE Military Communications Conference MILCOM
SP - 463
EP - 468
BT - 2024 IEEE Military Communications Conference, MILCOM 2024
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2024 IEEE Military Communications Conference, MILCOM 2024
Y2 - 28 October 2024 through 1 November 2024
ER -