TY - GEN
T1 - Cosine Similarity Based Adaptive Implicit Q-Learning for Offline Reinforcement Learning
AU - Han, Xinchen
AU - Afifi, Hossam
AU - Marot, Michel
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025/1/1
Y1 - 2025/1/1
N2 - Offline Reinforcement Learning (RL) methods constrain the policy to align with the behaviour policy, mitigating extrapolation errors caused by out-of-distribution (OOD) actions. Implicit Q-Learning (IQL), a popular offline RL algorithm, leverages expectile regression and introduces an in-sample learning paradigm that enhances the policy evaluation stage without querying OOD actions. However, the crucial parameter T for expectile regression in IQL is fixed, limiting both its performance and flexibility across diverse datasets. In this paper, we propose Cos-IQL, an improved IQL approach based on cosine similarity, which optimizes the policy evaluation function by measuring the cosine similarity between the policy and the behaviour policy. Cos-IQL is essentially a multi-step offline RL algorithm but retains the advantages of in-sample learning, thus avoiding the risks of OOD actions. In addition, Cos-IQL can adaptively adjust parameter T without elaborate fine-tuning. We evaluate Cos-IQL on D4RL benchmark datasets and compare its performance against recent competitive offline RL algorithms. Experimental results show that Cos-IQL achieves state-of-the-art performance.
AB - Offline Reinforcement Learning (RL) methods constrain the policy to align with the behaviour policy, mitigating extrapolation errors caused by out-of-distribution (OOD) actions. Implicit Q-Learning (IQL), a popular offline RL algorithm, leverages expectile regression and introduces an in-sample learning paradigm that enhances the policy evaluation stage without querying OOD actions. However, the crucial parameter T for expectile regression in IQL is fixed, limiting both its performance and flexibility across diverse datasets. In this paper, we propose Cos-IQL, an improved IQL approach based on cosine similarity, which optimizes the policy evaluation function by measuring the cosine similarity between the policy and the behaviour policy. Cos-IQL is essentially a multi-step offline RL algorithm but retains the advantages of in-sample learning, thus avoiding the risks of OOD actions. In addition, Cos-IQL can adaptively adjust parameter T without elaborate fine-tuning. We evaluate Cos-IQL on D4RL benchmark datasets and compare its performance against recent competitive offline RL algorithms. Experimental results show that Cos-IQL achieves state-of-the-art performance.
KW - Cosine Similarity
KW - Implicit Q-Learning
KW - In-sample Learning
KW - Offline RL
UR - https://www.scopus.com/pages/publications/105006446632
U2 - 10.1109/WCNC61545.2025.10978817
DO - 10.1109/WCNC61545.2025.10978817
M3 - Conference contribution
AN - SCOPUS:105006446632
T3 - IEEE Wireless Communications and Networking Conference, WCNC
BT - 2025 IEEE Wireless Communications and Networking Conference, WCNC 2025
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2025 IEEE Wireless Communications and Networking Conference, WCNC 2025
Y2 - 24 March 2025 through 27 March 2025
ER -