TY - GEN
T1 - State Prediction for Offline Reinforcement Learning via Sequence-to-Sequence Modeling
AU - Ghanem, Abdelghani
AU - Ghogho, Mounir
AU - Ciblat, Philippe
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025/1/1
Y1 - 2025/1/1
N2 - Recent offline reinforcement learning methods often frame the problem as a sequence modeling task, employing a decoder-only architecture to process states, actions, and a single scalar value representing the sum of future rewards (i.e., returns). However, the distinct characteristics of these modalities, such as the non-smoothness of action sequences and the scalar nature of returns, may hinder effective modeling and optimization when using a shared architecture. In this work, we propose a divide-and-conquer strategy, the Reward-Guided Decision Translator (RGDT), that leverages an encoder-decoder architecture by casting offline reinforcement learning as a sequence-to-sequence modeling problem. Our approach foregoes action prediction in favor of next state prediction, mitigating the challenges posed by the nonsmoothness of action sequences. Furthermore, our formulation enables direct conditioning of state generation on sequences of future returns, providing a more informative signal for the model. By disentangling the processing of different modalities, our approach addresses the limitations of shared decoder-only architectures. Empirical results demonstrate that our method significantly outperforms existing generative sequence modeling techniques and matches or surpasses state-of-the-art methods across a range of continuous control tasks from the D4RL benchmark.
AB - Recent offline reinforcement learning methods often frame the problem as a sequence modeling task, employing a decoder-only architecture to process states, actions, and a single scalar value representing the sum of future rewards (i.e., returns). However, the distinct characteristics of these modalities, such as the non-smoothness of action sequences and the scalar nature of returns, may hinder effective modeling and optimization when using a shared architecture. In this work, we propose a divide-and-conquer strategy, the Reward-Guided Decision Translator (RGDT), that leverages an encoder-decoder architecture by casting offline reinforcement learning as a sequence-to-sequence modeling problem. Our approach foregoes action prediction in favor of next state prediction, mitigating the challenges posed by the nonsmoothness of action sequences. Furthermore, our formulation enables direct conditioning of state generation on sequences of future returns, providing a more informative signal for the model. By disentangling the processing of different modalities, our approach addresses the limitations of shared decoder-only architectures. Empirical results demonstrate that our method significantly outperforms existing generative sequence modeling techniques and matches or surpasses state-of-the-art methods across a range of continuous control tasks from the D4RL benchmark.
KW - Offline Reinforcement Learning
KW - Sequence Modeling
KW - Transformer Architecture
UR - https://www.scopus.com/pages/publications/105022116812
U2 - 10.1109/MLSP62443.2025.11204264
DO - 10.1109/MLSP62443.2025.11204264
M3 - Conference contribution
AN - SCOPUS:105022116812
T3 - IEEE International Workshop on Machine Learning for Signal Processing, MLSP
BT - 35th IEEE International Workshop on Machine Learning for Signal Processing
PB - IEEE Computer Society
T2 - 35th IEEE International Workshop on Machine Learning for Signal Processing, MLSP 2025
Y2 - 31 August 2025 through 3 September 2025
ER -