TY - JOUR
T1 - Compositional Shield Synthesis for Safe Reinforcement Learning in Partial Observability
AU - Carr, Steven
AU - Bakirtzis, Georgios
AU - Topcu, Ufuk
N1 - Publisher Copyright:
© 2022 IEEE.
PY - 2025/1/1
Y1 - 2025/1/1
N2 - Agents controlled by the output of reinforcement learning (RL) algorithms often transition to unsafe states, particularly in uncertain and partially observable environments. Partially observable Markov decision processes (POMDPs) provide a natural setting for studying such scenarios with limited sensing. Shields filter undesirable actions to ensure safe RL by preserving safety requirements in the agents’ policy. However, synthesizing holistic shields is computationally expensive in complex deployment scenarios. We propose the compositional synthesis of shields by modeling safety requirements by parts, thereby improving scalability. In particular, problem formulations in the form of POMDPs using RL algorithms illustrate that an RL agent equipped with the resulting compositional shielding, beyond being safe, converges to higher values of expected reward. By using subproblem formulations, we preserve and improve the ability of shielded agents to require fewer training episodes than unshielded agents, especially in sparse-reward settings. Concretely, we find that compositional shield synthesis allows an RL agent to remain safe in environments two orders of magnitude larger than other state-of-the-art model-based approaches.
AB - Agents controlled by the output of reinforcement learning (RL) algorithms often transition to unsafe states, particularly in uncertain and partially observable environments. Partially observable Markov decision processes (POMDPs) provide a natural setting for studying such scenarios with limited sensing. Shields filter undesirable actions to ensure safe RL by preserving safety requirements in the agents’ policy. However, synthesizing holistic shields is computationally expensive in complex deployment scenarios. We propose the compositional synthesis of shields by modeling safety requirements by parts, thereby improving scalability. In particular, problem formulations in the form of POMDPs using RL algorithms illustrate that an RL agent equipped with the resulting compositional shielding, beyond being safe, converges to higher values of expected reward. By using subproblem formulations, we preserve and improve the ability of shielded agents to require fewer training episodes than unshielded agents, especially in sparse-reward settings. Concretely, we find that compositional shield synthesis allows an RL agent to remain safe in environments two orders of magnitude larger than other state-of-the-art model-based approaches.
KW - And shielding
KW - compositionality
KW - reinforcement learning
KW - safety
KW - uncertainty
UR - https://www.scopus.com/pages/publications/105016734818
U2 - 10.1109/OJCSYS.2025.3611725
DO - 10.1109/OJCSYS.2025.3611725
M3 - Article
AN - SCOPUS:105016734818
SN - 2694-085X
VL - 4
SP - 373
EP - 384
JO - IEEE Open Journal of Control Systems
JF - IEEE Open Journal of Control Systems
ER -