TY - GEN
T1 - The TIP of the Iceberg
T2 - 63rd Annual Meeting of the Association for Computational Linguistics, ACL 2025
AU - Berezin, Sergey
AU - Farahbakhsh, Reza
AU - Crespi, Noel
N1 - Publisher Copyright:
© 2025 Association for Computational Linguistics.
PY - 2025/1/1
Y1 - 2025/1/1
N2 - We present a novel class of jailbreak adversarial attacks on LLMs, termed Task-in-Prompt (TIP) attacks. Our approach embeds sequence-to-sequence tasks (e.g., cipher decoding, riddles, code execution) into the model's prompt to indirectly generate prohibited inputs. To systematically assess the effectiveness of these attacks, we introduce the PHRYGE benchmark. We demonstrate that our techniques successfully circumvent safeguards in six state-of-the-art language models, including GPT-4o and LLaMA 3.2. Our findings highlight critical weaknesses in current LLM safety alignment and underscore the urgent need for more sophisticated defence strategies.
AB - We present a novel class of jailbreak adversarial attacks on LLMs, termed Task-in-Prompt (TIP) attacks. Our approach embeds sequence-to-sequence tasks (e.g., cipher decoding, riddles, code execution) into the model's prompt to indirectly generate prohibited inputs. To systematically assess the effectiveness of these attacks, we introduce the PHRYGE benchmark. We demonstrate that our techniques successfully circumvent safeguards in six state-of-the-art language models, including GPT-4o and LLaMA 3.2. Our findings highlight critical weaknesses in current LLM safety alignment and underscore the urgent need for more sophisticated defence strategies.
UR - https://www.scopus.com/pages/publications/105021039451
U2 - 10.18653/v1/2025.acl-long.334
DO - 10.18653/v1/2025.acl-long.334
M3 - Conference contribution
AN - SCOPUS:105021039451
T3 - Proceedings of the Annual Meeting of the Association for Computational Linguistics
SP - 6716
EP - 6730
BT - Long Papers
A2 - Che, Wanxiang
A2 - Nabende, Joyce
A2 - Shutova, Ekaterina
A2 - Pilehvar, Mohammad Taher
PB - Association for Computational Linguistics (ACL)
Y2 - 27 July 2025 through 1 August 2025
ER -