TY - GEN
T1 - No offence, Bert - I insult only humans! Multiple addressees sentence-level attack on toxicity detection neural networks
AU - Berezin, Sergey
AU - Farahbakhsh, Reza
AU - Crespi, Noel
N1 - Publisher Copyright:
© 2023 Association for Computational Linguistics.
PY - 2023/1/1
Y1 - 2023/1/1
N2 - We introduce a simple yet efficient sentence-level attack on black-box toxicity detector models. By adding several positive words or sentences to the end of a hateful message, we are able to change the prediction of a neural network and pass the toxicity detection system check. This approach is shown to be working on seven languages from three different language families. We also describe the defence mechanism against the aforementioned attack and discuss its limitations.
AB - We introduce a simple yet efficient sentence-level attack on black-box toxicity detector models. By adding several positive words or sentences to the end of a hateful message, we are able to change the prediction of a neural network and pass the toxicity detection system check. This approach is shown to be working on seven languages from three different language families. We also describe the defence mechanism against the aforementioned attack and discuss its limitations.
UR - https://www.scopus.com/pages/publications/85183291122
U2 - 10.18653/v1/2023.findings-emnlp.155
DO - 10.18653/v1/2023.findings-emnlp.155
M3 - Conference contribution
AN - SCOPUS:85183291122
T3 - Findings of the Association for Computational Linguistics: EMNLP 2023
SP - 2362
EP - 2369
BT - Findings of the Association for Computational Linguistics
PB - Association for Computational Linguistics (ACL)
T2 - 2023 Findings of the Association for Computational Linguistics: EMNLP 2023
Y2 - 6 December 2023 through 10 December 2023
ER -