No offence, Bert - I insult only humans! Multiple addressees sentence-level attack on toxicity detection neural networks

Sergey Berezin, Reza Farahbakhsh, Noel Crespi

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

We introduce a simple yet efficient sentence-level attack on black-box toxicity detector models. By adding several positive words or sentences to the end of a hateful message, we are able to change the prediction of a neural network and pass the toxicity detection system check. This approach is shown to be working on seven languages from three different language families. We also describe the defence mechanism against the aforementioned attack and discuss its limitations.

Original languageEnglish
Title of host publicationFindings of the Association for Computational Linguistics
Subtitle of host publicationEMNLP 2023
PublisherAssociation for Computational Linguistics (ACL)
Pages2362-2369
Number of pages8
ISBN (Electronic)9798891760615
DOIs
Publication statusPublished - 1 Jan 2023
Event2023 Findings of the Association for Computational Linguistics: EMNLP 2023 - Hybrid, Singapore
Duration: 6 Dec 202310 Dec 2023

Publication series

NameFindings of the Association for Computational Linguistics: EMNLP 2023

Conference

Conference2023 Findings of the Association for Computational Linguistics: EMNLP 2023
Country/TerritorySingapore
CityHybrid
Period6/12/2310/12/23

Fingerprint

Dive into the research topics of 'No offence, Bert - I insult only humans! Multiple addressees sentence-level attack on toxicity detection neural networks'. Together they form a unique fingerprint.

Cite this