TY - GEN
T1 - An Adversarial Toxicity Prompt Generator Exploiting Multilingual Code-Switching in LLMs
AU - Wadgaonkar, Vinit Sudhir
AU - Ratha, Nalini
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - The widespread adoption of Large Language Models (LLMs) has heightened concerns about their potential to generate toxic, biased, or harmful content. These risks are magnified in multilingual contexts involving code-switching, culturally nuanced slang, and adversarial prompt engineering, which often evade conventional moderation. Existing detection models struggle due to static datasets and limited adaptability to evolving linguistic patterns. To address these challenges, we propose two innovations: a Dynamic Multilingual Toxicity Dataset that extends PolygloToxicityPrompts using subject-predicate permutations and translation averaging; and a novel Adversarial Prompt Generation Framework based on Reinforcement Learning with Adversarial Feedback (RLAF), designed to elicit toxic responses from target LLMs via reward-driven refinement. Enhanced with BLOOM fine-tuning, Model-Agnostic Meta-Learning (MAML), and dynamic few-shot learning, our framework offers robust stress-testing under multilingual, adversarial conditions [15], [16], [20].
AB - The widespread adoption of Large Language Models (LLMs) has heightened concerns about their potential to generate toxic, biased, or harmful content. These risks are magnified in multilingual contexts involving code-switching, culturally nuanced slang, and adversarial prompt engineering, which often evade conventional moderation. Existing detection models struggle due to static datasets and limited adaptability to evolving linguistic patterns. To address these challenges, we propose two innovations: a Dynamic Multilingual Toxicity Dataset that extends PolygloToxicityPrompts using subject-predicate permutations and translation averaging; and a novel Adversarial Prompt Generation Framework based on Reinforcement Learning with Adversarial Feedback (RLAF), designed to elicit toxic responses from target LLMs via reward-driven refinement. Enhanced with BLOOM fine-tuning, Model-Agnostic Meta-Learning (MAML), and dynamic few-shot learning, our framework offers robust stress-testing under multilingual, adversarial conditions [15], [16], [20].
KW - adversarial prompt generation
KW - BLOOM fine-tuning
KW - code-switching
KW - LLM safety
KW - MAML
KW - Multilingual toxicity detection
KW - RLAF
KW - stress-testing LLMs
UR - https://www.scopus.com/pages/publications/105011290887
U2 - 10.1109/CAI64502.2025.00156
DO - 10.1109/CAI64502.2025.00156
M3 - Conference contribution
AN - SCOPUS:105011290887
T3 - Proceedings - 2025 IEEE Conference on Artificial Intelligence, CAI 2025
SP - 884
EP - 887
BT - Proceedings - 2025 IEEE Conference on Artificial Intelligence, CAI 2025
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 3rd IEEE Conference on Artificial Intelligence, CAI 2025
Y2 - 5 May 2025 through 7 May 2025
ER -