Skip to main navigation Skip to search Skip to main content

An Adversarial Toxicity Prompt Generator Exploiting Multilingual Code-Switching in LLMs

  • SUNY Buffalo

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

The widespread adoption of Large Language Models (LLMs) has heightened concerns about their potential to generate toxic, biased, or harmful content. These risks are magnified in multilingual contexts involving code-switching, culturally nuanced slang, and adversarial prompt engineering, which often evade conventional moderation. Existing detection models struggle due to static datasets and limited adaptability to evolving linguistic patterns. To address these challenges, we propose two innovations: a Dynamic Multilingual Toxicity Dataset that extends PolygloToxicityPrompts using subject-predicate permutations and translation averaging; and a novel Adversarial Prompt Generation Framework based on Reinforcement Learning with Adversarial Feedback (RLAF), designed to elicit toxic responses from target LLMs via reward-driven refinement. Enhanced with BLOOM fine-tuning, Model-Agnostic Meta-Learning (MAML), and dynamic few-shot learning, our framework offers robust stress-testing under multilingual, adversarial conditions [15], [16], [20].

Original languageEnglish
Title of host publicationProceedings - 2025 IEEE Conference on Artificial Intelligence, CAI 2025
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages884-887
Number of pages4
ISBN (Electronic)9798331524005
DOIs
StatePublished - 2025
Event3rd IEEE Conference on Artificial Intelligence, CAI 2025 - Santa Clara, United States
Duration: May 5 2025May 7 2025

Publication series

NameProceedings - 2025 IEEE Conference on Artificial Intelligence, CAI 2025

Conference

Conference3rd IEEE Conference on Artificial Intelligence, CAI 2025
Country/TerritoryUnited States
CitySanta Clara
Period05/5/2505/7/25

Keywords

  • adversarial prompt generation
  • BLOOM fine-tuning
  • code-switching
  • LLM safety
  • MAML
  • Multilingual toxicity detection
  • RLAF
  • stress-testing LLMs

Fingerprint

Dive into the research topics of 'An Adversarial Toxicity Prompt Generator Exploiting Multilingual Code-Switching in LLMs'. Together they form a unique fingerprint.

Cite this