RedDiffuser: Red Teaming Vision-Language Models for Toxic Continuation via Reinforced Stable Diffusion

Wang, Ruofan; Zheng, Xiang; Wang, Xiaosen; Wang, Cong; Ma, Xingjun; Jiang, Yu-Gang

Computer Science > Computer Vision and Pattern Recognition

arXiv:2503.06223 (cs)

[Submitted on 8 Mar 2025 (v1), last revised 11 Nov 2025 (this version, v4)]

Title:RedDiffuser: Red Teaming Vision-Language Models for Toxic Continuation via Reinforced Stable Diffusion

Authors:Ruofan Wang, Xiang Zheng, Xiaosen Wang, Cong Wang, Xingjun Ma, Yu-Gang Jiang

View PDF HTML (experimental)

Abstract:Vision-Language Models (VLMs) are vulnerable to jailbreak attacks, where adversaries bypass safety mechanisms to elicit harmful outputs. In this work, we examine an insidious variant of this threat: toxic continuation. Unlike standard jailbreaks that rely solely on malicious instructions, toxic continuation arises when the model is given a malicious input alongside a partial toxic output, resulting in harmful completions. This vulnerability poses a unique challenge in multimodal settings, where even subtle image variations can disproportionately affect the model's response. To this end, we propose RedDiffuser (RedDiff), the first red teaming framework that uses reinforcement learning to fine-tune diffusion models into generating natural-looking adversarial images that induce toxic continuations. RedDiffuser integrates a greedy search procedure for selecting candidate image prompts with reinforcement fine-tuning that jointly promotes toxic output and semantic coherence. Experiments demonstrate that RedDiffuser significantly increases the toxicity rate in LLaVA outputs by 10.69% and 8.91% on the original and hold-out sets, respectively. It also exhibits strong transferability, increasing toxicity rates on Gemini by 5.1% and on LLaMA-Vision by 26.83%. These findings uncover a cross-modal toxicity amplification vulnerability in current VLM alignment, highlighting the need for robust multimodal red teaming. We will release the RedDiffuser codebase to support future research.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2503.06223 [cs.CV]
	(or arXiv:2503.06223v4 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2503.06223

Submission history

From: Ruofan Wang [view email]
[v1] Sat, 8 Mar 2025 13:51:40 UTC (5,945 KB)
[v2] Tue, 22 Apr 2025 08:07:23 UTC (2,895 KB)
[v3] Sun, 3 Aug 2025 09:52:38 UTC (904 KB)
[v4] Tue, 11 Nov 2025 09:28:15 UTC (1,564 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:RedDiffuser: Red Teaming Vision-Language Models for Toxic Continuation via Reinforced Stable Diffusion

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:RedDiffuser: Red Teaming Vision-Language Models for Toxic Continuation via Reinforced Stable Diffusion

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators