跳到主要导航 跳到搜索 跳到主要内容

Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models

  • Zonghao Ying
  • , Deyue Zhang
  • , Zonglei Jing
  • , Xiangzheng Zhang
  • , Quanchen Zou*
  • , Yisong Xiao
  • , Siyuan Liang
  • , Aishan Liu*
  • , Xianglong Liu
  • , Dacheng Tao
  • *此作品的通讯作者
  • Beihang University
  • 360 AI Security Lab
  • Nanyang Technological University

科研成果: 书/报告/会议事项章节会议稿件同行评审

摘要

Multi-turn jailbreak attacks simulate real-world human interactions by engaging large language models (LLMs) in iterative dialogues, exposing critical safety vulnerabilities. However, existing methods often struggle to balance semantic coherence with attack effectiveness, resulting in either benign semantic drift or ineffective detection evasion. To address this challenge, we propose Reasoning-Augmented Conversation (RACE), a novel multi-turn jailbreak framework that reformulates harmful queries into benign reasoning tasks and leverages LLMs’ strong reasoning capabilities to compromise safety alignment. Specifically, we introduce an attack state machine framework to systematically model problem translation and iterative reasoning, ensuring coherent query generation across multiple turns. Building on this framework, we design gain-guided exploration, self-play, and rejection feedback modules to preserve attack semantics, enhance effectiveness, and sustain reasoning-driven attack progression. Extensive experiments on multiple LLMs demonstrate that RACE achieves state-of-the-art attack effectiveness in complex conversational scenarios, with attack success rates (ASRs) increasing by up to 96%. Notably, our approach achieves average ASR of 83.3% against leading commercial models, including Gemini 2.0 Flashing Thinking and OpenAI o1, underscoring its potency.

源语言英语
主期刊名EMNLP 2025 - 2025 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2025
编辑Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
出版商Association for Computational Linguistics (ACL)
17138-17157
页数20
ISBN(电子版)9798891763357
DOI
出版状态已出版 - 2025
活动30th Conference on Empirical Methods in Natural Language Processing, EMNLP 2025 - Suzhou, 中国
期限: 4 11月 20259 11月 2025

出版系列

姓名EMNLP 2025 - 2025 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2025

会议

会议30th Conference on Empirical Methods in Natural Language Processing, EMNLP 2025
国家/地区中国
Suzhou
时期4/11/259/11/25

指纹

探究 'Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models' 的科研主题。它们共同构成独一无二的指纹。

引用此