跳到主要导航 跳到搜索 跳到主要内容

LLMs know their vulnerabilities: Uncover Safety Gaps through Natural Distribution Shifts

  • Qibing Ren
  • , Hao Li
  • , Dongrui Liu
  • , Zhanxu Xie
  • , Xiaoya Lu
  • , Yu Qiao
  • , Lei Sha
  • , Junchi Yan
  • , Lizhuang Ma*
  • , Jing Shao*
  • *此作品的通讯作者
  • Shanghai Jiao Tong University
  • Shanghai Artificial Intelligence Laboratory
  • Beihang University

科研成果: 书/报告/会议事项章节会议稿件同行评审

摘要

Safety concerns in large language models (LLMs) have gained significant attention due to their exposure to potentially harmful data during pre-training. In this paper, we identify a new safety vulnerability in LLMs: their susceptibility to natural distribution shifts between attack prompts and original toxic prompts, where seemingly benign prompts, semantically related to harmful content, can bypass safety mechanisms. To explore this issue, we introduce a novel attack method, ActorBreaker, which identifies actors related to toxic prompts within pre-training distribution to craft multi-turn prompts that gradually lead LLMs to reveal unsafe content. ActorBreaker is grounded in Latour's actor-network theory, encompassing both human and non-human actors to capture a broader range of vulnerabilities. Our experimental results demonstrate that ActorBreaker outperforms existing attack methods in terms of diversity, effectiveness, and efficiency across aligned LLMs. To address this vulnerability, we propose expanding safety training to cover a broader semantic space of toxic content. We thus construct a multi-turn safety dataset using ActorBreaker. Fine-tuning models on our dataset shows significant improvements in robustness, though with some tradeoffs in utility. Code is available at https://github.com/AI45Lab/ActorAttack.

源语言英语
主期刊名Long Papers
编辑Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
出版商Association for Computational Linguistics (ACL)
24763-24785
页数23
ISBN(电子版)9798891762510
DOI
出版状态已出版 - 2025
活动63rd Annual Meeting of the Association for Computational Linguistics, ACL 2025 - Vienna, 奥地利
期限: 27 7月 20251 8月 2025

出版系列

姓名Proceedings of the Annual Meeting of the Association for Computational Linguistics
1
ISSN(印刷版)0736-587X

会议

会议63rd Annual Meeting of the Association for Computational Linguistics, ACL 2025
国家/地区奥地利
Vienna
时期27/07/251/08/25

指纹

探究 'LLMs know their vulnerabilities: Uncover Safety Gaps through Natural Distribution Shifts' 的科研主题。它们共同构成独一无二的指纹。

引用此