跳到主要导航 跳到搜索 跳到主要内容

Be a Multitude to Itself: A Prompt Evolution Framework for Red Teaming

  • Rui Li
  • , Peiyi Wang
  • , Jingyuan Ma
  • , Di Zhang
  • , Zhifang Sui*
  • , Lei Sha
  • *此作品的通讯作者
  • Peking University

科研成果: 书/报告/会议事项章节会议稿件同行评审

摘要

Large Language Models (LLMs) have gained increasing attention for their remarkable capacity, alongside concerns about safety arising from their potential to produce harmful content. Red teaming aims to find prompts that could elicit harmful responses from LLMs, and is essential to discover and mitigate safety risks before real-world deployment. However, manual red teaming is both time-consuming and expensive, rendering it unscalable. In this paper, we propose RTPE, a scalable evolution framework to evolve red teaming prompts across both breadth and depth dimensions, facilitating the automatic generation of numerous high-quality and diverse red teaming prompts. Specifically, in-breadth evolving employs a novel enhanced in-context learning method to create a multitude of quality prompts, whereas in-depth evolving applies customized transformation operations to enhance both content and form of prompts, thereby increasing diversity. Extensive experiments demonstrate that RTPE surpasses existing representative automatic red teaming methods on both attack success rate and diversity. In addition, based on 4,800 red teaming prompts created by RTPE, we further provide a systematic analysis of 8 representative LLMs across 8 sensitive topics.

源语言英语
主期刊名EMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2024
编辑Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
出版商Association for Computational Linguistics (ACL)
3287-3301
页数15
ISBN(电子版)9798891761681
DOI
出版状态已出版 - 2024
活动2024 Findings of the Association for Computational Linguistics, EMNLP 2024 - Hybrid, Miami, 美国
期限: 12 11月 202416 11月 2024

出版系列

姓名EMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2024

会议

会议2024 Findings of the Association for Computational Linguistics, EMNLP 2024
国家/地区美国
Hybrid, Miami
时期12/11/2416/11/24

指纹

探究 'Be a Multitude to Itself: A Prompt Evolution Framework for Red Teaming' 的科研主题。它们共同构成独一无二的指纹。

引用此