摘要
Large Language Models (LLMs) have gained widespread use across various fields due to their exceptional performance. However, these models are vulnerable to generating inappropriate or incorrect outputs when exposed to carefully designed jailbreak prompts, which has sparked significant concerns about their ethical implications and safety. Attackers can exploit these models by crafting specific prompt statements that elicit unintended or harmful content, without needing to understand the models' internal workings or security mechanisms. In fact, researchers in the field have identified the inherent vulnerabilities of LLMs and developed automated jailbreak methods that are both highly effective and difficult for humans to detect. To mitigate the risks associated with these malicious jailbreak attacks, researchers have proposed comprehensive defense strategies that encompass the entire lifecycle of LLMs, from their training phases to deployment, aiming to strengthen model security. However, existing reviews of LLM security are primarily focused on describing jailbreak attack methods, often neglecting a detailed examination of the characteristics and interrelationships of these various techniques. Moreover, the lack of a thorough evaluation framework has hindered the advancement of robust defenses in this area. This paper seeks to fill that gap by providing an exhaustive review of current jailbreak attacks and defense mechanisms for LLMs. It begins with an introduction to the fundamental concepts and principles related to LLMs and jailbreak attacks, highlighting their importance in the context of model security and the potential threats they pose. The paper then delves into existing attack generation strategies, evaluating their strengths and weaknesses, particularly in how they exploit specific vulnerabilities within the models. Additionally, it offers a comprehensive summary of defense strategies across the different stages of LLMs and presents a detailed evaluation benchmark, discussing the metrics and methodologies used to assess the effectiveness of these defenses. Finally, the paper addresses the current challenges and outlines potential future research directions, emphasizing the need for ongoing attention to key issues to enhance the security and reliability of large language models. By providing a detailed and structured overview, this paper aims to guide the development of more secure, trustworthy, and ethically aligned LLMs.
| 投稿的翻译标题 | A Review of Jailbreak Attacks and Defenses for Large Language Models |
|---|---|
| 源语言 | 繁体中文 |
| 页(从-至) | 56-86 |
| 页数 | 31 |
| 期刊 | Journal of Cyber Security |
| 卷 | 9 |
| 期 | 5 |
| DOI | |
| 出版状态 | 已出版 - 9月 2024 |
关键词
- deep learning
- jailbreak attack
- jailbreak defense
- large language model
- trustworthy artificial intelligence
指纹
探究 '面向大语言模型的越狱攻击与防御综述' 的科研主题。它们共同构成独一无二的指纹。引用此
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver