跳到主要导航 跳到搜索 跳到主要内容

How Well Do Large Language Models Serve as End-to-End Secure Code Agents for Python?

  • Jianian Gong
  • , Nachuan Duan
  • , Ziheng Tao
  • , Zhaohui Gong
  • , Yuan Yuan*
  • , Minlie Huang
  • *此作品的通讯作者
  • Beihang University
  • Zhongguancun Laboratory
  • Tsinghua University

科研成果: 书/报告/会议事项章节会议稿件同行评审

摘要

The rapid advancement of large language models (LLMs) such as GPT-4 has revolutionized the landscape of software engineering, positioning these models at the core of modern development practices. To fully realize their potential in producing secure source code autonomously, LLMs must not only generate code but also identify and repair vulnerabilities in their outputs, thereby improving security iteratively. Despite growing prominence, LLMs' effectiveness in performing such end-to-end tasks remains unexplored. This paper bridges this gap by systematically investigating the capability of LLMs to generate source code, evaluate their own outputs for vulnerabilities, and apply necessary repairs to improve the security of their self-generated code.Specifically, we studied the ability of GPT-3.5 and GPT-4 to identify and repair vulnerabilities in the code generated by four popular LLMs including themselves (GPT-3.5, GPT-4, Code Llama, and CodeGeeX2). By manually or automatically reviewing 4,900 pieces of code, our study reveals that: (1) LLMs generate over 75% vulnerable Python code in given scenarios; (2) LLMs such as GPT-3.5 and GPT-4 are unable to precisely identify vulnerabilities in the code they generated; (3) GPT-3.5 and GPT-4 can achieve 33.2% ∼59.6% success rates in repairing the insecure code produced by the 4 LLMs, but they both perform poorly when repairing self-produced code, indicating self-repair "blind spots". To address the limitation of a single round of repair, we developed a lightweight tool using LLMs as agents to construct safer source code through an iterative repair procedure based on the insights gained from our study. Experiments show that, assisted by semantic analysis engines, our tool significantly improves the success rates of repair to 65.9% ∼85.5%.

源语言英语
主期刊名Proceedings of the 29th International Conference on Evaluation and Assessment in Software Engineering , EASE, 2025 edition, EASE 2025
编辑Muhammad Ali Babar, Ayse Tosun, Stefan Wagner, Viktoria Stray
出版商Association for Computing Machinery, Inc
1004-1013
页数10
ISBN(电子版)9798400713859
DOI
出版状态已出版 - 24 12月 2025
活动29th International Conference on Evaluation and Assessment of Software Engineering, EASE 2025 - Istanbul, 土耳其
期限: 17 6月 202520 6月 2025

出版系列

姓名Proceedings of the 29th International Conference on Evaluation and Assessment in Software Engineering , EASE, 2025 edition, EASE 2025

会议

会议29th International Conference on Evaluation and Assessment of Software Engineering, EASE 2025
国家/地区土耳其
Istanbul
时期17/06/2520/06/25

指纹

探究 'How Well Do Large Language Models Serve as End-to-End Secure Code Agents for Python?' 的科研主题。它们共同构成独一无二的指纹。

引用此