TY - GEN
T1 - How Well Do Large Language Models Serve as End-to-End Secure Code Agents for Python?
AU - Gong, Jianian
AU - Duan, Nachuan
AU - Tao, Ziheng
AU - Gong, Zhaohui
AU - Yuan, Yuan
AU - Huang, Minlie
N1 - Publisher Copyright:
© 2025 Copyright held by the owner/author(s).
PY - 2025/12/24
Y1 - 2025/12/24
N2 - The rapid advancement of large language models (LLMs) such as GPT-4 has revolutionized the landscape of software engineering, positioning these models at the core of modern development practices. To fully realize their potential in producing secure source code autonomously, LLMs must not only generate code but also identify and repair vulnerabilities in their outputs, thereby improving security iteratively. Despite growing prominence, LLMs' effectiveness in performing such end-to-end tasks remains unexplored. This paper bridges this gap by systematically investigating the capability of LLMs to generate source code, evaluate their own outputs for vulnerabilities, and apply necessary repairs to improve the security of their self-generated code.Specifically, we studied the ability of GPT-3.5 and GPT-4 to identify and repair vulnerabilities in the code generated by four popular LLMs including themselves (GPT-3.5, GPT-4, Code Llama, and CodeGeeX2). By manually or automatically reviewing 4,900 pieces of code, our study reveals that: (1) LLMs generate over 75% vulnerable Python code in given scenarios; (2) LLMs such as GPT-3.5 and GPT-4 are unable to precisely identify vulnerabilities in the code they generated; (3) GPT-3.5 and GPT-4 can achieve 33.2% ∼59.6% success rates in repairing the insecure code produced by the 4 LLMs, but they both perform poorly when repairing self-produced code, indicating self-repair "blind spots". To address the limitation of a single round of repair, we developed a lightweight tool using LLMs as agents to construct safer source code through an iterative repair procedure based on the insights gained from our study. Experiments show that, assisted by semantic analysis engines, our tool significantly improves the success rates of repair to 65.9% ∼85.5%.
AB - The rapid advancement of large language models (LLMs) such as GPT-4 has revolutionized the landscape of software engineering, positioning these models at the core of modern development practices. To fully realize their potential in producing secure source code autonomously, LLMs must not only generate code but also identify and repair vulnerabilities in their outputs, thereby improving security iteratively. Despite growing prominence, LLMs' effectiveness in performing such end-to-end tasks remains unexplored. This paper bridges this gap by systematically investigating the capability of LLMs to generate source code, evaluate their own outputs for vulnerabilities, and apply necessary repairs to improve the security of their self-generated code.Specifically, we studied the ability of GPT-3.5 and GPT-4 to identify and repair vulnerabilities in the code generated by four popular LLMs including themselves (GPT-3.5, GPT-4, Code Llama, and CodeGeeX2). By manually or automatically reviewing 4,900 pieces of code, our study reveals that: (1) LLMs generate over 75% vulnerable Python code in given scenarios; (2) LLMs such as GPT-3.5 and GPT-4 are unable to precisely identify vulnerabilities in the code they generated; (3) GPT-3.5 and GPT-4 can achieve 33.2% ∼59.6% success rates in repairing the insecure code produced by the 4 LLMs, but they both perform poorly when repairing self-produced code, indicating self-repair "blind spots". To address the limitation of a single round of repair, we developed a lightweight tool using LLMs as agents to construct safer source code through an iterative repair procedure based on the insights gained from our study. Experiments show that, assisted by semantic analysis engines, our tool significantly improves the success rates of repair to 65.9% ∼85.5%.
KW - code generation
KW - CWE
KW - end-to-end
KW - large language models
KW - Software security
KW - vulnerability detection and repair
UR - https://www.scopus.com/pages/publications/105026975859
U2 - 10.1145/3756681.3756984
DO - 10.1145/3756681.3756984
M3 - 会议稿件
AN - SCOPUS:105026975859
T3 - Proceedings of the 29th International Conference on Evaluation and Assessment in Software Engineering , EASE, 2025 edition, EASE 2025
SP - 1004
EP - 1013
BT - Proceedings of the 29th International Conference on Evaluation and Assessment in Software Engineering , EASE, 2025 edition, EASE 2025
A2 - Babar, Muhammad Ali
A2 - Tosun, Ayse
A2 - Wagner, Stefan
A2 - Stray, Viktoria
PB - Association for Computing Machinery, Inc
T2 - 29th International Conference on Evaluation and Assessment of Software Engineering, EASE 2025
Y2 - 17 June 2025 through 20 June 2025
ER -