TY - GEN
T1 - Reducing False Positives of Static Bug Detectors Through Code Representation Learning
AU - Yang, Yixin
AU - Wen, Ming
AU - Gao, Xiang
AU - Zhang, Yuting
AU - Sun, Hailong
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - With the increasing significance of software correctness and security, automatic static analysis tools (ASATs) play a more and more important role in software development due to their ability and scalability. However, compared to dynamic analysis methods, static tools often suffer from the severe problem of generating high false positive rates, due to their analysis mechanisms. To alleviate the false positive problem, many approaches have been proposed, which focus on manually extracted features from code snippets and then prioritize real warnings by means of statistics or machine learning techniques. However, manual encoded features are insufficient to achieve satisfactory performance across different datasets. In this study, we focus on exploring the effectiveness of various code representation learning (CRL) techniques in understanding the semantics of warnings generated by ASATs. In particular, our large-scale empirical study not only reveals that CRL models can effectively differentiate buggy code snippets (i.e., containing warnings detected by ASATs) from clean ones (the median of F1-score reaches 87.3 % for binary classification, and reaches 77.4 % for multi-class classification), they are also promising in identifying false positive warnings (the F1-score of best performer is 75.6%). Such findings drive us to further design a novel approach named PRI SM, to PRIoritize Static warnings based on aggregating multiple CRL Models to reduce the false positives generated by existing ASATs. Extensive evaluations demonstrate that our designed approach can outperform existing baselines significantly.
AB - With the increasing significance of software correctness and security, automatic static analysis tools (ASATs) play a more and more important role in software development due to their ability and scalability. However, compared to dynamic analysis methods, static tools often suffer from the severe problem of generating high false positive rates, due to their analysis mechanisms. To alleviate the false positive problem, many approaches have been proposed, which focus on manually extracted features from code snippets and then prioritize real warnings by means of statistics or machine learning techniques. However, manual encoded features are insufficient to achieve satisfactory performance across different datasets. In this study, we focus on exploring the effectiveness of various code representation learning (CRL) techniques in understanding the semantics of warnings generated by ASATs. In particular, our large-scale empirical study not only reveals that CRL models can effectively differentiate buggy code snippets (i.e., containing warnings detected by ASATs) from clean ones (the median of F1-score reaches 87.3 % for binary classification, and reaches 77.4 % for multi-class classification), they are also promising in identifying false positive warnings (the F1-score of best performer is 75.6%). Such findings drive us to further design a novel approach named PRI SM, to PRIoritize Static warnings based on aggregating multiple CRL Models to reduce the false positives generated by existing ASATs. Extensive evaluations demonstrate that our designed approach can outperform existing baselines significantly.
KW - Code Representation Learning
KW - False Positive Warnings
KW - Static Hug Detector
UR - https://www.scopus.com/pages/publications/85199794248
U2 - 10.1109/SANER60148.2024.00075
DO - 10.1109/SANER60148.2024.00075
M3 - 会议稿件
AN - SCOPUS:85199794248
T3 - Proceedings - 2024 IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2024
SP - 681
EP - 692
BT - Proceedings - 2024 IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2024
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 31st IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2024
Y2 - 12 March 2024 through 15 March 2024
ER -