TY - GEN
T1 - CODE-MVP
T2 - 2022 Findings of the Association for Computational Linguistics: NAACL 2022
AU - Wang, Xin
AU - Wang, Yasheng
AU - Wan, Yao
AU - Wang, Jiawei
AU - Zhou, Pingyi
AU - Li, Li
AU - Wu, Hao
AU - Liu, Jin
N1 - Publisher Copyright:
© Findings of the Association for Computational Linguistics: NAACL 2022 - Findings.
PY - 2022
Y1 - 2022
N2 - Recent years have witnessed increasing interest in code representation learning, which aims to represent the semantics of source code into distributed vectors. Currently, various works have been proposed to represent the complex semantics of source code from different views, including plain text, Abstract Syntax Tree (AST), and several kinds of code graphs (e.g., Control/Data Flow Graph). However, most of them only consider a single view of source code independently, ignoring the correspondences among different views. In this paper, we propose to integrate different views with the naturallanguage description of source code into a unified framework with Multi-View contrastive Pre-training, and name our model as CODEMVP. Specifically, we first extract multiple code views using compiler tools, and learn the complementary information among them under a contrastive learning framework. Inspired by the type checking in compilation, we also design a fine-grained type inference objective in the pre-training. Experiments on three downstream tasks over five datasets demonstrate the superiority of CODE-MVP when compared with several state-of-the-art baselines. For example, we achieve 2.4/2.3/1.1 gain in terms of MRR/MAP/Accuracy metrics on natural language code retrieval, code similarity, and code defect detection tasks, respectively.
AB - Recent years have witnessed increasing interest in code representation learning, which aims to represent the semantics of source code into distributed vectors. Currently, various works have been proposed to represent the complex semantics of source code from different views, including plain text, Abstract Syntax Tree (AST), and several kinds of code graphs (e.g., Control/Data Flow Graph). However, most of them only consider a single view of source code independently, ignoring the correspondences among different views. In this paper, we propose to integrate different views with the naturallanguage description of source code into a unified framework with Multi-View contrastive Pre-training, and name our model as CODEMVP. Specifically, we first extract multiple code views using compiler tools, and learn the complementary information among them under a contrastive learning framework. Inspired by the type checking in compilation, we also design a fine-grained type inference objective in the pre-training. Experiments on three downstream tasks over five datasets demonstrate the superiority of CODE-MVP when compared with several state-of-the-art baselines. For example, we achieve 2.4/2.3/1.1 gain in terms of MRR/MAP/Accuracy metrics on natural language code retrieval, code similarity, and code defect detection tasks, respectively.
UR - https://www.scopus.com/pages/publications/85135499146
U2 - 10.18653/v1/2022.findings-naacl.80
DO - 10.18653/v1/2022.findings-naacl.80
M3 - 会议稿件
AN - SCOPUS:85135499146
T3 - Findings of the Association for Computational Linguistics: NAACL 2022 - Findings
SP - 1066
EP - 1077
BT - Findings of the Association for Computational Linguistics
PB - Association for Computational Linguistics (ACL)
Y2 - 10 July 2022 through 15 July 2022
ER -