TY - GEN
T1 - MCodeSearcher
T2 - 14th Asia-Pacific Symposium on Internetware, Internetware 2023
AU - Li, Jia
AU - Liu, Fang
AU - Zhao, Yunfei
AU - Li, Ge
AU - Jin, Zhi
N1 - Publisher Copyright:
© 2023 ACM.
PY - 2023/8/4
Y1 - 2023/8/4
N2 - Code search has been a critical software development activity in facilitating developers to retrieve a proper code snippet from open-source repositories given a user intent. In recent years, large-scale pre-trained models have shown impressive performance on code representation learning and have achieved state-of-the-art performance on code search task. However, it is challenging for these models to distinguish the functionally equivalent code snippets with dissimilar implementations or the non-equivalent code snippets that look similar. Due to the diversity of the code implementations, it is necessary for the code search engines to identify the functional similarities or dissimilarities of source code so as to return the functionally matched source code for a given query. Besides, existing pre-trained models mainly focus on learning the semantic representations of code snippets. The semantic correlation between the code snippet and natural language query is not sufficiently exploited. An effective code search tool not only needs to understand the relationship between queries and code snippets but also needs to identify the relationship between diversified code snippets. To address these limitations, we propose a novel multi-view contrastive learning model MCodeSearcher for code retrieval, aiming at sufficiently exploiting (1) the semantic correlation between queries and code snippets, and (2) the relationship between functionally equivalent code snippets. To achieve this, we design contrastive training objectives from three views and pre-train our model with these objectives. The experimental results on five representative code search datasets show that our approach significantly outperforms the state-of-the-art methods.
AB - Code search has been a critical software development activity in facilitating developers to retrieve a proper code snippet from open-source repositories given a user intent. In recent years, large-scale pre-trained models have shown impressive performance on code representation learning and have achieved state-of-the-art performance on code search task. However, it is challenging for these models to distinguish the functionally equivalent code snippets with dissimilar implementations or the non-equivalent code snippets that look similar. Due to the diversity of the code implementations, it is necessary for the code search engines to identify the functional similarities or dissimilarities of source code so as to return the functionally matched source code for a given query. Besides, existing pre-trained models mainly focus on learning the semantic representations of code snippets. The semantic correlation between the code snippet and natural language query is not sufficiently exploited. An effective code search tool not only needs to understand the relationship between queries and code snippets but also needs to identify the relationship between diversified code snippets. To address these limitations, we propose a novel multi-view contrastive learning model MCodeSearcher for code retrieval, aiming at sufficiently exploiting (1) the semantic correlation between queries and code snippets, and (2) the relationship between functionally equivalent code snippets. To achieve this, we design contrastive training objectives from three views and pre-train our model with these objectives. The experimental results on five representative code search datasets show that our approach significantly outperforms the state-of-the-art methods.
KW - Code search
KW - contrastive learning
KW - deep neural network
UR - https://www.scopus.com/pages/publications/85175738682
U2 - 10.1145/3609437.3609456
DO - 10.1145/3609437.3609456
M3 - 会议稿件
AN - SCOPUS:85175738682
T3 - ACM International Conference Proceeding Series
SP - 270
EP - 280
BT - 14th Asia-Pacific Symposium on Internetware, Internetware 2023 - Proceedings
PB - Association for Computing Machinery
Y2 - 4 August 2023 through 6 August 2023
ER -