TY - JOUR
T1 - Learning to represent code semantics
AU - Liu, Fang
AU - Li, Ge
AU - Zhao, Qianhui
AU - Zhang, Li
N1 - Publisher Copyright:
© Science China Press 2025.
PY - 2025/7
Y1 - 2025/7
N2 - Code semantic learning serves as the basis of many program analysis tasks. Researchers have paid much effort to build robust and effective code representation models over the years. One line of work focuses on introducing the code structure into the representations. To further improve the robustness of the code representation, approaches based on compiler intermediate representations (IRs) are proposed. However, these IR-based models suffer from heavy computational costs and memory overhead. How to represent program semantics effectively and efficiently still remains a challenge. To this end, we propose EECS, an effective and efficient code semantic representation approach based on compiler IRs and a hybrid attention mechanism. For input representation, to address the unlimited vocabulary size issue in IR, we propose a variable identification strategy to allocate each register variable to a new ID that can represent their relative positions. Besides, we also extract the data flow information among the code blocks. Then we build a hierarchical multi-layer Transformer encoder to capture the data dependency information as well as the code semantics through a hybrid attention mechanism. To enable EECS to learn code semantics and functionality better, we optimize three objectives jointly during the training process. Experimental results on three code semantic understanding tasks show that EECS performs better than the state-of-the-art techniques, demonstrating the remarkable capability of EECS on program semantics understanding.
AB - Code semantic learning serves as the basis of many program analysis tasks. Researchers have paid much effort to build robust and effective code representation models over the years. One line of work focuses on introducing the code structure into the representations. To further improve the robustness of the code representation, approaches based on compiler intermediate representations (IRs) are proposed. However, these IR-based models suffer from heavy computational costs and memory overhead. How to represent program semantics effectively and efficiently still remains a challenge. To this end, we propose EECS, an effective and efficient code semantic representation approach based on compiler IRs and a hybrid attention mechanism. For input representation, to address the unlimited vocabulary size issue in IR, we propose a variable identification strategy to allocate each register variable to a new ID that can represent their relative positions. Besides, we also extract the data flow information among the code blocks. Then we build a hierarchical multi-layer Transformer encoder to capture the data dependency information as well as the code semantics through a hybrid attention mechanism. To enable EECS to learn code semantics and functionality better, we optimize three objectives jointly during the training process. Experimental results on three code semantic understanding tasks show that EECS performs better than the state-of-the-art techniques, demonstrating the remarkable capability of EECS on program semantics understanding.
KW - artificial intelligence
KW - code semantic learning
KW - compiler intermediate representation
KW - data dependency modeling
KW - software engineering
UR - https://www.scopus.com/pages/publications/105008715676
U2 - 10.1007/s11432-023-3898-5
DO - 10.1007/s11432-023-3898-5
M3 - 文章
AN - SCOPUS:105008715676
SN - 1674-733X
VL - 68
JO - Science China Information Sciences
JF - Science China Information Sciences
IS - 7
M1 - 172101
ER -