跳到主要导航 跳到搜索 跳到主要内容

Learning to represent code semantics

  • State Key Laboratory of Complex & Critical Software Environment
  • Peking University
  • Beihang University

科研成果: 期刊稿件文章同行评审

摘要

Code semantic learning serves as the basis of many program analysis tasks. Researchers have paid much effort to build robust and effective code representation models over the years. One line of work focuses on introducing the code structure into the representations. To further improve the robustness of the code representation, approaches based on compiler intermediate representations (IRs) are proposed. However, these IR-based models suffer from heavy computational costs and memory overhead. How to represent program semantics effectively and efficiently still remains a challenge. To this end, we propose EECS, an effective and efficient code semantic representation approach based on compiler IRs and a hybrid attention mechanism. For input representation, to address the unlimited vocabulary size issue in IR, we propose a variable identification strategy to allocate each register variable to a new ID that can represent their relative positions. Besides, we also extract the data flow information among the code blocks. Then we build a hierarchical multi-layer Transformer encoder to capture the data dependency information as well as the code semantics through a hybrid attention mechanism. To enable EECS to learn code semantics and functionality better, we optimize three objectives jointly during the training process. Experimental results on three code semantic understanding tasks show that EECS performs better than the state-of-the-art techniques, demonstrating the remarkable capability of EECS on program semantics understanding.

源语言英语
文章编号172101
期刊Science China Information Sciences
68
7
DOI
出版状态已出版 - 7月 2025

指纹

探究 'Learning to represent code semantics' 的科研主题。它们共同构成独一无二的指纹。

引用此