Skip to main navigation Skip to search Skip to main content

Learning to represent code semantics

  • State Key Laboratory of Complex & Critical Software Environment
  • Peking University
  • Beihang University

Research output: Contribution to journalArticlepeer-review

Abstract

Code semantic learning serves as the basis of many program analysis tasks. Researchers have paid much effort to build robust and effective code representation models over the years. One line of work focuses on introducing the code structure into the representations. To further improve the robustness of the code representation, approaches based on compiler intermediate representations (IRs) are proposed. However, these IR-based models suffer from heavy computational costs and memory overhead. How to represent program semantics effectively and efficiently still remains a challenge. To this end, we propose EECS, an effective and efficient code semantic representation approach based on compiler IRs and a hybrid attention mechanism. For input representation, to address the unlimited vocabulary size issue in IR, we propose a variable identification strategy to allocate each register variable to a new ID that can represent their relative positions. Besides, we also extract the data flow information among the code blocks. Then we build a hierarchical multi-layer Transformer encoder to capture the data dependency information as well as the code semantics through a hybrid attention mechanism. To enable EECS to learn code semantics and functionality better, we optimize three objectives jointly during the training process. Experimental results on three code semantic understanding tasks show that EECS performs better than the state-of-the-art techniques, demonstrating the remarkable capability of EECS on program semantics understanding.

Original languageEnglish
Article number172101
JournalScience China Information Sciences
Volume68
Issue number7
DOIs
StatePublished - Jul 2025

Keywords

  • artificial intelligence
  • code semantic learning
  • compiler intermediate representation
  • data dependency modeling
  • software engineering

Fingerprint

Dive into the research topics of 'Learning to represent code semantics'. Together they form a unique fingerprint.

Cite this