TY - GEN
T1 - UNICODER
T2 - 62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024
AU - Sun, Tao
AU - Chai, Linzheng
AU - Yang, Jian
AU - Yin, Yuwei
AU - Guo, Hongcheng
AU - Liu, Jiaheng
AU - Wang, Bing
AU - Yang, Liqun
AU - Li, Zhoujun
N1 - Publisher Copyright:
© 2024 Association for Computational Linguistics.
PY - 2024
Y1 - 2024
N2 - Intermediate reasoning or acting steps have successfully improved large language models (LLMs) for handling various downstream natural language processing (NLP) tasks. When applying LLMs for code generation, recent works mainly focus on directing the models to articulate intermediate natural-language reasoning steps, as in chain-of-thought (CoT) prompting, and then output code with the natural language or other structured intermediate steps. However, such output is not suitable for code translation or generation tasks since the standard CoT has different logical structures and forms of expression with the code. In this work, we introduce the universal code (UniCode) as the intermediate representation. It is a description of algorithm steps using a mix of conventions of programming languages, such as assignment operator, conditional operator, and loop. Hence, we collect an instruction dataset UNICODER-INSTRUCT to train our model UNICODER on multi-task learning objectives. UNICODER-INSTRUCT comprises natural-language questions, code solutions, and the corresponding universal code. The alignment between the intermediate universal code representation and the final code solution significantly improves the quality of the generated code. The experimental results demonstrate that UNICODER with the universal code significantly outperforms the previous prompting methods by a large margin, showcasing the effectiveness of the structural clues in pseudo-code.
AB - Intermediate reasoning or acting steps have successfully improved large language models (LLMs) for handling various downstream natural language processing (NLP) tasks. When applying LLMs for code generation, recent works mainly focus on directing the models to articulate intermediate natural-language reasoning steps, as in chain-of-thought (CoT) prompting, and then output code with the natural language or other structured intermediate steps. However, such output is not suitable for code translation or generation tasks since the standard CoT has different logical structures and forms of expression with the code. In this work, we introduce the universal code (UniCode) as the intermediate representation. It is a description of algorithm steps using a mix of conventions of programming languages, such as assignment operator, conditional operator, and loop. Hence, we collect an instruction dataset UNICODER-INSTRUCT to train our model UNICODER on multi-task learning objectives. UNICODER-INSTRUCT comprises natural-language questions, code solutions, and the corresponding universal code. The alignment between the intermediate universal code representation and the final code solution significantly improves the quality of the generated code. The experimental results demonstrate that UNICODER with the universal code significantly outperforms the previous prompting methods by a large margin, showcasing the effectiveness of the structural clues in pseudo-code.
UR - https://www.scopus.com/pages/publications/85202614298
U2 - 10.18653/v1/2024.acl-long.100
DO - 10.18653/v1/2024.acl-long.100
M3 - 会议稿件
AN - SCOPUS:85202614298
T3 - Proceedings of the Annual Meeting of the Association for Computational Linguistics
SP - 1812
EP - 1824
BT - Long Papers
A2 - Ku, Lun-Wei
A2 - Martins, Andre F. T.
A2 - Srikumar, Vivek
PB - Association for Computational Linguistics (ACL)
Y2 - 11 August 2024 through 16 August 2024
ER -