跳到主要导航 跳到搜索 跳到主要内容

Model Decides How to Tokenize: Adaptive DNA Sequence Tokenization with MxDNA

  • Lifeng Qiao
  • , Peng Ye*
  • , Yuchen Ren
  • , Weiqiang Bai
  • , Chaoqi Liang
  • , Xinzhu Ma
  • , Nanqing Dong
  • , Wanli Ouyang
  • *此作品的通讯作者
  • Shanghai Artificial Intelligence Laboratory
  • Shanghai Jiao Tong University
  • Chinese University of Hong Kong
  • The University of Sydney

科研成果: 期刊稿件会议文章同行评审

摘要

Foundation models have made significant strides in understanding the genomic language of DNA sequences. However, previous models typically adopt the tokenization methods designed for natural language, which are unsuitable for DNA sequences due to their unique characteristics. In addition, the optimal approach to tokenize DNA remains largely under-explored, and may not be intuitively understood by humans even if discovered. To address these challenges, we introduce MxDNA, a novel framework where the model autonomously learns an effective DNA tokenization strategy through gradient decent. MxDNA employs a sparse Mixture of Convolution Experts coupled with a deformable convolution to model the tokenization process, with the discontinuous, overlapping, and ambiguous nature of meaningful genomic segments explicitly considered. On Nucleotide Transformer Benchmarks and Genomic Benchmarks, MxDNA demonstrates superior performance to existing methods with less pretraining data and time, highlighting its effectiveness. Finally, we show that MxDNA learns unique tokenization strategy distinct to those of previous methods and captures genomic functionalities at a token level during self-supervised pretraining. Our MxDNA aims to provide a new perspective on DNA tokenization, potentially offering broad applications in various domains and yielding profound insights. Code is available at https://github.com/qiaoqiaoLF/MxDNA.

源语言英语
期刊Advances in Neural Information Processing Systems
37
出版状态已出版 - 2024
已对外发布
活动38th Conference on Neural Information Processing Systems, NeurIPS 2024 - Vancouver, 加拿大
期限: 9 12月 202415 12月 2024

指纹

探究 'Model Decides How to Tokenize: Adaptive DNA Sequence Tokenization with MxDNA' 的科研主题。它们共同构成独一无二的指纹。

引用此