TY - GEN
T1 - Learning to Sample Replacements for ELECTRA Pre-Training
AU - Hao, Yaru
AU - Dong, Li
AU - Bao, Hangbo
AU - Xu, Ke
AU - Wei, Furu
N1 - Publisher Copyright:
© 2021 Association for Computational Linguistics
PY - 2021
Y1 - 2021
N2 - ELECTRA (Clark et al., 2020a) pretrains a discriminator to detect replaced tokens, where the replacements are sampled from a generator trained with masked language modeling. Despite the compelling performance, ELECTRA suffers from the following two issues. First, there is no direct feedback loop from discriminator to generator, which renders replacement sampling inefficient. Second, the generator's prediction tends to be over-confident along with training, making replacements biased to correct tokens. In this paper, we propose two methods to improve replacement sampling for ELECTRA pre-training. Specifically, we augment sampling with a hardness prediction mechanism, so that the generator can encourage the discriminator to learn what it has not acquired. We also prove that the efficient sampling reduces the training variance of the discriminator. Moreover, we propose to use a focal loss for the generator in order to relieve oversampling correct tokens as replacements. Experimental results show that our method improves ELECTRA pre-training on various downstream tasks. Our code and pre-trained models will be released at: https://github.com/YRdddream/electra-hp.
AB - ELECTRA (Clark et al., 2020a) pretrains a discriminator to detect replaced tokens, where the replacements are sampled from a generator trained with masked language modeling. Despite the compelling performance, ELECTRA suffers from the following two issues. First, there is no direct feedback loop from discriminator to generator, which renders replacement sampling inefficient. Second, the generator's prediction tends to be over-confident along with training, making replacements biased to correct tokens. In this paper, we propose two methods to improve replacement sampling for ELECTRA pre-training. Specifically, we augment sampling with a hardness prediction mechanism, so that the generator can encourage the discriminator to learn what it has not acquired. We also prove that the efficient sampling reduces the training variance of the discriminator. Moreover, we propose to use a focal loss for the generator in order to relieve oversampling correct tokens as replacements. Experimental results show that our method improves ELECTRA pre-training on various downstream tasks. Our code and pre-trained models will be released at: https://github.com/YRdddream/electra-hp.
UR - https://www.scopus.com/pages/publications/85123947818
U2 - 10.18653/v1/2021.findings-acl.394
DO - 10.18653/v1/2021.findings-acl.394
M3 - 会议稿件
AN - SCOPUS:85123947818
T3 - Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021
SP - 4495
EP - 4506
BT - Findings of the Association for Computational Linguistics
A2 - Zong, Chengqing
A2 - Xia, Fei
A2 - Li, Wenjie
A2 - Navigli, Roberto
PB - Association for Computational Linguistics (ACL)
T2 - Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021
Y2 - 1 August 2021 through 6 August 2021
ER -