Skip to main navigation Skip to search Skip to main content

Hierarchical and Pairwise Document Embedding for Plagiarism Detection

  • Ruitong Zhang
  • , Lianzhong Liu
  • , Jiaofu Zhang
  • , Zihang Huang
  • , Caiwei Yang
  • , Liangxuan Zhao
  • , Tongge Xu*
  • *Corresponding author for this work
  • Beihang University

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

The rapid development of the Internet, especially the application of search engines and machine translation, makes it easier to copy texts. Most existing text plagiarism detection methods are not capable of dealing with the increasing number of plagiarism sources and the increasingly ambiguous plagiarized texts. In this paper, we pay attention to the task of large-scale text deduplication, and propose a multi-level distributed text computing model, which improves the checking speed through multi-level latent semantic analysis, and combines BERT to judge plagiarized text more accurately. In order to further verify the model, we also combined the latest fuzzy plagiarism technology to construct a three-level data set. The experimental results show that our model performs well when plagiarism data increases and plagiarism ambiguity increases.

Original languageEnglish
Title of host publicationAdvanced Data Mining and Applications - 16th International Conference, ADMA 2020, Proceedings
EditorsXiaochun Yang, Chang-Dong Wang, Md. Saiful Islam, Zheng Zhang
PublisherSpringer Science and Business Media Deutschland GmbH
Pages148-156
Number of pages9
ISBN (Print)9783030653897
DOIs
StatePublished - 2020
Event16th International Conference on Advanced Data Mining and Applications, ADMA 2020 - Foshan, China
Duration: 12 Nov 202014 Nov 2020

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume12447 LNAI
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference16th International Conference on Advanced Data Mining and Applications, ADMA 2020
Country/TerritoryChina
CityFoshan
Period12/11/2014/11/20

Keywords

  • BERT
  • LSA
  • Plagiarism detection

Fingerprint

Dive into the research topics of 'Hierarchical and Pairwise Document Embedding for Plagiarism Detection'. Together they form a unique fingerprint.

Cite this