A fast algorithm for plagiarism detection in large-scale data

Kensuke Baba

Research output: Contribution to journalArticlepeer-review

1 Citation (Scopus)


This paper proposes a fast plagiarism detection algorithm in large-scale data. Plagiarisms of superficial descriptions, such as "copy and paste", can be detected using a simple document similarity based on string matching. The algorithm reduces the effort for computing the document similarity by approximating the similarity. The effects of the approximation on the processing time and accuracy are evaluated by conducting experiments with a data set generated from practical scholarly documents. The experimental results show that the algorithm based on the approximated similarity can reduce the processing time of the straightforward algorithm based on the exact similarity to less than one-Third in exchange for a slight decrease of the accuracy.

Original languageEnglish
Pages (from-to)331-338
Number of pages8
JournalJournal of Digital Information Management
Issue number6
Publication statusPublished - Dec 2017
Externally publishedYes


  • Approximate string matching
  • Discrete Fourier transform
  • Plagiarism detection
  • Vector representation of words

ASJC Scopus subject areas

  • Management Information Systems
  • Information Systems
  • Library and Information Sciences


Dive into the research topics of 'A fast algorithm for plagiarism detection in large-scale data'. Together they form a unique fingerprint.

Cite this