Evaluation of Fingerprint Selection Algorithms for Two-Stage Plagiarism Detection

Version 1.0 (December 30, 2021)
Developed by Gints Jekabsons (gints.jekabsons@rtu.lv)
Available at: http://www.cs.rtu.lv/jekabsons/nlp.html
Licensed under the GNU Lesser General Public License version 3.


This software was used for the experiments in the paper "Evaluation of Fingerprint Selection Algorithms for Two-Stage Plagiarism Detection" (available: http://www.cs.rtu.lv/jekabsons/Files/Jek_ACSS2021.pdf and https://sciendo.com/article/10.2478/acss-2021-0022). If you are using this software, please give a reference to the paper. And it would be nice to give a link to the website as well: http://www.cs.rtu.lv/jekabsons/nlp.html.


This software is developed for evaluating the effectiveness of fingerprint selection algorithms for a two-stage (source retrieval + aligning) local text reuse detection. It implements the following fingerprint selection algorithms (see the paper above or the paper here https://sciendo.com/article/10.2478/acss-2020-0002 for details):
* Full fingerprinting;
* Every p-th;
* 0 mod p;
* Winnowing;
* Hailstorm;
* Frequency-Biased Winnowing (FBW);
* Modified Frequency-Biased Winnowing (MFBW).

Indexing of the fingerprints is implemented using the Apache Lucene library (https://lucene.apache.org/).


To use the software, you need a dataset containing collection documents (directory 'src') and query documents (directory 'susp').
You can set the directories for the files using the static variables of the FingerprintIndexing class.
You also need to create a directory named 'correct' that contains xml files for all correct text reuse document pairs.
The xml files should contain data about overlapping text spans between the documents. The syntax for the xml files is the one defined at the PAN shared tasks.
You will also need an aligner (e.g., the one developed by Sanchez-Perez: https://www.gelbukh.com/plagiarism-detection/PAN-2015/index.html) and an evaluator (e.g., the one developed for the PAN 2009-2014 shared task; see 'evaluate.ipynb').


Java dependencies: Apache OpenNLP (https://opennlp.apache.org/) (used with version 1.9.0) and Apache Lucene (https://lucene.apache.org/) (used with version 8.8.0).
