Fingerprint selection algorithms for local text reuse detection

Version 1.0 (June 5, 2020)
Developed by Gints Jekabsons (gints.jekabsons@rtu.lv)
Available at: http://www.cs.rtu.lv/jekabsons/nlp.html
Licensed under the GNU Lesser General Public License version 3.


This software was used for the experiments in the paper "Evaluation of Fingerprint Selection Algorithms for Local Text Reuse Detection" (available: http://www.cs.rtu.lv/jekabsons/Files/Jek_ACSS2020.pdf and https://content.sciendo.com/view/journals/acss/25/1/article-p11.xml). If you are using this software, please give a reference to the paper. And it would be nice to give a link to the website as well: http://www.cs.rtu.lv/jekabsons/nlp.html.


This software is developed for evaluating the effectiveness of fingerprint selection algorithms for the source
retrieval stage of local text reuse detection. It implements the following fingerprint selection algorithms (see the paper for details):
* Full fingerprinting;
* Every p-th;
* 0 mod p;
* Winnowing;
* Hailstorm;
* Frequency-Biased Winnowing (FBW);
* Modified Frequency-Biased Winnowing (MFBW) - proposed in the paper.


To use the software, you need a dataset containing collection documents and query documents. You can set the directories for the files using the static variables of the Fingerprinting class.
The text files in the "Query" directory (the query documents) should be numbered, i.e., "1.txt", "2.txt", etc. The text files in the "Collection" directory (the collection documents) should be named in the following way:
* If a collection document is considered a correct match with a query document, its filename should start with the corresponding query document's number followed by underline character followed by any other characters as convenient, e.g., "1_fileA.txt", "1_fileB.txt", "2_fileC.txt" etc. Note that for all other query documents the collection document will be considered a false match.
* If a collection document is considered a false match with all query documents, its filename should start with character "F", e.g., "F_fileD.txt".


Java dependency: Apache OpenNLP (https://opennlp.apache.org/) - used with version 1.9.0
