diff --git a/README.md b/README.md index 3468212..566133a 100755 --- a/README.md +++ b/README.md @@ -5,6 +5,15 @@ Code to reproduce the experiments reported in the papers and ["L’Epistola a Cangrande al vaglio della Computational Authorship Verification: Risultati preliminari (con una postilla sulla cosiddetta XIV Epistola di Dante Alighieri)"](https://www.academia.edu/42297516/L_Epistola_a_Cangrande_al_vaglio_della_Computational_Authorship_Verification_risultati_preliminari_con_una_postilla_sulla_cosiddetta_XIV_Epistola_di_Dante_Alighieri_in_Nuove_inchieste_sull_Epistola_a_Cangrande_a_c._di_A._Casadei_Pisa_Pisa_University_Press_pp._153-192) +## Requirements: +The experiments have been run using the following packages (older versions might work as well): +* joblib==0.11 +* nltk==3.4.5 +* numpy==1.18.2 +* scikit-learn==0.22.2.post1 +* scipy==1.4.1 + + ## Disclaimer: The dataset is not distributed in this version. We have asked the Editors for permission to publish the corpus. We are waiting for some of these responses to arrive. @@ -13,18 +22,23 @@ We are waiting for some of these responses to arrive. The script in __./src/author_identification.py__ executes the experiments. This is the script syntax (--help): ``` +usage: author_identification.py [-h] [--loo] [--unknown PATH] [--log PATH] + CORPUSPATH AUTHOR + Authorship verification for Epistola XIII positional arguments: - PATH Path to the directory containing the corpus (documents - must be named _.txt) - positive Positive author for the hypothesis (default "Dante"); set - to "ALL" to check every author + CORPUSPATH Path to the directory containing the corpus (documents must + be named _.txt) + AUTHOR Positive author for the hypothesis (default "Dante"); set to + "ALL" to check every author optional arguments: - -h, --help show this help message and exit - --loo submit each binary classifier to leave-one-out validation - --unknown PATH path to the file of unknown paternity (default None) + -h, --help show this help message and exit + --loo submit each binary classifier to leave-one-out validation + --unknown PATH path to the file of unknown paternity (default None) + --log PATH path to the log file where to write the results (default + ./results.txt) ``` The following command line: @@ -42,6 +56,18 @@ to the positive class. Similarly, the command line: ``` cd src -python author_identification.py ../Corpora/CorpusI Dante --loo +python author_identification.py ../Corpora/CorpusI ALL --loo ``` -will perform a cross-validation of the binary classifier for Dante using all training documents in a leave-one-out (LOO) fashion. +will perform a cross-validation of the binary classifier for all authors using all training documents in a leave-one-out (LOO) fashion. + +The script will report the results both in the standard output (more elaborated) and in a log file. For example, the last command will produce a log file containing: +``` +F1 for ClaraAssisiensis = 0.400 +F1 for Dante = 0.957 +F1 for GiovanniBoccaccio = 1.000 +F1 for GuidoFaba = 0.974 +F1 for PierDellaVigna = 0.993 +LOO Macro-F1 = 0.865 +LOO Micro-F1 = 0.981 +``` +(Note that small numerical variations with respect to the original papers might occur due to different software versions and as a result from any stochastic underlying process. Those changes should anyway not alter the conclusions derived from the published results.)