This commit is contained in:
Alejandro Moreo Fernandez 2020-04-03 11:21:34 +02:00
commit dc810272b2
1 changed files with 35 additions and 9 deletions

View File

@ -5,6 +5,15 @@ Code to reproduce the experiments reported in the papers
and and
["LEpistola a Cangrande al vaglio della Computational Authorship Verification: Risultati preliminari (con una postilla sulla cosiddetta XIV Epistola di Dante Alighieri)"](https://www.academia.edu/42297516/L_Epistola_a_Cangrande_al_vaglio_della_Computational_Authorship_Verification_risultati_preliminari_con_una_postilla_sulla_cosiddetta_XIV_Epistola_di_Dante_Alighieri_in_Nuove_inchieste_sull_Epistola_a_Cangrande_a_c._di_A._Casadei_Pisa_Pisa_University_Press_pp._153-192) ["LEpistola a Cangrande al vaglio della Computational Authorship Verification: Risultati preliminari (con una postilla sulla cosiddetta XIV Epistola di Dante Alighieri)"](https://www.academia.edu/42297516/L_Epistola_a_Cangrande_al_vaglio_della_Computational_Authorship_Verification_risultati_preliminari_con_una_postilla_sulla_cosiddetta_XIV_Epistola_di_Dante_Alighieri_in_Nuove_inchieste_sull_Epistola_a_Cangrande_a_c._di_A._Casadei_Pisa_Pisa_University_Press_pp._153-192)
## Requirements:
The experiments have been run using the following packages (older versions might work as well):
* joblib==0.11
* nltk==3.4.5
* numpy==1.18.2
* scikit-learn==0.22.2.post1
* scipy==1.4.1
## Disclaimer: ## Disclaimer:
The dataset is not distributed in this version. We have asked the Editors for permission to publish the corpus. The dataset is not distributed in this version. We have asked the Editors for permission to publish the corpus.
We are waiting for some of these responses to arrive. We are waiting for some of these responses to arrive.
@ -13,18 +22,23 @@ We are waiting for some of these responses to arrive.
The script in __./src/author_identification.py__ executes the experiments. This is the script syntax (--help): The script in __./src/author_identification.py__ executes the experiments. This is the script syntax (--help):
``` ```
usage: author_identification.py [-h] [--loo] [--unknown PATH] [--log PATH]
CORPUSPATH AUTHOR
Authorship verification for Epistola XIII Authorship verification for Epistola XIII
positional arguments: positional arguments:
PATH Path to the directory containing the corpus (documents CORPUSPATH Path to the directory containing the corpus (documents must
must be named <author>_<texname>.txt) be named <author>_<texname>.txt)
positive Positive author for the hypothesis (default "Dante"); set AUTHOR Positive author for the hypothesis (default "Dante"); set to
to "ALL" to check every author "ALL" to check every author
optional arguments: optional arguments:
-h, --help show this help message and exit -h, --help show this help message and exit
--loo submit each binary classifier to leave-one-out validation --loo submit each binary classifier to leave-one-out validation
--unknown PATH path to the file of unknown paternity (default None) --unknown PATH path to the file of unknown paternity (default None)
--log PATH path to the log file where to write the results (default
./results.txt)
``` ```
The following command line: The following command line:
@ -42,6 +56,18 @@ to the positive class.
Similarly, the command line: Similarly, the command line:
``` ```
cd src cd src
python author_identification.py ../Corpora/CorpusI Dante --loo python author_identification.py ../Corpora/CorpusI ALL --loo
``` ```
will perform a cross-validation of the binary classifier for Dante using all training documents in a leave-one-out (LOO) fashion. will perform a cross-validation of the binary classifier for all authors using all training documents in a leave-one-out (LOO) fashion.
The script will report the results both in the standard output (more elaborated) and in a log file. For example, the last command will produce a log file containing:
```
F1 for ClaraAssisiensis = 0.400
F1 for Dante = 0.957
F1 for GiovanniBoccaccio = 1.000
F1 for GuidoFaba = 0.974
F1 for PierDellaVigna = 0.993
LOO Macro-F1 = 0.865
LOO Micro-F1 = 0.981
```
(Note that small numerical variations with respect to the original papers might occur due to different software versions and as a result from any stochastic underlying process. Those changes should anyway not alter the conclusions derived from the published results.)