diff --git a/README.md b/README.md index a7d4140..53cb23d 100755 --- a/README.md +++ b/README.md @@ -11,10 +11,8 @@ The experiments have been run using the following packages (older versions might * scikit-learn==0.22.2.post1 * scipy==1.4.1 - -## Disclaimer: -The dataset is not distributed in this version. We have asked the Editors of each document for permission to publish the corpus. -We are waiting for some of these responses to arrive. +## Dataset: +The dataset can be downloaded from [http://hlt.isti.cnr.it/medlatin/](http://hlt.isti.cnr.it/medlatin/). ## Running the Experiments The script in __./src/author_identification.py__ executes the experiments. This is the script syntax (--help): @@ -42,7 +40,7 @@ optional arguments: The following command line: ``` cd src -python author_identification.py ../Corpora/CorpusI Dante --unknown ../Epistle/EpistolaXIII_1.txt +python author_identification.py ../Corpora/MedLatin1 Dante --unknown ../Epistle/EpistolaXIII_1.txt ``` Will use all texts in ../Corpora/CorpusI as training documents to train a verificator for the @@ -54,18 +52,18 @@ to the positive class. Similarly, the command line: ``` cd src -python author_identification.py ../Corpora/CorpusI ALL --loo +python author_identification.py ../Corpora/MedLatin1 ALL --loo ``` will perform a cross-validation of the binary classifier for all authors using all training documents in a leave-one-out (LOO) fashion. The script will report the results both in the standard output (more elaborated) and in a log file. For example, the last command will produce a log file containing: ``` -F1 for ClaraAssisiensis = 0.400 +F1 for ClaraAssisiensis = 0.571 F1 for Dante = 0.957 F1 for GiovanniBoccaccio = 1.000 -F1 for GuidoFaba = 0.974 +F1 for GuidoFaba = 0.980 F1 for PierDellaVigna = 0.993 -LOO Macro-F1 = 0.865 -LOO Micro-F1 = 0.981 +LOO Macro-F1 = 0.900 +LOO Micro-F1 = 0.985 ``` (Note that small numerical variations with respect to the original papers might occur due to different software versions and as a result from any stochastic underlying process. Those changes should anyway not alter the conclusions derived from the published results.)