1
0
Fork 0

readme updated

This commit is contained in:
Alejandro Moreo Fernandez 2021-02-02 12:10:57 +01:00
parent b6f80077d7
commit 91f8d8f3e1
3 changed files with 68 additions and 12 deletions

View File

@ -11,6 +11,13 @@ used for evaluating quantification methods.
QuaPy also integrates commonly used datasets and offers visualization tools QuaPy also integrates commonly used datasets and offers visualization tools
for facilitating the analysis and interpretation of results. for facilitating the analysis and interpretation of results.
## A quick example:
The following script fetchs a Twitter dataset, trains and evaluates an
_Adjusted Classify & Count_ model in terms of the _Mean Absolute Error_ (MAE)
between the class prevalences estimated for the test set and the true prevalences
of the test set.
```python ```python
import quapy as qp import quapy as qp
from sklearn.linear_model import LogisticRegression from sklearn.linear_model import LogisticRegression
@ -21,14 +28,69 @@ dataset = qp.datasets.fetch_twitter('semeval16')
model = qp.method.aggregative.ACC(LogisticRegression()) model = qp.method.aggregative.ACC(LogisticRegression())
model.fit(dataset.training) model.fit(dataset.training)
prevalences_estim = model.quantify(dataset.test.instances) estim_prevalences = model.quantify(dataset.test.instances)
prevalences_true = dataset.test.prevalence() true_prevalences = dataset.test.prevalence()
error = qp.error.mae(prevalences_true, prevalences_estim) error = qp.error.mae(true_prevalences, estim_prevalences)
print(f'MAE={error:.3f}') print(f'Mean Absolute Error (MAE)={error:.3f}')
``` ```
binary, and single-label Quantification is useful in scenarios of distribution shift. In other
words, we would not need to estimate the class prevalences of the test set if
we could assume the IID assumption to hold, as this prevalence would simply coincide with the
class prevalence of the training set. That is to say, a Quantification model
should be tested across samples characterized by different class prevalences.
QuaPy implements sampling procedures and evaluation protocols that automates this endeavour.
See the Wiki for detailed examples.
## Features
* Implementation of most popular quantification methods (Classify-&-Count variants, Expectation-Maximization,
SVM-based variants for quantification, HDy, QuaNet, and Ensembles).
* Versatile functionality for performing evaluation based on artificial sampling protocols.
* Implementation of most commonly used evaluation metrics (e.g., MAE, MRAE, MSE, NKLD, etc.).
* Popular datasets for Quantification (textual and numeric) available, including:
* 32 UCI Machine Learning datasets.
* 11 Twitter Sentiment datasets.
* 3 Reviews Sentiment datasets.
* Native supports for binary and single-label scenarios of quantification.
* Model selection functionality targeting quantification-oriented losses.
* Plotting routines ("error-by-drift", "diagonal", and "bias" plots).
## Requirements
* sklearnm, numpy, scipy
* pytorch (for QuaNet)
* svmperf patched for quantification (see below)
* joblib
* tqdm
* pandas, xlrd
* matplotlib
## SVM-perf with quantification-oriented losses
In order to run experiments involving SVM(Q), SVM(KLD), SVM(NKLD),
SVM(AE), or SVM(RAE), you have to first download the
[svmperf](http://www.cs.cornell.edu/people/tj/svm_light/svm_perf.html)
package, apply the patch
[svm-perf-quantification-ext.patch](./svm-perf-quantification-ext.patch), and compile the sources.
The script [prepare_svmperf.sh](prepare_svmperf.sh) does all the job. Simply run:
```
./prepare_svmperf.sh
```
The resulting directory [svm_perf_quantification](./svm_perf_quantification) contains the
patched version of _svmperf_ with quantification-oriented losses.
The [svm-perf-quantification-ext.patch](./svm-perf-quantification-ext.patch) is an extension of the patch made available by
[Esuli et al. 2015](https://dl.acm.org/doi/abs/10.1145/2700406?casa_token=8D2fHsGCVn0AAAAA:ZfThYOvrzWxMGfZYlQW_y8Cagg-o_l6X_PcF09mdETQ4Tu7jK98mxFbGSXp9ZSO14JkUIYuDGFG0)
that allows SVMperf to optimize for
the _Q_ measure as proposed by [Barranquero et al. 2015](https://www.sciencedirect.com/science/article/abs/pii/S003132031400291X)
and for the _KLD_ and _NKLD_ as proposed by [Esuli et al. 2015](https://dl.acm.org/doi/abs/10.1145/2700406?casa_token=8D2fHsGCVn0AAAAA:ZfThYOvrzWxMGfZYlQW_y8Cagg-o_l6X_PcF09mdETQ4Tu7jK98mxFbGSXp9ZSO14JkUIYuDGFG0)
for quantification.
This patch extends the former by also allowing SVMperf to optimize for
_AE_ and _RAE_.

View File

@ -10,14 +10,8 @@ an instance of single-label with 2 labels. Check
Add classnames to LabelledCollection ? Add classnames to LabelledCollection ?
Check the overhead in OneVsAll for SVMperf-based (?) Check the overhead in OneVsAll for SVMperf-based (?)
Add HDy to QuaNet? if so, wrap HDy into OneVsAll in case the dataset is not binary. Add HDy to QuaNet? if so, wrap HDy into OneVsAll in case the dataset is not binary.
Plots (one for binary -- the "diagonal", or for a specific class), another for the error as a funcition of drift.
Add datasets for topic. Add datasets for topic.
Add other methods
Clarify whether QuaNet is an aggregative method or not. Clarify whether QuaNet is an aggregative method or not.
Add datasets from Pérez-Gallego et al. 2017, 2019
Add ensemble models from Pérez-Gallego et al. 2017, 2019
Add plots models like those in Pérez-Gallego et al. 2017 (error boxes)
Add support for CV prediction in ACC and PACC for tpr, fpr
Add medium swap method Add medium swap method
Explore the hyperparameter "number of bins" in HDy Explore the hyperparameter "number of bins" in HDy
Implement HDy for single-label? Implement HDy for single-label?
@ -25,4 +19,3 @@ Rename EMQ to SLD ?
How many times is the system of equations for ACC and PACC not solved? How many times is it clipped? Do they sum up How many times is the system of equations for ACC and PACC not solved? How many times is it clipped? Do they sum up
to one always? to one always?
Parallelize the kFCV in ACC and PACC Parallelize the kFCV in ACC and PACC
Requirements: xlrd for reading excel

View File

@ -195,6 +195,7 @@ class Dataset:
print(f'Dataset={self.name} #tr-instances={tr_stats["instances"]}, #te-instances={te_stats["instances"]}, ' print(f'Dataset={self.name} #tr-instances={tr_stats["instances"]}, #te-instances={te_stats["instances"]}, '
f'type={tr_stats["type"]}, #features={tr_stats["features"]}, #classes={tr_stats["classes"]}, ' f'type={tr_stats["type"]}, #features={tr_stats["features"]}, #classes={tr_stats["classes"]}, '
f'tr-prevs={tr_stats["prevs"]}, te-prevs={te_stats["prevs"]}') f'tr-prevs={tr_stats["prevs"]}, te-prevs={te_stats["prevs"]}')
return {'train': tr_stats ,'test':te_stats}
@classmethod @classmethod
def kFCV(cls, data: LabelledCollection, nfolds=5, nrepeats=1, random_state=0): def kFCV(cls, data: LabelledCollection, nfolds=5, nrepeats=1, random_state=0):