QuaPy/README.md

# QuaPy

QuaPy is an open source framework for quantification (a.k.a. supervised prevalence estimation, or learning to quantify)
written in Python.

QuaPy is based on the concept of "data sample", and provides implementations of the
most important aspects of the quantification workflow, such as (baseline and advanced)
quantification methods, 
quantification-oriented model selection mechanisms, evaluation measures, and evaluations protocols
used for evaluating quantification methods.
QuaPy also makes available commonly used datasets, and offers visualization tools 
for facilitating the analysis and interpretation of the experimental results.

### Last updates:

* A detailed documentation is now available [here](https://hlt-isti.github.io/QuaPy/)
* The developer API documentation is available [here](https://hlt-isti.github.io/QuaPy/build/html/modules.html)

### Installation

```commandline
pip install quapy
```

## A quick example:

The following script fetches a dataset of tweets, trains, applies, and evaluates a quantifier based on the 
_Adjusted Classify & Count_ quantification method, using, as the evaluation measure, the _Mean Absolute Error_ (MAE)
between the predicted and the true class prevalence values
of the test set.

```python
import quapy as qp
from sklearn.linear_model import LogisticRegression

dataset = qp.datasets.fetch_twitter('semeval16')

# create an "Adjusted Classify & Count" quantifier
model = qp.method.aggregative.ACC(LogisticRegression())
model.fit(dataset.training)

estim_prevalence = model.quantify(dataset.test.instances)
true_prevalence  = dataset.test.prevalence()

error = qp.error.mae(true_prevalence, estim_prevalence)

print(f'Mean Absolute Error (MAE)={error:.3f}')
```

Quantification is useful in scenarios characterized by prior probability shift. In other
words, we would be little interested in estimating the class prevalence values of the test set if 
we could assume the IID assumption to hold, as this prevalence would be roughly equivalent to the 
class prevalence of the training set. For this reason, any quantification model 
should be tested across many samples, even ones characterized by class prevalence 
values different or very different from those found in the training set.
QuaPy implements sampling procedures and evaluation protocols that automate this workflow.
See the [Wiki](https://github.com/HLT-ISTI/QuaPy/wiki) for detailed examples.

## Features

* Implementation of many popular quantification methods (Classify-&-Count and its variants, Expectation Maximization,
quantification methods based on structured output learning, HDy, QuaNet, and quantification ensembles).
* Versatile functionality for performing evaluation based on artificial sampling protocols.
* Implementation of most commonly used evaluation metrics (e.g., AE, RAE, SE, KLD, NKLD, etc.).
* Datasets frequently used in quantification (textual and numeric), including:
    * 32 UCI Machine Learning datasets.
    * 11 Twitter quantification-by-sentiment datasets.
    * 3 product reviews quantification-by-sentiment datasets. 
* Native support for binary and single-label multiclass quantification scenarios.
* Model selection functionality that minimizes quantification-oriented loss functions.
* Visualization tools for analysing the experimental results.

## Requirements

* scikit-learn, numpy, scipy
* pytorch (for QuaNet)
* svmperf patched for quantification (see below)
* joblib
* tqdm
* pandas, xlrd
* matplotlib

## SVM-perf with quantification-oriented losses
In order to run experiments involving SVM(Q), SVM(KLD), SVM(NKLD),
SVM(AE), or SVM(RAE), you have to first download the 
[svmperf](http://www.cs.cornell.edu/people/tj/svm_light/svm_perf.html) 
package, apply the patch 
[svm-perf-quantification-ext.patch](./svm-perf-quantification-ext.patch), and compile the sources.
The script [prepare_svmperf.sh](prepare_svmperf.sh) does all the job. Simply run:

```
./prepare_svmperf.sh
```

The resulting directory [svm_perf_quantification](./svm_perf_quantification) contains the
patched version of _svmperf_ with quantification-oriented losses. 

The [svm-perf-quantification-ext.patch](./svm-perf-quantification-ext.patch) is an extension of the patch made available by
[Esuli et al. 2015](https://dl.acm.org/doi/abs/10.1145/2700406?casa_token=8D2fHsGCVn0AAAAA:ZfThYOvrzWxMGfZYlQW_y8Cagg-o_l6X_PcF09mdETQ4Tu7jK98mxFbGSXp9ZSO14JkUIYuDGFG0) 
that allows SVMperf to optimize for
the _Q_ measure as proposed by [Barranquero et al. 2015](https://www.sciencedirect.com/science/article/abs/pii/S003132031400291X) 
and for the _KLD_ and _NKLD_ measures as proposed by [Esuli et al. 2015](https://dl.acm.org/doi/abs/10.1145/2700406?casa_token=8D2fHsGCVn0AAAAA:ZfThYOvrzWxMGfZYlQW_y8Cagg-o_l6X_PcF09mdETQ4Tu7jK98mxFbGSXp9ZSO14JkUIYuDGFG0).
This patch extends the above one by also allowing SVMperf to optimize for 
_AE_ and _RAE_.
  
  
## Documentation

The [developer API documentation](https://hlt-isti.github.io/QuaPy/build/html/modules.html) is available [here](https://hlt-isti.github.io/QuaPy/build/html/index.html). 

Check out our [Wiki](https://github.com/HLT-ISTI/QuaPy/wiki), in which many examples
are provided:

* [Datasets](https://github.com/HLT-ISTI/QuaPy/wiki/Datasets)
* [Evaluation](https://github.com/HLT-ISTI/QuaPy/wiki/Evaluation)
* [Methods](https://github.com/HLT-ISTI/QuaPy/wiki/Methods)
* [Model Selection](https://github.com/HLT-ISTI/QuaPy/wiki/Model-Selection)
* [Plotting](https://github.com/HLT-ISTI/QuaPy/wiki/Plotting)
Initial commit 2020-11-16 16:46:30 +01:00			`# QuaPy`

Update README.md 2021-11-23 18:49:48 +01:00			`QuaPy is an open source framework for quantification (a.k.a. supervised prevalence estimation, or learning to quantify)`
refactor: methods requiring a val_split can now declare a default value in the __init__ method that will be used in case the fit method is called without specifying the val_split, which now is by default None in the fit, i.e., by default takes the value of the init, that is generally set to 0.4; some uci datasets added; ensembles can now be optimized for quantification, and can be trained on samples of smaller size 2021-01-22 18:01:51 +01:00			`written in Python.`

Update README.md 2021-11-23 18:49:48 +01:00			`QuaPy is based on the concept of "data sample", and provides implementations of the`
			`most important aspects of the quantification workflow, such as (baseline and advanced)`
			`quantification methods,`
			`quantification-oriented model selection mechanisms, evaluation measures, and evaluations protocols`
refactor: methods requiring a val_split can now declare a default value in the __init__ method that will be used in case the fit method is called without specifying the val_split, which now is by default None in the fit, i.e., by default takes the value of the init, that is generally set to 0.4; some uci datasets added; ensembles can now be optimized for quantification, and can be trained on samples of smaller size 2021-01-22 18:01:51 +01:00			`used for evaluating quantification methods.`
Update README.md 2021-11-23 18:49:48 +01:00			`QuaPy also makes available commonly used datasets, and offers visualization tools`
			`for facilitating the analysis and interpretation of the experimental results.`
refactor: methods requiring a val_split can now declare a default value in the __init__ method that will be used in case the fit method is called without specifying the val_split, which now is by default None in the fit, i.e., by default takes the value of the init, that is generally set to 0.4; some uci datasets added; ensembles can now be optimized for quantification, and can be trained on samples of smaller size 2021-01-22 18:01:51 +01:00
documenting quanet 2021-12-15 16:39:57 +01:00			`### Last updates:`

updating readme to point to the API doc 2021-12-15 16:43:49 +01:00			`* A detailed documentation is now available [here](https://hlt-isti.github.io/QuaPy/)`
updating readme to point to the API doc 2021-12-15 16:43:14 +01:00			`* The developer API documentation is available [here](https://hlt-isti.github.io/QuaPy/build/html/modules.html)`
documenting quanet 2021-12-15 16:39:57 +01:00
install command in README.md 2021-05-10 13:40:40 +02:00			`### Installation`

			```commandline
			`pip install quapy`
			```

readme updated 2021-02-02 12:10:57 +01:00			`## A quick example:`

Update README.md 2021-11-23 18:49:48 +01:00			`The following script fetches a dataset of tweets, trains, applies, and evaluates a quantifier based on the`
			`_Adjusted Classify & Count_ quantification method, using, as the evaluation measure, the _Mean Absolute Error_ (MAE)`
			`between the predicted and the true class prevalence values`
readme updated 2021-02-02 12:10:57 +01:00			`of the test set.`

refactor: methods requiring a val_split can now declare a default value in the __init__ method that will be used in case the fit method is called without specifying the val_split, which now is by default None in the fit, i.e., by default takes the value of the init, that is generally set to 0.4; some uci datasets added; ensembles can now be optimized for quantification, and can be trained on samples of smaller size 2021-01-22 18:01:51 +01:00			```python
			`import quapy as qp`
			`from sklearn.linear_model import LogisticRegression`

			`dataset = qp.datasets.fetch_twitter('semeval16')`

			`# create an "Adjusted Classify & Count" quantifier`
			`model = qp.method.aggregative.ACC(LogisticRegression())`
			`model.fit(dataset.training)`

Update README.md 2021-08-10 11:44:44 +02:00			`estim_prevalence = model.quantify(dataset.test.instances)`
			`true_prevalence = dataset.test.prevalence()`
readme updated 2021-02-02 12:10:57 +01:00
Update README.md 2021-08-10 11:44:44 +02:00			`error = qp.error.mae(true_prevalence, estim_prevalence)`
readme updated 2021-02-02 12:10:57 +01:00
			`print(f'Mean Absolute Error (MAE)={error:.3f}')`
			```

Update README.md 2021-11-23 18:49:48 +01:00			`Quantification is useful in scenarios characterized by prior probability shift. In other`
			`words, we would be little interested in estimating the class prevalence values of the test set if`
			`we could assume the IID assumption to hold, as this prevalence would be roughly equivalent to the`
			`class prevalence of the training set. For this reason, any quantification model`
			`should be tested across many samples, even ones characterized by class prevalence`
			`values different or very different from those found in the training set.`
			`QuaPy implements sampling procedures and evaluation protocols that automate this workflow.`
readme updated 2021-02-08 19:16:43 +01:00			`See the [Wiki](https://github.com/HLT-ISTI/QuaPy/wiki) for detailed examples.`
refactor: methods requiring a val_split can now declare a default value in the __init__ method that will be used in case the fit method is called without specifying the val_split, which now is by default None in the fit, i.e., by default takes the value of the init, that is generally set to 0.4; some uci datasets added; ensembles can now be optimized for quantification, and can be trained on samples of smaller size 2021-01-22 18:01:51 +01:00
readme updated 2021-02-02 12:10:57 +01:00			`## Features`
refactor: methods requiring a val_split can now declare a default value in the __init__ method that will be used in case the fit method is called without specifying the val_split, which now is by default None in the fit, i.e., by default takes the value of the init, that is generally set to 0.4; some uci datasets added; ensembles can now be optimized for quantification, and can be trained on samples of smaller size 2021-01-22 18:01:51 +01:00
Update README.md 2021-11-23 18:49:48 +01:00			`* Implementation of many popular quantification methods (Classify-&-Count and its variants, Expectation Maximization,`
			`quantification methods based on structured output learning, HDy, QuaNet, and quantification ensembles).`
readme updated 2021-02-02 12:10:57 +01:00			`* Versatile functionality for performing evaluation based on artificial sampling protocols.`
Update README.md 2021-11-23 18:49:48 +01:00			`* Implementation of most commonly used evaluation metrics (e.g., AE, RAE, SE, KLD, NKLD, etc.).`
			`* Datasets frequently used in quantification (textual and numeric), including:`
readme updated 2021-02-02 12:10:57 +01:00			`* 32 UCI Machine Learning datasets.`
Update README.md 2021-11-23 18:49:48 +01:00			`* 11 Twitter quantification-by-sentiment datasets.`
			`* 3 product reviews quantification-by-sentiment datasets.`
			`* Native support for binary and single-label multiclass quantification scenarios.`
			`* Model selection functionality that minimizes quantification-oriented loss functions.`
			`* Visualization tools for analysing the experimental results.`
readme updated 2021-02-02 12:10:57 +01:00
			`## Requirements`

readme updated 2021-02-08 19:16:43 +01:00			`* scikit-learn, numpy, scipy`
readme updated 2021-02-02 12:10:57 +01:00			`* pytorch (for QuaNet)`
			`* svmperf patched for quantification (see below)`
			`* joblib`
			`* tqdm`
			`* pandas, xlrd`
			`* matplotlib`

			`## SVM-perf with quantification-oriented losses`
			`In order to run experiments involving SVM(Q), SVM(KLD), SVM(NKLD),`
			`SVM(AE), or SVM(RAE), you have to first download the`
			`[svmperf](http://www.cs.cornell.edu/people/tj/svm_light/svm_perf.html)`
			`package, apply the patch`
			`[svm-perf-quantification-ext.patch](./svm-perf-quantification-ext.patch), and compile the sources.`
			`The script [prepare_svmperf.sh](prepare_svmperf.sh) does all the job. Simply run:`

			```
			`./prepare_svmperf.sh`
refactor: methods requiring a val_split can now declare a default value in the __init__ method that will be used in case the fit method is called without specifying the val_split, which now is by default None in the fit, i.e., by default takes the value of the init, that is generally set to 0.4; some uci datasets added; ensembles can now be optimized for quantification, and can be trained on samples of smaller size 2021-01-22 18:01:51 +01:00			```

readme updated 2021-02-02 12:10:57 +01:00			`The resulting directory [svm_perf_quantification](./svm_perf_quantification) contains the`
			`patched version of _svmperf_ with quantification-oriented losses.`

			`The [svm-perf-quantification-ext.patch](./svm-perf-quantification-ext.patch) is an extension of the patch made available by`
			`[Esuli et al. 2015](https://dl.acm.org/doi/abs/10.1145/2700406?casa_token=8D2fHsGCVn0AAAAA:ZfThYOvrzWxMGfZYlQW_y8Cagg-o_l6X_PcF09mdETQ4Tu7jK98mxFbGSXp9ZSO14JkUIYuDGFG0)`
			`that allows SVMperf to optimize for`
			`the _Q_ measure as proposed by [Barranquero et al. 2015](https://www.sciencedirect.com/science/article/abs/pii/S003132031400291X)`
Update README.md 2021-11-23 18:49:48 +01:00			`and for the _KLD_ and _NKLD_ measures as proposed by [Esuli et al. 2015](https://dl.acm.org/doi/abs/10.1145/2700406?casa_token=8D2fHsGCVn0AAAAA:ZfThYOvrzWxMGfZYlQW_y8Cagg-o_l6X_PcF09mdETQ4Tu7jK98mxFbGSXp9ZSO14JkUIYuDGFG0).`
			`This patch extends the above one by also allowing SVMperf to optimize for`
readme updated 2021-02-02 12:10:57 +01:00			`_AE_ and _RAE_.`

readme updated 2021-02-08 19:16:43 +01:00
linking the api documentation from the readme.md 2021-11-09 16:25:27 +01:00			`## Documentation`

			`The [developer API documentation](https://hlt-isti.github.io/QuaPy/build/html/modules.html) is available [here](https://hlt-isti.github.io/QuaPy/build/html/index.html).`
refactor: methods requiring a val_split can now declare a default value in the __init__ method that will be used in case the fit method is called without specifying the val_split, which now is by default None in the fit, i.e., by default takes the value of the init, that is generally set to 0.4; some uci datasets added; ensembles can now be optimized for quantification, and can be trained on samples of smaller size 2021-01-22 18:01:51 +01:00
Update README.md 2021-11-23 18:49:48 +01:00			`Check out our [Wiki](https://github.com/HLT-ISTI/QuaPy/wiki), in which many examples`
readme update 2021-02-24 16:47:12 +01:00			`are provided:`

			`* [Datasets](https://github.com/HLT-ISTI/QuaPy/wiki/Datasets)`
			`* [Evaluation](https://github.com/HLT-ISTI/QuaPy/wiki/Evaluation)`
			`* [Methods](https://github.com/HLT-ISTI/QuaPy/wiki/Methods)`
			`* [Model Selection](https://github.com/HLT-ISTI/QuaPy/wiki/Model-Selection)`
Update README.md 2021-08-10 11:44:44 +02:00			`* [Plotting](https://github.com/HLT-ISTI/QuaPy/wiki/Plotting)`