A Python framework for Quantification

Go to file

Alejandro Moreo Fernandez 775417c8eb bugfix in PACC		2021-02-18 13:48:41 +01:00
NewMethods	making everything work like in the wiki	2021-02-17 18:05:22 +01:00
TweetSentQuant	making everything work like in the wiki	2021-02-17 18:05:22 +01:00
quapy	bugfix in PACC	2021-02-18 13:48:41 +01:00
.gitignore	data loading	2020-12-03 16:24:21 +01:00
LICENSE	license updated	2020-12-15 13:36:24 +01:00
README.md	readme update	2021-02-15 15:28:34 +01:00
TODO.txt	making everything work like in the wiki	2021-02-17 18:05:22 +01:00
plot_example.py	plot functionality added	2021-01-07 17:58:48 +01:00
prepare_svmperf.sh	cleaning	2020-12-15 15:28:20 +01:00
svm-perf-quantification-ext.patch	many aggregative methods added	2020-12-03 18:12:28 +01:00
test.py	adding eval_budget to evaluation functions	2021-02-09 11:48:16 +01:00

README.md

QuaPy

QuaPy is an open source framework for Quantification (a.k.a. Supervised Prevalence Estimation) written in Python.

QuaPy roots on the concept of data sample, and provides implementations of most important concepts in quantification literature, such as the most important quantification baselines, many advanced quantification methods, quantification-oriented model selection, many evaluation measures and protocols used for evaluating quantification methods. QuaPy also integrates commonly used datasets and offers visualization tools for facilitating the analysis and interpretation of results.

A quick example:

The following script fetchs a Twitter dataset, trains and evaluates an Adjusted Classify & Count model in terms of the Mean Absolute Error (MAE) between the class prevalences estimated for the test set and the true prevalences of the test set.

import quapy as qp
from sklearn.linear_model import LogisticRegression

dataset = qp.datasets.fetch_twitter('semeval16')

# create an "Adjusted Classify & Count" quantifier
model = qp.method.aggregative.ACC(LogisticRegression())
model.fit(dataset.training)

estim_prevalences = model.quantify(dataset.test.instances)
true_prevalences  = dataset.test.prevalence()

error = qp.error.mae(true_prevalences, estim_prevalences)

print(f'Mean Absolute Error (MAE)={error:.3f}')

Quantification is useful in scenarios of distribution shift. In other words, we would not need to estimate the class prevalences of the test set if we could assume the IID assumption to hold, as this prevalence would simply coincide with the class prevalence of the training set. That is to say, a Quantification model should be tested across samples characterized by different class prevalences. QuaPy implements sampling procedures and evaluation protocols that automates this endeavour. See the Wiki for detailed examples.

Features

Implementation of most popular quantification methods (Classify-&-Count variants, Expectation-Maximization, SVM-based variants for quantification, HDy, QuaNet, and Ensembles).
Versatile functionality for performing evaluation based on artificial sampling protocols.
Implementation of most commonly used evaluation metrics (e.g., MAE, MRAE, MSE, NKLD, etc.).
Popular datasets for Quantification (textual and numeric) available, including:
- 32 UCI Machine Learning datasets.
- 11 Twitter Sentiment datasets.
- 3 Reviews Sentiment datasets.
Native supports for binary and single-label scenarios of quantification.
Model selection functionality targeting quantification-oriented losses.
Plotting routines (“error-by-drift”, “diagonal”, and “bias” plots).

Requirements

scikit-learn, numpy, scipy
pytorch (for QuaNet)
svmperf patched for quantification (see below)
joblib
tqdm
pandas, xlrd
matplotlib

SVM-perf with quantification-oriented losses

In order to run experiments involving SVM(Q), SVM(KLD), SVM(NKLD), SVM(AE), or SVM(RAE), you have to first download the svmperf package, apply the patch svm-perf-quantification-ext.patch, and compile the sources. The script prepare_svmperf.sh does all the job. Simply run:

./prepare_svmperf.sh

The resulting directory svm_perf_quantification contains the patched version of svmperf with quantification-oriented losses.

The svm-perf-quantification-ext.patch is an extension of the patch made available by Esuli et al. 2015 that allows SVMperf to optimize for the Q measure as proposed by Barranquero et al. 2015 and for the KLD and NKLD as proposed by Esuli et al. 2015 for quantification. This patch extends the former by also allowing SVMperf to optimize for AE and RAE.

Wiki

Check out our Wiki in which many examples are provided.