A Python framework for Quantification
Go to file
Pablo González 948f63fade updating plot to center it better 2023-01-16 17:00:24 +01:00
docs documenting quanet 2021-12-15 16:39:57 +01:00
examples adding the possibility to estimate the training prevalence, instead of using the true training prevalence, as a starting point in emq 2022-12-12 09:34:09 +01:00
quapy updating plot to center it better 2023-01-16 17:00:24 +01:00
.gitignore data loading 2020-12-03 16:24:21 +01:00
LICENSE license updated 2020-12-15 13:36:24 +01:00
README.md Merge branch 'master' of github.com:HLT-ISTI/QuaPy 2021-12-15 16:50:22 +01:00
TODO.txt the heuristic exact_train_prev is performed via kFCV, using a new function qp.model_selection.cross_val_predict 2022-12-12 17:32:30 +01:00
prepare_svmperf.sh cleaning 2020-12-15 15:28:20 +01:00
setup.py pip package 2021-05-10 13:36:35 +02:00
svm-perf-quantification-ext.patch many aggregative methods added 2020-12-03 18:12:28 +01:00

README.md

QuaPy

QuaPy is an open source framework for quantification (a.k.a. supervised prevalence estimation, or learning to quantify) written in Python.

QuaPy is based on the concept of “data sample”, and provides implementations of the most important aspects of the quantification workflow, such as (baseline and advanced) quantification methods, quantification-oriented model selection mechanisms, evaluation measures, and evaluations protocols used for evaluating quantification methods. QuaPy also makes available commonly used datasets, and offers visualization tools for facilitating the analysis and interpretation of the experimental results.

Last updates:

  • A detailed documentation is now available here
  • The developer API documentation is available here

Installation

pip install quapy

A quick example:

The following script fetches a dataset of tweets, trains, applies, and evaluates a quantifier based on the Adjusted Classify & Count quantification method, using, as the evaluation measure, the Mean Absolute Error (MAE) between the predicted and the true class prevalence values of the test set.

import quapy as qp
from sklearn.linear_model import LogisticRegression

dataset = qp.datasets.fetch_twitter('semeval16')

# create an "Adjusted Classify & Count" quantifier
model = qp.method.aggregative.ACC(LogisticRegression())
model.fit(dataset.training)

estim_prevalence = model.quantify(dataset.test.instances)
true_prevalence  = dataset.test.prevalence()

error = qp.error.mae(true_prevalence, estim_prevalence)

print(f'Mean Absolute Error (MAE)={error:.3f}')

Quantification is useful in scenarios characterized by prior probability shift. In other words, we would be little interested in estimating the class prevalence values of the test set if we could assume the IID assumption to hold, as this prevalence would be roughly equivalent to the class prevalence of the training set. For this reason, any quantification model should be tested across many samples, even ones characterized by class prevalence values different or very different from those found in the training set. QuaPy implements sampling procedures and evaluation protocols that automate this workflow. See the Wiki for detailed examples.

Features

  • Implementation of many popular quantification methods (Classify-&-Count and its variants, Expectation Maximization, quantification methods based on structured output learning, HDy, QuaNet, and quantification ensembles).
  • Versatile functionality for performing evaluation based on artificial sampling protocols.
  • Implementation of most commonly used evaluation metrics (e.g., AE, RAE, SE, KLD, NKLD, etc.).
  • Datasets frequently used in quantification (textual and numeric), including:
    • 32 UCI Machine Learning datasets.
    • 11 Twitter quantification-by-sentiment datasets.
    • 3 product reviews quantification-by-sentiment datasets.
  • Native support for binary and single-label multiclass quantification scenarios.
  • Model selection functionality that minimizes quantification-oriented loss functions.
  • Visualization tools for analysing the experimental results.

Requirements

  • scikit-learn, numpy, scipy
  • pytorch (for QuaNet)
  • svmperf patched for quantification (see below)
  • joblib
  • tqdm
  • pandas, xlrd
  • matplotlib

SVM-perf with quantification-oriented losses

In order to run experiments involving SVM(Q), SVM(KLD), SVM(NKLD), SVM(AE), or SVM(RAE), you have to first download the svmperf package, apply the patch svm-perf-quantification-ext.patch, and compile the sources. The script prepare_svmperf.sh does all the job. Simply run:

./prepare_svmperf.sh

The resulting directory svm_perf_quantification contains the patched version of svmperf with quantification-oriented losses.

The svm-perf-quantification-ext.patch is an extension of the patch made available by Esuli et al. 2015 that allows SVMperf to optimize for the Q measure as proposed by Barranquero et al. 2015 and for the KLD and NKLD measures as proposed by Esuli et al. 2015. This patch extends the above one by also allowing SVMperf to optimize for AE and RAE.

Documentation

The developer API documentation is available here.

Check out our Wiki, in which many examples are provided: