Evaluation

Quantification is an appealing tool in scenarios of dataset shift, and particularly in scenarios of prior-probability shift. That is, the interest in estimating the class prevalences arises under the belief that those class prevalences might have changed with respect to the ones observed during training. In other words, one could simply return the training prevalence as a predictor of the test prevalence if this change is assumed to be unlikely (as is the case in general scenarios of machine learning governed by the iid assumption). In brief, quantification requires dedicated evaluation protocols, which are implemented in QuaPy and explained here.

Error Measures

The module quapy.error implements the following error measures for quantification:

  • mae: mean absolute error

  • mrae: mean relative absolute error

  • mse: mean squared error

  • mkld: mean Kullback-Leibler Divergence

  • mnkld: mean normalized Kullback-Leibler Divergence

Functions ae, rae, se, kld, and nkld are also available, which return the individual errors (i.e., without averaging the whole).

Some errors of classification are also available:

  • acce: accuracy error (1-accuracy)

  • f1e: F-1 score error (1-F1 score)

The error functions implement the following interface, e.g.:

mae(true_prevs, prevs_hat)

in which the first argument is a ndarray containing the true prevalences, and the second argument is another ndarray with the estimations produced by some method.

Some error functions, e.g., mrae, mkld, and mnkld, are smoothed for numerical stability. In those cases, there is a third argument, e.g.:

def mrae(true_prevs, prevs_hat, eps=None): ...

indicating the value for the smoothing parameter epsilon. Traditionally, this value is set to 1/(2T) in past literature, with T the sampling size. One could either pass this value to the function each time, or to set a QuaPy’s environment variable SAMPLE_SIZE once, and ommit this argument thereafter (recommended); e.g.:

qp.environ['SAMPLE_SIZE'] = 100  # once for all
true_prev = np.asarray([0.5, 0.3, 0.2])  # let's assume 3 classes
estim_prev = np.asarray([0.1, 0.3, 0.6])
error = qp.ae_.mrae(true_prev, estim_prev)
print(f'mrae({true_prev}, {estim_prev}) = {error:.3f}')

will print:

mrae([0.500, 0.300, 0.200], [0.100, 0.300, 0.600]) = 0.914

Finally, it is possible to instantiate QuaPy’s quantification error functions from strings using, e.g.:

error_function = qp.ae_.from_name('mse')
error = error_function(true_prev, estim_prev)

Evaluation Protocols

QuaPy implements the so-called “artificial sampling protocol”, according to which a test set is used to generate samplings at desired prevalences of fixed size and covering the full spectrum of prevalences. This protocol is called “artificial” in contrast to the “natural prevalence sampling” protocol that, despite introducing some variability during sampling, approximately preserves the training class prevalence.

In the artificial sampling procol, the user specifies the number of (equally distant) points to be generated from the interval [0,1].

For example, if n_prevpoints=11 then, for each class, the prevalences [0., 0.1, 0.2, …, 1.] will be used. This means that, for two classes, the number of different prevalences will be 11 (since, once the prevalence of one class is determined, the other one is constrained). For 3 classes, the number of valid combinations can be obtained as 11 + 10 + … + 1 = 66. In general, the number of valid combinations that will be produced for a given value of n_prevpoints can be consulted by invoking quapy.functional.num_prevalence_combinations, e.g.:

import quapy.functional as F
n_prevpoints = 21
n_classes = 4
n = F.num_prevalence_combinations(n_prevpoints, n_classes, n_repeats=1)

in this example, n=1771. Note the last argument, n_repeats, that informs of the number of examples that will be generated for any valid combination (typical values are, e.g., 1 for a single sample, or 10 or higher for computing standard deviations of performing statistical significance tests).

One can instead work the other way around, i.e., one could set a maximum budged of evaluations and get the number of prevalence points that will generate a number of evaluations close, but not higher, than the fixed budget. This can be achieved with the function quapy.functional.get_nprevpoints_approximation, e.g.:

budget = 5000
n_prevpoints = F.get_nprevpoints_approximation(budget, n_classes, n_repeats=1)
n = F.num_prevalence_combinations(n_prevpoints, n_classes, n_repeats=1)
print(f'by setting n_prevpoints={n_prevpoints} the number of evaluations for {n_classes} classes will be {n}')

that will print:

by setting n_prevpoints=30 the number of evaluations for 4 classes will be 4960

The cost of evaluation will depend on the values of n_prevpoints, n_classes, and n_repeats. Since it might sometimes be cumbersome to control the overall cost of an experiment having to do with the number of combinations that will be generated for a particular setting of these arguments (particularly when n_classes>2), evaluation functions typically allow the user to rather specify an evaluation budget, i.e., a maximum number of samplings to generate. By specifying this argument, one could avoid specifying n_prevpoints, and the value for it that would lead to a closer number of evaluation budget, without surpassing it, will be automatically set.

The following script shows a full example in which a PACC model relying on a Logistic Regressor classifier is tested on the kindle dataset by means of the artificial prevalence sampling protocol on samples of size 500, in terms of various evaluation metrics.

import quapy as qp
import quapy.functional as F
from sklearn.linear_model import LogisticRegression

qp.environ['SAMPLE_SIZE'] = 500

dataset = qp.datasets.fetch_reviews('kindle')
qp.data.preprocessing.text2tfidf(dataset, min_df=5, inplace=True)

training = dataset.training
test = dataset.test

lr = LogisticRegression()
pacc = qp.method.aggregative.PACC(lr)

pacc.fit(training)

df = qp.evaluation.artificial_sampling_report(
    pacc,  # the quantification method
    test,  # the test set on which the method will be evaluated
    sample_size=qp.environ['SAMPLE_SIZE'],  #indicates the size of samples to be drawn
    n_prevpoints=11,  # how many prevalence points will be extracted from the interval [0, 1] for each category
    n_repetitions=1,  # number of times each prevalence will be used to generate a test sample
    n_jobs=-1,  # indicates the number of parallel workers (-1 indicates, as in sklearn, all CPUs)
    random_seed=42,  # setting a random seed allows to replicate the test samples across runs
    error_metrics=['mae', 'mrae', 'mkld'],  # specify the evaluation metrics
    verbose=True  # set to True to show some standard-line outputs
)

The resulting report is a pandas’ dataframe that can be directly printed. Here, we set some display options from pandas just to make the output clearer; note also that the estimated prevalences are shown as strings using the function strprev function that simply converts a prevalence into a string representing it, with a fixed decimal precision (default 3):

import pandas as pd
pd.set_option('display.expand_frame_repr', False)
pd.set_option("precision", 3)
df['estim-prev'] = df['estim-prev'].map(F.strprev)
print(df)

The output should look like:

     true-prev      estim-prev    mae    mrae       mkld
0   [0.0, 1.0]  [0.000, 1.000]  0.000   0.000  0.000e+00
1   [0.1, 0.9]  [0.091, 0.909]  0.009   0.048  4.426e-04
2   [0.2, 0.8]  [0.163, 0.837]  0.037   0.114  4.633e-03
3   [0.3, 0.7]  [0.283, 0.717]  0.017   0.041  7.383e-04
4   [0.4, 0.6]  [0.366, 0.634]  0.034   0.070  2.412e-03
5   [0.5, 0.5]  [0.459, 0.541]  0.041   0.082  3.387e-03
6   [0.6, 0.4]  [0.565, 0.435]  0.035   0.073  2.535e-03
7   [0.7, 0.3]  [0.654, 0.346]  0.046   0.108  4.701e-03
8   [0.8, 0.2]  [0.725, 0.275]  0.075   0.235  1.515e-02
9   [0.9, 0.1]  [0.858, 0.142]  0.042   0.229  7.740e-03
10  [1.0, 0.0]  [0.945, 0.055]  0.055  27.357  5.219e-02

One can get the averaged scores using standard pandas’ functions, i.e.:

print(df.mean())

will produce the following output:

true-prev    0.500
mae          0.035
mrae         2.578
mkld         0.009
dtype: float64

Other evaluation functions include:

  • artificial_sampling_eval: that computes the evaluation for a given evaluation metric, returning the average instead of a dataframe.

  • artificial_sampling_prediction: that returns two np.arrays containing the true prevalences and the estimated prevalences.

See the documentation for further details.