QuaPy/docs/build/html/_sources/Evaluation.md.txt

# Evaluation

Quantification is an appealing tool in scenarios of dataset shift,
and particularly in scenarios of prior-probability shift.
That is, the interest in estimating the class prevalences arises
under the belief that those class prevalences might have changed
with respect to the ones observed during training.
In other words, one could simply return the training prevalence
as a predictor of the test prevalence if this change is assumed
to be unlikely (as is the case in general scenarios of
machine learning governed by the iid assumption).
In brief, quantification requires dedicated evaluation protocols,
which are implemented in QuaPy and explained here.

## Error Measures

The module quapy.error implements the following error measures for quantification:
* _mae_: mean absolute error
* _mrae_: mean relative absolute error
* _mse_: mean squared error
* _mkld_: mean Kullback-Leibler Divergence
* _mnkld_: mean normalized Kullback-Leibler Divergence

Functions _ae_, _rae_, _se_, _kld_, and _nkld_ are also available,
which return the individual errors (i.e., without averaging the whole).

Some errors of classification are also available:
* _acce_: accuracy error (1-accuracy)
* _f1e_: F-1 score error (1-F1 score)

The error functions implement the following interface, e.g.:

```python
mae(true_prevs, prevs_hat)
```

in which the first argument is a ndarray containing the true
prevalences, and the second argument is another ndarray with
the estimations produced by some method.

Some error functions, e.g., _mrae_, _mkld_, and _mnkld_, are
smoothed for numerical stability. In those cases, there is a
third argument, e.g.:

```python
def mrae(true_prevs, prevs_hat, eps=None): ...
```

indicating the value for the smoothing parameter epsilon.
Traditionally, this value is set to 1/(2T) in past literature,
with T the sampling size. One could either pass this value
to the function each time, or to set a QuaPy's environment
variable _SAMPLE_SIZE_ once, and ommit this argument
thereafter (recommended);
e.g.:

```python
qp.environ['SAMPLE_SIZE'] = 100  # once for all
true_prev = np.asarray([0.5, 0.3, 0.2])  # let's assume 3 classes
estim_prev = np.asarray([0.1, 0.3, 0.6])
error = qp.ae_.mrae(true_prev, estim_prev)
print(f'mrae({true_prev}, {estim_prev}) = {error:.3f}')
```

will print:
```
mrae([0.500, 0.300, 0.200], [0.100, 0.300, 0.600]) = 0.914
```

Finally, it is possible to instantiate QuaPy's quantification
error functions from strings using, e.g.:

```python
error_function = qp.ae_.from_name('mse')
error = error_function(true_prev, estim_prev)
```

## Evaluation Protocols

QuaPy implements the so-called "artificial sampling protocol",
according to which a test set is used to generate samplings at
desired prevalences of fixed size and covering the full spectrum
of prevalences. This protocol is called "artificial" in contrast
to the "natural prevalence sampling" protocol that,
despite introducing some variability during sampling, approximately
preserves the training class prevalence.

In the artificial sampling procol, the user specifies the number
of (equally distant) points to be generated from the interval [0,1].

For example, if n_prevpoints=11 then, for each class, the prevalences
[0., 0.1, 0.2, ..., 1.] will be used. This means that, for two classes,
the number of different prevalences will be 11 (since, once the prevalence
of one class is determined, the other one is constrained). For 3 classes,
the number of valid combinations can be obtained as 11 + 10 + ... + 1 = 66.
In general, the number of valid combinations that will be produced for a given
value of n_prevpoints can be consulted by invoking
quapy.functional.num_prevalence_combinations, e.g.:

```python
import quapy.functional as F
n_prevpoints = 21
n_classes = 4
n = F.num_prevalence_combinations(n_prevpoints, n_classes, n_repeats=1)
```

in this example, n=1771. Note the last argument, n_repeats, that
informs of the number of examples that will be generated for any
valid combination (typical values are, e.g., 1 for a single sample,
or 10 or higher for computing standard deviations of performing statistical
significance tests).

One can instead work the other way around, i.e., one could set a
maximum budged of evaluations and get the number of prevalence points that
will generate a number of evaluations close, but not higher, than
the fixed budget. This can be achieved with the function
quapy.functional.get_nprevpoints_approximation, e.g.:

```python
budget = 5000
n_prevpoints = F.get_nprevpoints_approximation(budget, n_classes, n_repeats=1)
n = F.num_prevalence_combinations(n_prevpoints, n_classes, n_repeats=1)
print(f'by setting n_prevpoints={n_prevpoints} the number of evaluations for {n_classes} classes will be {n}')
```
that will print:
```
by setting n_prevpoints=30 the number of evaluations for 4 classes will be 4960
```

The cost of evaluation will depend on the values of _n_prevpoints_, _n_classes_,
and _n_repeats_. Since it might sometimes be cumbersome to control the overall
cost of an experiment having to do with the number of combinations that
will be generated for a particular setting of these arguments (particularly
when _n_classes>2_), evaluation functions
typically allow the user to rather specify an _evaluation budget_, i.e., a maximum
number of samplings to generate. By specifying this argument, one could avoid
specifying _n_prevpoints_, and the value for it that would lead to a closer
number of evaluation budget, without surpassing it, will be automatically set.

The following script shows a full example in which a PACC model relying
on a Logistic Regressor classifier is
tested on the _kindle_ dataset by means of the artificial prevalence
sampling protocol on samples of size 500, in terms of various
evaluation metrics.

````python
import quapy as qp
import quapy.functional as F
from sklearn.linear_model import LogisticRegression

qp.environ['SAMPLE_SIZE'] = 500

dataset = qp.datasets.fetch_reviews('kindle')
qp.data.preprocessing.text2tfidf(dataset, min_df=5, inplace=True)

training = dataset.training
test = dataset.test

lr = LogisticRegression()
pacc = qp.method.aggregative.PACC(lr)

pacc.fit(training)

df = qp.evaluation.artificial_sampling_report(
    pacc,  # the quantification method
    test,  # the test set on which the method will be evaluated
    sample_size=qp.environ['SAMPLE_SIZE'],  #indicates the size of samples to be drawn
    n_prevpoints=11,  # how many prevalence points will be extracted from the interval [0, 1] for each category
    n_repetitions=1,  # number of times each prevalence will be used to generate a test sample
    n_jobs=-1,  # indicates the number of parallel workers (-1 indicates, as in sklearn, all CPUs)
    random_seed=42,  # setting a random seed allows to replicate the test samples across runs
    error_metrics=['mae', 'mrae', 'mkld'],  # specify the evaluation metrics
    verbose=True  # set to True to show some standard-line outputs
)
````

The resulting report is a pandas' dataframe that can be directly printed.
Here, we set some display options from pandas just to make the output clearer;
note also that the estimated prevalences are shown as strings using the
function strprev function that simply converts a prevalence into a
string representing it, with a fixed decimal precision (default 3):

```python
import pandas as pd
pd.set_option('display.expand_frame_repr', False)
pd.set_option("precision", 3)
df['estim-prev'] = df['estim-prev'].map(F.strprev)
print(df)
```

The output should look like:

```
     true-prev      estim-prev    mae    mrae       mkld
0   [0.0, 1.0]  [0.000, 1.000]  0.000   0.000  0.000e+00
1   [0.1, 0.9]  [0.091, 0.909]  0.009   0.048  4.426e-04
2   [0.2, 0.8]  [0.163, 0.837]  0.037   0.114  4.633e-03
3   [0.3, 0.7]  [0.283, 0.717]  0.017   0.041  7.383e-04
4   [0.4, 0.6]  [0.366, 0.634]  0.034   0.070  2.412e-03
5   [0.5, 0.5]  [0.459, 0.541]  0.041   0.082  3.387e-03
6   [0.6, 0.4]  [0.565, 0.435]  0.035   0.073  2.535e-03
7   [0.7, 0.3]  [0.654, 0.346]  0.046   0.108  4.701e-03
8   [0.8, 0.2]  [0.725, 0.275]  0.075   0.235  1.515e-02
9   [0.9, 0.1]  [0.858, 0.142]  0.042   0.229  7.740e-03
10  [1.0, 0.0]  [0.945, 0.055]  0.055  27.357  5.219e-02
```

One can get the averaged scores using standard pandas'
functions, i.e.:

```python
print(df.mean())
```

will produce the following output:

```
true-prev    0.500
mae          0.035
mrae         2.578
mkld         0.009
dtype: float64
```

Other evaluation functions include:

* _artificial_sampling_eval_: that computes the evaluation for a
given evaluation metric, returning the average instead of a dataframe.
* _artificial_sampling_prediction_: that returns two np.arrays containing the
true prevalences and the estimated prevalences.

See the documentation for further details.