forked from moreo/QuaPy
232 lines
8.6 KiB
Plaintext
232 lines
8.6 KiB
Plaintext
# Evaluation
|
|
|
|
Quantification is an appealing tool in scenarios of dataset shift,
|
|
and particularly in scenarios of prior-probability shift.
|
|
That is, the interest in estimating the class prevalences arises
|
|
under the belief that those class prevalences might have changed
|
|
with respect to the ones observed during training.
|
|
In other words, one could simply return the training prevalence
|
|
as a predictor of the test prevalence if this change is assumed
|
|
to be unlikely (as is the case in general scenarios of
|
|
machine learning governed by the iid assumption).
|
|
In brief, quantification requires dedicated evaluation protocols,
|
|
which are implemented in QuaPy and explained here.
|
|
|
|
## Error Measures
|
|
|
|
The module quapy.error implements the following error measures for quantification:
|
|
* _mae_: mean absolute error
|
|
* _mrae_: mean relative absolute error
|
|
* _mse_: mean squared error
|
|
* _mkld_: mean Kullback-Leibler Divergence
|
|
* _mnkld_: mean normalized Kullback-Leibler Divergence
|
|
|
|
Functions _ae_, _rae_, _se_, _kld_, and _nkld_ are also available,
|
|
which return the individual errors (i.e., without averaging the whole).
|
|
|
|
Some errors of classification are also available:
|
|
* _acce_: accuracy error (1-accuracy)
|
|
* _f1e_: F-1 score error (1-F1 score)
|
|
|
|
The error functions implement the following interface, e.g.:
|
|
|
|
```python
|
|
mae(true_prevs, prevs_hat)
|
|
```
|
|
|
|
in which the first argument is a ndarray containing the true
|
|
prevalences, and the second argument is another ndarray with
|
|
the estimations produced by some method.
|
|
|
|
Some error functions, e.g., _mrae_, _mkld_, and _mnkld_, are
|
|
smoothed for numerical stability. In those cases, there is a
|
|
third argument, e.g.:
|
|
|
|
```python
|
|
def mrae(true_prevs, prevs_hat, eps=None): ...
|
|
```
|
|
|
|
indicating the value for the smoothing parameter epsilon.
|
|
Traditionally, this value is set to 1/(2T) in past literature,
|
|
with T the sampling size. One could either pass this value
|
|
to the function each time, or to set a QuaPy's environment
|
|
variable _SAMPLE_SIZE_ once, and ommit this argument
|
|
thereafter (recommended);
|
|
e.g.:
|
|
|
|
```python
|
|
qp.environ['SAMPLE_SIZE'] = 100 # once for all
|
|
true_prev = np.asarray([0.5, 0.3, 0.2]) # let's assume 3 classes
|
|
estim_prev = np.asarray([0.1, 0.3, 0.6])
|
|
error = qp.ae_.mrae(true_prev, estim_prev)
|
|
print(f'mrae({true_prev}, {estim_prev}) = {error:.3f}')
|
|
```
|
|
|
|
will print:
|
|
```
|
|
mrae([0.500, 0.300, 0.200], [0.100, 0.300, 0.600]) = 0.914
|
|
```
|
|
|
|
Finally, it is possible to instantiate QuaPy's quantification
|
|
error functions from strings using, e.g.:
|
|
|
|
```python
|
|
error_function = qp.ae_.from_name('mse')
|
|
error = error_function(true_prev, estim_prev)
|
|
```
|
|
|
|
## Evaluation Protocols
|
|
|
|
QuaPy implements the so-called "artificial sampling protocol",
|
|
according to which a test set is used to generate samplings at
|
|
desired prevalences of fixed size and covering the full spectrum
|
|
of prevalences. This protocol is called "artificial" in contrast
|
|
to the "natural prevalence sampling" protocol that,
|
|
despite introducing some variability during sampling, approximately
|
|
preserves the training class prevalence.
|
|
|
|
In the artificial sampling procol, the user specifies the number
|
|
of (equally distant) points to be generated from the interval [0,1].
|
|
|
|
For example, if n_prevpoints=11 then, for each class, the prevalences
|
|
[0., 0.1, 0.2, ..., 1.] will be used. This means that, for two classes,
|
|
the number of different prevalences will be 11 (since, once the prevalence
|
|
of one class is determined, the other one is constrained). For 3 classes,
|
|
the number of valid combinations can be obtained as 11 + 10 + ... + 1 = 66.
|
|
In general, the number of valid combinations that will be produced for a given
|
|
value of n_prevpoints can be consulted by invoking
|
|
quapy.functional.num_prevalence_combinations, e.g.:
|
|
|
|
```python
|
|
import quapy.functional as F
|
|
n_prevpoints = 21
|
|
n_classes = 4
|
|
n = F.num_prevalence_combinations(n_prevpoints, n_classes, n_repeats=1)
|
|
```
|
|
|
|
in this example, n=1771. Note the last argument, n_repeats, that
|
|
informs of the number of examples that will be generated for any
|
|
valid combination (typical values are, e.g., 1 for a single sample,
|
|
or 10 or higher for computing standard deviations of performing statistical
|
|
significance tests).
|
|
|
|
One can instead work the other way around, i.e., one could set a
|
|
maximum budged of evaluations and get the number of prevalence points that
|
|
will generate a number of evaluations close, but not higher, than
|
|
the fixed budget. This can be achieved with the function
|
|
quapy.functional.get_nprevpoints_approximation, e.g.:
|
|
|
|
```python
|
|
budget = 5000
|
|
n_prevpoints = F.get_nprevpoints_approximation(budget, n_classes, n_repeats=1)
|
|
n = F.num_prevalence_combinations(n_prevpoints, n_classes, n_repeats=1)
|
|
print(f'by setting n_prevpoints={n_prevpoints} the number of evaluations for {n_classes} classes will be {n}')
|
|
```
|
|
that will print:
|
|
```
|
|
by setting n_prevpoints=30 the number of evaluations for 4 classes will be 4960
|
|
```
|
|
|
|
The cost of evaluation will depend on the values of _n_prevpoints_, _n_classes_,
|
|
and _n_repeats_. Since it might sometimes be cumbersome to control the overall
|
|
cost of an experiment having to do with the number of combinations that
|
|
will be generated for a particular setting of these arguments (particularly
|
|
when _n_classes>2_), evaluation functions
|
|
typically allow the user to rather specify an _evaluation budget_, i.e., a maximum
|
|
number of samplings to generate. By specifying this argument, one could avoid
|
|
specifying _n_prevpoints_, and the value for it that would lead to a closer
|
|
number of evaluation budget, without surpassing it, will be automatically set.
|
|
|
|
The following script shows a full example in which a PACC model relying
|
|
on a Logistic Regressor classifier is
|
|
tested on the _kindle_ dataset by means of the artificial prevalence
|
|
sampling protocol on samples of size 500, in terms of various
|
|
evaluation metrics.
|
|
|
|
````python
|
|
import quapy as qp
|
|
import quapy.functional as F
|
|
from sklearn.linear_model import LogisticRegression
|
|
|
|
qp.environ['SAMPLE_SIZE'] = 500
|
|
|
|
dataset = qp.datasets.fetch_reviews('kindle')
|
|
qp.data.preprocessing.text2tfidf(dataset, min_df=5, inplace=True)
|
|
|
|
training = dataset.training
|
|
test = dataset.test
|
|
|
|
lr = LogisticRegression()
|
|
pacc = qp.method.aggregative.PACC(lr)
|
|
|
|
pacc.fit(training)
|
|
|
|
df = qp.evaluation.artificial_sampling_report(
|
|
pacc, # the quantification method
|
|
test, # the test set on which the method will be evaluated
|
|
sample_size=qp.environ['SAMPLE_SIZE'], #indicates the size of samples to be drawn
|
|
n_prevpoints=11, # how many prevalence points will be extracted from the interval [0, 1] for each category
|
|
n_repetitions=1, # number of times each prevalence will be used to generate a test sample
|
|
n_jobs=-1, # indicates the number of parallel workers (-1 indicates, as in sklearn, all CPUs)
|
|
random_seed=42, # setting a random seed allows to replicate the test samples across runs
|
|
error_metrics=['mae', 'mrae', 'mkld'], # specify the evaluation metrics
|
|
verbose=True # set to True to show some standard-line outputs
|
|
)
|
|
````
|
|
|
|
The resulting report is a pandas' dataframe that can be directly printed.
|
|
Here, we set some display options from pandas just to make the output clearer;
|
|
note also that the estimated prevalences are shown as strings using the
|
|
function strprev function that simply converts a prevalence into a
|
|
string representing it, with a fixed decimal precision (default 3):
|
|
|
|
```python
|
|
import pandas as pd
|
|
pd.set_option('display.expand_frame_repr', False)
|
|
pd.set_option("precision", 3)
|
|
df['estim-prev'] = df['estim-prev'].map(F.strprev)
|
|
print(df)
|
|
```
|
|
|
|
The output should look like:
|
|
|
|
```
|
|
true-prev estim-prev mae mrae mkld
|
|
0 [0.0, 1.0] [0.000, 1.000] 0.000 0.000 0.000e+00
|
|
1 [0.1, 0.9] [0.091, 0.909] 0.009 0.048 4.426e-04
|
|
2 [0.2, 0.8] [0.163, 0.837] 0.037 0.114 4.633e-03
|
|
3 [0.3, 0.7] [0.283, 0.717] 0.017 0.041 7.383e-04
|
|
4 [0.4, 0.6] [0.366, 0.634] 0.034 0.070 2.412e-03
|
|
5 [0.5, 0.5] [0.459, 0.541] 0.041 0.082 3.387e-03
|
|
6 [0.6, 0.4] [0.565, 0.435] 0.035 0.073 2.535e-03
|
|
7 [0.7, 0.3] [0.654, 0.346] 0.046 0.108 4.701e-03
|
|
8 [0.8, 0.2] [0.725, 0.275] 0.075 0.235 1.515e-02
|
|
9 [0.9, 0.1] [0.858, 0.142] 0.042 0.229 7.740e-03
|
|
10 [1.0, 0.0] [0.945, 0.055] 0.055 27.357 5.219e-02
|
|
```
|
|
|
|
One can get the averaged scores using standard pandas'
|
|
functions, i.e.:
|
|
|
|
```python
|
|
print(df.mean())
|
|
```
|
|
|
|
will produce the following output:
|
|
|
|
```
|
|
true-prev 0.500
|
|
mae 0.035
|
|
mrae 2.578
|
|
mkld 0.009
|
|
dtype: float64
|
|
```
|
|
|
|
Other evaluation functions include:
|
|
|
|
* _artificial_sampling_eval_: that computes the evaluation for a
|
|
given evaluation metric, returning the average instead of a dataframe.
|
|
* _artificial_sampling_prediction_: that returns two np.arrays containing the
|
|
true prevalences and the estimated prevalences.
|
|
|
|
See the documentation for further details. |