# Evaluation Quantification is an appealing tool in scenarios of dataset shift, and particularly in scenarios of prior-probability shift. That is, the interest in estimating the class prevalences arises under the belief that those class prevalences might have changed with respect to the ones observed during training. In other words, one could simply return the training prevalence as a predictor of the test prevalence if this change is assumed to be unlikely (as is the case in general scenarios of machine learning governed by the iid assumption). In brief, quantification requires dedicated evaluation protocols, which are implemented in QuaPy and explained here. ## Error Measures The module quapy.error implements the following error measures for quantification: * _mae_: mean absolute error * _mrae_: mean relative absolute error * _mse_: mean squared error * _mkld_: mean Kullback-Leibler Divergence * _mnkld_: mean normalized Kullback-Leibler Divergence Functions _ae_, _rae_, _se_, _kld_, and _nkld_ are also available, which return the individual errors (i.e., without averaging the whole). Some errors of classification are also available: * _acce_: accuracy error (1-accuracy) * _f1e_: F-1 score error (1-F1 score) The error functions implement the following interface, e.g.: ```python mae(true_prevs, prevs_hat) ``` in which the first argument is a ndarray containing the true prevalences, and the second argument is another ndarray with the estimations produced by some method. Some error functions, e.g., _mrae_, _mkld_, and _mnkld_, are smoothed for numerical stability. In those cases, there is a third argument, e.g.: ```python def mrae(true_prevs, prevs_hat, eps=None): ... ``` indicating the value for the smoothing parameter epsilon. Traditionally, this value is set to 1/(2T) in past literature, with T the sampling size. One could either pass this value to the function each time, or to set a QuaPy's environment variable _SAMPLE_SIZE_ once, and ommit this argument thereafter (recommended); e.g.: ```python qp.environ['SAMPLE_SIZE'] = 100 # once for all true_prev = np.asarray([0.5, 0.3, 0.2]) # let's assume 3 classes estim_prev = np.asarray([0.1, 0.3, 0.6]) error = qp.ae_.mrae(true_prev, estim_prev) print(f'mrae({true_prev}, {estim_prev}) = {error:.3f}') ``` will print: ``` mrae([0.500, 0.300, 0.200], [0.100, 0.300, 0.600]) = 0.914 ``` Finally, it is possible to instantiate QuaPy's quantification error functions from strings using, e.g.: ```python error_function = qp.ae_.from_name('mse') error = error_function(true_prev, estim_prev) ``` ## Evaluation Protocols QuaPy implements the so-called "artificial sampling protocol", according to which a test set is used to generate samplings at desired prevalences of fixed size and covering the full spectrum of prevalences. This protocol is called "artificial" in contrast to the "natural prevalence sampling" protocol that, despite introducing some variability during sampling, approximately preserves the training class prevalence. In the artificial sampling procol, the user specifies the number of (equally distant) points to be generated from the interval [0,1]. For example, if n_prevpoints=11 then, for each class, the prevalences [0., 0.1, 0.2, ..., 1.] will be used. This means that, for two classes, the number of different prevalences will be 11 (since, once the prevalence of one class is determined, the other one is constrained). For 3 classes, the number of valid combinations can be obtained as 11 + 10 + ... + 1 = 66. In general, the number of valid combinations that will be produced for a given value of n_prevpoints can be consulted by invoking quapy.functional.num_prevalence_combinations, e.g.: ```python import quapy.functional as F n_prevpoints = 21 n_classes = 4 n = F.num_prevalence_combinations(n_prevpoints, n_classes, n_repeats=1) ``` in this example, n=1771. Note the last argument, n_repeats, that informs of the number of examples that will be generated for any valid combination (typical values are, e.g., 1 for a single sample, or 10 or higher for computing standard deviations of performing statistical significance tests). One can instead work the other way around, i.e., one could set a maximum budged of evaluations and get the number of prevalence points that will generate a number of evaluations close, but not higher, than the fixed budget. This can be achieved with the function quapy.functional.get_nprevpoints_approximation, e.g.: ```python budget = 5000 n_prevpoints = F.get_nprevpoints_approximation(budget, n_classes, n_repeats=1) n = F.num_prevalence_combinations(n_prevpoints, n_classes, n_repeats=1) print(f'by setting n_prevpoints={n_prevpoints} the number of evaluations for {n_classes} classes will be {n}') ``` that will print: ``` by setting n_prevpoints=30 the number of evaluations for 4 classes will be 4960 ``` The cost of evaluation will depend on the values of _n_prevpoints_, _n_classes_, and _n_repeats_. Since it might sometimes be cumbersome to control the overall cost of an experiment having to do with the number of combinations that will be generated for a particular setting of these arguments (particularly when _n_classes>2_), evaluation functions typically allow the user to rather specify an _evaluation budget_, i.e., a maximum number of samplings to generate. By specifying this argument, one could avoid specifying _n_prevpoints_, and the value for it that would lead to a closer number of evaluation budget, without surpassing it, will be automatically set. The following script shows a full example in which a PACC model relying on a Logistic Regressor classifier is tested on the _kindle_ dataset by means of the artificial prevalence sampling protocol on samples of size 500, in terms of various evaluation metrics. ````python import quapy as qp import quapy.functional as F from sklearn.linear_model import LogisticRegression qp.environ['SAMPLE_SIZE'] = 500 dataset = qp.datasets.fetch_reviews('kindle') qp.data.preprocessing.text2tfidf(dataset, min_df=5, inplace=True) training = dataset.training test = dataset.test lr = LogisticRegression() pacc = qp.method.aggregative.PACC(lr) pacc.fit(training) df = qp.evaluation.artificial_sampling_report( pacc, # the quantification method test, # the test set on which the method will be evaluated sample_size=qp.environ['SAMPLE_SIZE'], #indicates the size of samples to be drawn n_prevpoints=11, # how many prevalence points will be extracted from the interval [0, 1] for each category n_repetitions=1, # number of times each prevalence will be used to generate a test sample n_jobs=-1, # indicates the number of parallel workers (-1 indicates, as in sklearn, all CPUs) random_seed=42, # setting a random seed allows to replicate the test samples across runs error_metrics=['mae', 'mrae', 'mkld'], # specify the evaluation metrics verbose=True # set to True to show some standard-line outputs ) ```` The resulting report is a pandas' dataframe that can be directly printed. Here, we set some display options from pandas just to make the output clearer; note also that the estimated prevalences are shown as strings using the function strprev function that simply converts a prevalence into a string representing it, with a fixed decimal precision (default 3): ```python import pandas as pd pd.set_option('display.expand_frame_repr', False) pd.set_option("precision", 3) df['estim-prev'] = df['estim-prev'].map(F.strprev) print(df) ``` The output should look like: ``` true-prev estim-prev mae mrae mkld 0 [0.0, 1.0] [0.000, 1.000] 0.000 0.000 0.000e+00 1 [0.1, 0.9] [0.091, 0.909] 0.009 0.048 4.426e-04 2 [0.2, 0.8] [0.163, 0.837] 0.037 0.114 4.633e-03 3 [0.3, 0.7] [0.283, 0.717] 0.017 0.041 7.383e-04 4 [0.4, 0.6] [0.366, 0.634] 0.034 0.070 2.412e-03 5 [0.5, 0.5] [0.459, 0.541] 0.041 0.082 3.387e-03 6 [0.6, 0.4] [0.565, 0.435] 0.035 0.073 2.535e-03 7 [0.7, 0.3] [0.654, 0.346] 0.046 0.108 4.701e-03 8 [0.8, 0.2] [0.725, 0.275] 0.075 0.235 1.515e-02 9 [0.9, 0.1] [0.858, 0.142] 0.042 0.229 7.740e-03 10 [1.0, 0.0] [0.945, 0.055] 0.055 27.357 5.219e-02 ``` One can get the averaged scores using standard pandas' functions, i.e.: ```python print(df.mean()) ``` will produce the following output: ``` true-prev 0.500 mae 0.035 mrae 2.578 mkld 0.009 dtype: float64 ``` Other evaluation functions include: * _artificial_sampling_eval_: that computes the evaluation for a given evaluation metric, returning the average instead of a dataframe. * _artificial_sampling_prediction_: that returns two np.arrays containing the true prevalences and the estimated prevalences. See the documentation for further details.