QuaPy/docs/build/html/_sources/Model-Selection.md.txt

# Model Selection

As a supervised machine learning task, quantification methods
can strongly depend on a good choice of model hyper-parameters.
The process whereby those hyper-parameters are chosen is
typically known as _Model Selection_, and typically consists of
testing different settings and picking the one that performed
best in a held-out validation set in terms of any given
evaluation measure.

## Targeting a Quantification-oriented loss

The task being optimized determines the evaluation protocol,
i.e., the criteria according to which the performance of
any given method for solving is to be assessed.
As a task on its own right, quantification should impose
its own model selection strategies, i.e., strategies
aimed at finding appropriate configurations
specifically designed for the task of quantification.

Quantification has long been regarded as an add-on of
classification, and thus the model selection strategies
customarily adopted in classification have simply been
applied to quantification (see the next section).
It has been argued in _Moreo, Alejandro, and Fabrizio Sebastiani.
"Re-Assessing the" Classify and Count" Quantification Method."
arXiv preprint arXiv:2011.02552 (2020)._
that specific model selection strategies should
be adopted for quantification. That is, model selection
strategies for quantification should target
quantification-oriented losses and be tested in a variety
of scenarios exhibiting different degrees of prior
probability shift.

The class
_qp.model_selection.GridSearchQ_
implements a grid-search exploration over the space of
hyper-parameter combinations that evaluates each
combination of hyper-parameters
by means of a given quantification-oriented
error metric (e.g., any of the error functions implemented
in _qp.error_) and according to the
[_artificial sampling protocol_](https://github.com/HLT-ISTI/QuaPy/wiki/Evaluation).

The following is an example of model selection for quantification:

```python
import quapy as qp
from quapy.method.aggregative import PCC
from sklearn.linear_model import LogisticRegression
import numpy as np

# set a seed to replicate runs
np.random.seed(0)
qp.environ['SAMPLE_SIZE'] = 500

dataset = qp.datasets.fetch_reviews('hp', tfidf=True, min_df=5)

# The model will be returned by the fit method of GridSearchQ.
# Model selection will be performed with a fixed budget of 1000 evaluations
# for each hyper-parameter combination. The error to optimize is the MAE for
# quantification, as evaluated on artificially drawn samples at prevalences
# covering the entire spectrum on a held-out portion (40%) of the training set.
model = qp.model_selection.GridSearchQ(
    model=PCC(LogisticRegression()),
    param_grid={'C': np.logspace(-4,5,10), 'class_weight': ['balanced', None]},
    sample_size=qp.environ['SAMPLE_SIZE'],
    eval_budget=1000,
    error='mae',
    refit=True,  # retrain on the whole labelled set
    val_split=0.4,
    verbose=True  # show information as the process goes on
).fit(dataset.training)

print(f'model selection ended: best hyper-parameters={model.best_params_}')
model = model.best_model_

# evaluation in terms of MAE
results = qp.evaluation.artificial_sampling_eval(
    model,
    dataset.test,
    sample_size=qp.environ['SAMPLE_SIZE'],
    n_prevpoints=101,
    n_repetitions=10,
    error_metric='mae'
)

print(f'MAE={results:.5f}')
```

In this example, the system outputs:
```
[GridSearchQ]: starting optimization with n_jobs=1
[GridSearchQ]: checking hyperparams={'C': 0.0001, 'class_weight': 'balanced'} got mae score 0.24987
[GridSearchQ]: checking hyperparams={'C': 0.0001, 'class_weight': None} got mae score 0.48135
[GridSearchQ]: checking hyperparams={'C': 0.001, 'class_weight': 'balanced'} got mae score 0.24866
[...]
[GridSearchQ]: checking hyperparams={'C': 100000.0, 'class_weight': None} got mae score 0.43676
[GridSearchQ]: optimization finished: best params {'C': 0.1, 'class_weight': 'balanced'} (score=0.19982)
[GridSearchQ]: refitting on the whole development set
model selection ended: best hyper-parameters={'C': 0.1, 'class_weight': 'balanced'}
1010 evaluations will be performed for each combination of hyper-parameters
[artificial sampling protocol] generating predictions: 100%|██████████| 1010/1010 [00:00<00:00, 5005.54it/s]
MAE=0.20342
```

The parameter _val_split_ can alternatively be used to indicate
a validation set (i.e., an instance of _LabelledCollection_) instead
of a proportion. This could be useful if one wants to have control
on the specific data split to be used across different model selection
experiments.

## Targeting a Classification-oriented loss

Optimizing a model for quantification could rather be
computationally costly.
In aggregative methods, one could alternatively try to optimize
the classifier's hyper-parameters for classification.
Although this is theoretically suboptimal, many articles in
quantification literature have opted for this strategy.

In QuaPy, this is achieved by simply instantiating the
classifier learner as a GridSearchCV from scikit-learn.
The following code illustrates how to do that:

```python
learner = GridSearchCV(
    LogisticRegression(),
    param_grid={'C': np.logspace(-4, 5, 10), 'class_weight': ['balanced', None]},
    cv=5)
model = PCC(learner).fit(dataset.training)
print(f'model selection ended: best hyper-parameters={model.learner.best_params_}')
```

In this example, the system outputs:
```
model selection ended: best hyper-parameters={'C': 10000.0, 'class_weight': None}
1010 evaluations will be performed for each combination of hyper-parameters
[artificial sampling protocol] generating predictions: 100%|██████████| 1010/1010 [00:00<00:00, 5379.55it/s]
MAE=0.41734
```

Note that the MAE is worse than the one we obtained when optimizing
for quantification and, indeed, the hyper-parameters found optimal
largely differ between the two selection modalities. The
hyper-parameters C=10000 and class_weight=None have been found
to work well for the specific training prevalence of the HP dataset,
but these hyper-parameters turned out to be suboptimal when the
class prevalences of the test set differs (as is indeed tested
in scenarios of quantification).

This is, however, not always the case, and one could, in practice,
find examples
in which optimizing for classification ends up resulting in a better
quantifier than when optimizing for quantification.
Nonetheless, this is theoretically unlikely to happen.