QuaPy/docs/build/html/_sources/Model-Selection.md.txt

# Model Selection

As a supervised machine learning task, quantification methods
can strongly depend on a good choice of model hyper-parameters.
The process whereby those hyper-parameters are chosen is
typically known as _Model Selection_, and typically consists of
testing different settings and picking the one that performed
best in a held-out validation set in terms of any given
evaluation measure.

## Targeting a Quantification-oriented loss

The task being optimized determines the evaluation protocol,
i.e., the criteria according to which the performance of
any given method for solving is to be assessed.
As a task on its own right, quantification should impose
its own model selection strategies, i.e., strategies
aimed at finding appropriate configurations
specifically designed for the task of quantification.

Quantification has long been regarded as an add-on of
classification, and thus the model selection strategies
customarily adopted in classification have simply been
applied to quantification (see the next section).
It has been argued in [Moreo, Alejandro, and Fabrizio Sebastiani.
Re-Assessing the "Classify and Count" Quantification Method.
ECIR 2021: Advances in Information Retrieval pp 75–91.](https://link.springer.com/chapter/10.1007/978-3-030-72240-1_6)
that specific model selection strategies should
be adopted for quantification. That is, model selection
strategies for quantification should target
quantification-oriented losses and be tested in a variety
of scenarios exhibiting different degrees of prior
probability shift.

The class _qp.model_selection.GridSearchQ_ implements a grid-search exploration over the space of
hyper-parameter combinations that [evaluates](https://github.com/HLT-ISTI/QuaPy/wiki/Evaluation)
each combination of hyper-parameters by means of a given quantification-oriented
error metric (e.g., any of the error functions implemented
in _qp.error_) and according to a
[sampling generation protocol](https://github.com/HLT-ISTI/QuaPy/wiki/Protocols).

The following is an example (also included in the examples folder) of model selection for quantification:

```python
import quapy as qp
from quapy.protocol import APP
from quapy.method.aggregative import DistributionMatching
from sklearn.linear_model import LogisticRegression
import numpy as np

"""
In this example, we show how to perform model selection on a DistributionMatching quantifier.
"""

model = DistributionMatching(LogisticRegression())

qp.environ['SAMPLE_SIZE'] = 100
qp.environ['N_JOBS'] = -1  # explore hyper-parameters in parallel

training, test = qp.datasets.fetch_reviews('imdb', tfidf=True, min_df=5).train_test

# The model will be returned by the fit method of GridSearchQ.
# Every combination of hyper-parameters will be evaluated by confronting the
# quantifier thus configured against a series of samples generated by means
# of a sample generation protocol. For this example, we will use the
# artificial-prevalence protocol (APP), that generates samples with prevalence
# values in the entire range of values from a grid (e.g., [0, 0.1, 0.2, ..., 1]).
# We devote 30% of the dataset for this exploration.
training, validation = training.split_stratified(train_prop=0.7)
protocol = APP(validation)

# We will explore a classification-dependent hyper-parameter (e.g., the 'C'
# hyper-parameter of LogisticRegression) and a quantification-dependent hyper-parameter
# (e.g., the number of bins in a DistributionMatching quantifier.
# Classifier-dependent hyper-parameters have to be marked with a prefix "classifier__"
# in order to let the quantifier know this hyper-parameter belongs to its underlying
# classifier.
param_grid = {
    'classifier__C': np.logspace(-3,3,7),
    'nbins': [8, 16, 32, 64],
}

model = qp.model_selection.GridSearchQ(
    model=model,
    param_grid=param_grid,
    protocol=protocol,
    error='mae',  # the error to optimize is the MAE (a quantification-oriented loss)
    refit=True,   # retrain on the whole labelled set once done
    verbose=True  # show information as the process goes on
).fit(training)

print(f'model selection ended: best hyper-parameters={model.best_params_}')
model = model.best_model_

# evaluation in terms of MAE
# we use the same evaluation protocol (APP) on the test set
mae_score = qp.evaluation.evaluate(model, protocol=APP(test), error_metric='mae')

print(f'MAE={mae_score:.5f}')
```

In this example, the system outputs:
```
[GridSearchQ]: starting model selection with self.n_jobs =-1
[GridSearchQ]: hyperparams={'classifier__C': 0.01, 'nbins': 64}	 got mae score 0.04021 [took 1.1356s]
[GridSearchQ]: hyperparams={'classifier__C': 0.01, 'nbins': 32}	 got mae score 0.04286 [took 1.2139s]
[GridSearchQ]: hyperparams={'classifier__C': 0.01, 'nbins': 16}	 got mae score 0.04888 [took 1.2491s]
[GridSearchQ]: hyperparams={'classifier__C': 0.001, 'nbins': 8}	 got mae score 0.05163 [took 1.5372s]
[...]
[GridSearchQ]: hyperparams={'classifier__C': 1000.0, 'nbins': 32}	 got mae score 0.02445 [took 2.9056s]
[GridSearchQ]: optimization finished: best params {'classifier__C': 100.0, 'nbins': 32} (score=0.02234) [took 7.3114s]
[GridSearchQ]: refitting on the whole development set
model selection ended: best hyper-parameters={'classifier__C': 100.0, 'nbins': 32}
MAE=0.03102
```

The parameter _val_split_ can alternatively be used to indicate
a validation set (i.e., an instance of _LabelledCollection_) instead
of a proportion. This could be useful if one wants to have control
on the specific data split to be used across different model selection
experiments.

## Targeting a Classification-oriented loss

Optimizing a model for quantification could rather be
computationally costly.
In aggregative methods, one could alternatively try to optimize
the classifier's hyper-parameters for classification.
Although this is theoretically suboptimal, many articles in
quantification literature have opted for this strategy.

In QuaPy, this is achieved by simply instantiating the
classifier learner as a GridSearchCV from scikit-learn.
The following code illustrates how to do that:

```python
learner = GridSearchCV(
    LogisticRegression(),
    param_grid={'C': np.logspace(-4, 5, 10), 'class_weight': ['balanced', None]},
    cv=5)
model = DistributionMatching(learner).fit(dataset.training)
```

However, this is conceptually flawed, since the model should be
optimized for the task at hand (quantification), and not for a surrogate task (classification),
i.e., the model should be requested to deliver low quantification errors, rather
than low classification errors.