Model Selection

As a supervised machine learning task, quantification methods can strongly depend on a good choice of model hyper-parameters. The process whereby those hyper-parameters are chosen is typically known as Model Selection, and typically consists of testing different settings and picking the one that performed best in a held-out validation set in terms of any given evaluation measure.

Targeting a Quantification-oriented loss

The task being optimized determines the evaluation protocol, i.e., the criteria according to which the performance of any given method for solving is to be assessed. As a task on its own right, quantification should impose its own model selection strategies, i.e., strategies aimed at finding appropriate configurations specifically designed for the task of quantification.

Quantification has long been regarded as an add-on of classification, and thus the model selection strategies customarily adopted in classification have simply been applied to quantification (see the next section). It has been argued in Moreo, Alejandro, and Fabrizio Sebastiani. “Re-Assessing the” Classify and Count” Quantification Method.” arXiv preprint arXiv:2011.02552 (2020). that specific model selection strategies should be adopted for quantification. That is, model selection strategies for quantification should target quantification-oriented losses and be tested in a variety of scenarios exhibiting different degrees of prior probability shift.

The class qp.model_selection.GridSearchQ implements a grid-search exploration over the space of hyper-parameter combinations that evaluates each
combination of hyper-parameters by means of a given quantification-oriented error metric (e.g., any of the error functions implemented in qp.error) and according to the artificial sampling protocol.

The following is an example of model selection for quantification:

import quapy as qp
from quapy.method.aggregative import PCC
from sklearn.linear_model import LogisticRegression
import numpy as np

# set a seed to replicate runs
np.random.seed(0)
qp.environ['SAMPLE_SIZE'] = 500

dataset = qp.datasets.fetch_reviews('hp', tfidf=True, min_df=5)

# The model will be returned by the fit method of GridSearchQ.
# Model selection will be performed with a fixed budget of 1000 evaluations
# for each hyper-parameter combination. The error to optimize is the MAE for
# quantification, as evaluated on artificially drawn samples at prevalences 
# covering the entire spectrum on a held-out portion (40%) of the training set.
model = qp.model_selection.GridSearchQ(
    model=PCC(LogisticRegression()),
    param_grid={'C': np.logspace(-4,5,10), 'class_weight': ['balanced', None]},
    sample_size=qp.environ['SAMPLE_SIZE'],
    eval_budget=1000,
    error='mae',
    refit=True,  # retrain on the whole labelled set
    val_split=0.4,
    verbose=True  # show information as the process goes on
).fit(dataset.training)

print(f'model selection ended: best hyper-parameters={model.best_params_}')
model = model.best_model_

# evaluation in terms of MAE
results = qp.evaluation.artificial_sampling_eval(
    model,
    dataset.test,
    sample_size=qp.environ['SAMPLE_SIZE'],
    n_prevpoints=101,
    n_repetitions=10,
    error_metric='mae'
)

print(f'MAE={results:.5f}')

In this example, the system outputs:

[GridSearchQ]: starting optimization with n_jobs=1
[GridSearchQ]: checking hyperparams={'C': 0.0001, 'class_weight': 'balanced'} got mae score 0.24987
[GridSearchQ]: checking hyperparams={'C': 0.0001, 'class_weight': None} got mae score 0.48135
[GridSearchQ]: checking hyperparams={'C': 0.001, 'class_weight': 'balanced'} got mae score 0.24866
[...]
[GridSearchQ]: checking hyperparams={'C': 100000.0, 'class_weight': None} got mae score 0.43676
[GridSearchQ]: optimization finished: best params {'C': 0.1, 'class_weight': 'balanced'} (score=0.19982)
[GridSearchQ]: refitting on the whole development set
model selection ended: best hyper-parameters={'C': 0.1, 'class_weight': 'balanced'}
1010 evaluations will be performed for each combination of hyper-parameters
[artificial sampling protocol] generating predictions: 100%|██████████| 1010/1010 [00:00<00:00, 5005.54it/s]
MAE=0.20342

The parameter val_split can alternatively be used to indicate a validation set (i.e., an instance of LabelledCollection) instead of a proportion. This could be useful if one wants to have control on the specific data split to be used across different model selection experiments.

Targeting a Classification-oriented loss

Optimizing a model for quantification could rather be computationally costly. In aggregative methods, one could alternatively try to optimize the classifier’s hyper-parameters for classification. Although this is theoretically suboptimal, many articles in quantification literature have opted for this strategy.

In QuaPy, this is achieved by simply instantiating the classifier learner as a GridSearchCV from scikit-learn. The following code illustrates how to do that:

learner = GridSearchCV(
    LogisticRegression(),
    param_grid={'C': np.logspace(-4, 5, 10), 'class_weight': ['balanced', None]},
    cv=5)
model = PCC(learner).fit(dataset.training)
print(f'model selection ended: best hyper-parameters={model.learner.best_params_}')

In this example, the system outputs:

model selection ended: best hyper-parameters={'C': 10000.0, 'class_weight': None}
1010 evaluations will be performed for each combination of hyper-parameters
[artificial sampling protocol] generating predictions: 100%|██████████| 1010/1010 [00:00<00:00, 5379.55it/s]
MAE=0.41734

Note that the MAE is worse than the one we obtained when optimizing for quantification and, indeed, the hyper-parameters found optimal largely differ between the two selection modalities. The hyper-parameters C=10000 and class_weight=None have been found to work well for the specific training prevalence of the HP dataset, but these hyper-parameters turned out to be suboptimal when the class prevalences of the test set differs (as is indeed tested in scenarios of quantification).

This is, however, not always the case, and one could, in practice, find examples in which optimizing for classification ends up resulting in a better quantifier than when optimizing for quantification. Nonetheless, this is theoretically unlikely to happen.