2021-11-09 15:50:53 +01:00
<!doctype html>
2023-02-08 19:06:53 +01:00
< html lang = "en" >
2021-11-09 15:50:53 +01:00
< head >
< meta charset = "utf-8" / >
2023-02-08 19:06:53 +01:00
< meta name = "viewport" content = "width=device-width, initial-scale=1.0" / > < meta name = "generator" content = "Docutils 0.19: https://docutils.sourceforge.io/" / >
< title > Model Selection — QuaPy 0.1.7 documentation< / title >
2021-11-09 15:50:53 +01:00
< link rel = "stylesheet" type = "text/css" href = "_static/pygments.css" / >
< link rel = "stylesheet" type = "text/css" href = "_static/bizstyle.css" / >
< script data-url_root = "./" id = "documentation_options" src = "_static/documentation_options.js" > < / script >
< script src = "_static/jquery.js" > < / script >
< script src = "_static/underscore.js" > < / script >
2023-02-08 19:06:53 +01:00
< script src = "_static/_sphinx_javascript_frameworks_compat.js" > < / script >
2021-11-09 15:50:53 +01:00
< script src = "_static/doctools.js" > < / script >
2023-02-08 19:06:53 +01:00
< script src = "_static/sphinx_highlight.js" > < / script >
2021-11-09 15:50:53 +01:00
< script src = "_static/bizstyle.js" > < / script >
< link rel = "index" title = "Index" href = "genindex.html" / >
< link rel = "search" title = "Search" href = "search.html" / >
2023-02-08 19:06:53 +01:00
< link rel = "next" title = "Plotting" href = "Plotting.html" / >
< link rel = "prev" title = "Quantification Methods" href = "Methods.html" / >
2021-11-09 15:50:53 +01:00
< meta name = "viewport" content = "width=device-width,initial-scale=1.0" / >
<!-- [if lt IE 9]>
< script src = "_static/css3-mediaqueries.js" > < / script >
<![endif]-->
< / head > < body >
< div class = "related" role = "navigation" aria-label = "related navigation" >
< h3 > Navigation< / h3 >
< ul >
< li class = "right" style = "margin-right: 10px" >
< a href = "genindex.html" title = "General Index"
accesskey="I">index< / a > < / li >
< li class = "right" >
< a href = "py-modindex.html" title = "Python Module Index"
>modules< / a > |< / li >
2023-02-08 19:06:53 +01:00
< li class = "right" >
< a href = "Plotting.html" title = "Plotting"
accesskey="N">next< / a > |< / li >
< li class = "right" >
< a href = "Methods.html" title = "Quantification Methods"
accesskey="P">previous< / a > |< / li >
< li class = "nav-item nav-item-0" > < a href = "index.html" > QuaPy 0.1.7 documentation< / a > » < / li >
2021-11-09 15:50:53 +01:00
< li class = "nav-item nav-item-this" > < a href = "" > Model Selection< / a > < / li >
< / ul >
< / div >
< div class = "document" >
< div class = "documentwrapper" >
< div class = "bodywrapper" >
< div class = "body" role = "main" >
2023-02-08 19:06:53 +01:00
< section id = "model-selection" >
< h1 > Model Selection< a class = "headerlink" href = "#model-selection" title = "Permalink to this heading" > ¶< / a > < / h1 >
2021-11-09 15:50:53 +01:00
< p > As a supervised machine learning task, quantification methods
can strongly depend on a good choice of model hyper-parameters.
The process whereby those hyper-parameters are chosen is
typically known as < em > Model Selection< / em > , and typically consists of
testing different settings and picking the one that performed
best in a held-out validation set in terms of any given
evaluation measure.< / p >
2023-02-08 19:06:53 +01:00
< section id = "targeting-a-quantification-oriented-loss" >
< h2 > Targeting a Quantification-oriented loss< a class = "headerlink" href = "#targeting-a-quantification-oriented-loss" title = "Permalink to this heading" > ¶< / a > < / h2 >
2021-11-09 15:50:53 +01:00
< p > The task being optimized determines the evaluation protocol,
i.e., the criteria according to which the performance of
any given method for solving is to be assessed.
As a task on its own right, quantification should impose
its own model selection strategies, i.e., strategies
aimed at finding appropriate configurations
specifically designed for the task of quantification.< / p >
< p > Quantification has long been regarded as an add-on of
classification, and thus the model selection strategies
customarily adopted in classification have simply been
applied to quantification (see the next section).
It has been argued in < em > Moreo, Alejandro, and Fabrizio Sebastiani.
“Re-Assessing the” Classify and Count” Quantification Method.”
arXiv preprint arXiv:2011.02552 (2020).< / em >
that specific model selection strategies should
be adopted for quantification. That is, model selection
strategies for quantification should target
quantification-oriented losses and be tested in a variety
of scenarios exhibiting different degrees of prior
probability shift.< / p >
< p > The class
< em > qp.model_selection.GridSearchQ< / em >
implements a grid-search exploration over the space of
hyper-parameter combinations that evaluates each< br / >
combination of hyper-parameters
by means of a given quantification-oriented
error metric (e.g., any of the error functions implemented
in < em > qp.error< / em > ) and according to the
< a class = "reference external" href = "https://github.com/HLT-ISTI/QuaPy/wiki/Evaluation" > < em > artificial sampling protocol< / em > < / a > .< / p >
< p > The following is an example of model selection for quantification:< / p >
< div class = "highlight-python notranslate" > < div class = "highlight" > < pre > < span > < / span > < span class = "kn" > import< / span > < span class = "nn" > quapy< / span > < span class = "k" > as< / span > < span class = "nn" > qp< / span >
< span class = "kn" > from< / span > < span class = "nn" > quapy.method.aggregative< / span > < span class = "kn" > import< / span > < span class = "n" > PCC< / span >
< span class = "kn" > from< / span > < span class = "nn" > sklearn.linear_model< / span > < span class = "kn" > import< / span > < span class = "n" > LogisticRegression< / span >
< span class = "kn" > import< / span > < span class = "nn" > numpy< / span > < span class = "k" > as< / span > < span class = "nn" > np< / span >
< span class = "c1" > # set a seed to replicate runs< / span >
< span class = "n" > np< / span > < span class = "o" > .< / span > < span class = "n" > random< / span > < span class = "o" > .< / span > < span class = "n" > seed< / span > < span class = "p" > (< / span > < span class = "mi" > 0< / span > < span class = "p" > )< / span >
< span class = "n" > qp< / span > < span class = "o" > .< / span > < span class = "n" > environ< / span > < span class = "p" > [< / span > < span class = "s1" > ' SAMPLE_SIZE' < / span > < span class = "p" > ]< / span > < span class = "o" > =< / span > < span class = "mi" > 500< / span >
< span class = "n" > dataset< / span > < span class = "o" > =< / span > < span class = "n" > qp< / span > < span class = "o" > .< / span > < span class = "n" > datasets< / span > < span class = "o" > .< / span > < span class = "n" > fetch_reviews< / span > < span class = "p" > (< / span > < span class = "s1" > ' hp' < / span > < span class = "p" > ,< / span > < span class = "n" > tfidf< / span > < span class = "o" > =< / span > < span class = "kc" > True< / span > < span class = "p" > ,< / span > < span class = "n" > min_df< / span > < span class = "o" > =< / span > < span class = "mi" > 5< / span > < span class = "p" > )< / span >
< span class = "c1" > # The model will be returned by the fit method of GridSearchQ.< / span >
< span class = "c1" > # Model selection will be performed with a fixed budget of 1000 evaluations< / span >
< span class = "c1" > # for each hyper-parameter combination. The error to optimize is the MAE for< / span >
< span class = "c1" > # quantification, as evaluated on artificially drawn samples at prevalences < / span >
< span class = "c1" > # covering the entire spectrum on a held-out portion (40%) of the training set.< / span >
< span class = "n" > model< / span > < span class = "o" > =< / span > < span class = "n" > qp< / span > < span class = "o" > .< / span > < span class = "n" > model_selection< / span > < span class = "o" > .< / span > < span class = "n" > GridSearchQ< / span > < span class = "p" > (< / span >
< span class = "n" > model< / span > < span class = "o" > =< / span > < span class = "n" > PCC< / span > < span class = "p" > (< / span > < span class = "n" > LogisticRegression< / span > < span class = "p" > ()),< / span >
< span class = "n" > param_grid< / span > < span class = "o" > =< / span > < span class = "p" > {< / span > < span class = "s1" > ' C' < / span > < span class = "p" > :< / span > < span class = "n" > np< / span > < span class = "o" > .< / span > < span class = "n" > logspace< / span > < span class = "p" > (< / span > < span class = "o" > -< / span > < span class = "mi" > 4< / span > < span class = "p" > ,< / span > < span class = "mi" > 5< / span > < span class = "p" > ,< / span > < span class = "mi" > 10< / span > < span class = "p" > ),< / span > < span class = "s1" > ' class_weight' < / span > < span class = "p" > :< / span > < span class = "p" > [< / span > < span class = "s1" > ' balanced' < / span > < span class = "p" > ,< / span > < span class = "kc" > None< / span > < span class = "p" > ]},< / span >
< span class = "n" > sample_size< / span > < span class = "o" > =< / span > < span class = "n" > qp< / span > < span class = "o" > .< / span > < span class = "n" > environ< / span > < span class = "p" > [< / span > < span class = "s1" > ' SAMPLE_SIZE' < / span > < span class = "p" > ],< / span >
< span class = "n" > eval_budget< / span > < span class = "o" > =< / span > < span class = "mi" > 1000< / span > < span class = "p" > ,< / span >
< span class = "n" > error< / span > < span class = "o" > =< / span > < span class = "s1" > ' mae' < / span > < span class = "p" > ,< / span >
< span class = "n" > refit< / span > < span class = "o" > =< / span > < span class = "kc" > True< / span > < span class = "p" > ,< / span > < span class = "c1" > # retrain on the whole labelled set< / span >
< span class = "n" > val_split< / span > < span class = "o" > =< / span > < span class = "mf" > 0.4< / span > < span class = "p" > ,< / span >
< span class = "n" > verbose< / span > < span class = "o" > =< / span > < span class = "kc" > True< / span > < span class = "c1" > # show information as the process goes on< / span >
< span class = "p" > )< / span > < span class = "o" > .< / span > < span class = "n" > fit< / span > < span class = "p" > (< / span > < span class = "n" > dataset< / span > < span class = "o" > .< / span > < span class = "n" > training< / span > < span class = "p" > )< / span >
< span class = "nb" > print< / span > < span class = "p" > (< / span > < span class = "sa" > f< / span > < span class = "s1" > ' model selection ended: best hyper-parameters=< / span > < span class = "si" > {< / span > < span class = "n" > model< / span > < span class = "o" > .< / span > < span class = "n" > best_params_< / span > < span class = "si" > }< / span > < span class = "s1" > ' < / span > < span class = "p" > )< / span >
< span class = "n" > model< / span > < span class = "o" > =< / span > < span class = "n" > model< / span > < span class = "o" > .< / span > < span class = "n" > best_model_< / span >
< span class = "c1" > # evaluation in terms of MAE< / span >
< span class = "n" > results< / span > < span class = "o" > =< / span > < span class = "n" > qp< / span > < span class = "o" > .< / span > < span class = "n" > evaluation< / span > < span class = "o" > .< / span > < span class = "n" > artificial_sampling_eval< / span > < span class = "p" > (< / span >
< span class = "n" > model< / span > < span class = "p" > ,< / span >
< span class = "n" > dataset< / span > < span class = "o" > .< / span > < span class = "n" > test< / span > < span class = "p" > ,< / span >
< span class = "n" > sample_size< / span > < span class = "o" > =< / span > < span class = "n" > qp< / span > < span class = "o" > .< / span > < span class = "n" > environ< / span > < span class = "p" > [< / span > < span class = "s1" > ' SAMPLE_SIZE' < / span > < span class = "p" > ],< / span >
< span class = "n" > n_prevpoints< / span > < span class = "o" > =< / span > < span class = "mi" > 101< / span > < span class = "p" > ,< / span >
< span class = "n" > n_repetitions< / span > < span class = "o" > =< / span > < span class = "mi" > 10< / span > < span class = "p" > ,< / span >
< span class = "n" > error_metric< / span > < span class = "o" > =< / span > < span class = "s1" > ' mae' < / span >
< span class = "p" > )< / span >
< span class = "nb" > print< / span > < span class = "p" > (< / span > < span class = "sa" > f< / span > < span class = "s1" > ' MAE=< / span > < span class = "si" > {< / span > < span class = "n" > results< / span > < span class = "si" > :< / span > < span class = "s1" > .5f< / span > < span class = "si" > }< / span > < span class = "s1" > ' < / span > < span class = "p" > )< / span >
< / pre > < / div >
< / div >
< p > In this example, the system outputs:< / p >
< div class = "highlight-default notranslate" > < div class = "highlight" > < pre > < span > < / span > [GridSearchQ]: starting optimization with n_jobs=1
[GridSearchQ]: checking hyperparams={' C' : 0.0001, ' class_weight' : ' balanced' } got mae score 0.24987
[GridSearchQ]: checking hyperparams={' C' : 0.0001, ' class_weight' : None} got mae score 0.48135
[GridSearchQ]: checking hyperparams={' C' : 0.001, ' class_weight' : ' balanced' } got mae score 0.24866
[...]
[GridSearchQ]: checking hyperparams={' C' : 100000.0, ' class_weight' : None} got mae score 0.43676
[GridSearchQ]: optimization finished: best params {' C' : 0.1, ' class_weight' : ' balanced' } (score=0.19982)
[GridSearchQ]: refitting on the whole development set
model selection ended: best hyper-parameters={' C' : 0.1, ' class_weight' : ' balanced' }
1010 evaluations will be performed for each combination of hyper-parameters
[artificial sampling protocol] generating predictions: 100%|██████████| 1010/1010 [00:00< 00:00, 5005.54it/s]
MAE=0.20342
< / pre > < / div >
< / div >
< p > The parameter < em > val_split< / em > can alternatively be used to indicate
a validation set (i.e., an instance of < em > LabelledCollection< / em > ) instead
of a proportion. This could be useful if one wants to have control
on the specific data split to be used across different model selection
experiments.< / p >
2023-02-08 19:06:53 +01:00
< / section >
< section id = "targeting-a-classification-oriented-loss" >
< h2 > Targeting a Classification-oriented loss< a class = "headerlink" href = "#targeting-a-classification-oriented-loss" title = "Permalink to this heading" > ¶< / a > < / h2 >
2021-11-09 15:50:53 +01:00
< p > Optimizing a model for quantification could rather be
computationally costly.
In aggregative methods, one could alternatively try to optimize
the classifier’ s hyper-parameters for classification.
Although this is theoretically suboptimal, many articles in
quantification literature have opted for this strategy.< / p >
< p > In QuaPy, this is achieved by simply instantiating the
classifier learner as a GridSearchCV from scikit-learn.
The following code illustrates how to do that:< / p >
< div class = "highlight-python notranslate" > < div class = "highlight" > < pre > < span > < / span > < span class = "n" > learner< / span > < span class = "o" > =< / span > < span class = "n" > GridSearchCV< / span > < span class = "p" > (< / span >
< span class = "n" > LogisticRegression< / span > < span class = "p" > (),< / span >
< span class = "n" > param_grid< / span > < span class = "o" > =< / span > < span class = "p" > {< / span > < span class = "s1" > ' C' < / span > < span class = "p" > :< / span > < span class = "n" > np< / span > < span class = "o" > .< / span > < span class = "n" > logspace< / span > < span class = "p" > (< / span > < span class = "o" > -< / span > < span class = "mi" > 4< / span > < span class = "p" > ,< / span > < span class = "mi" > 5< / span > < span class = "p" > ,< / span > < span class = "mi" > 10< / span > < span class = "p" > ),< / span > < span class = "s1" > ' class_weight' < / span > < span class = "p" > :< / span > < span class = "p" > [< / span > < span class = "s1" > ' balanced' < / span > < span class = "p" > ,< / span > < span class = "kc" > None< / span > < span class = "p" > ]},< / span >
< span class = "n" > cv< / span > < span class = "o" > =< / span > < span class = "mi" > 5< / span > < span class = "p" > )< / span >
< span class = "n" > model< / span > < span class = "o" > =< / span > < span class = "n" > PCC< / span > < span class = "p" > (< / span > < span class = "n" > learner< / span > < span class = "p" > )< / span > < span class = "o" > .< / span > < span class = "n" > fit< / span > < span class = "p" > (< / span > < span class = "n" > dataset< / span > < span class = "o" > .< / span > < span class = "n" > training< / span > < span class = "p" > )< / span >
< span class = "nb" > print< / span > < span class = "p" > (< / span > < span class = "sa" > f< / span > < span class = "s1" > ' model selection ended: best hyper-parameters=< / span > < span class = "si" > {< / span > < span class = "n" > model< / span > < span class = "o" > .< / span > < span class = "n" > learner< / span > < span class = "o" > .< / span > < span class = "n" > best_params_< / span > < span class = "si" > }< / span > < span class = "s1" > ' < / span > < span class = "p" > )< / span >
< / pre > < / div >
< / div >
< p > In this example, the system outputs:< / p >
< div class = "highlight-default notranslate" > < div class = "highlight" > < pre > < span > < / span > model selection ended: best hyper-parameters={' C' : 10000.0, ' class_weight' : None}
1010 evaluations will be performed for each combination of hyper-parameters
[artificial sampling protocol] generating predictions: 100%|██████████| 1010/1010 [00:00< 00:00, 5379.55it/s]
MAE=0.41734
< / pre > < / div >
< / div >
< p > Note that the MAE is worse than the one we obtained when optimizing
for quantification and, indeed, the hyper-parameters found optimal
largely differ between the two selection modalities. The
hyper-parameters C=10000 and class_weight=None have been found
to work well for the specific training prevalence of the HP dataset,
but these hyper-parameters turned out to be suboptimal when the
class prevalences of the test set differs (as is indeed tested
in scenarios of quantification).< / p >
< p > This is, however, not always the case, and one could, in practice,
find examples
in which optimizing for classification ends up resulting in a better
quantifier than when optimizing for quantification.
Nonetheless, this is theoretically unlikely to happen.< / p >
2023-02-08 19:06:53 +01:00
< / section >
< / section >
2021-11-09 15:50:53 +01:00
< div class = "clearer" > < / div >
< / div >
< / div >
< / div >
< div class = "sphinxsidebar" role = "navigation" aria-label = "main navigation" >
< div class = "sphinxsidebarwrapper" >
2023-02-08 19:06:53 +01:00
< div >
< h3 > < a href = "index.html" > Table of Contents< / a > < / h3 >
< ul >
2021-11-09 15:50:53 +01:00
< li > < a class = "reference internal" href = "#" > Model Selection< / a > < ul >
< li > < a class = "reference internal" href = "#targeting-a-quantification-oriented-loss" > Targeting a Quantification-oriented loss< / a > < / li >
< li > < a class = "reference internal" href = "#targeting-a-classification-oriented-loss" > Targeting a Classification-oriented loss< / a > < / li >
< / ul >
< / li >
< / ul >
2023-02-08 19:06:53 +01:00
< / div >
< div >
< h4 > Previous topic< / h4 >
< p class = "topless" > < a href = "Methods.html"
title="previous chapter">Quantification Methods< / a > < / p >
< / div >
< div >
< h4 > Next topic< / h4 >
< p class = "topless" > < a href = "Plotting.html"
title="next chapter">Plotting< / a > < / p >
< / div >
2021-11-09 15:50:53 +01:00
< div role = "note" aria-label = "source link" >
< h3 > This Page< / h3 >
< ul class = "this-page-menu" >
< li > < a href = "_sources/Model-Selection.md.txt"
rel="nofollow">Show Source< / a > < / li >
< / ul >
< / div >
< div id = "searchbox" style = "display: none" role = "search" >
< h3 id = "searchlabel" > Quick search< / h3 >
< div class = "searchformwrapper" >
< form class = "search" action = "search.html" method = "get" >
< input type = "text" name = "q" aria-labelledby = "searchlabel" autocomplete = "off" autocorrect = "off" autocapitalize = "off" spellcheck = "false" / >
< input type = "submit" value = "Go" / >
< / form >
< / div >
< / div >
2023-02-08 19:06:53 +01:00
< script > document . getElementById ( 'searchbox' ) . style . display = "block" < / script >
2021-11-09 15:50:53 +01:00
< / div >
< / div >
< div class = "clearer" > < / div >
< / div >
< div class = "related" role = "navigation" aria-label = "related navigation" >
< h3 > Navigation< / h3 >
< ul >
< li class = "right" style = "margin-right: 10px" >
< a href = "genindex.html" title = "General Index"
>index< / a > < / li >
< li class = "right" >
< a href = "py-modindex.html" title = "Python Module Index"
>modules< / a > |< / li >
2023-02-08 19:06:53 +01:00
< li class = "right" >
< a href = "Plotting.html" title = "Plotting"
>next< / a > |< / li >
< li class = "right" >
< a href = "Methods.html" title = "Quantification Methods"
>previous< / a > |< / li >
< li class = "nav-item nav-item-0" > < a href = "index.html" > QuaPy 0.1.7 documentation< / a > » < / li >
2021-11-09 15:50:53 +01:00
< li class = "nav-item nav-item-this" > < a href = "" > Model Selection< / a > < / li >
< / ul >
< / div >
< div class = "footer" role = "contentinfo" >
© Copyright 2021, Alejandro Moreo.
2023-02-08 19:06:53 +01:00
Created using < a href = "https://www.sphinx-doc.org/" > Sphinx< / a > 5.3.0.
2021-11-09 15:50:53 +01:00
< / div >
< / body >
< / html >