forked from moreo/QuaPy
246 lines
17 KiB
HTML
246 lines
17 KiB
HTML
|
||
|
||
<!doctype html>
|
||
|
||
<html>
|
||
<head>
|
||
<meta charset="utf-8" />
|
||
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
|
||
<title>Model Selection — QuaPy 0.1.6 documentation</title>
|
||
<link rel="stylesheet" type="text/css" href="_static/pygments.css" />
|
||
<link rel="stylesheet" type="text/css" href="_static/bizstyle.css" />
|
||
|
||
<script data-url_root="./" id="documentation_options" src="_static/documentation_options.js"></script>
|
||
<script src="_static/jquery.js"></script>
|
||
<script src="_static/underscore.js"></script>
|
||
<script src="_static/doctools.js"></script>
|
||
<script src="_static/bizstyle.js"></script>
|
||
<link rel="index" title="Index" href="genindex.html" />
|
||
<link rel="search" title="Search" href="search.html" />
|
||
<meta name="viewport" content="width=device-width,initial-scale=1.0" />
|
||
<!--[if lt IE 9]>
|
||
<script src="_static/css3-mediaqueries.js"></script>
|
||
<![endif]-->
|
||
</head><body>
|
||
<div class="related" role="navigation" aria-label="related navigation">
|
||
<h3>Navigation</h3>
|
||
<ul>
|
||
<li class="right" style="margin-right: 10px">
|
||
<a href="genindex.html" title="General Index"
|
||
accesskey="I">index</a></li>
|
||
<li class="right" >
|
||
<a href="py-modindex.html" title="Python Module Index"
|
||
>modules</a> |</li>
|
||
<li class="nav-item nav-item-0"><a href="index.html">QuaPy 0.1.6 documentation</a> »</li>
|
||
<li class="nav-item nav-item-this"><a href="">Model Selection</a></li>
|
||
</ul>
|
||
</div>
|
||
|
||
<div class="document">
|
||
<div class="documentwrapper">
|
||
<div class="bodywrapper">
|
||
<div class="body" role="main">
|
||
|
||
<div class="tex2jax_ignore mathjax_ignore section" id="model-selection">
|
||
<h1>Model Selection<a class="headerlink" href="#model-selection" title="Permalink to this headline">¶</a></h1>
|
||
<p>As a supervised machine learning task, quantification methods
|
||
can strongly depend on a good choice of model hyper-parameters.
|
||
The process whereby those hyper-parameters are chosen is
|
||
typically known as <em>Model Selection</em>, and typically consists of
|
||
testing different settings and picking the one that performed
|
||
best in a held-out validation set in terms of any given
|
||
evaluation measure.</p>
|
||
<div class="section" id="targeting-a-quantification-oriented-loss">
|
||
<h2>Targeting a Quantification-oriented loss<a class="headerlink" href="#targeting-a-quantification-oriented-loss" title="Permalink to this headline">¶</a></h2>
|
||
<p>The task being optimized determines the evaluation protocol,
|
||
i.e., the criteria according to which the performance of
|
||
any given method for solving is to be assessed.
|
||
As a task on its own right, quantification should impose
|
||
its own model selection strategies, i.e., strategies
|
||
aimed at finding appropriate configurations
|
||
specifically designed for the task of quantification.</p>
|
||
<p>Quantification has long been regarded as an add-on of
|
||
classification, and thus the model selection strategies
|
||
customarily adopted in classification have simply been
|
||
applied to quantification (see the next section).
|
||
It has been argued in <em>Moreo, Alejandro, and Fabrizio Sebastiani.
|
||
“Re-Assessing the” Classify and Count” Quantification Method.”
|
||
arXiv preprint arXiv:2011.02552 (2020).</em>
|
||
that specific model selection strategies should
|
||
be adopted for quantification. That is, model selection
|
||
strategies for quantification should target
|
||
quantification-oriented losses and be tested in a variety
|
||
of scenarios exhibiting different degrees of prior
|
||
probability shift.</p>
|
||
<p>The class
|
||
<em>qp.model_selection.GridSearchQ</em>
|
||
implements a grid-search exploration over the space of
|
||
hyper-parameter combinations that evaluates each<br />
|
||
combination of hyper-parameters
|
||
by means of a given quantification-oriented
|
||
error metric (e.g., any of the error functions implemented
|
||
in <em>qp.error</em>) and according to the
|
||
<a class="reference external" href="https://github.com/HLT-ISTI/QuaPy/wiki/Evaluation"><em>artificial sampling protocol</em></a>.</p>
|
||
<p>The following is an example of model selection for quantification:</p>
|
||
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">quapy</span> <span class="k">as</span> <span class="nn">qp</span>
|
||
<span class="kn">from</span> <span class="nn">quapy.method.aggregative</span> <span class="kn">import</span> <span class="n">PCC</span>
|
||
<span class="kn">from</span> <span class="nn">sklearn.linear_model</span> <span class="kn">import</span> <span class="n">LogisticRegression</span>
|
||
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
|
||
|
||
<span class="c1"># set a seed to replicate runs</span>
|
||
<span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">seed</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
|
||
<span class="n">qp</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s1">'SAMPLE_SIZE'</span><span class="p">]</span> <span class="o">=</span> <span class="mi">500</span>
|
||
|
||
<span class="n">dataset</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">datasets</span><span class="o">.</span><span class="n">fetch_reviews</span><span class="p">(</span><span class="s1">'hp'</span><span class="p">,</span> <span class="n">tfidf</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">min_df</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span>
|
||
|
||
<span class="c1"># The model will be returned by the fit method of GridSearchQ.</span>
|
||
<span class="c1"># Model selection will be performed with a fixed budget of 1000 evaluations</span>
|
||
<span class="c1"># for each hyper-parameter combination. The error to optimize is the MAE for</span>
|
||
<span class="c1"># quantification, as evaluated on artificially drawn samples at prevalences </span>
|
||
<span class="c1"># covering the entire spectrum on a held-out portion (40%) of the training set.</span>
|
||
<span class="n">model</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">model_selection</span><span class="o">.</span><span class="n">GridSearchQ</span><span class="p">(</span>
|
||
<span class="n">model</span><span class="o">=</span><span class="n">PCC</span><span class="p">(</span><span class="n">LogisticRegression</span><span class="p">()),</span>
|
||
<span class="n">param_grid</span><span class="o">=</span><span class="p">{</span><span class="s1">'C'</span><span class="p">:</span> <span class="n">np</span><span class="o">.</span><span class="n">logspace</span><span class="p">(</span><span class="o">-</span><span class="mi">4</span><span class="p">,</span><span class="mi">5</span><span class="p">,</span><span class="mi">10</span><span class="p">),</span> <span class="s1">'class_weight'</span><span class="p">:</span> <span class="p">[</span><span class="s1">'balanced'</span><span class="p">,</span> <span class="kc">None</span><span class="p">]},</span>
|
||
<span class="n">sample_size</span><span class="o">=</span><span class="n">qp</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s1">'SAMPLE_SIZE'</span><span class="p">],</span>
|
||
<span class="n">eval_budget</span><span class="o">=</span><span class="mi">1000</span><span class="p">,</span>
|
||
<span class="n">error</span><span class="o">=</span><span class="s1">'mae'</span><span class="p">,</span>
|
||
<span class="n">refit</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="c1"># retrain on the whole labelled set</span>
|
||
<span class="n">val_split</span><span class="o">=</span><span class="mf">0.4</span><span class="p">,</span>
|
||
<span class="n">verbose</span><span class="o">=</span><span class="kc">True</span> <span class="c1"># show information as the process goes on</span>
|
||
<span class="p">)</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">dataset</span><span class="o">.</span><span class="n">training</span><span class="p">)</span>
|
||
|
||
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">'model selection ended: best hyper-parameters=</span><span class="si">{</span><span class="n">model</span><span class="o">.</span><span class="n">best_params_</span><span class="si">}</span><span class="s1">'</span><span class="p">)</span>
|
||
<span class="n">model</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">best_model_</span>
|
||
|
||
<span class="c1"># evaluation in terms of MAE</span>
|
||
<span class="n">results</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">evaluation</span><span class="o">.</span><span class="n">artificial_sampling_eval</span><span class="p">(</span>
|
||
<span class="n">model</span><span class="p">,</span>
|
||
<span class="n">dataset</span><span class="o">.</span><span class="n">test</span><span class="p">,</span>
|
||
<span class="n">sample_size</span><span class="o">=</span><span class="n">qp</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s1">'SAMPLE_SIZE'</span><span class="p">],</span>
|
||
<span class="n">n_prevpoints</span><span class="o">=</span><span class="mi">101</span><span class="p">,</span>
|
||
<span class="n">n_repetitions</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span>
|
||
<span class="n">error_metric</span><span class="o">=</span><span class="s1">'mae'</span>
|
||
<span class="p">)</span>
|
||
|
||
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">'MAE=</span><span class="si">{</span><span class="n">results</span><span class="si">:</span><span class="s1">.5f</span><span class="si">}</span><span class="s1">'</span><span class="p">)</span>
|
||
</pre></div>
|
||
</div>
|
||
<p>In this example, the system outputs:</p>
|
||
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span>[GridSearchQ]: starting optimization with n_jobs=1
|
||
[GridSearchQ]: checking hyperparams={'C': 0.0001, 'class_weight': 'balanced'} got mae score 0.24987
|
||
[GridSearchQ]: checking hyperparams={'C': 0.0001, 'class_weight': None} got mae score 0.48135
|
||
[GridSearchQ]: checking hyperparams={'C': 0.001, 'class_weight': 'balanced'} got mae score 0.24866
|
||
[...]
|
||
[GridSearchQ]: checking hyperparams={'C': 100000.0, 'class_weight': None} got mae score 0.43676
|
||
[GridSearchQ]: optimization finished: best params {'C': 0.1, 'class_weight': 'balanced'} (score=0.19982)
|
||
[GridSearchQ]: refitting on the whole development set
|
||
model selection ended: best hyper-parameters={'C': 0.1, 'class_weight': 'balanced'}
|
||
1010 evaluations will be performed for each combination of hyper-parameters
|
||
[artificial sampling protocol] generating predictions: 100%|██████████| 1010/1010 [00:00<00:00, 5005.54it/s]
|
||
MAE=0.20342
|
||
</pre></div>
|
||
</div>
|
||
<p>The parameter <em>val_split</em> can alternatively be used to indicate
|
||
a validation set (i.e., an instance of <em>LabelledCollection</em>) instead
|
||
of a proportion. This could be useful if one wants to have control
|
||
on the specific data split to be used across different model selection
|
||
experiments.</p>
|
||
</div>
|
||
<div class="section" id="targeting-a-classification-oriented-loss">
|
||
<h2>Targeting a Classification-oriented loss<a class="headerlink" href="#targeting-a-classification-oriented-loss" title="Permalink to this headline">¶</a></h2>
|
||
<p>Optimizing a model for quantification could rather be
|
||
computationally costly.
|
||
In aggregative methods, one could alternatively try to optimize
|
||
the classifier’s hyper-parameters for classification.
|
||
Although this is theoretically suboptimal, many articles in
|
||
quantification literature have opted for this strategy.</p>
|
||
<p>In QuaPy, this is achieved by simply instantiating the
|
||
classifier learner as a GridSearchCV from scikit-learn.
|
||
The following code illustrates how to do that:</p>
|
||
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">learner</span> <span class="o">=</span> <span class="n">GridSearchCV</span><span class="p">(</span>
|
||
<span class="n">LogisticRegression</span><span class="p">(),</span>
|
||
<span class="n">param_grid</span><span class="o">=</span><span class="p">{</span><span class="s1">'C'</span><span class="p">:</span> <span class="n">np</span><span class="o">.</span><span class="n">logspace</span><span class="p">(</span><span class="o">-</span><span class="mi">4</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">10</span><span class="p">),</span> <span class="s1">'class_weight'</span><span class="p">:</span> <span class="p">[</span><span class="s1">'balanced'</span><span class="p">,</span> <span class="kc">None</span><span class="p">]},</span>
|
||
<span class="n">cv</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span>
|
||
<span class="n">model</span> <span class="o">=</span> <span class="n">PCC</span><span class="p">(</span><span class="n">learner</span><span class="p">)</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">dataset</span><span class="o">.</span><span class="n">training</span><span class="p">)</span>
|
||
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">'model selection ended: best hyper-parameters=</span><span class="si">{</span><span class="n">model</span><span class="o">.</span><span class="n">learner</span><span class="o">.</span><span class="n">best_params_</span><span class="si">}</span><span class="s1">'</span><span class="p">)</span>
|
||
</pre></div>
|
||
</div>
|
||
<p>In this example, the system outputs:</p>
|
||
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span>model selection ended: best hyper-parameters={'C': 10000.0, 'class_weight': None}
|
||
1010 evaluations will be performed for each combination of hyper-parameters
|
||
[artificial sampling protocol] generating predictions: 100%|██████████| 1010/1010 [00:00<00:00, 5379.55it/s]
|
||
MAE=0.41734
|
||
</pre></div>
|
||
</div>
|
||
<p>Note that the MAE is worse than the one we obtained when optimizing
|
||
for quantification and, indeed, the hyper-parameters found optimal
|
||
largely differ between the two selection modalities. The
|
||
hyper-parameters C=10000 and class_weight=None have been found
|
||
to work well for the specific training prevalence of the HP dataset,
|
||
but these hyper-parameters turned out to be suboptimal when the
|
||
class prevalences of the test set differs (as is indeed tested
|
||
in scenarios of quantification).</p>
|
||
<p>This is, however, not always the case, and one could, in practice,
|
||
find examples
|
||
in which optimizing for classification ends up resulting in a better
|
||
quantifier than when optimizing for quantification.
|
||
Nonetheless, this is theoretically unlikely to happen.</p>
|
||
</div>
|
||
</div>
|
||
|
||
|
||
<div class="clearer"></div>
|
||
</div>
|
||
</div>
|
||
</div>
|
||
<div class="sphinxsidebar" role="navigation" aria-label="main navigation">
|
||
<div class="sphinxsidebarwrapper">
|
||
<h3><a href="index.html">Table of Contents</a></h3>
|
||
<ul>
|
||
<li><a class="reference internal" href="#">Model Selection</a><ul>
|
||
<li><a class="reference internal" href="#targeting-a-quantification-oriented-loss">Targeting a Quantification-oriented loss</a></li>
|
||
<li><a class="reference internal" href="#targeting-a-classification-oriented-loss">Targeting a Classification-oriented loss</a></li>
|
||
</ul>
|
||
</li>
|
||
</ul>
|
||
|
||
<div role="note" aria-label="source link">
|
||
<h3>This Page</h3>
|
||
<ul class="this-page-menu">
|
||
<li><a href="_sources/Model-Selection.md.txt"
|
||
rel="nofollow">Show Source</a></li>
|
||
</ul>
|
||
</div>
|
||
<div id="searchbox" style="display: none" role="search">
|
||
<h3 id="searchlabel">Quick search</h3>
|
||
<div class="searchformwrapper">
|
||
<form class="search" action="search.html" method="get">
|
||
<input type="text" name="q" aria-labelledby="searchlabel" autocomplete="off" autocorrect="off" autocapitalize="off" spellcheck="false"/>
|
||
<input type="submit" value="Go" />
|
||
</form>
|
||
</div>
|
||
</div>
|
||
<script>$('#searchbox').show(0);</script>
|
||
</div>
|
||
</div>
|
||
<div class="clearer"></div>
|
||
</div>
|
||
<div class="related" role="navigation" aria-label="related navigation">
|
||
<h3>Navigation</h3>
|
||
<ul>
|
||
<li class="right" style="margin-right: 10px">
|
||
<a href="genindex.html" title="General Index"
|
||
>index</a></li>
|
||
<li class="right" >
|
||
<a href="py-modindex.html" title="Python Module Index"
|
||
>modules</a> |</li>
|
||
<li class="nav-item nav-item-0"><a href="index.html">QuaPy 0.1.6 documentation</a> »</li>
|
||
<li class="nav-item nav-item-this"><a href="">Model Selection</a></li>
|
||
</ul>
|
||
</div>
|
||
<div class="footer" role="contentinfo">
|
||
© Copyright 2021, Alejandro Moreo.
|
||
Created using <a href="https://www.sphinx-doc.org/">Sphinx</a> 4.2.0.
|
||
</div>
|
||
</body>
|
||
</html> |