QuaPy/docs/build/html/Model-Selection.html



<!doctype html>

<html lang="en">
  <head>
    <meta charset="utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="generator" content="Docutils 0.19: https://docutils.sourceforge.io/" />

    <title>Model Selection &#8212; QuaPy 0.1.7 documentation</title>
    <link rel="stylesheet" type="text/css" href="_static/pygments.css" />
    <link rel="stylesheet" type="text/css" href="_static/bizstyle.css" />
    
    <script data-url_root="./" id="documentation_options" src="_static/documentation_options.js"></script>
    <script src="_static/jquery.js"></script>
    <script src="_static/underscore.js"></script>
    <script src="_static/_sphinx_javascript_frameworks_compat.js"></script>
    <script src="_static/doctools.js"></script>
    <script src="_static/sphinx_highlight.js"></script>
    <script src="_static/bizstyle.js"></script>
    <link rel="index" title="Index" href="genindex.html" />
    <link rel="search" title="Search" href="search.html" />
    <link rel="next" title="Plotting" href="Plotting.html" />
    <link rel="prev" title="Quantification Methods" href="Methods.html" />
    <meta name="viewport" content="width=device-width,initial-scale=1.0" />
    <!--[if lt IE 9]>
    <script src="_static/css3-mediaqueries.js"></script>
    <![endif]-->
  </head><body>
    <div class="related" role="navigation" aria-label="related navigation">
      <h3>Navigation</h3>
      <ul>
        <li class="right" style="margin-right: 10px">
          <a href="genindex.html" title="General Index"
             accesskey="I">index</a></li>
        <li class="right" >
          <a href="py-modindex.html" title="Python Module Index"
             >modules</a> |</li>
        <li class="right" >
          <a href="Plotting.html" title="Plotting"
             accesskey="N">next</a> |</li>
        <li class="right" >
          <a href="Methods.html" title="Quantification Methods"
             accesskey="P">previous</a> |</li>
        <li class="nav-item nav-item-0"><a href="index.html">QuaPy 0.1.7 documentation</a> &#187;</li>
        <li class="nav-item nav-item-this"><a href="">Model Selection</a></li> 
      </ul>
    </div>  

    <div class="document">
      <div class="documentwrapper">
        <div class="bodywrapper">
          <div class="body" role="main">
            
  <section id="model-selection">
<h1>Model Selection<a class="headerlink" href="#model-selection" title="Permalink to this heading">¶</a></h1>
<p>As a supervised machine learning task, quantification methods
can strongly depend on a good choice of model hyper-parameters.
The process whereby those hyper-parameters are chosen is
typically known as <em>Model Selection</em>, and typically consists of
testing different settings and picking the one that performed
best in a held-out validation set in terms of any given
evaluation measure.</p>
<section id="targeting-a-quantification-oriented-loss">
<h2>Targeting a Quantification-oriented loss<a class="headerlink" href="#targeting-a-quantification-oriented-loss" title="Permalink to this heading">¶</a></h2>
<p>The task being optimized determines the evaluation protocol,
i.e., the criteria according to which the performance of
any given method for solving is to be assessed.
As a task on its own right, quantification should impose
its own model selection strategies, i.e., strategies
aimed at finding appropriate configurations
specifically designed for the task of quantification.</p>
<p>Quantification has long been regarded as an add-on of
classification, and thus the model selection strategies
customarily adopted in classification have simply been
applied to quantification (see the next section).
It has been argued in <em>Moreo, Alejandro, and Fabrizio Sebastiani.
“Re-Assessing the” Classify and Count” Quantification Method.”
arXiv preprint arXiv:2011.02552 (2020).</em>
that specific model selection strategies should
be adopted for quantification. That is, model selection
strategies for quantification should target
quantification-oriented losses and be tested in a variety
of scenarios exhibiting different degrees of prior
probability shift.</p>
<p>The class
<em>qp.model_selection.GridSearchQ</em>
implements a grid-search exploration over the space of
hyper-parameter combinations that evaluates each<br />
combination of hyper-parameters
by means of a given quantification-oriented
error metric (e.g., any of the error functions implemented
in <em>qp.error</em>) and according to the
<a class="reference external" href="https://github.com/HLT-ISTI/QuaPy/wiki/Evaluation"><em>artificial sampling protocol</em></a>.</p>
<p>The following is an example of model selection for quantification:</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">quapy</span> <span class="k">as</span> <span class="nn">qp</span>
<span class="kn">from</span> <span class="nn">quapy.method.aggregative</span> <span class="kn">import</span> <span class="n">PCC</span>
<span class="kn">from</span> <span class="nn">sklearn.linear_model</span> <span class="kn">import</span> <span class="n">LogisticRegression</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>

<span class="c1"># set a seed to replicate runs</span>
<span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">seed</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
<span class="n">qp</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s1">&#39;SAMPLE_SIZE&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="mi">500</span>

<span class="n">dataset</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">datasets</span><span class="o">.</span><span class="n">fetch_reviews</span><span class="p">(</span><span class="s1">&#39;hp&#39;</span><span class="p">,</span> <span class="n">tfidf</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">min_df</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span>

<span class="c1"># The model will be returned by the fit method of GridSearchQ.</span>
<span class="c1"># Model selection will be performed with a fixed budget of 1000 evaluations</span>
<span class="c1"># for each hyper-parameter combination. The error to optimize is the MAE for</span>
<span class="c1"># quantification, as evaluated on artificially drawn samples at prevalences </span>
<span class="c1"># covering the entire spectrum on a held-out portion (40%) of the training set.</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">model_selection</span><span class="o">.</span><span class="n">GridSearchQ</span><span class="p">(</span>
    <span class="n">model</span><span class="o">=</span><span class="n">PCC</span><span class="p">(</span><span class="n">LogisticRegression</span><span class="p">()),</span>
    <span class="n">param_grid</span><span class="o">=</span><span class="p">{</span><span class="s1">&#39;C&#39;</span><span class="p">:</span> <span class="n">np</span><span class="o">.</span><span class="n">logspace</span><span class="p">(</span><span class="o">-</span><span class="mi">4</span><span class="p">,</span><span class="mi">5</span><span class="p">,</span><span class="mi">10</span><span class="p">),</span> <span class="s1">&#39;class_weight&#39;</span><span class="p">:</span> <span class="p">[</span><span class="s1">&#39;balanced&#39;</span><span class="p">,</span> <span class="kc">None</span><span class="p">]},</span>
    <span class="n">sample_size</span><span class="o">=</span><span class="n">qp</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s1">&#39;SAMPLE_SIZE&#39;</span><span class="p">],</span>
    <span class="n">eval_budget</span><span class="o">=</span><span class="mi">1000</span><span class="p">,</span>
    <span class="n">error</span><span class="o">=</span><span class="s1">&#39;mae&#39;</span><span class="p">,</span>
    <span class="n">refit</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span>  <span class="c1"># retrain on the whole labelled set</span>
    <span class="n">val_split</span><span class="o">=</span><span class="mf">0.4</span><span class="p">,</span>
    <span class="n">verbose</span><span class="o">=</span><span class="kc">True</span>  <span class="c1"># show information as the process goes on</span>
<span class="p">)</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">dataset</span><span class="o">.</span><span class="n">training</span><span class="p">)</span>

<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39;model selection ended: best hyper-parameters=</span><span class="si">{</span><span class="n">model</span><span class="o">.</span><span class="n">best_params_</span><span class="si">}</span><span class="s1">&#39;</span><span class="p">)</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">best_model_</span>

<span class="c1"># evaluation in terms of MAE</span>
<span class="n">results</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">evaluation</span><span class="o">.</span><span class="n">artificial_sampling_eval</span><span class="p">(</span>
    <span class="n">model</span><span class="p">,</span>
    <span class="n">dataset</span><span class="o">.</span><span class="n">test</span><span class="p">,</span>
    <span class="n">sample_size</span><span class="o">=</span><span class="n">qp</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s1">&#39;SAMPLE_SIZE&#39;</span><span class="p">],</span>
    <span class="n">n_prevpoints</span><span class="o">=</span><span class="mi">101</span><span class="p">,</span>
    <span class="n">n_repetitions</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span>
    <span class="n">error_metric</span><span class="o">=</span><span class="s1">&#39;mae&#39;</span>
<span class="p">)</span>

<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39;MAE=</span><span class="si">{</span><span class="n">results</span><span class="si">:</span><span class="s1">.5f</span><span class="si">}</span><span class="s1">&#39;</span><span class="p">)</span>
</pre></div>
</div>
<p>In this example, the system outputs:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span>[GridSearchQ]: starting optimization with n_jobs=1
[GridSearchQ]: checking hyperparams={&#39;C&#39;: 0.0001, &#39;class_weight&#39;: &#39;balanced&#39;} got mae score 0.24987
[GridSearchQ]: checking hyperparams={&#39;C&#39;: 0.0001, &#39;class_weight&#39;: None} got mae score 0.48135
[GridSearchQ]: checking hyperparams={&#39;C&#39;: 0.001, &#39;class_weight&#39;: &#39;balanced&#39;} got mae score 0.24866
[...]
[GridSearchQ]: checking hyperparams={&#39;C&#39;: 100000.0, &#39;class_weight&#39;: None} got mae score 0.43676
[GridSearchQ]: optimization finished: best params {&#39;C&#39;: 0.1, &#39;class_weight&#39;: &#39;balanced&#39;} (score=0.19982)
[GridSearchQ]: refitting on the whole development set
model selection ended: best hyper-parameters={&#39;C&#39;: 0.1, &#39;class_weight&#39;: &#39;balanced&#39;}
1010 evaluations will be performed for each combination of hyper-parameters
[artificial sampling protocol] generating predictions: 100%|██████████| 1010/1010 [00:00&lt;00:00, 5005.54it/s]
MAE=0.20342
</pre></div>
</div>
<p>The parameter <em>val_split</em> can alternatively be used to indicate
a validation set (i.e., an instance of <em>LabelledCollection</em>) instead
of a proportion. This could be useful if one wants to have control
on the specific data split to be used across different model selection
experiments.</p>
</section>
<section id="targeting-a-classification-oriented-loss">
<h2>Targeting a Classification-oriented loss<a class="headerlink" href="#targeting-a-classification-oriented-loss" title="Permalink to this heading">¶</a></h2>
<p>Optimizing a model for quantification could rather be
computationally costly.
In aggregative methods, one could alternatively try to optimize
the classifier’s hyper-parameters for classification.
Although this is theoretically suboptimal, many articles in
quantification literature have opted for this strategy.</p>
<p>In QuaPy, this is achieved by simply instantiating the
classifier learner as a GridSearchCV from scikit-learn.
The following code illustrates how to do that:</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">learner</span> <span class="o">=</span> <span class="n">GridSearchCV</span><span class="p">(</span>
    <span class="n">LogisticRegression</span><span class="p">(),</span>
    <span class="n">param_grid</span><span class="o">=</span><span class="p">{</span><span class="s1">&#39;C&#39;</span><span class="p">:</span> <span class="n">np</span><span class="o">.</span><span class="n">logspace</span><span class="p">(</span><span class="o">-</span><span class="mi">4</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">10</span><span class="p">),</span> <span class="s1">&#39;class_weight&#39;</span><span class="p">:</span> <span class="p">[</span><span class="s1">&#39;balanced&#39;</span><span class="p">,</span> <span class="kc">None</span><span class="p">]},</span>
    <span class="n">cv</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">PCC</span><span class="p">(</span><span class="n">learner</span><span class="p">)</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">dataset</span><span class="o">.</span><span class="n">training</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39;model selection ended: best hyper-parameters=</span><span class="si">{</span><span class="n">model</span><span class="o">.</span><span class="n">learner</span><span class="o">.</span><span class="n">best_params_</span><span class="si">}</span><span class="s1">&#39;</span><span class="p">)</span>
</pre></div>
</div>
<p>In this example, the system outputs:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span>model selection ended: best hyper-parameters={&#39;C&#39;: 10000.0, &#39;class_weight&#39;: None}
1010 evaluations will be performed for each combination of hyper-parameters
[artificial sampling protocol] generating predictions: 100%|██████████| 1010/1010 [00:00&lt;00:00, 5379.55it/s]
MAE=0.41734
</pre></div>
</div>
<p>Note that the MAE is worse than the one we obtained when optimizing
for quantification and, indeed, the hyper-parameters found optimal
largely differ between the two selection modalities. The
hyper-parameters C=10000 and class_weight=None have been found
to work well for the specific training prevalence of the HP dataset,
but these hyper-parameters turned out to be suboptimal when the
class prevalences of the test set differs (as is indeed tested
in scenarios of quantification).</p>
<p>This is, however, not always the case, and one could, in practice,
find examples
in which optimizing for classification ends up resulting in a better
quantifier than when optimizing for quantification.
Nonetheless, this is theoretically unlikely to happen.</p>
</section>
</section>


            <div class="clearer"></div>
          </div>
        </div>
      </div>
      <div class="sphinxsidebar" role="navigation" aria-label="main navigation">
        <div class="sphinxsidebarwrapper">
  <div>
    <h3><a href="index.html">Table of Contents</a></h3>
    <ul>
<li><a class="reference internal" href="#">Model Selection</a><ul>
<li><a class="reference internal" href="#targeting-a-quantification-oriented-loss">Targeting a Quantification-oriented loss</a></li>
<li><a class="reference internal" href="#targeting-a-classification-oriented-loss">Targeting a Classification-oriented loss</a></li>
</ul>
</li>
</ul>

  </div>
  <div>
    <h4>Previous topic</h4>
    <p class="topless"><a href="Methods.html"
                          title="previous chapter">Quantification Methods</a></p>
  </div>
  <div>
    <h4>Next topic</h4>
    <p class="topless"><a href="Plotting.html"
                          title="next chapter">Plotting</a></p>
  </div>
  <div role="note" aria-label="source link">
    <h3>This Page</h3>
    <ul class="this-page-menu">
      <li><a href="_sources/Model-Selection.md.txt"
            rel="nofollow">Show Source</a></li>
    </ul>
   </div>
<div id="searchbox" style="display: none" role="search">
  <h3 id="searchlabel">Quick search</h3>
    <div class="searchformwrapper">
    <form class="search" action="search.html" method="get">
      <input type="text" name="q" aria-labelledby="searchlabel" autocomplete="off" autocorrect="off" autocapitalize="off" spellcheck="false"/>
      <input type="submit" value="Go" />
    </form>
    </div>
</div>
<script>document.getElementById('searchbox').style.display = "block"</script>
        </div>
      </div>
      <div class="clearer"></div>
    </div>
    <div class="related" role="navigation" aria-label="related navigation">
      <h3>Navigation</h3>
      <ul>
        <li class="right" style="margin-right: 10px">
          <a href="genindex.html" title="General Index"
             >index</a></li>
        <li class="right" >
          <a href="py-modindex.html" title="Python Module Index"
             >modules</a> |</li>
        <li class="right" >
          <a href="Plotting.html" title="Plotting"
             >next</a> |</li>
        <li class="right" >
          <a href="Methods.html" title="Quantification Methods"
             >previous</a> |</li>
        <li class="nav-item nav-item-0"><a href="index.html">QuaPy 0.1.7 documentation</a> &#187;</li>
        <li class="nav-item nav-item-this"><a href="">Model Selection</a></li> 
      </ul>
    </div>
    <div class="footer" role="contentinfo">
        &#169; Copyright 2021, Alejandro Moreo.
      Created using <a href="https://www.sphinx-doc.org/">Sphinx</a> 5.3.0.
    </div>
  </body>
</html>
-												doc with sphinx

											
										
										
											2021-11-09 15:50:53 +01:00
 								<!doctype html>
-												adding documentation and adding one new example

											
										
										
											2023-02-08 19:06:53 +01:00
+								<html lang="en">
-												doc with sphinx

											
										
										
											2021-11-09 15:50:53 +01:00
+								  <head>
 								    <meta charset="utf-8" />
-												adding documentation and adding one new example

											
										
										
											2023-02-08 19:06:53 +01:00
+								    <meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="generator" content="Docutils 0.19: https://docutils.sourceforge.io/" />
 								    <title>Model Selection &#8212; QuaPy 0.1.7 documentation</title>
-												doc with sphinx

											
										
										
											2021-11-09 15:50:53 +01:00
+								    <link rel="stylesheet" type="text/css" href="_static/pygments.css" />
 								    <link rel="stylesheet" type="text/css" href="_static/bizstyle.css" />
 								    <script data-url_root="./" id="documentation_options" src="_static/documentation_options.js"></script>
 								    <script src="_static/jquery.js"></script>
 								    <script src="_static/underscore.js"></script>
-												adding documentation and adding one new example

											
										
										
											2023-02-08 19:06:53 +01:00
+								    <script src="_static/_sphinx_javascript_frameworks_compat.js"></script>
-												doc with sphinx

											
										
										
											2021-11-09 15:50:53 +01:00
+								    <script src="_static/doctools.js"></script>
-												adding documentation and adding one new example

											
										
										
											2023-02-08 19:06:53 +01:00
+								    <script src="_static/sphinx_highlight.js"></script>
-												doc with sphinx

											
										
										
											2021-11-09 15:50:53 +01:00
+								    <script src="_static/bizstyle.js"></script>
 								    <link rel="index" title="Index" href="genindex.html" />
 								    <link rel="search" title="Search" href="search.html" />
-												adding documentation and adding one new example

											
										
										
											2023-02-08 19:06:53 +01:00
+								    <link rel="next" title="Plotting" href="Plotting.html" />
 								    <link rel="prev" title="Quantification Methods" href="Methods.html" />
-												doc with sphinx

											
										
										
											2021-11-09 15:50:53 +01:00
+								    <meta name="viewport" content="width=device-width,initial-scale=1.0" />
 								    <!--[if lt IE 9]>
 								    <script src="_static/css3-mediaqueries.js"></script>
 								    <![endif]-->
 								  </head><body>
 								    <div class="related" role="navigation" aria-label="related navigation">
 								      <h3>Navigation</h3>
 								      <ul>
 								        <li class="right" style="margin-right: 10px">
 								          <a href="genindex.html" title="General Index"
 								             accesskey="I">index</a></li>
 								        <li class="right" >
 								          <a href="py-modindex.html" title="Python Module Index"
 								             >modules</a> |</li>
-												adding documentation and adding one new example

											
										
										
											2023-02-08 19:06:53 +01:00
+								        <li class="right" >
 								          <a href="Plotting.html" title="Plotting"
 								             accesskey="N">next</a> |</li>
 								        <li class="right" >
 								          <a href="Methods.html" title="Quantification Methods"
 								             accesskey="P">previous</a> |</li>
 								        <li class="nav-item nav-item-0"><a href="index.html">QuaPy 0.1.7 documentation</a> &#187;</li>
-												doc with sphinx

											
										
										
											2021-11-09 15:50:53 +01:00
+								        <li class="nav-item nav-item-this"><a href="">Model Selection</a></li>
 								      </ul>
 								    </div>
 								    <div class="document">
 								      <div class="documentwrapper">
 								        <div class="bodywrapper">
 								          <div class="body" role="main">
-												adding documentation and adding one new example

											
										
										
											2023-02-08 19:06:53 +01:00
+								  <section id="model-selection">
 								<h1>Model Selection<a class="headerlink" href="#model-selection" title="Permalink to this heading">¶</a></h1>
-												doc with sphinx

											
										
										
											2021-11-09 15:50:53 +01:00
+								<p>As a supervised machine learning task, quantification methods
 								can strongly depend on a good choice of model hyper-parameters.
 								The process whereby those hyper-parameters are chosen is
 								typically known as <em>Model Selection</em>, and typically consists of
 								testing different settings and picking the one that performed
 								best in a held-out validation set in terms of any given
 								evaluation measure.</p>
-												adding documentation and adding one new example

											
										
										
											2023-02-08 19:06:53 +01:00
+								<section id="targeting-a-quantification-oriented-loss">
 								<h2>Targeting a Quantification-oriented loss<a class="headerlink" href="#targeting-a-quantification-oriented-loss" title="Permalink to this heading">¶</a></h2>
-												doc with sphinx

											
										
										
											2021-11-09 15:50:53 +01:00
+								<p>The task being optimized determines the evaluation protocol,
 								i.e., the criteria according to which the performance of
 								any given method for solving is to be assessed.
 								As a task on its own right, quantification should impose
 								its own model selection strategies, i.e., strategies
 								aimed at finding appropriate configurations
 								specifically designed for the task of quantification.</p>
 								<p>Quantification has long been regarded as an add-on of
 								classification, and thus the model selection strategies
 								customarily adopted in classification have simply been
 								applied to quantification (see the next section).
 								It has been argued in <em>Moreo, Alejandro, and Fabrizio Sebastiani.
 								“Re-Assessing the” Classify and Count” Quantification Method.”
 								arXiv preprint arXiv:2011.02552 (2020).</em>
 								that specific model selection strategies should
 								be adopted for quantification. That is, model selection
 								strategies for quantification should target
 								quantification-oriented losses and be tested in a variety
 								of scenarios exhibiting different degrees of prior
 								probability shift.</p>
 								<p>The class
 								<em>qp.model_selection.GridSearchQ</em>
 								implements a grid-search exploration over the space of
 								hyper-parameter combinations that evaluates each<br />
 								combination of hyper-parameters
 								by means of a given quantification-oriented
 								error metric (e.g., any of the error functions implemented
 								in <em>qp.error</em>) and according to the
 								<a class="reference external" href="https://github.com/HLT-ISTI/QuaPy/wiki/Evaluation"><em>artificial sampling protocol</em></a>.</p>
 								<p>The following is an example of model selection for quantification:</p>
 								<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">quapy</span> <span class="k">as</span> <span class="nn">qp</span>
 								<span class="kn">from</span> <span class="nn">quapy.method.aggregative</span> <span class="kn">import</span> <span class="n">PCC</span>
 								<span class="kn">from</span> <span class="nn">sklearn.linear_model</span> <span class="kn">import</span> <span class="n">LogisticRegression</span>
 								<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
 								<span class="c1"># set a seed to replicate runs</span>
 								<span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">seed</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
 								<span class="n">qp</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s1">&#39;SAMPLE_SIZE&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="mi">500</span>
 								<span class="n">dataset</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">datasets</span><span class="o">.</span><span class="n">fetch_reviews</span><span class="p">(</span><span class="s1">&#39;hp&#39;</span><span class="p">,</span> <span class="n">tfidf</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">min_df</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span>
 								<span class="c1"># The model will be returned by the fit method of GridSearchQ.</span>
 								<span class="c1"># Model selection will be performed with a fixed budget of 1000 evaluations</span>
 								<span class="c1"># for each hyper-parameter combination. The error to optimize is the MAE for</span>
 								<span class="c1"># quantification, as evaluated on artificially drawn samples at prevalences </span>
 								<span class="c1"># covering the entire spectrum on a held-out portion (40%) of the training set.</span>
 								<span class="n">model</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">model_selection</span><span class="o">.</span><span class="n">GridSearchQ</span><span class="p">(</span>
 								    <span class="n">model</span><span class="o">=</span><span class="n">PCC</span><span class="p">(</span><span class="n">LogisticRegression</span><span class="p">()),</span>
 								    <span class="n">param_grid</span><span class="o">=</span><span class="p">{</span><span class="s1">&#39;C&#39;</span><span class="p">:</span> <span class="n">np</span><span class="o">.</span><span class="n">logspace</span><span class="p">(</span><span class="o">-</span><span class="mi">4</span><span class="p">,</span><span class="mi">5</span><span class="p">,</span><span class="mi">10</span><span class="p">),</span> <span class="s1">&#39;class_weight&#39;</span><span class="p">:</span> <span class="p">[</span><span class="s1">&#39;balanced&#39;</span><span class="p">,</span> <span class="kc">None</span><span class="p">]},</span>
 								    <span class="n">sample_size</span><span class="o">=</span><span class="n">qp</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s1">&#39;SAMPLE_SIZE&#39;</span><span class="p">],</span>
 								    <span class="n">eval_budget</span><span class="o">=</span><span class="mi">1000</span><span class="p">,</span>
 								    <span class="n">error</span><span class="o">=</span><span class="s1">&#39;mae&#39;</span><span class="p">,</span>
 								    <span class="n">refit</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span>  <span class="c1"># retrain on the whole labelled set</span>
 								    <span class="n">val_split</span><span class="o">=</span><span class="mf">0.4</span><span class="p">,</span>
 								    <span class="n">verbose</span><span class="o">=</span><span class="kc">True</span>  <span class="c1"># show information as the process goes on</span>
 								<span class="p">)</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">dataset</span><span class="o">.</span><span class="n">training</span><span class="p">)</span>
 								<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39;model selection ended: best hyper-parameters=</span><span class="si">{</span><span class="n">model</span><span class="o">.</span><span class="n">best_params_</span><span class="si">}</span><span class="s1">&#39;</span><span class="p">)</span>
 								<span class="n">model</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">best_model_</span>
 								<span class="c1"># evaluation in terms of MAE</span>
 								<span class="n">results</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">evaluation</span><span class="o">.</span><span class="n">artificial_sampling_eval</span><span class="p">(</span>
 								    <span class="n">model</span><span class="p">,</span>
 								    <span class="n">dataset</span><span class="o">.</span><span class="n">test</span><span class="p">,</span>
 								    <span class="n">sample_size</span><span class="o">=</span><span class="n">qp</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s1">&#39;SAMPLE_SIZE&#39;</span><span class="p">],</span>
 								    <span class="n">n_prevpoints</span><span class="o">=</span><span class="mi">101</span><span class="p">,</span>
 								    <span class="n">n_repetitions</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span>
 								    <span class="n">error_metric</span><span class="o">=</span><span class="s1">&#39;mae&#39;</span>
 								<span class="p">)</span>
 								<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39;MAE=</span><span class="si">{</span><span class="n">results</span><span class="si">:</span><span class="s1">.5f</span><span class="si">}</span><span class="s1">&#39;</span><span class="p">)</span>
 								</pre></div>
 								</div>
 								<p>In this example, the system outputs:</p>
 								<div class="highlight-default notranslate"><div class="highlight"><pre><span></span>[GridSearchQ]: starting optimization with n_jobs=1
 								[GridSearchQ]: checking hyperparams={&#39;C&#39;: 0.0001, &#39;class_weight&#39;: &#39;balanced&#39;} got mae score 0.24987
 								[GridSearchQ]: checking hyperparams={&#39;C&#39;: 0.0001, &#39;class_weight&#39;: None} got mae score 0.48135
 								[GridSearchQ]: checking hyperparams={&#39;C&#39;: 0.001, &#39;class_weight&#39;: &#39;balanced&#39;} got mae score 0.24866
 								[...]
 								[GridSearchQ]: checking hyperparams={&#39;C&#39;: 100000.0, &#39;class_weight&#39;: None} got mae score 0.43676
 								[GridSearchQ]: optimization finished: best params {&#39;C&#39;: 0.1, &#39;class_weight&#39;: &#39;balanced&#39;} (score=0.19982)
 								[GridSearchQ]: refitting on the whole development set
 								model selection ended: best hyper-parameters={&#39;C&#39;: 0.1, &#39;class_weight&#39;: &#39;balanced&#39;}
 evaluations will be performed for each combination of hyper-parameters
 								[artificial sampling protocol] generating predictions: 100%|██████████| 1010/1010 [00:00&lt;00:00, 5005.54it/s]
 								MAE=0.20342
 								</pre></div>
 								</div>
 								<p>The parameter <em>val_split</em> can alternatively be used to indicate
 								a validation set (i.e., an instance of <em>LabelledCollection</em>) instead
 								of a proportion. This could be useful if one wants to have control
 								on the specific data split to be used across different model selection
 								experiments.</p>
-												adding documentation and adding one new example

											
										
										
											2023-02-08 19:06:53 +01:00
+								</section>
 								<section id="targeting-a-classification-oriented-loss">
 								<h2>Targeting a Classification-oriented loss<a class="headerlink" href="#targeting-a-classification-oriented-loss" title="Permalink to this heading">¶</a></h2>
-												doc with sphinx

											
										
										
											2021-11-09 15:50:53 +01:00
+								<p>Optimizing a model for quantification could rather be
 								computationally costly.
 								In aggregative methods, one could alternatively try to optimize
 								the classifier’s hyper-parameters for classification.
 								Although this is theoretically suboptimal, many articles in
 								quantification literature have opted for this strategy.</p>
 								<p>In QuaPy, this is achieved by simply instantiating the
 								classifier learner as a GridSearchCV from scikit-learn.
 								The following code illustrates how to do that:</p>
 								<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">learner</span> <span class="o">=</span> <span class="n">GridSearchCV</span><span class="p">(</span>
 								    <span class="n">LogisticRegression</span><span class="p">(),</span>
 								    <span class="n">param_grid</span><span class="o">=</span><span class="p">{</span><span class="s1">&#39;C&#39;</span><span class="p">:</span> <span class="n">np</span><span class="o">.</span><span class="n">logspace</span><span class="p">(</span><span class="o">-</span><span class="mi">4</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">10</span><span class="p">),</span> <span class="s1">&#39;class_weight&#39;</span><span class="p">:</span> <span class="p">[</span><span class="s1">&#39;balanced&#39;</span><span class="p">,</span> <span class="kc">None</span><span class="p">]},</span>
 								    <span class="n">cv</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span>
 								<span class="n">model</span> <span class="o">=</span> <span class="n">PCC</span><span class="p">(</span><span class="n">learner</span><span class="p">)</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">dataset</span><span class="o">.</span><span class="n">training</span><span class="p">)</span>
 								<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39;model selection ended: best hyper-parameters=</span><span class="si">{</span><span class="n">model</span><span class="o">.</span><span class="n">learner</span><span class="o">.</span><span class="n">best_params_</span><span class="si">}</span><span class="s1">&#39;</span><span class="p">)</span>
 								</pre></div>
 								</div>
 								<p>In this example, the system outputs:</p>
 								<div class="highlight-default notranslate"><div class="highlight"><pre><span></span>model selection ended: best hyper-parameters={&#39;C&#39;: 10000.0, &#39;class_weight&#39;: None}
 evaluations will be performed for each combination of hyper-parameters
 								[artificial sampling protocol] generating predictions: 100%|██████████| 1010/1010 [00:00&lt;00:00, 5379.55it/s]
 								MAE=0.41734
 								</pre></div>
 								</div>
 								<p>Note that the MAE is worse than the one we obtained when optimizing
 								for quantification and, indeed, the hyper-parameters found optimal
 								largely differ between the two selection modalities. The
 								hyper-parameters C=10000 and class_weight=None have been found
 								to work well for the specific training prevalence of the HP dataset,
 								but these hyper-parameters turned out to be suboptimal when the
 								class prevalences of the test set differs (as is indeed tested
 								in scenarios of quantification).</p>
 								<p>This is, however, not always the case, and one could, in practice,
 								find examples
 								in which optimizing for classification ends up resulting in a better
 								quantifier than when optimizing for quantification.
 								Nonetheless, this is theoretically unlikely to happen.</p>
-												adding documentation and adding one new example

											
										
										
											2023-02-08 19:06:53 +01:00
+								</section>
 								</section>
-												doc with sphinx

											
										
										
											2021-11-09 15:50:53 +01:00
 								            <div class="clearer"></div>
 								          </div>
 								        </div>
 								      </div>
 								      <div class="sphinxsidebar" role="navigation" aria-label="main navigation">
 								        <div class="sphinxsidebarwrapper">
-												adding documentation and adding one new example

											
										
										
											2023-02-08 19:06:53 +01:00
+								  <div>
 								    <h3><a href="index.html">Table of Contents</a></h3>
 								    <ul>
-												doc with sphinx

											
										
										
											2021-11-09 15:50:53 +01:00
+								<li><a class="reference internal" href="#">Model Selection</a><ul>
 								<li><a class="reference internal" href="#targeting-a-quantification-oriented-loss">Targeting a Quantification-oriented loss</a></li>
 								<li><a class="reference internal" href="#targeting-a-classification-oriented-loss">Targeting a Classification-oriented loss</a></li>
 								</ul>
 								</li>
 								</ul>
-												adding documentation and adding one new example

											
										
										
											2023-02-08 19:06:53 +01:00
+								  </div>
 								  <div>
 								    <h4>Previous topic</h4>
 								    <p class="topless"><a href="Methods.html"
 								                          title="previous chapter">Quantification Methods</a></p>
 								  </div>
 								  <div>
 								    <h4>Next topic</h4>
 								    <p class="topless"><a href="Plotting.html"
 								                          title="next chapter">Plotting</a></p>
 								  </div>
-												doc with sphinx

											
										
										
											2021-11-09 15:50:53 +01:00
+								  <div role="note" aria-label="source link">
 								    <h3>This Page</h3>
 								    <ul class="this-page-menu">
 								      <li><a href="_sources/Model-Selection.md.txt"
 								            rel="nofollow">Show Source</a></li>
 								    </ul>
 								   </div>
 								<div id="searchbox" style="display: none" role="search">
 								  <h3 id="searchlabel">Quick search</h3>
 								    <div class="searchformwrapper">
 								    <form class="search" action="search.html" method="get">
 								      <input type="text" name="q" aria-labelledby="searchlabel" autocomplete="off" autocorrect="off" autocapitalize="off" spellcheck="false"/>
 								      <input type="submit" value="Go" />
 								    </form>
 								    </div>
 								</div>
-												adding documentation and adding one new example

											
										
										
											2023-02-08 19:06:53 +01:00
+								<script>document.getElementById('searchbox').style.display = "block"</script>
-												doc with sphinx

											
										
										
											2021-11-09 15:50:53 +01:00
+								        </div>
 								      </div>
 								      <div class="clearer"></div>
 								    </div>
 								    <div class="related" role="navigation" aria-label="related navigation">
 								      <h3>Navigation</h3>
 								      <ul>
 								        <li class="right" style="margin-right: 10px">
 								          <a href="genindex.html" title="General Index"
 								             >index</a></li>
 								        <li class="right" >
 								          <a href="py-modindex.html" title="Python Module Index"
 								             >modules</a> |</li>
-												adding documentation and adding one new example

											
										
										
											2023-02-08 19:06:53 +01:00
+								        <li class="right" >
 								          <a href="Plotting.html" title="Plotting"
 								             >next</a> |</li>
 								        <li class="right" >
 								          <a href="Methods.html" title="Quantification Methods"
 								             >previous</a> |</li>
 								        <li class="nav-item nav-item-0"><a href="index.html">QuaPy 0.1.7 documentation</a> &#187;</li>
-												doc with sphinx

											
										
										
											2021-11-09 15:50:53 +01:00
+								        <li class="nav-item nav-item-this"><a href="">Model Selection</a></li>
 								      </ul>
 								    </div>
 								    <div class="footer" role="contentinfo">
 								        &#169; Copyright 2021, Alejandro Moreo.
-												adding documentation and adding one new example

											
										
										
											2023-02-08 19:06:53 +01:00
+								      Created using <a href="https://www.sphinx-doc.org/">Sphinx</a> 5.3.0.
-												doc with sphinx

											
										
										
											2021-11-09 15:50:53 +01:00
+								    </div>
 								  </body>
 								</html>