forked from moreo/QuaPy
preparing to merge
This commit is contained in:
parent
25a829996e
commit
49fc486c53
31
README.md
31
README.md
|
@ -13,6 +13,7 @@ for facilitating the analysis and interpretation of the experimental results.
|
|||
|
||||
### Last updates:
|
||||
|
||||
* Version 0.1.7 is released! major changes can be consulted [here](quapy/FCHANGE_LOG.txt).
|
||||
* A detailed documentation is now available [here](https://hlt-isti.github.io/QuaPy/)
|
||||
* The developer API documentation is available [here](https://hlt-isti.github.io/QuaPy/build/html/modules.html)
|
||||
|
||||
|
@ -59,13 +60,14 @@ See the [Wiki](https://github.com/HLT-ISTI/QuaPy/wiki) for detailed examples.
|
|||
## Features
|
||||
|
||||
* Implementation of many popular quantification methods (Classify-&-Count and its variants, Expectation Maximization,
|
||||
quantification methods based on structured output learning, HDy, QuaNet, and quantification ensembles).
|
||||
* Versatile functionality for performing evaluation based on artificial sampling protocols.
|
||||
quantification methods based on structured output learning, HDy, QuaNet, quantification ensembles, among others).
|
||||
* Versatile functionality for performing evaluation based on sampling generation protocols (e.g., APP, NPP, etc.).
|
||||
* Implementation of most commonly used evaluation metrics (e.g., AE, RAE, SE, KLD, NKLD, etc.).
|
||||
* Datasets frequently used in quantification (textual and numeric), including:
|
||||
* 32 UCI Machine Learning datasets.
|
||||
* 11 Twitter quantification-by-sentiment datasets.
|
||||
* 3 product reviews quantification-by-sentiment datasets.
|
||||
* 4 tasks from LeQua competition (_new in v0.1.7!_)
|
||||
* Native support for binary and single-label multiclass quantification scenarios.
|
||||
* Model selection functionality that minimizes quantification-oriented loss functions.
|
||||
* Visualization tools for analysing the experimental results.
|
||||
|
@ -80,29 +82,6 @@ quantification methods based on structured output learning, HDy, QuaNet, and qua
|
|||
* pandas, xlrd
|
||||
* matplotlib
|
||||
|
||||
## SVM-perf with quantification-oriented losses
|
||||
In order to run experiments involving SVM(Q), SVM(KLD), SVM(NKLD),
|
||||
SVM(AE), or SVM(RAE), you have to first download the
|
||||
[svmperf](http://www.cs.cornell.edu/people/tj/svm_light/svm_perf.html)
|
||||
package, apply the patch
|
||||
[svm-perf-quantification-ext.patch](./svm-perf-quantification-ext.patch), and compile the sources.
|
||||
The script [prepare_svmperf.sh](prepare_svmperf.sh) does all the job. Simply run:
|
||||
|
||||
```
|
||||
./prepare_svmperf.sh
|
||||
```
|
||||
|
||||
The resulting directory [svm_perf_quantification](./svm_perf_quantification) contains the
|
||||
patched version of _svmperf_ with quantification-oriented losses.
|
||||
|
||||
The [svm-perf-quantification-ext.patch](./svm-perf-quantification-ext.patch) is an extension of the patch made available by
|
||||
[Esuli et al. 2015](https://dl.acm.org/doi/abs/10.1145/2700406?casa_token=8D2fHsGCVn0AAAAA:ZfThYOvrzWxMGfZYlQW_y8Cagg-o_l6X_PcF09mdETQ4Tu7jK98mxFbGSXp9ZSO14JkUIYuDGFG0)
|
||||
that allows SVMperf to optimize for
|
||||
the _Q_ measure as proposed by [Barranquero et al. 2015](https://www.sciencedirect.com/science/article/abs/pii/S003132031400291X)
|
||||
and for the _KLD_ and _NKLD_ measures as proposed by [Esuli et al. 2015](https://dl.acm.org/doi/abs/10.1145/2700406?casa_token=8D2fHsGCVn0AAAAA:ZfThYOvrzWxMGfZYlQW_y8Cagg-o_l6X_PcF09mdETQ4Tu7jK98mxFbGSXp9ZSO14JkUIYuDGFG0).
|
||||
This patch extends the above one by also allowing SVMperf to optimize for
|
||||
_AE_ and _RAE_.
|
||||
|
||||
|
||||
## Documentation
|
||||
|
||||
|
@ -113,6 +92,8 @@ are provided:
|
|||
|
||||
* [Datasets](https://github.com/HLT-ISTI/QuaPy/wiki/Datasets)
|
||||
* [Evaluation](https://github.com/HLT-ISTI/QuaPy/wiki/Evaluation)
|
||||
* [Protocols](https://github.com/HLT-ISTI/QuaPy/wiki/Protocols)
|
||||
* [Methods](https://github.com/HLT-ISTI/QuaPy/wiki/Methods)
|
||||
* [SVMperf](https://github.com/HLT-ISTI/QuaPy/wiki/ExplicitLossMinimization)
|
||||
* [Model Selection](https://github.com/HLT-ISTI/QuaPy/wiki/Model-Selection)
|
||||
* [Plotting](https://github.com/HLT-ISTI/QuaPy/wiki/Plotting)
|
||||
|
|
|
@ -86,7 +86,7 @@ Take a look at the following code:</p>
|
|||
<span class="n">sample</span> <span class="o">=</span> <span class="n">data</span><span class="o">.</span><span class="n">sampling</span><span class="p">(</span><span class="n">sample_size</span><span class="p">,</span> <span class="o">*</span><span class="n">prev</span><span class="p">)</span>
|
||||
|
||||
<span class="nb">print</span><span class="p">(</span><span class="s1">'instances:'</span><span class="p">,</span> <span class="n">sample</span><span class="o">.</span><span class="n">instances</span><span class="p">)</span>
|
||||
<span class="nb">print</span><span class="p">(</span><span class="s1">'labels:'</span><span class="p">,</span> <span class="n">sample</span><span class="o">.</span><span class="n">labels</span><span class="p">)</span>
|
||||
<span class="nb">print</span><span class="p">(</span><span class="s1">'labels:'</span><span class="p">,</span> <span class="n">sample</span><span class="o">.</span><span class="n">classes</span><span class="p">)</span>
|
||||
<span class="nb">print</span><span class="p">(</span><span class="s1">'prevalence:'</span><span class="p">,</span> <span class="n">F</span><span class="o">.</span><span class="n">strprev</span><span class="p">(</span><span class="n">sample</span><span class="o">.</span><span class="n">prevalence</span><span class="p">(),</span> <span class="n">prec</span><span class="o">=</span><span class="mi">2</span><span class="p">))</span>
|
||||
</pre></div>
|
||||
</div>
|
||||
|
|
|
@ -20,7 +20,7 @@
|
|||
<script src="_static/bizstyle.js"></script>
|
||||
<link rel="index" title="Index" href="genindex.html" />
|
||||
<link rel="search" title="Search" href="search.html" />
|
||||
<link rel="next" title="Quantification Methods" href="Methods.html" />
|
||||
<link rel="next" title="Protocols" href="Protocols.html" />
|
||||
<link rel="prev" title="Datasets" href="Datasets.html" />
|
||||
<meta name="viewport" content="width=device-width,initial-scale=1.0" />
|
||||
<!--[if lt IE 9]>
|
||||
|
@ -37,7 +37,7 @@
|
|||
<a href="py-modindex.html" title="Python Module Index"
|
||||
>modules</a> |</li>
|
||||
<li class="right" >
|
||||
<a href="Methods.html" title="Quantification Methods"
|
||||
<a href="Protocols.html" title="Protocols"
|
||||
accesskey="N">next</a> |</li>
|
||||
<li class="right" >
|
||||
<a href="Datasets.html" title="Datasets"
|
||||
|
@ -99,13 +99,13 @@ third argument, e.g.:</p>
|
|||
Traditionally, this value is set to 1/(2T) in past literature,
|
||||
with T the sampling size. One could either pass this value
|
||||
to the function each time, or to set a QuaPy’s environment
|
||||
variable <em>SAMPLE_SIZE</em> once, and ommit this argument
|
||||
variable <em>SAMPLE_SIZE</em> once, and omit this argument
|
||||
thereafter (recommended);
|
||||
e.g.:</p>
|
||||
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">qp</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s1">'SAMPLE_SIZE'</span><span class="p">]</span> <span class="o">=</span> <span class="mi">100</span> <span class="c1"># once for all</span>
|
||||
<span class="n">true_prev</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">asarray</span><span class="p">([</span><span class="mf">0.5</span><span class="p">,</span> <span class="mf">0.3</span><span class="p">,</span> <span class="mf">0.2</span><span class="p">])</span> <span class="c1"># let's assume 3 classes</span>
|
||||
<span class="n">estim_prev</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">asarray</span><span class="p">([</span><span class="mf">0.1</span><span class="p">,</span> <span class="mf">0.3</span><span class="p">,</span> <span class="mf">0.6</span><span class="p">])</span>
|
||||
<span class="n">error</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">ae_</span><span class="o">.</span><span class="n">mrae</span><span class="p">(</span><span class="n">true_prev</span><span class="p">,</span> <span class="n">estim_prev</span><span class="p">)</span>
|
||||
<span class="n">error</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">error</span><span class="o">.</span><span class="n">mrae</span><span class="p">(</span><span class="n">true_prev</span><span class="p">,</span> <span class="n">estim_prev</span><span class="p">)</span>
|
||||
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">'mrae(</span><span class="si">{</span><span class="n">true_prev</span><span class="si">}</span><span class="s1">, </span><span class="si">{</span><span class="n">estim_prev</span><span class="si">}</span><span class="s1">) = </span><span class="si">{</span><span class="n">error</span><span class="si">:</span><span class="s1">.3f</span><span class="si">}</span><span class="s1">'</span><span class="p">)</span>
|
||||
</pre></div>
|
||||
</div>
|
||||
|
@ -115,148 +115,93 @@ e.g.:</p>
|
|||
</div>
|
||||
<p>Finally, it is possible to instantiate QuaPy’s quantification
|
||||
error functions from strings using, e.g.:</p>
|
||||
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">error_function</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">ae_</span><span class="o">.</span><span class="n">from_name</span><span class="p">(</span><span class="s1">'mse'</span><span class="p">)</span>
|
||||
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">error_function</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">error</span><span class="o">.</span><span class="n">from_name</span><span class="p">(</span><span class="s1">'mse'</span><span class="p">)</span>
|
||||
<span class="n">error</span> <span class="o">=</span> <span class="n">error_function</span><span class="p">(</span><span class="n">true_prev</span><span class="p">,</span> <span class="n">estim_prev</span><span class="p">)</span>
|
||||
</pre></div>
|
||||
</div>
|
||||
</section>
|
||||
<section id="evaluation-protocols">
|
||||
<h2>Evaluation Protocols<a class="headerlink" href="#evaluation-protocols" title="Permalink to this heading">¶</a></h2>
|
||||
<p>QuaPy implements the so-called “artificial sampling protocol”,
|
||||
according to which a test set is used to generate samplings at
|
||||
desired prevalences of fixed size and covering the full spectrum
|
||||
of prevalences. This protocol is called “artificial” in contrast
|
||||
to the “natural prevalence sampling” protocol that,
|
||||
despite introducing some variability during sampling, approximately
|
||||
preserves the training class prevalence.</p>
|
||||
<p>In the artificial sampling procol, the user specifies the number
|
||||
of (equally distant) points to be generated from the interval [0,1].</p>
|
||||
<p>For example, if n_prevpoints=11 then, for each class, the prevalences
|
||||
[0., 0.1, 0.2, …, 1.] will be used. This means that, for two classes,
|
||||
the number of different prevalences will be 11 (since, once the prevalence
|
||||
of one class is determined, the other one is constrained). For 3 classes,
|
||||
the number of valid combinations can be obtained as 11 + 10 + … + 1 = 66.
|
||||
In general, the number of valid combinations that will be produced for a given
|
||||
value of n_prevpoints can be consulted by invoking
|
||||
quapy.functional.num_prevalence_combinations, e.g.:</p>
|
||||
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">quapy.functional</span> <span class="k">as</span> <span class="nn">F</span>
|
||||
<span class="n">n_prevpoints</span> <span class="o">=</span> <span class="mi">21</span>
|
||||
<span class="n">n_classes</span> <span class="o">=</span> <span class="mi">4</span>
|
||||
<span class="n">n</span> <span class="o">=</span> <span class="n">F</span><span class="o">.</span><span class="n">num_prevalence_combinations</span><span class="p">(</span><span class="n">n_prevpoints</span><span class="p">,</span> <span class="n">n_classes</span><span class="p">,</span> <span class="n">n_repeats</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
|
||||
<p>An <em>evaluation protocol</em> is an evaluation procedure that uses
|
||||
one specific <em>sample generation procotol</em> to genereate many
|
||||
samples, typically characterized by widely varying amounts of
|
||||
<em>shift</em> with respect to the original distribution, that are then
|
||||
used to evaluate the performance of a (trained) quantifier.
|
||||
These protocols are explained in more detail in a dedicated <a class="reference internal" href="Protocols.html"><span class="doc std std-doc">entry
|
||||
in the wiki</span></a>. For the moment being, let us assume we already have
|
||||
chosen and instantiated one specific such protocol, that we here
|
||||
simply call <em>prot</em>. Let also assume our model is called
|
||||
<em>quantifier</em> and that our evaluatio measure of choice is
|
||||
<em>mae</em>. The evaluation comes down to:</p>
|
||||
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">mae</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">evaluation</span><span class="o">.</span><span class="n">evaluate</span><span class="p">(</span><span class="n">quantifier</span><span class="p">,</span> <span class="n">protocol</span><span class="o">=</span><span class="n">prot</span><span class="p">,</span> <span class="n">error_metric</span><span class="o">=</span><span class="s1">'mae'</span><span class="p">)</span>
|
||||
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">'MAE = </span><span class="si">{</span><span class="n">mae</span><span class="si">:</span><span class="s1">.4f</span><span class="si">}</span><span class="s1">'</span><span class="p">)</span>
|
||||
</pre></div>
|
||||
</div>
|
||||
<p>in this example, n=1771. Note the last argument, n_repeats, that
|
||||
informs of the number of examples that will be generated for any
|
||||
valid combination (typical values are, e.g., 1 for a single sample,
|
||||
or 10 or higher for computing standard deviations of performing statistical
|
||||
significance tests).</p>
|
||||
<p>One can instead work the other way around, i.e., one could set a
|
||||
maximum budged of evaluations and get the number of prevalence points that
|
||||
will generate a number of evaluations close, but not higher, than
|
||||
the fixed budget. This can be achieved with the function
|
||||
quapy.functional.get_nprevpoints_approximation, e.g.:</p>
|
||||
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">budget</span> <span class="o">=</span> <span class="mi">5000</span>
|
||||
<span class="n">n_prevpoints</span> <span class="o">=</span> <span class="n">F</span><span class="o">.</span><span class="n">get_nprevpoints_approximation</span><span class="p">(</span><span class="n">budget</span><span class="p">,</span> <span class="n">n_classes</span><span class="p">,</span> <span class="n">n_repeats</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
|
||||
<span class="n">n</span> <span class="o">=</span> <span class="n">F</span><span class="o">.</span><span class="n">num_prevalence_combinations</span><span class="p">(</span><span class="n">n_prevpoints</span><span class="p">,</span> <span class="n">n_classes</span><span class="p">,</span> <span class="n">n_repeats</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
|
||||
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">'by setting n_prevpoints=</span><span class="si">{</span><span class="n">n_prevpoints</span><span class="si">}</span><span class="s1"> the number of evaluations for </span><span class="si">{</span><span class="n">n_classes</span><span class="si">}</span><span class="s1"> classes will be </span><span class="si">{</span><span class="n">n</span><span class="si">}</span><span class="s1">'</span><span class="p">)</span>
|
||||
<p>It is often desirable to evaluate our system using more than one
|
||||
single evaluatio measure. In this case, it is convenient to generate
|
||||
a <em>report</em>. A report in QuaPy is a dataframe accounting for all the
|
||||
true prevalence values with their corresponding prevalence values
|
||||
as estimated by the quantifier, along with the error each has given
|
||||
rise.</p>
|
||||
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">report</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">evaluation</span><span class="o">.</span><span class="n">evaluation_report</span><span class="p">(</span><span class="n">quantifier</span><span class="p">,</span> <span class="n">protocol</span><span class="o">=</span><span class="n">prot</span><span class="p">,</span> <span class="n">error_metrics</span><span class="o">=</span><span class="p">[</span><span class="s1">'mae'</span><span class="p">,</span> <span class="s1">'mrae'</span><span class="p">,</span> <span class="s1">'mkld'</span><span class="p">])</span>
|
||||
</pre></div>
|
||||
</div>
|
||||
<p>that will print:</p>
|
||||
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">by</span> <span class="n">setting</span> <span class="n">n_prevpoints</span><span class="o">=</span><span class="mi">30</span> <span class="n">the</span> <span class="n">number</span> <span class="n">of</span> <span class="n">evaluations</span> <span class="k">for</span> <span class="mi">4</span> <span class="n">classes</span> <span class="n">will</span> <span class="n">be</span> <span class="mi">4960</span>
|
||||
</pre></div>
|
||||
</div>
|
||||
<p>The cost of evaluation will depend on the values of <em>n_prevpoints</em>, <em>n_classes</em>,
|
||||
and <em>n_repeats</em>. Since it might sometimes be cumbersome to control the overall
|
||||
cost of an experiment having to do with the number of combinations that
|
||||
will be generated for a particular setting of these arguments (particularly
|
||||
when <em>n_classes>2</em>), evaluation functions
|
||||
typically allow the user to rather specify an <em>evaluation budget</em>, i.e., a maximum
|
||||
number of samplings to generate. By specifying this argument, one could avoid
|
||||
specifying <em>n_prevpoints</em>, and the value for it that would lead to a closer
|
||||
number of evaluation budget, without surpassing it, will be automatically set.</p>
|
||||
<p>The following script shows a full example in which a PACC model relying
|
||||
on a Logistic Regressor classifier is
|
||||
tested on the <em>kindle</em> dataset by means of the artificial prevalence
|
||||
sampling protocol on samples of size 500, in terms of various
|
||||
evaluation metrics.</p>
|
||||
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">quapy</span> <span class="k">as</span> <span class="nn">qp</span>
|
||||
<span class="kn">import</span> <span class="nn">quapy.functional</span> <span class="k">as</span> <span class="nn">F</span>
|
||||
<span class="kn">from</span> <span class="nn">sklearn.linear_model</span> <span class="kn">import</span> <span class="n">LogisticRegression</span>
|
||||
<p>From a pandas’ dataframe, it is straightforward to visualize all the results,
|
||||
and compute the averaged values, e.g.:</p>
|
||||
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">pd</span><span class="o">.</span><span class="n">set_option</span><span class="p">(</span><span class="s1">'display.expand_frame_repr'</span><span class="p">,</span> <span class="kc">False</span><span class="p">)</span>
|
||||
<span class="n">report</span><span class="p">[</span><span class="s1">'estim-prev'</span><span class="p">]</span> <span class="o">=</span> <span class="n">report</span><span class="p">[</span><span class="s1">'estim-prev'</span><span class="p">]</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="n">F</span><span class="o">.</span><span class="n">strprev</span><span class="p">)</span>
|
||||
<span class="nb">print</span><span class="p">(</span><span class="n">report</span><span class="p">)</span>
|
||||
|
||||
<span class="n">qp</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s1">'SAMPLE_SIZE'</span><span class="p">]</span> <span class="o">=</span> <span class="mi">500</span>
|
||||
|
||||
<span class="n">dataset</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">datasets</span><span class="o">.</span><span class="n">fetch_reviews</span><span class="p">(</span><span class="s1">'kindle'</span><span class="p">)</span>
|
||||
<span class="n">qp</span><span class="o">.</span><span class="n">data</span><span class="o">.</span><span class="n">preprocessing</span><span class="o">.</span><span class="n">text2tfidf</span><span class="p">(</span><span class="n">dataset</span><span class="p">,</span> <span class="n">min_df</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span> <span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
|
||||
|
||||
<span class="n">training</span> <span class="o">=</span> <span class="n">dataset</span><span class="o">.</span><span class="n">training</span>
|
||||
<span class="n">test</span> <span class="o">=</span> <span class="n">dataset</span><span class="o">.</span><span class="n">test</span>
|
||||
|
||||
<span class="n">lr</span> <span class="o">=</span> <span class="n">LogisticRegression</span><span class="p">()</span>
|
||||
<span class="n">pacc</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">method</span><span class="o">.</span><span class="n">aggregative</span><span class="o">.</span><span class="n">PACC</span><span class="p">(</span><span class="n">lr</span><span class="p">)</span>
|
||||
|
||||
<span class="n">pacc</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">training</span><span class="p">)</span>
|
||||
|
||||
<span class="n">df</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">evaluation</span><span class="o">.</span><span class="n">artificial_sampling_report</span><span class="p">(</span>
|
||||
<span class="n">pacc</span><span class="p">,</span> <span class="c1"># the quantification method</span>
|
||||
<span class="n">test</span><span class="p">,</span> <span class="c1"># the test set on which the method will be evaluated</span>
|
||||
<span class="n">sample_size</span><span class="o">=</span><span class="n">qp</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s1">'SAMPLE_SIZE'</span><span class="p">],</span> <span class="c1">#indicates the size of samples to be drawn</span>
|
||||
<span class="n">n_prevpoints</span><span class="o">=</span><span class="mi">11</span><span class="p">,</span> <span class="c1"># how many prevalence points will be extracted from the interval [0, 1] for each category</span>
|
||||
<span class="n">n_repetitions</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="c1"># number of times each prevalence will be used to generate a test sample</span>
|
||||
<span class="n">n_jobs</span><span class="o">=-</span><span class="mi">1</span><span class="p">,</span> <span class="c1"># indicates the number of parallel workers (-1 indicates, as in sklearn, all CPUs)</span>
|
||||
<span class="n">random_seed</span><span class="o">=</span><span class="mi">42</span><span class="p">,</span> <span class="c1"># setting a random seed allows to replicate the test samples across runs</span>
|
||||
<span class="n">error_metrics</span><span class="o">=</span><span class="p">[</span><span class="s1">'mae'</span><span class="p">,</span> <span class="s1">'mrae'</span><span class="p">,</span> <span class="s1">'mkld'</span><span class="p">],</span> <span class="c1"># specify the evaluation metrics</span>
|
||||
<span class="n">verbose</span><span class="o">=</span><span class="kc">True</span> <span class="c1"># set to True to show some standard-line outputs</span>
|
||||
<span class="p">)</span>
|
||||
<span class="nb">print</span><span class="p">(</span><span class="s1">'Averaged values:'</span><span class="p">)</span>
|
||||
<span class="nb">print</span><span class="p">(</span><span class="n">report</span><span class="o">.</span><span class="n">mean</span><span class="p">())</span>
|
||||
</pre></div>
|
||||
</div>
|
||||
<p>The resulting report is a pandas’ dataframe that can be directly printed.
|
||||
Here, we set some display options from pandas just to make the output clearer;
|
||||
note also that the estimated prevalences are shown as strings using the
|
||||
function strprev function that simply converts a prevalence into a
|
||||
string representing it, with a fixed decimal precision (default 3):</p>
|
||||
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
|
||||
<span class="n">pd</span><span class="o">.</span><span class="n">set_option</span><span class="p">(</span><span class="s1">'display.expand_frame_repr'</span><span class="p">,</span> <span class="kc">False</span><span class="p">)</span>
|
||||
<span class="n">pd</span><span class="o">.</span><span class="n">set_option</span><span class="p">(</span><span class="s2">"precision"</span><span class="p">,</span> <span class="mi">3</span><span class="p">)</span>
|
||||
<span class="n">df</span><span class="p">[</span><span class="s1">'estim-prev'</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">'estim-prev'</span><span class="p">]</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="n">F</span><span class="o">.</span><span class="n">strprev</span><span class="p">)</span>
|
||||
<span class="nb">print</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>
|
||||
</pre></div>
|
||||
</div>
|
||||
<p>The output should look like:</p>
|
||||
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span> <span class="n">true</span><span class="o">-</span><span class="n">prev</span> <span class="n">estim</span><span class="o">-</span><span class="n">prev</span> <span class="n">mae</span> <span class="n">mrae</span> <span class="n">mkld</span>
|
||||
<span class="mi">0</span> <span class="p">[</span><span class="mf">0.0</span><span class="p">,</span> <span class="mf">1.0</span><span class="p">]</span> <span class="p">[</span><span class="mf">0.000</span><span class="p">,</span> <span class="mf">1.000</span><span class="p">]</span> <span class="mf">0.000</span> <span class="mf">0.000</span> <span class="mf">0.000e+00</span>
|
||||
<span class="mi">1</span> <span class="p">[</span><span class="mf">0.1</span><span class="p">,</span> <span class="mf">0.9</span><span class="p">]</span> <span class="p">[</span><span class="mf">0.091</span><span class="p">,</span> <span class="mf">0.909</span><span class="p">]</span> <span class="mf">0.009</span> <span class="mf">0.048</span> <span class="mf">4.426e-04</span>
|
||||
<span class="mi">2</span> <span class="p">[</span><span class="mf">0.2</span><span class="p">,</span> <span class="mf">0.8</span><span class="p">]</span> <span class="p">[</span><span class="mf">0.163</span><span class="p">,</span> <span class="mf">0.837</span><span class="p">]</span> <span class="mf">0.037</span> <span class="mf">0.114</span> <span class="mf">4.633e-03</span>
|
||||
<span class="mi">3</span> <span class="p">[</span><span class="mf">0.3</span><span class="p">,</span> <span class="mf">0.7</span><span class="p">]</span> <span class="p">[</span><span class="mf">0.283</span><span class="p">,</span> <span class="mf">0.717</span><span class="p">]</span> <span class="mf">0.017</span> <span class="mf">0.041</span> <span class="mf">7.383e-04</span>
|
||||
<span class="mi">4</span> <span class="p">[</span><span class="mf">0.4</span><span class="p">,</span> <span class="mf">0.6</span><span class="p">]</span> <span class="p">[</span><span class="mf">0.366</span><span class="p">,</span> <span class="mf">0.634</span><span class="p">]</span> <span class="mf">0.034</span> <span class="mf">0.070</span> <span class="mf">2.412e-03</span>
|
||||
<span class="mi">5</span> <span class="p">[</span><span class="mf">0.5</span><span class="p">,</span> <span class="mf">0.5</span><span class="p">]</span> <span class="p">[</span><span class="mf">0.459</span><span class="p">,</span> <span class="mf">0.541</span><span class="p">]</span> <span class="mf">0.041</span> <span class="mf">0.082</span> <span class="mf">3.387e-03</span>
|
||||
<span class="mi">6</span> <span class="p">[</span><span class="mf">0.6</span><span class="p">,</span> <span class="mf">0.4</span><span class="p">]</span> <span class="p">[</span><span class="mf">0.565</span><span class="p">,</span> <span class="mf">0.435</span><span class="p">]</span> <span class="mf">0.035</span> <span class="mf">0.073</span> <span class="mf">2.535e-03</span>
|
||||
<span class="mi">7</span> <span class="p">[</span><span class="mf">0.7</span><span class="p">,</span> <span class="mf">0.3</span><span class="p">]</span> <span class="p">[</span><span class="mf">0.654</span><span class="p">,</span> <span class="mf">0.346</span><span class="p">]</span> <span class="mf">0.046</span> <span class="mf">0.108</span> <span class="mf">4.701e-03</span>
|
||||
<span class="mi">8</span> <span class="p">[</span><span class="mf">0.8</span><span class="p">,</span> <span class="mf">0.2</span><span class="p">]</span> <span class="p">[</span><span class="mf">0.725</span><span class="p">,</span> <span class="mf">0.275</span><span class="p">]</span> <span class="mf">0.075</span> <span class="mf">0.235</span> <span class="mf">1.515e-02</span>
|
||||
<span class="mi">9</span> <span class="p">[</span><span class="mf">0.9</span><span class="p">,</span> <span class="mf">0.1</span><span class="p">]</span> <span class="p">[</span><span class="mf">0.858</span><span class="p">,</span> <span class="mf">0.142</span><span class="p">]</span> <span class="mf">0.042</span> <span class="mf">0.229</span> <span class="mf">7.740e-03</span>
|
||||
<span class="mi">10</span> <span class="p">[</span><span class="mf">1.0</span><span class="p">,</span> <span class="mf">0.0</span><span class="p">]</span> <span class="p">[</span><span class="mf">0.945</span><span class="p">,</span> <span class="mf">0.055</span><span class="p">]</span> <span class="mf">0.055</span> <span class="mf">27.357</span> <span class="mf">5.219e-02</span>
|
||||
</pre></div>
|
||||
</div>
|
||||
<p>One can get the averaged scores using standard pandas’
|
||||
functions, i.e.:</p>
|
||||
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="nb">print</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">mean</span><span class="p">())</span>
|
||||
</pre></div>
|
||||
</div>
|
||||
<p>will produce the following output:</p>
|
||||
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">true</span><span class="o">-</span><span class="n">prev</span> <span class="mf">0.500</span>
|
||||
<span class="n">mae</span> <span class="mf">0.035</span>
|
||||
<span class="n">mrae</span> <span class="mf">2.578</span>
|
||||
<span class="n">mkld</span> <span class="mf">0.009</span>
|
||||
<p>This will produce an output like:</p>
|
||||
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span> <span class="n">true</span><span class="o">-</span><span class="n">prev</span> <span class="n">estim</span><span class="o">-</span><span class="n">prev</span> <span class="n">mae</span> <span class="n">mrae</span> <span class="n">mkld</span>
|
||||
<span class="mi">0</span> <span class="p">[</span><span class="mf">0.308</span><span class="p">,</span> <span class="mf">0.692</span><span class="p">]</span> <span class="p">[</span><span class="mf">0.314</span><span class="p">,</span> <span class="mf">0.686</span><span class="p">]</span> <span class="mf">0.005649</span> <span class="mf">0.013182</span> <span class="mf">0.000074</span>
|
||||
<span class="mi">1</span> <span class="p">[</span><span class="mf">0.896</span><span class="p">,</span> <span class="mf">0.104</span><span class="p">]</span> <span class="p">[</span><span class="mf">0.909</span><span class="p">,</span> <span class="mf">0.091</span><span class="p">]</span> <span class="mf">0.013145</span> <span class="mf">0.069323</span> <span class="mf">0.000985</span>
|
||||
<span class="mi">2</span> <span class="p">[</span><span class="mf">0.848</span><span class="p">,</span> <span class="mf">0.152</span><span class="p">]</span> <span class="p">[</span><span class="mf">0.809</span><span class="p">,</span> <span class="mf">0.191</span><span class="p">]</span> <span class="mf">0.039063</span> <span class="mf">0.149806</span> <span class="mf">0.005175</span>
|
||||
<span class="mi">3</span> <span class="p">[</span><span class="mf">0.016</span><span class="p">,</span> <span class="mf">0.984</span><span class="p">]</span> <span class="p">[</span><span class="mf">0.033</span><span class="p">,</span> <span class="mf">0.967</span><span class="p">]</span> <span class="mf">0.017236</span> <span class="mf">0.487529</span> <span class="mf">0.005298</span>
|
||||
<span class="mi">4</span> <span class="p">[</span><span class="mf">0.728</span><span class="p">,</span> <span class="mf">0.272</span><span class="p">]</span> <span class="p">[</span><span class="mf">0.751</span><span class="p">,</span> <span class="mf">0.249</span><span class="p">]</span> <span class="mf">0.022769</span> <span class="mf">0.057146</span> <span class="mf">0.001350</span>
|
||||
<span class="o">...</span> <span class="o">...</span> <span class="o">...</span> <span class="o">...</span> <span class="o">...</span> <span class="o">...</span>
|
||||
<span class="mi">4995</span> <span class="p">[</span><span class="mf">0.72</span><span class="p">,</span> <span class="mf">0.28</span><span class="p">]</span> <span class="p">[</span><span class="mf">0.698</span><span class="p">,</span> <span class="mf">0.302</span><span class="p">]</span> <span class="mf">0.021752</span> <span class="mf">0.053631</span> <span class="mf">0.001133</span>
|
||||
<span class="mi">4996</span> <span class="p">[</span><span class="mf">0.868</span><span class="p">,</span> <span class="mf">0.132</span><span class="p">]</span> <span class="p">[</span><span class="mf">0.888</span><span class="p">,</span> <span class="mf">0.112</span><span class="p">]</span> <span class="mf">0.020490</span> <span class="mf">0.088230</span> <span class="mf">0.001985</span>
|
||||
<span class="mi">4997</span> <span class="p">[</span><span class="mf">0.292</span><span class="p">,</span> <span class="mf">0.708</span><span class="p">]</span> <span class="p">[</span><span class="mf">0.298</span><span class="p">,</span> <span class="mf">0.702</span><span class="p">]</span> <span class="mf">0.006149</span> <span class="mf">0.014788</span> <span class="mf">0.000090</span>
|
||||
<span class="mi">4998</span> <span class="p">[</span><span class="mf">0.24</span><span class="p">,</span> <span class="mf">0.76</span><span class="p">]</span> <span class="p">[</span><span class="mf">0.220</span><span class="p">,</span> <span class="mf">0.780</span><span class="p">]</span> <span class="mf">0.019950</span> <span class="mf">0.054309</span> <span class="mf">0.001127</span>
|
||||
<span class="mi">4999</span> <span class="p">[</span><span class="mf">0.948</span><span class="p">,</span> <span class="mf">0.052</span><span class="p">]</span> <span class="p">[</span><span class="mf">0.965</span><span class="p">,</span> <span class="mf">0.035</span><span class="p">]</span> <span class="mf">0.016941</span> <span class="mf">0.165776</span> <span class="mf">0.003538</span>
|
||||
|
||||
<span class="p">[</span><span class="mi">5000</span> <span class="n">rows</span> <span class="n">x</span> <span class="mi">5</span> <span class="n">columns</span><span class="p">]</span>
|
||||
<span class="n">Averaged</span> <span class="n">values</span><span class="p">:</span>
|
||||
<span class="n">mae</span> <span class="mf">0.023588</span>
|
||||
<span class="n">mrae</span> <span class="mf">0.108779</span>
|
||||
<span class="n">mkld</span> <span class="mf">0.003631</span>
|
||||
<span class="n">dtype</span><span class="p">:</span> <span class="n">float64</span>
|
||||
|
||||
<span class="n">Process</span> <span class="n">finished</span> <span class="k">with</span> <span class="n">exit</span> <span class="n">code</span> <span class="mi">0</span>
|
||||
</pre></div>
|
||||
</div>
|
||||
<p>Other evaluation functions include:</p>
|
||||
<ul class="simple">
|
||||
<li><p><em>artificial_sampling_eval</em>: that computes the evaluation for a
|
||||
given evaluation metric, returning the average instead of a dataframe.</p></li>
|
||||
<li><p><em>artificial_sampling_prediction</em>: that returns two np.arrays containing the
|
||||
true prevalences and the estimated prevalences.</p></li>
|
||||
</ul>
|
||||
<p>See the documentation for further details.</p>
|
||||
<p>Alternatively, we can simply generate all the predictions by:</p>
|
||||
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">true_prevs</span><span class="p">,</span> <span class="n">estim_prevs</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">evaluation</span><span class="o">.</span><span class="n">prediction</span><span class="p">(</span><span class="n">quantifier</span><span class="p">,</span> <span class="n">protocol</span><span class="o">=</span><span class="n">prot</span><span class="p">)</span>
|
||||
</pre></div>
|
||||
</div>
|
||||
<p>All the evaluation functions implement specific optimizations for speeding-up
|
||||
the evaluation of aggregative quantifiers (i.e., of instances of <em>AggregativeQuantifier</em>).
|
||||
The optimization comes down to generating classification predictions (either crisp or soft)
|
||||
only once for the entire test set, and then applying the sampling procedure to the
|
||||
predictions, instead of generating samples of instances and then computing the
|
||||
classification predictions every time. This is only possible when the protocol
|
||||
is an instance of <em>OnLabelledCollectionProtocol</em>. The optimization is only
|
||||
carried out when the number of classification predictions thus generated would be
|
||||
smaller than the number of predictions required for the entire protocol; e.g.,
|
||||
if the original dataset contains 1M instances, but the protocol is such that it would
|
||||
at most generate 20 samples of 100 instances, then it would be preferable to postpone the
|
||||
classification for each sample. This behaviour is indicated by setting
|
||||
<em>aggr_speedup=”auto”</em>. Conversely, when indicating <em>aggr_speedup=”force”</em> QuaPy will
|
||||
precompute all the predictions irrespectively of the number of instances and number of samples.
|
||||
Finally, this can be deactivated by setting <em>aggr_speedup=False</em>. Note that this optimization
|
||||
is not only applied for the final evaluation, but also for the internal evaluations carried
|
||||
out during <em>model selection</em>. Since these are typically many, the heuristic can help reduce the
|
||||
execution time a lot.</p>
|
||||
</section>
|
||||
</section>
|
||||
|
||||
|
@ -285,8 +230,8 @@ true prevalences and the estimated prevalences.</p></li>
|
|||
</div>
|
||||
<div>
|
||||
<h4>Next topic</h4>
|
||||
<p class="topless"><a href="Methods.html"
|
||||
title="next chapter">Quantification Methods</a></p>
|
||||
<p class="topless"><a href="Protocols.html"
|
||||
title="next chapter">Protocols</a></p>
|
||||
</div>
|
||||
<div role="note" aria-label="source link">
|
||||
<h3>This Page</h3>
|
||||
|
@ -319,7 +264,7 @@ true prevalences and the estimated prevalences.</p></li>
|
|||
<a href="py-modindex.html" title="Python Module Index"
|
||||
>modules</a> |</li>
|
||||
<li class="right" >
|
||||
<a href="Methods.html" title="Quantification Methods"
|
||||
<a href="Protocols.html" title="Protocols"
|
||||
>next</a> |</li>
|
||||
<li class="right" >
|
||||
<a href="Datasets.html" title="Datasets"
|
||||
|
|
|
@ -21,7 +21,7 @@
|
|||
<link rel="index" title="Index" href="genindex.html" />
|
||||
<link rel="search" title="Search" href="search.html" />
|
||||
<link rel="next" title="Model Selection" href="Model-Selection.html" />
|
||||
<link rel="prev" title="Evaluation" href="Evaluation.html" />
|
||||
<link rel="prev" title="Protocols" href="Protocols.html" />
|
||||
<meta name="viewport" content="width=device-width,initial-scale=1.0" />
|
||||
<!--[if lt IE 9]>
|
||||
<script src="_static/css3-mediaqueries.js"></script>
|
||||
|
@ -40,7 +40,7 @@
|
|||
<a href="Model-Selection.html" title="Model Selection"
|
||||
accesskey="N">next</a> |</li>
|
||||
<li class="right" >
|
||||
<a href="Evaluation.html" title="Evaluation"
|
||||
<a href="Protocols.html" title="Protocols"
|
||||
accesskey="P">previous</a> |</li>
|
||||
<li class="nav-item nav-item-0"><a href="index.html">QuaPy 0.1.7 documentation</a> »</li>
|
||||
<li class="nav-item nav-item-this"><a href="">Quantification Methods</a></li>
|
||||
|
@ -68,12 +68,6 @@ and implement some abstract methods:</p>
|
|||
|
||||
<span class="nd">@abstractmethod</span>
|
||||
<span class="k">def</span> <span class="nf">quantify</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">instances</span><span class="p">):</span> <span class="o">...</span>
|
||||
|
||||
<span class="nd">@abstractmethod</span>
|
||||
<span class="k">def</span> <span class="nf">set_params</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="o">**</span><span class="n">parameters</span><span class="p">):</span> <span class="o">...</span>
|
||||
|
||||
<span class="nd">@abstractmethod</span>
|
||||
<span class="k">def</span> <span class="nf">get_params</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">deep</span><span class="o">=</span><span class="kc">True</span><span class="p">):</span> <span class="o">...</span>
|
||||
</pre></div>
|
||||
</div>
|
||||
<p>The meaning of those functions should be familiar to those
|
||||
|
@ -85,10 +79,10 @@ scikit-learn’ structure has not been adopted <em>as is</em> in QuaPy responds
|
|||
the fact that scikit-learn’s <em>predict</em> function is expected to return
|
||||
one output for each input element –e.g., a predicted label for each
|
||||
instance in a sample– while in quantification the output for a sample
|
||||
is one single array of class prevalences), while functions <em>set_params</em>
|
||||
and <em>get_params</em> allow a
|
||||
<a class="reference external" href="https://github.com/HLT-ISTI/QuaPy/wiki/Model-Selection">model selector</a>
|
||||
to automate the process of hyperparameter search.</p>
|
||||
is one single array of class prevalences).
|
||||
Quantifiers also extend from scikit-learn’s <code class="docutils literal notranslate"><span class="pre">BaseEstimator</span></code>, in order
|
||||
to simplify the use of <em>set_params</em> and <em>get_params</em> used in
|
||||
<a class="reference external" href="https://github.com/HLT-ISTI/QuaPy/wiki/Model-Selection">model selector</a>.</p>
|
||||
<section id="aggregative-methods">
|
||||
<h2>Aggregative Methods<a class="headerlink" href="#aggregative-methods" title="Permalink to this heading">¶</a></h2>
|
||||
<p>All quantification methods are implemented as part of the
|
||||
|
@ -106,12 +100,12 @@ The methods that any <em>aggregative</em> quantifier must implement are:</p>
|
|||
individual predictions of a classifier. Indeed, a default implementation
|
||||
of <em>BaseQuantifier.quantify</em> is already provided, which looks like:</p>
|
||||
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span> <span class="k">def</span> <span class="nf">quantify</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">instances</span><span class="p">):</span>
|
||||
<span class="n">classif_predictions</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">preclassify</span><span class="p">(</span><span class="n">instances</span><span class="p">)</span>
|
||||
<span class="n">classif_predictions</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">classify</span><span class="p">(</span><span class="n">instances</span><span class="p">)</span>
|
||||
<span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">aggregate</span><span class="p">(</span><span class="n">classif_predictions</span><span class="p">)</span>
|
||||
</pre></div>
|
||||
</div>
|
||||
<p>Aggregative quantifiers are expected to maintain a classifier (which is
|
||||
accessed through the <em>@property</em> <em>learner</em>). This classifier is
|
||||
accessed through the <em>@property</em> <em>classifier</em>). This classifier is
|
||||
given as input to the quantifier, and can be already fit
|
||||
on external data (in which case, the <em>fit_learner</em> argument should
|
||||
be set to False), or be fit by the quantifier’s fit (default).</p>
|
||||
|
@ -121,12 +115,8 @@ aggregative methods, that should inherit from the abstract class
|
|||
The particularity of <em>probabilistic</em> aggregative methods (w.r.t.
|
||||
non-probabilistic ones), is that the default quantifier is defined
|
||||
in terms of the posterior probabilities returned by a probabilistic
|
||||
classifier, and not by the crisp decisions of a hard classifier; i.e.:</p>
|
||||
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span> <span class="k">def</span> <span class="nf">quantify</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">instances</span><span class="p">):</span>
|
||||
<span class="n">classif_posteriors</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">posterior_probabilities</span><span class="p">(</span><span class="n">instances</span><span class="p">)</span>
|
||||
<span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">aggregate</span><span class="p">(</span><span class="n">classif_posteriors</span><span class="p">)</span>
|
||||
</pre></div>
|
||||
</div>
|
||||
classifier, and not by the crisp decisions of a hard classifier.
|
||||
In any case, the interface <em>classify(instances)</em> remains unchanged.</p>
|
||||
<p>One advantage of <em>aggregative</em> methods (either probabilistic or not)
|
||||
is that the evaluation according to any sampling procedure (e.g.,
|
||||
the <a class="reference external" href="https://github.com/HLT-ISTI/QuaPy/wiki/Evaluation">artificial sampling protocol</a>)
|
||||
|
@ -153,9 +143,7 @@ with a SVM as the classifier:</p>
|
|||
<span class="kn">import</span> <span class="nn">quapy.functional</span> <span class="k">as</span> <span class="nn">F</span>
|
||||
<span class="kn">from</span> <span class="nn">sklearn.svm</span> <span class="kn">import</span> <span class="n">LinearSVC</span>
|
||||
|
||||
<span class="n">dataset</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">datasets</span><span class="o">.</span><span class="n">fetch_twitter</span><span class="p">(</span><span class="s1">'hcr'</span><span class="p">,</span> <span class="n">pickle</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
|
||||
<span class="n">training</span> <span class="o">=</span> <span class="n">dataset</span><span class="o">.</span><span class="n">training</span>
|
||||
<span class="n">test</span> <span class="o">=</span> <span class="n">dataset</span><span class="o">.</span><span class="n">test</span>
|
||||
<span class="n">training</span><span class="p">,</span> <span class="n">test</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">datasets</span><span class="o">.</span><span class="n">fetch_twitter</span><span class="p">(</span><span class="s1">'hcr'</span><span class="p">,</span> <span class="n">pickle</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span><span class="o">.</span><span class="n">train_test</span>
|
||||
|
||||
<span class="c1"># instantiate a classifier learner, in this case a SVM</span>
|
||||
<span class="n">svm</span> <span class="o">=</span> <span class="n">LinearSVC</span><span class="p">()</span>
|
||||
|
@ -199,7 +187,7 @@ e.g.:</p>
|
|||
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">model</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">method</span><span class="o">.</span><span class="n">aggregative</span><span class="o">.</span><span class="n">PCC</span><span class="p">(</span><span class="n">svm</span><span class="p">)</span>
|
||||
<span class="n">model</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">training</span><span class="p">)</span>
|
||||
<span class="n">estim_prevalence</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">quantify</span><span class="p">(</span><span class="n">test</span><span class="o">.</span><span class="n">instances</span><span class="p">)</span>
|
||||
<span class="nb">print</span><span class="p">(</span><span class="s1">'classifier:'</span><span class="p">,</span> <span class="n">model</span><span class="o">.</span><span class="n">learner</span><span class="p">)</span>
|
||||
<span class="nb">print</span><span class="p">(</span><span class="s1">'classifier:'</span><span class="p">,</span> <span class="n">model</span><span class="o">.</span><span class="n">classifier</span><span class="p">)</span>
|
||||
</pre></div>
|
||||
</div>
|
||||
<p>In this case, QuaPy will print:</p>
|
||||
|
@ -244,13 +232,21 @@ experiments we have carried out.</p>
|
|||
<span class="n">estim_prevalence</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">quantify</span><span class="p">(</span><span class="n">dataset</span><span class="o">.</span><span class="n">test</span><span class="o">.</span><span class="n">instances</span><span class="p">)</span>
|
||||
</pre></div>
|
||||
</div>
|
||||
<p><em>New in v0.1.7</em>: EMQ now accepts two new parameters in the construction method, namely
|
||||
<em>exact_train_prev</em> which allows to use the true training prevalence as the departing
|
||||
prevalence estimation (default behaviour), or instead an approximation of it as
|
||||
suggested by <a class="reference external" href="http://proceedings.mlr.press/v119/alexandari20a.html">Alexandari et al. (2020)</a>
|
||||
(by setting <em>exact_train_prev=False</em>).
|
||||
The other parameter is <em>recalib</em> which allows to indicate a calibration method, among those
|
||||
proposed by <a class="reference external" href="http://proceedings.mlr.press/v119/alexandari20a.html">Alexandari et al. (2020)</a>,
|
||||
including the Bias-Corrected Temperature Scaling, Vector Scaling, etc.
|
||||
See the API documentation for further details.</p>
|
||||
</section>
|
||||
<section id="hellinger-distance-y-hdy">
|
||||
<h3>Hellinger Distance y (HDy)<a class="headerlink" href="#hellinger-distance-y-hdy" title="Permalink to this heading">¶</a></h3>
|
||||
<p>The method HDy is described in:</p>
|
||||
<p><em>Implementation of the method based on the Hellinger Distance y (HDy) proposed by
|
||||
González-Castro, V., Alaiz-Rodrı́guez, R., and Alegre, E. (2013). Class distribution
|
||||
estimation based on the Hellinger distance. Information Sciences, 218:146–164.</em></p>
|
||||
<p>Implementation of the method based on the Hellinger Distance y (HDy) proposed by
|
||||
<a class="reference external" href="https://www.sciencedirect.com/science/article/pii/S0020025512004069">González-Castro, V., Alaiz-Rodrı́guez, R., and Alegre, E. (2013). Class distribution
|
||||
estimation based on the Hellinger distance. Information Sciences, 218:146–164.</a></p>
|
||||
<p>It is implemented in <em>qp.method.aggregative.HDy</em> (also accessible
|
||||
through the allias <em>qp.method.aggregative.HellingerDistanceY</em>).
|
||||
This method works with a probabilistic classifier (hard classifiers
|
||||
|
@ -277,30 +273,48 @@ provided in QuaPy accepts only binary datasets.</p>
|
|||
<span class="n">estim_prevalence</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">quantify</span><span class="p">(</span><span class="n">dataset</span><span class="o">.</span><span class="n">test</span><span class="o">.</span><span class="n">instances</span><span class="p">)</span>
|
||||
</pre></div>
|
||||
</div>
|
||||
<p><em>New in v0.1.7:</em> QuaPy now provides an implementation of the generalized
|
||||
“Distribution Matching” approaches for multiclass, inspired by the framework
|
||||
of <a class="reference external" href="https://arxiv.org/abs/1606.00868">Firat (2016)</a>. One can instantiate
|
||||
a variant of HDy for multiclass quantification as follows:</p>
|
||||
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">mutliclassHDy</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">method</span><span class="o">.</span><span class="n">aggregative</span><span class="o">.</span><span class="n">DistributionMatching</span><span class="p">(</span><span class="n">classifier</span><span class="o">=</span><span class="n">LogisticRegression</span><span class="p">(),</span> <span class="n">divergence</span><span class="o">=</span><span class="s1">'HD'</span><span class="p">,</span> <span class="n">cdf</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
|
||||
</pre></div>
|
||||
</div>
|
||||
<p><em>New in v0.1.7:</em> QuaPy now provides an implementation of the “DyS”
|
||||
framework proposed by <a class="reference external" href="https://ojs.aaai.org/index.php/AAAI/article/view/4376">Maletzke et al (2020)</a>
|
||||
and the “SMM” method proposed by <a class="reference external" href="https://ieeexplore.ieee.org/document/9260028">Hassan et al (2019)</a>
|
||||
(thanks to <em>Pablo González</em> for the contributions!)</p>
|
||||
</section>
|
||||
<section id="threshold-optimization-methods">
|
||||
<h3>Threshold Optimization methods<a class="headerlink" href="#threshold-optimization-methods" title="Permalink to this heading">¶</a></h3>
|
||||
<p><em>New in v0.1.7:</em> QuaPy now implements Forman’s threshold optimization methods;
|
||||
see, e.g., <a class="reference external" href="https://dl.acm.org/doi/abs/10.1145/1150402.1150423">(Forman 2006)</a>
|
||||
and <a class="reference external" href="https://link.springer.com/article/10.1007/s10618-008-0097-y">(Forman 2008)</a>.
|
||||
These include: T50, MAX, X, Median Sweep (MS), and its variant MS2.</p>
|
||||
</section>
|
||||
<section id="explicit-loss-minimization">
|
||||
<h3>Explicit Loss Minimization<a class="headerlink" href="#explicit-loss-minimization" title="Permalink to this heading">¶</a></h3>
|
||||
<p>The Explicit Loss Minimization (ELM) represent a family of methods
|
||||
based on structured output learning, i.e., quantifiers relying on
|
||||
classifiers that have been optimized targeting a
|
||||
quantification-oriented evaluation measure.</p>
|
||||
<p>In QuaPy, the following methods, all relying on Joachim’s
|
||||
<a class="reference external" href="https://www.cs.cornell.edu/people/tj/svm_light/svm_perf.html">SVMperf</a>
|
||||
implementation, are available in <em>qp.method.aggregative</em>:</p>
|
||||
quantification-oriented evaluation measure.
|
||||
The original methods are implemented in QuaPy as classify & count (CC)
|
||||
quantifiers that use Joachim’s <a class="reference external" href="https://www.cs.cornell.edu/people/tj/svm_light/svm_perf.html">SVMperf</a>
|
||||
as the underlying classifier, properly set to optimize for the desired loss.</p>
|
||||
<p>In QuaPy, this can be more achieved by calling the functions:</p>
|
||||
<ul class="simple">
|
||||
<li><p>SVMQ (SVM-Q) is a quantification method optimizing the metric <em>Q</em> defined
|
||||
in <em>Barranquero, J., Díez, J., and del Coz, J. J. (2015). Quantification-oriented learning based
|
||||
on reliable classifiers. Pattern Recognition, 48(2):591–604.</em></p></li>
|
||||
<li><p>SVMKLD (SVM for Kullback-Leibler Divergence) proposed in <em>Esuli, A. and Sebastiani, F. (2015).
|
||||
<li><p><em>newSVMQ</em>: returns the quantification method called SVM(Q) that optimizes for the metric <em>Q</em> defined
|
||||
in <a class="reference external" href="https://www.sciencedirect.com/science/article/pii/S003132031400291X"><em>Barranquero, J., Díez, J., and del Coz, J. J. (2015). Quantification-oriented learning based
|
||||
on reliable classifiers. Pattern Recognition, 48(2):591–604.</em></a></p></li>
|
||||
<li><p><em>newSVMKLD</em> and <em>newSVMNKLD</em>: returns the quantification method called SVM(KLD) and SVM(nKLD), standing for
|
||||
Kullback-Leibler Divergence and Normalized Kullback-Leibler Divergence, as proposed in <a class="reference external" href="https://dl.acm.org/doi/abs/10.1145/2700406"><em>Esuli, A. and Sebastiani, F. (2015).
|
||||
Optimizing text quantifiers for multivariate loss functions.
|
||||
ACM Transactions on Knowledge Discovery and Data, 9(4):Article 27.</em></p></li>
|
||||
<li><p>SVMNKLD (SVM for Normalized Kullback-Leibler Divergence) proposed in <em>Esuli, A. and Sebastiani, F. (2015).
|
||||
Optimizing text quantifiers for multivariate loss functions.
|
||||
ACM Transactions on Knowledge Discovery and Data, 9(4):Article 27.</em></p></li>
|
||||
<li><p>SVMAE (SVM for Mean Absolute Error)</p></li>
|
||||
<li><p>SVMRAE (SVM for Mean Relative Absolute Error)</p></li>
|
||||
ACM Transactions on Knowledge Discovery and Data, 9(4):Article 27.</em></a></p></li>
|
||||
<li><p><em>newSVMAE</em> and <em>newSVMRAE</em>: returns a quantification method called SVM(AE) and SVM(RAE) that optimizes for the (Mean) Absolute Error and for the
|
||||
(Mean) Relative Absolute Error, as first used by
|
||||
<a class="reference external" href="https://arxiv.org/abs/2011.02552"><em>Moreo, A. and Sebastiani, F. (2021). Tweet sentiment quantification: An experimental re-evaluation. PLOS ONE 17 (9), 1-23.</em></a></p></li>
|
||||
</ul>
|
||||
<p>the last two methods (SVMAE and SVMRAE) have been implemented in
|
||||
<p>the last two methods (SVM(AE) and SVM(RAE)) have been implemented in
|
||||
QuaPy in order to make available ELM variants for what nowadays
|
||||
are considered the most well-behaved evaluation metrics in quantification.</p>
|
||||
<p>In order to make these models work, you would need to run the script
|
||||
|
@ -330,11 +344,15 @@ currently supports only binary classification.
|
|||
ELM variants (any binary quantifier in general) can be extended
|
||||
to operate in single-label scenarios trivially by adopting a
|
||||
“one-vs-all” strategy (as, e.g., in
|
||||
<em>Gao, W. and Sebastiani, F. (2016). From classification to quantification in tweet sentiment
|
||||
analysis. Social Network Analysis and Mining, 6(19):1–22</em>).
|
||||
In QuaPy this is possible by using the <em>OneVsAll</em> class:</p>
|
||||
<a class="reference external" href="https://link.springer.com/article/10.1007/s13278-016-0327-z"><em>Gao, W. and Sebastiani, F. (2016). From classification to quantification in tweet sentiment
|
||||
analysis. Social Network Analysis and Mining, 6(19):1–22</em></a>).
|
||||
In QuaPy this is possible by using the <em>OneVsAll</em> class.</p>
|
||||
<p>There are two ways for instantiating this class, <em>OneVsAllGeneric</em> that works for
|
||||
any quantifier, and <em>OneVsAllAggregative</em> that is optimized for aggregative quantifiers.
|
||||
In general, you can simply use the <em>getOneVsAll</em> function and QuaPy will choose
|
||||
the more convenient of the two.</p>
|
||||
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">quapy</span> <span class="k">as</span> <span class="nn">qp</span>
|
||||
<span class="kn">from</span> <span class="nn">quapy.method.aggregative</span> <span class="kn">import</span> <span class="n">SVMQ</span><span class="p">,</span> <span class="n">OneVsAll</span>
|
||||
<span class="kn">from</span> <span class="nn">quapy.method.aggregative</span> <span class="kn">import</span> <span class="n">SVMQ</span>
|
||||
|
||||
<span class="c1"># load a single-label dataset (this one contains 3 classes)</span>
|
||||
<span class="n">dataset</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">datasets</span><span class="o">.</span><span class="n">fetch_twitter</span><span class="p">(</span><span class="s1">'hcr'</span><span class="p">,</span> <span class="n">pickle</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
|
||||
|
@ -342,11 +360,13 @@ In QuaPy this is possible by using the <em>OneVsAll</em> class:</p>
|
|||
<span class="c1"># let qp know where svmperf is</span>
|
||||
<span class="n">qp</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s1">'SVMPERF_HOME'</span><span class="p">]</span> <span class="o">=</span> <span class="s1">'../svm_perf_quantification'</span>
|
||||
|
||||
<span class="n">model</span> <span class="o">=</span> <span class="n">OneVsAll</span><span class="p">(</span><span class="n">SVMQ</span><span class="p">(),</span> <span class="n">n_jobs</span><span class="o">=-</span><span class="mi">1</span><span class="p">)</span> <span class="c1"># run them on parallel</span>
|
||||
<span class="n">model</span> <span class="o">=</span> <span class="n">getOneVsAll</span><span class="p">(</span><span class="n">SVMQ</span><span class="p">(),</span> <span class="n">n_jobs</span><span class="o">=-</span><span class="mi">1</span><span class="p">)</span> <span class="c1"># run them on parallel</span>
|
||||
<span class="n">model</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">dataset</span><span class="o">.</span><span class="n">training</span><span class="p">)</span>
|
||||
<span class="n">estim_prevalence</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">quantify</span><span class="p">(</span><span class="n">dataset</span><span class="o">.</span><span class="n">test</span><span class="o">.</span><span class="n">instances</span><span class="p">)</span>
|
||||
</pre></div>
|
||||
</div>
|
||||
<p>Check the examples <em><span class="xref myst">explicit_loss_minimization.py</span></em>
|
||||
and <span class="xref myst">one_vs_all.py</span> for more details.</p>
|
||||
</section>
|
||||
</section>
|
||||
<section id="meta-models">
|
||||
|
@ -360,12 +380,12 @@ groups).
|
|||
<h3>Ensembles<a class="headerlink" href="#ensembles" title="Permalink to this heading">¶</a></h3>
|
||||
<p>QuaPy implements (some of) the variants proposed in:</p>
|
||||
<ul class="simple">
|
||||
<li><p><em>Pérez-Gállego, P., Quevedo, J. R., & del Coz, J. J. (2017).
|
||||
<li><p><a class="reference external" href="https://www.sciencedirect.com/science/article/pii/S1566253516300628"><em>Pérez-Gállego, P., Quevedo, J. R., & del Coz, J. J. (2017).
|
||||
Using ensembles for problems with characterizable changes in data distribution: A case study on quantification.
|
||||
Information Fusion, 34, 87-100.</em></p></li>
|
||||
<li><p><em>Pérez-Gállego, P., Castano, A., Quevedo, J. R., & del Coz, J. J. (2019).
|
||||
Information Fusion, 34, 87-100.</em></a></p></li>
|
||||
<li><p><a class="reference external" href="https://www.sciencedirect.com/science/article/pii/S1566253517303652"><em>Pérez-Gállego, P., Castano, A., Quevedo, J. R., & del Coz, J. J. (2019).
|
||||
Dynamic ensemble selection for quantification tasks.
|
||||
Information Fusion, 45, 1-15.</em></p></li>
|
||||
Information Fusion, 45, 1-15.</em></a></p></li>
|
||||
</ul>
|
||||
<p>The following code shows how to instantiate an Ensemble of 30 <em>Adjusted Classify & Count</em> (ACC)
|
||||
quantifiers operating with a <em>Logistic Regressor</em> (LR) as the base classifier, and using the
|
||||
|
@ -398,10 +418,10 @@ wiki if you want to optimize the hyperparameters of ensemble for classification
|
|||
<section id="the-quanet-neural-network">
|
||||
<h3>The QuaNet neural network<a class="headerlink" href="#the-quanet-neural-network" title="Permalink to this heading">¶</a></h3>
|
||||
<p>QuaPy offers an implementation of QuaNet, a deep learning model presented in:</p>
|
||||
<p><em>Esuli, A., Moreo, A., & Sebastiani, F. (2018, October).
|
||||
<p><a class="reference external" href="https://dl.acm.org/doi/abs/10.1145/3269206.3269287"><em>Esuli, A., Moreo, A., & Sebastiani, F. (2018, October).
|
||||
A recurrent neural network for sentiment quantification.
|
||||
In Proceedings of the 27th ACM International Conference on
|
||||
Information and Knowledge Management (pp. 1775-1778).</em></p>
|
||||
Information and Knowledge Management (pp. 1775-1778).</em></a></p>
|
||||
<p>This model requires <em>torch</em> to be installed.
|
||||
QuaNet also requires a classifier that can provide embedded representations
|
||||
of the inputs.
|
||||
|
@ -423,7 +443,7 @@ In the following example, we show an instantiation of QuaNet that instead uses C
|
|||
<span class="n">learner</span> <span class="o">=</span> <span class="n">NeuralClassifierTrainer</span><span class="p">(</span><span class="n">cnn</span><span class="p">,</span> <span class="n">device</span><span class="o">=</span><span class="s1">'cuda'</span><span class="p">)</span>
|
||||
|
||||
<span class="c1"># train QuaNet</span>
|
||||
<span class="n">model</span> <span class="o">=</span> <span class="n">QuaNet</span><span class="p">(</span><span class="n">learner</span><span class="p">,</span> <span class="n">qp</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s1">'SAMPLE_SIZE'</span><span class="p">],</span> <span class="n">device</span><span class="o">=</span><span class="s1">'cuda'</span><span class="p">)</span>
|
||||
<span class="n">model</span> <span class="o">=</span> <span class="n">QuaNet</span><span class="p">(</span><span class="n">learner</span><span class="p">,</span> <span class="n">device</span><span class="o">=</span><span class="s1">'cuda'</span><span class="p">)</span>
|
||||
<span class="n">model</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">dataset</span><span class="o">.</span><span class="n">training</span><span class="p">)</span>
|
||||
<span class="n">estim_prevalence</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">quantify</span><span class="p">(</span><span class="n">dataset</span><span class="o">.</span><span class="n">test</span><span class="o">.</span><span class="n">instances</span><span class="p">)</span>
|
||||
</pre></div>
|
||||
|
@ -447,6 +467,7 @@ In the following example, we show an instantiation of QuaNet that instead uses C
|
|||
<li><a class="reference internal" href="#the-classify-count-variants">The Classify & Count variants</a></li>
|
||||
<li><a class="reference internal" href="#expectation-maximization-emq">Expectation Maximization (EMQ)</a></li>
|
||||
<li><a class="reference internal" href="#hellinger-distance-y-hdy">Hellinger Distance y (HDy)</a></li>
|
||||
<li><a class="reference internal" href="#threshold-optimization-methods">Threshold Optimization methods</a></li>
|
||||
<li><a class="reference internal" href="#explicit-loss-minimization">Explicit Loss Minimization</a></li>
|
||||
</ul>
|
||||
</li>
|
||||
|
@ -462,8 +483,8 @@ In the following example, we show an instantiation of QuaNet that instead uses C
|
|||
</div>
|
||||
<div>
|
||||
<h4>Previous topic</h4>
|
||||
<p class="topless"><a href="Evaluation.html"
|
||||
title="previous chapter">Evaluation</a></p>
|
||||
<p class="topless"><a href="Protocols.html"
|
||||
title="previous chapter">Protocols</a></p>
|
||||
</div>
|
||||
<div>
|
||||
<h4>Next topic</h4>
|
||||
|
@ -504,7 +525,7 @@ In the following example, we show an instantiation of QuaNet that instead uses C
|
|||
<a href="Model-Selection.html" title="Model Selection"
|
||||
>next</a> |</li>
|
||||
<li class="right" >
|
||||
<a href="Evaluation.html" title="Evaluation"
|
||||
<a href="Protocols.html" title="Protocols"
|
||||
>previous</a> |</li>
|
||||
<li class="nav-item nav-item-0"><a href="index.html">QuaPy 0.1.7 documentation</a> »</li>
|
||||
<li class="nav-item nav-item-this"><a href="">Quantification Methods</a></li>
|
||||
|
|
|
@ -74,81 +74,91 @@ specifically designed for the task of quantification.</p>
|
|||
classification, and thus the model selection strategies
|
||||
customarily adopted in classification have simply been
|
||||
applied to quantification (see the next section).
|
||||
It has been argued in <em>Moreo, Alejandro, and Fabrizio Sebastiani.
|
||||
“Re-Assessing the” Classify and Count” Quantification Method.”
|
||||
arXiv preprint arXiv:2011.02552 (2020).</em>
|
||||
It has been argued in <a class="reference external" href="https://link.springer.com/chapter/10.1007/978-3-030-72240-1_6">Moreo, Alejandro, and Fabrizio Sebastiani.
|
||||
Re-Assessing the “Classify and Count” Quantification Method.
|
||||
ECIR 2021: Advances in Information Retrieval pp 75–91.</a>
|
||||
that specific model selection strategies should
|
||||
be adopted for quantification. That is, model selection
|
||||
strategies for quantification should target
|
||||
quantification-oriented losses and be tested in a variety
|
||||
of scenarios exhibiting different degrees of prior
|
||||
probability shift.</p>
|
||||
<p>The class
|
||||
<em>qp.model_selection.GridSearchQ</em>
|
||||
implements a grid-search exploration over the space of
|
||||
hyper-parameter combinations that evaluates each<br />
|
||||
combination of hyper-parameters
|
||||
by means of a given quantification-oriented
|
||||
<p>The class <em>qp.model_selection.GridSearchQ</em> implements a grid-search exploration over the space of
|
||||
hyper-parameter combinations that <a class="reference external" href="https://github.com/HLT-ISTI/QuaPy/wiki/Evaluation">evaluates</a>
|
||||
each combination of hyper-parameters by means of a given quantification-oriented
|
||||
error metric (e.g., any of the error functions implemented
|
||||
in <em>qp.error</em>) and according to the
|
||||
<a class="reference external" href="https://github.com/HLT-ISTI/QuaPy/wiki/Evaluation"><em>artificial sampling protocol</em></a>.</p>
|
||||
<p>The following is an example of model selection for quantification:</p>
|
||||
in <em>qp.error</em>) and according to a
|
||||
<a class="reference external" href="https://github.com/HLT-ISTI/QuaPy/wiki/Protocols">sampling generation protocol</a>.</p>
|
||||
<p>The following is an example (also included in the examples folder) of model selection for quantification:</p>
|
||||
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">quapy</span> <span class="k">as</span> <span class="nn">qp</span>
|
||||
<span class="kn">from</span> <span class="nn">quapy.method.aggregative</span> <span class="kn">import</span> <span class="n">PCC</span>
|
||||
<span class="kn">from</span> <span class="nn">quapy.protocol</span> <span class="kn">import</span> <span class="n">APP</span>
|
||||
<span class="kn">from</span> <span class="nn">quapy.method.aggregative</span> <span class="kn">import</span> <span class="n">DistributionMatching</span>
|
||||
<span class="kn">from</span> <span class="nn">sklearn.linear_model</span> <span class="kn">import</span> <span class="n">LogisticRegression</span>
|
||||
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
|
||||
|
||||
<span class="c1"># set a seed to replicate runs</span>
|
||||
<span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">seed</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
|
||||
<span class="n">qp</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s1">'SAMPLE_SIZE'</span><span class="p">]</span> <span class="o">=</span> <span class="mi">500</span>
|
||||
<span class="sd">"""</span>
|
||||
<span class="sd">In this example, we show how to perform model selection on a DistributionMatching quantifier.</span>
|
||||
<span class="sd">"""</span>
|
||||
|
||||
<span class="n">dataset</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">datasets</span><span class="o">.</span><span class="n">fetch_reviews</span><span class="p">(</span><span class="s1">'hp'</span><span class="p">,</span> <span class="n">tfidf</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">min_df</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span>
|
||||
<span class="n">model</span> <span class="o">=</span> <span class="n">DistributionMatching</span><span class="p">(</span><span class="n">LogisticRegression</span><span class="p">())</span>
|
||||
|
||||
<span class="n">qp</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s1">'SAMPLE_SIZE'</span><span class="p">]</span> <span class="o">=</span> <span class="mi">100</span>
|
||||
<span class="n">qp</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s1">'N_JOBS'</span><span class="p">]</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span> <span class="c1"># explore hyper-parameters in parallel</span>
|
||||
|
||||
<span class="n">training</span><span class="p">,</span> <span class="n">test</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">datasets</span><span class="o">.</span><span class="n">fetch_reviews</span><span class="p">(</span><span class="s1">'imdb'</span><span class="p">,</span> <span class="n">tfidf</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">min_df</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span><span class="o">.</span><span class="n">train_test</span>
|
||||
|
||||
<span class="c1"># The model will be returned by the fit method of GridSearchQ.</span>
|
||||
<span class="c1"># Model selection will be performed with a fixed budget of 1000 evaluations</span>
|
||||
<span class="c1"># for each hyper-parameter combination. The error to optimize is the MAE for</span>
|
||||
<span class="c1"># quantification, as evaluated on artificially drawn samples at prevalences </span>
|
||||
<span class="c1"># covering the entire spectrum on a held-out portion (40%) of the training set.</span>
|
||||
<span class="c1"># Every combination of hyper-parameters will be evaluated by confronting the</span>
|
||||
<span class="c1"># quantifier thus configured against a series of samples generated by means</span>
|
||||
<span class="c1"># of a sample generation protocol. For this example, we will use the</span>
|
||||
<span class="c1"># artificial-prevalence protocol (APP), that generates samples with prevalence</span>
|
||||
<span class="c1"># values in the entire range of values from a grid (e.g., [0, 0.1, 0.2, ..., 1]).</span>
|
||||
<span class="c1"># We devote 30% of the dataset for this exploration.</span>
|
||||
<span class="n">training</span><span class="p">,</span> <span class="n">validation</span> <span class="o">=</span> <span class="n">training</span><span class="o">.</span><span class="n">split_stratified</span><span class="p">(</span><span class="n">train_prop</span><span class="o">=</span><span class="mf">0.7</span><span class="p">)</span>
|
||||
<span class="n">protocol</span> <span class="o">=</span> <span class="n">APP</span><span class="p">(</span><span class="n">validation</span><span class="p">)</span>
|
||||
|
||||
<span class="c1"># We will explore a classification-dependent hyper-parameter (e.g., the 'C'</span>
|
||||
<span class="c1"># hyper-parameter of LogisticRegression) and a quantification-dependent hyper-parameter</span>
|
||||
<span class="c1"># (e.g., the number of bins in a DistributionMatching quantifier.</span>
|
||||
<span class="c1"># Classifier-dependent hyper-parameters have to be marked with a prefix "classifier__"</span>
|
||||
<span class="c1"># in order to let the quantifier know this hyper-parameter belongs to its underlying</span>
|
||||
<span class="c1"># classifier.</span>
|
||||
<span class="n">param_grid</span> <span class="o">=</span> <span class="p">{</span>
|
||||
<span class="s1">'classifier__C'</span><span class="p">:</span> <span class="n">np</span><span class="o">.</span><span class="n">logspace</span><span class="p">(</span><span class="o">-</span><span class="mi">3</span><span class="p">,</span><span class="mi">3</span><span class="p">,</span><span class="mi">7</span><span class="p">),</span>
|
||||
<span class="s1">'nbins'</span><span class="p">:</span> <span class="p">[</span><span class="mi">8</span><span class="p">,</span> <span class="mi">16</span><span class="p">,</span> <span class="mi">32</span><span class="p">,</span> <span class="mi">64</span><span class="p">],</span>
|
||||
<span class="p">}</span>
|
||||
|
||||
<span class="n">model</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">model_selection</span><span class="o">.</span><span class="n">GridSearchQ</span><span class="p">(</span>
|
||||
<span class="n">model</span><span class="o">=</span><span class="n">PCC</span><span class="p">(</span><span class="n">LogisticRegression</span><span class="p">()),</span>
|
||||
<span class="n">param_grid</span><span class="o">=</span><span class="p">{</span><span class="s1">'C'</span><span class="p">:</span> <span class="n">np</span><span class="o">.</span><span class="n">logspace</span><span class="p">(</span><span class="o">-</span><span class="mi">4</span><span class="p">,</span><span class="mi">5</span><span class="p">,</span><span class="mi">10</span><span class="p">),</span> <span class="s1">'class_weight'</span><span class="p">:</span> <span class="p">[</span><span class="s1">'balanced'</span><span class="p">,</span> <span class="kc">None</span><span class="p">]},</span>
|
||||
<span class="n">sample_size</span><span class="o">=</span><span class="n">qp</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s1">'SAMPLE_SIZE'</span><span class="p">],</span>
|
||||
<span class="n">eval_budget</span><span class="o">=</span><span class="mi">1000</span><span class="p">,</span>
|
||||
<span class="n">error</span><span class="o">=</span><span class="s1">'mae'</span><span class="p">,</span>
|
||||
<span class="n">refit</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="c1"># retrain on the whole labelled set</span>
|
||||
<span class="n">val_split</span><span class="o">=</span><span class="mf">0.4</span><span class="p">,</span>
|
||||
<span class="n">model</span><span class="o">=</span><span class="n">model</span><span class="p">,</span>
|
||||
<span class="n">param_grid</span><span class="o">=</span><span class="n">param_grid</span><span class="p">,</span>
|
||||
<span class="n">protocol</span><span class="o">=</span><span class="n">protocol</span><span class="p">,</span>
|
||||
<span class="n">error</span><span class="o">=</span><span class="s1">'mae'</span><span class="p">,</span> <span class="c1"># the error to optimize is the MAE (a quantification-oriented loss)</span>
|
||||
<span class="n">refit</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="c1"># retrain on the whole labelled set once done</span>
|
||||
<span class="n">verbose</span><span class="o">=</span><span class="kc">True</span> <span class="c1"># show information as the process goes on</span>
|
||||
<span class="p">)</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">dataset</span><span class="o">.</span><span class="n">training</span><span class="p">)</span>
|
||||
<span class="p">)</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">training</span><span class="p">)</span>
|
||||
|
||||
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">'model selection ended: best hyper-parameters=</span><span class="si">{</span><span class="n">model</span><span class="o">.</span><span class="n">best_params_</span><span class="si">}</span><span class="s1">'</span><span class="p">)</span>
|
||||
<span class="n">model</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">best_model_</span>
|
||||
|
||||
<span class="c1"># evaluation in terms of MAE</span>
|
||||
<span class="n">results</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">evaluation</span><span class="o">.</span><span class="n">artificial_sampling_eval</span><span class="p">(</span>
|
||||
<span class="n">model</span><span class="p">,</span>
|
||||
<span class="n">dataset</span><span class="o">.</span><span class="n">test</span><span class="p">,</span>
|
||||
<span class="n">sample_size</span><span class="o">=</span><span class="n">qp</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s1">'SAMPLE_SIZE'</span><span class="p">],</span>
|
||||
<span class="n">n_prevpoints</span><span class="o">=</span><span class="mi">101</span><span class="p">,</span>
|
||||
<span class="n">n_repetitions</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span>
|
||||
<span class="n">error_metric</span><span class="o">=</span><span class="s1">'mae'</span>
|
||||
<span class="p">)</span>
|
||||
<span class="c1"># we use the same evaluation protocol (APP) on the test set</span>
|
||||
<span class="n">mae_score</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">evaluation</span><span class="o">.</span><span class="n">evaluate</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">protocol</span><span class="o">=</span><span class="n">APP</span><span class="p">(</span><span class="n">test</span><span class="p">),</span> <span class="n">error_metric</span><span class="o">=</span><span class="s1">'mae'</span><span class="p">)</span>
|
||||
|
||||
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">'MAE=</span><span class="si">{</span><span class="n">results</span><span class="si">:</span><span class="s1">.5f</span><span class="si">}</span><span class="s1">'</span><span class="p">)</span>
|
||||
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">'MAE=</span><span class="si">{</span><span class="n">mae_score</span><span class="si">:</span><span class="s1">.5f</span><span class="si">}</span><span class="s1">'</span><span class="p">)</span>
|
||||
</pre></div>
|
||||
</div>
|
||||
<p>In this example, the system outputs:</p>
|
||||
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span>[GridSearchQ]: starting optimization with n_jobs=1
|
||||
[GridSearchQ]: checking hyperparams={'C': 0.0001, 'class_weight': 'balanced'} got mae score 0.24987
|
||||
[GridSearchQ]: checking hyperparams={'C': 0.0001, 'class_weight': None} got mae score 0.48135
|
||||
[GridSearchQ]: checking hyperparams={'C': 0.001, 'class_weight': 'balanced'} got mae score 0.24866
|
||||
[...]
|
||||
[GridSearchQ]: checking hyperparams={'C': 100000.0, 'class_weight': None} got mae score 0.43676
|
||||
[GridSearchQ]: optimization finished: best params {'C': 0.1, 'class_weight': 'balanced'} (score=0.19982)
|
||||
[GridSearchQ]: refitting on the whole development set
|
||||
model selection ended: best hyper-parameters={'C': 0.1, 'class_weight': 'balanced'}
|
||||
1010 evaluations will be performed for each combination of hyper-parameters
|
||||
[artificial sampling protocol] generating predictions: 100%|██████████| 1010/1010 [00:00<00:00, 5005.54it/s]
|
||||
MAE=0.20342
|
||||
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="p">[</span><span class="n">GridSearchQ</span><span class="p">]:</span> <span class="n">starting</span> <span class="n">model</span> <span class="n">selection</span> <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">n_jobs</span> <span class="o">=-</span><span class="mi">1</span>
|
||||
<span class="p">[</span><span class="n">GridSearchQ</span><span class="p">]:</span> <span class="n">hyperparams</span><span class="o">=</span><span class="p">{</span><span class="s1">'classifier__C'</span><span class="p">:</span> <span class="mf">0.01</span><span class="p">,</span> <span class="s1">'nbins'</span><span class="p">:</span> <span class="mi">64</span><span class="p">}</span> <span class="n">got</span> <span class="n">mae</span> <span class="n">score</span> <span class="mf">0.04021</span> <span class="p">[</span><span class="n">took</span> <span class="mf">1.1356</span><span class="n">s</span><span class="p">]</span>
|
||||
<span class="p">[</span><span class="n">GridSearchQ</span><span class="p">]:</span> <span class="n">hyperparams</span><span class="o">=</span><span class="p">{</span><span class="s1">'classifier__C'</span><span class="p">:</span> <span class="mf">0.01</span><span class="p">,</span> <span class="s1">'nbins'</span><span class="p">:</span> <span class="mi">32</span><span class="p">}</span> <span class="n">got</span> <span class="n">mae</span> <span class="n">score</span> <span class="mf">0.04286</span> <span class="p">[</span><span class="n">took</span> <span class="mf">1.2139</span><span class="n">s</span><span class="p">]</span>
|
||||
<span class="p">[</span><span class="n">GridSearchQ</span><span class="p">]:</span> <span class="n">hyperparams</span><span class="o">=</span><span class="p">{</span><span class="s1">'classifier__C'</span><span class="p">:</span> <span class="mf">0.01</span><span class="p">,</span> <span class="s1">'nbins'</span><span class="p">:</span> <span class="mi">16</span><span class="p">}</span> <span class="n">got</span> <span class="n">mae</span> <span class="n">score</span> <span class="mf">0.04888</span> <span class="p">[</span><span class="n">took</span> <span class="mf">1.2491</span><span class="n">s</span><span class="p">]</span>
|
||||
<span class="p">[</span><span class="n">GridSearchQ</span><span class="p">]:</span> <span class="n">hyperparams</span><span class="o">=</span><span class="p">{</span><span class="s1">'classifier__C'</span><span class="p">:</span> <span class="mf">0.001</span><span class="p">,</span> <span class="s1">'nbins'</span><span class="p">:</span> <span class="mi">8</span><span class="p">}</span> <span class="n">got</span> <span class="n">mae</span> <span class="n">score</span> <span class="mf">0.05163</span> <span class="p">[</span><span class="n">took</span> <span class="mf">1.5372</span><span class="n">s</span><span class="p">]</span>
|
||||
<span class="p">[</span><span class="o">...</span><span class="p">]</span>
|
||||
<span class="p">[</span><span class="n">GridSearchQ</span><span class="p">]:</span> <span class="n">hyperparams</span><span class="o">=</span><span class="p">{</span><span class="s1">'classifier__C'</span><span class="p">:</span> <span class="mf">1000.0</span><span class="p">,</span> <span class="s1">'nbins'</span><span class="p">:</span> <span class="mi">32</span><span class="p">}</span> <span class="n">got</span> <span class="n">mae</span> <span class="n">score</span> <span class="mf">0.02445</span> <span class="p">[</span><span class="n">took</span> <span class="mf">2.9056</span><span class="n">s</span><span class="p">]</span>
|
||||
<span class="p">[</span><span class="n">GridSearchQ</span><span class="p">]:</span> <span class="n">optimization</span> <span class="n">finished</span><span class="p">:</span> <span class="n">best</span> <span class="n">params</span> <span class="p">{</span><span class="s1">'classifier__C'</span><span class="p">:</span> <span class="mf">100.0</span><span class="p">,</span> <span class="s1">'nbins'</span><span class="p">:</span> <span class="mi">32</span><span class="p">}</span> <span class="p">(</span><span class="n">score</span><span class="o">=</span><span class="mf">0.02234</span><span class="p">)</span> <span class="p">[</span><span class="n">took</span> <span class="mf">7.3114</span><span class="n">s</span><span class="p">]</span>
|
||||
<span class="p">[</span><span class="n">GridSearchQ</span><span class="p">]:</span> <span class="n">refitting</span> <span class="n">on</span> <span class="n">the</span> <span class="n">whole</span> <span class="n">development</span> <span class="nb">set</span>
|
||||
<span class="n">model</span> <span class="n">selection</span> <span class="n">ended</span><span class="p">:</span> <span class="n">best</span> <span class="n">hyper</span><span class="o">-</span><span class="n">parameters</span><span class="o">=</span><span class="p">{</span><span class="s1">'classifier__C'</span><span class="p">:</span> <span class="mf">100.0</span><span class="p">,</span> <span class="s1">'nbins'</span><span class="p">:</span> <span class="mi">32</span><span class="p">}</span>
|
||||
<span class="n">MAE</span><span class="o">=</span><span class="mf">0.03102</span>
|
||||
</pre></div>
|
||||
</div>
|
||||
<p>The parameter <em>val_split</em> can alternatively be used to indicate
|
||||
|
@ -172,30 +182,13 @@ The following code illustrates how to do that:</p>
|
|||
<span class="n">LogisticRegression</span><span class="p">(),</span>
|
||||
<span class="n">param_grid</span><span class="o">=</span><span class="p">{</span><span class="s1">'C'</span><span class="p">:</span> <span class="n">np</span><span class="o">.</span><span class="n">logspace</span><span class="p">(</span><span class="o">-</span><span class="mi">4</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">10</span><span class="p">),</span> <span class="s1">'class_weight'</span><span class="p">:</span> <span class="p">[</span><span class="s1">'balanced'</span><span class="p">,</span> <span class="kc">None</span><span class="p">]},</span>
|
||||
<span class="n">cv</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span>
|
||||
<span class="n">model</span> <span class="o">=</span> <span class="n">PCC</span><span class="p">(</span><span class="n">learner</span><span class="p">)</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">dataset</span><span class="o">.</span><span class="n">training</span><span class="p">)</span>
|
||||
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">'model selection ended: best hyper-parameters=</span><span class="si">{</span><span class="n">model</span><span class="o">.</span><span class="n">learner</span><span class="o">.</span><span class="n">best_params_</span><span class="si">}</span><span class="s1">'</span><span class="p">)</span>
|
||||
<span class="n">model</span> <span class="o">=</span> <span class="n">DistributionMatching</span><span class="p">(</span><span class="n">learner</span><span class="p">)</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">dataset</span><span class="o">.</span><span class="n">training</span><span class="p">)</span>
|
||||
</pre></div>
|
||||
</div>
|
||||
<p>In this example, the system outputs:</p>
|
||||
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span>model selection ended: best hyper-parameters={'C': 10000.0, 'class_weight': None}
|
||||
1010 evaluations will be performed for each combination of hyper-parameters
|
||||
[artificial sampling protocol] generating predictions: 100%|██████████| 1010/1010 [00:00<00:00, 5379.55it/s]
|
||||
MAE=0.41734
|
||||
</pre></div>
|
||||
</div>
|
||||
<p>Note that the MAE is worse than the one we obtained when optimizing
|
||||
for quantification and, indeed, the hyper-parameters found optimal
|
||||
largely differ between the two selection modalities. The
|
||||
hyper-parameters C=10000 and class_weight=None have been found
|
||||
to work well for the specific training prevalence of the HP dataset,
|
||||
but these hyper-parameters turned out to be suboptimal when the
|
||||
class prevalences of the test set differs (as is indeed tested
|
||||
in scenarios of quantification).</p>
|
||||
<p>This is, however, not always the case, and one could, in practice,
|
||||
find examples
|
||||
in which optimizing for classification ends up resulting in a better
|
||||
quantifier than when optimizing for quantification.
|
||||
Nonetheless, this is theoretically unlikely to happen.</p>
|
||||
<p>However, this is conceptually flawed, since the model should be
|
||||
optimized for the task at hand (quantification), and not for a surrogate task (classification),
|
||||
i.e., the model should be requested to deliver low quantification errors, rather
|
||||
than low classification errors.</p>
|
||||
</section>
|
||||
</section>
|
||||
|
||||
|
|
|
@ -94,7 +94,7 @@ quantification methods across different scenarios showcasing
|
|||
the accuracy of the quantifier in predicting class prevalences
|
||||
for a wide range of prior distributions. This can easily be
|
||||
achieved by means of the
|
||||
<a class="reference external" href="https://github.com/HLT-ISTI/QuaPy/wiki/Evaluation">artificial sampling protocol</a>
|
||||
<a class="reference external" href="https://github.com/HLT-ISTI/QuaPy/wiki/Protocols">artificial sampling protocol</a>
|
||||
that is implemented in QuaPy.</p>
|
||||
<p>The following code shows how to perform one simple experiment
|
||||
in which the 4 <em>CC-variants</em>, all equipped with a linear SVM, are
|
||||
|
@ -103,6 +103,7 @@ tested across the entire spectrum of class priors (taking 21 splits
|
|||
of the interval [0,1], i.e., using prevalence steps of 0.05, and
|
||||
generating 100 random samples at each prevalence).</p>
|
||||
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">quapy</span> <span class="k">as</span> <span class="nn">qp</span>
|
||||
<span class="kn">from</span> <span class="nn">protocol</span> <span class="kn">import</span> <span class="n">APP</span>
|
||||
<span class="kn">from</span> <span class="nn">quapy.method.aggregative</span> <span class="kn">import</span> <span class="n">CC</span><span class="p">,</span> <span class="n">ACC</span><span class="p">,</span> <span class="n">PCC</span><span class="p">,</span> <span class="n">PACC</span>
|
||||
<span class="kn">from</span> <span class="nn">sklearn.svm</span> <span class="kn">import</span> <span class="n">LinearSVC</span>
|
||||
|
||||
|
@ -111,28 +112,26 @@ generating 100 random samples at each prevalence).</p>
|
|||
<span class="k">def</span> <span class="nf">gen_data</span><span class="p">():</span>
|
||||
|
||||
<span class="k">def</span> <span class="nf">base_classifier</span><span class="p">():</span>
|
||||
<span class="k">return</span> <span class="n">LinearSVC</span><span class="p">()</span>
|
||||
<span class="k">return</span> <span class="n">LinearSVC</span><span class="p">(</span><span class="n">class_weight</span><span class="o">=</span><span class="s1">'balanced'</span><span class="p">)</span>
|
||||
|
||||
<span class="k">def</span> <span class="nf">models</span><span class="p">():</span>
|
||||
<span class="k">yield</span> <span class="n">CC</span><span class="p">(</span><span class="n">base_classifier</span><span class="p">())</span>
|
||||
<span class="k">yield</span> <span class="n">ACC</span><span class="p">(</span><span class="n">base_classifier</span><span class="p">())</span>
|
||||
<span class="k">yield</span> <span class="n">PCC</span><span class="p">(</span><span class="n">base_classifier</span><span class="p">())</span>
|
||||
<span class="k">yield</span> <span class="n">PACC</span><span class="p">(</span><span class="n">base_classifier</span><span class="p">())</span>
|
||||
<span class="k">yield</span> <span class="s1">'CC'</span><span class="p">,</span> <span class="n">CC</span><span class="p">(</span><span class="n">base_classifier</span><span class="p">())</span>
|
||||
<span class="k">yield</span> <span class="s1">'ACC'</span><span class="p">,</span> <span class="n">ACC</span><span class="p">(</span><span class="n">base_classifier</span><span class="p">())</span>
|
||||
<span class="k">yield</span> <span class="s1">'PCC'</span><span class="p">,</span> <span class="n">PCC</span><span class="p">(</span><span class="n">base_classifier</span><span class="p">())</span>
|
||||
<span class="k">yield</span> <span class="s1">'PACC'</span><span class="p">,</span> <span class="n">PACC</span><span class="p">(</span><span class="n">base_classifier</span><span class="p">())</span>
|
||||
|
||||
<span class="n">data</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">datasets</span><span class="o">.</span><span class="n">fetch_reviews</span><span class="p">(</span><span class="s1">'kindle'</span><span class="p">,</span> <span class="n">tfidf</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">min_df</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span>
|
||||
<span class="n">train</span><span class="p">,</span> <span class="n">test</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">datasets</span><span class="o">.</span><span class="n">fetch_reviews</span><span class="p">(</span><span class="s1">'kindle'</span><span class="p">,</span> <span class="n">tfidf</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">min_df</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span><span class="o">.</span><span class="n">train_test</span>
|
||||
|
||||
<span class="n">method_names</span><span class="p">,</span> <span class="n">true_prevs</span><span class="p">,</span> <span class="n">estim_prevs</span><span class="p">,</span> <span class="n">tr_prevs</span> <span class="o">=</span> <span class="p">[],</span> <span class="p">[],</span> <span class="p">[],</span> <span class="p">[]</span>
|
||||
|
||||
<span class="k">for</span> <span class="n">model</span> <span class="ow">in</span> <span class="n">models</span><span class="p">():</span>
|
||||
<span class="n">model</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">data</span><span class="o">.</span><span class="n">training</span><span class="p">)</span>
|
||||
<span class="n">true_prev</span><span class="p">,</span> <span class="n">estim_prev</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">evaluation</span><span class="o">.</span><span class="n">artificial_sampling_prediction</span><span class="p">(</span>
|
||||
<span class="n">model</span><span class="p">,</span> <span class="n">data</span><span class="o">.</span><span class="n">test</span><span class="p">,</span> <span class="n">qp</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s1">'SAMPLE_SIZE'</span><span class="p">],</span> <span class="n">n_repetitions</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span> <span class="n">n_prevpoints</span><span class="o">=</span><span class="mi">21</span>
|
||||
<span class="p">)</span>
|
||||
<span class="k">for</span> <span class="n">method_name</span><span class="p">,</span> <span class="n">model</span> <span class="ow">in</span> <span class="n">models</span><span class="p">():</span>
|
||||
<span class="n">model</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">train</span><span class="p">)</span>
|
||||
<span class="n">true_prev</span><span class="p">,</span> <span class="n">estim_prev</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">evaluation</span><span class="o">.</span><span class="n">prediction</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">APP</span><span class="p">(</span><span class="n">test</span><span class="p">,</span> <span class="n">repeats</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">0</span><span class="p">))</span>
|
||||
|
||||
<span class="n">method_names</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">model</span><span class="o">.</span><span class="vm">__class__</span><span class="o">.</span><span class="vm">__name__</span><span class="p">)</span>
|
||||
<span class="n">method_names</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">method_name</span><span class="p">)</span>
|
||||
<span class="n">true_prevs</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">true_prev</span><span class="p">)</span>
|
||||
<span class="n">estim_prevs</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">estim_prev</span><span class="p">)</span>
|
||||
<span class="n">tr_prevs</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">data</span><span class="o">.</span><span class="n">training</span><span class="o">.</span><span class="n">prevalence</span><span class="p">())</span>
|
||||
<span class="n">tr_prevs</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">train</span><span class="o">.</span><span class="n">prevalence</span><span class="p">())</span>
|
||||
|
||||
<span class="k">return</span> <span class="n">method_names</span><span class="p">,</span> <span class="n">true_prevs</span><span class="p">,</span> <span class="n">estim_prevs</span><span class="p">,</span> <span class="n">tr_prevs</span>
|
||||
|
||||
|
@ -199,21 +198,19 @@ IMDb dataset, and generate the bias plot again.
|
|||
This example can be run by rewritting the <em>gen_data()</em> function
|
||||
like this:</p>
|
||||
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="k">def</span> <span class="nf">gen_data</span><span class="p">():</span>
|
||||
<span class="n">data</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">datasets</span><span class="o">.</span><span class="n">fetch_reviews</span><span class="p">(</span><span class="s1">'imdb'</span><span class="p">,</span> <span class="n">tfidf</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">min_df</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span>
|
||||
|
||||
<span class="n">train</span><span class="p">,</span> <span class="n">test</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">datasets</span><span class="o">.</span><span class="n">fetch_reviews</span><span class="p">(</span><span class="s1">'imdb'</span><span class="p">,</span> <span class="n">tfidf</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">min_df</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span><span class="o">.</span><span class="n">train_test</span>
|
||||
<span class="n">model</span> <span class="o">=</span> <span class="n">CC</span><span class="p">(</span><span class="n">LinearSVC</span><span class="p">())</span>
|
||||
|
||||
<span class="n">method_data</span> <span class="o">=</span> <span class="p">[]</span>
|
||||
<span class="k">for</span> <span class="n">training_prevalence</span> <span class="ow">in</span> <span class="n">np</span><span class="o">.</span><span class="n">linspace</span><span class="p">(</span><span class="mf">0.1</span><span class="p">,</span> <span class="mf">0.9</span><span class="p">,</span> <span class="mi">9</span><span class="p">):</span>
|
||||
<span class="n">training_size</span> <span class="o">=</span> <span class="mi">5000</span>
|
||||
<span class="c1"># since the problem is binary, it suffices to specify the negative prevalence (the positive is constrained)</span>
|
||||
<span class="n">training</span> <span class="o">=</span> <span class="n">data</span><span class="o">.</span><span class="n">training</span><span class="o">.</span><span class="n">sampling</span><span class="p">(</span><span class="n">training_size</span><span class="p">,</span> <span class="mi">1</span> <span class="o">-</span> <span class="n">training_prevalence</span><span class="p">)</span>
|
||||
<span class="n">model</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">training</span><span class="p">)</span>
|
||||
<span class="n">true_prev</span><span class="p">,</span> <span class="n">estim_prev</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">evaluation</span><span class="o">.</span><span class="n">artificial_sampling_prediction</span><span class="p">(</span>
|
||||
<span class="n">model</span><span class="p">,</span> <span class="n">data</span><span class="o">.</span><span class="n">sample</span><span class="p">,</span> <span class="n">qp</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s1">'SAMPLE_SIZE'</span><span class="p">],</span> <span class="n">n_repetitions</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span> <span class="n">n_prevpoints</span><span class="o">=</span><span class="mi">21</span>
|
||||
<span class="p">)</span>
|
||||
<span class="c1"># method names can contain Latex syntax</span>
|
||||
<span class="n">method_name</span> <span class="o">=</span> <span class="s1">'CC$_{'</span> <span class="o">+</span> <span class="sa">f</span><span class="s1">'</span><span class="si">{</span><span class="nb">int</span><span class="p">(</span><span class="mi">100</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">training_prevalence</span><span class="p">)</span><span class="si">}</span><span class="s1">'</span> <span class="o">+</span> <span class="s1">'\%}$'</span>
|
||||
<span class="n">method_data</span><span class="o">.</span><span class="n">append</span><span class="p">((</span><span class="n">method_name</span><span class="p">,</span> <span class="n">true_prev</span><span class="p">,</span> <span class="n">estim_prev</span><span class="p">,</span> <span class="n">training</span><span class="o">.</span><span class="n">prevalence</span><span class="p">()))</span>
|
||||
<span class="c1"># since the problem is binary, it suffices to specify the negative prevalence, since the positive is constrained</span>
|
||||
<span class="n">train_sample</span> <span class="o">=</span> <span class="n">train</span><span class="o">.</span><span class="n">sampling</span><span class="p">(</span><span class="n">training_size</span><span class="p">,</span> <span class="mi">1</span><span class="o">-</span><span class="n">training_prevalence</span><span class="p">)</span>
|
||||
<span class="n">model</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">train_sample</span><span class="p">)</span>
|
||||
<span class="n">true_prev</span><span class="p">,</span> <span class="n">estim_prev</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">evaluation</span><span class="o">.</span><span class="n">prediction</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">APP</span><span class="p">(</span><span class="n">test</span><span class="p">,</span> <span class="n">repeats</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">0</span><span class="p">))</span>
|
||||
<span class="n">method_name</span> <span class="o">=</span> <span class="s1">'CC$_{'</span><span class="o">+</span><span class="sa">f</span><span class="s1">'</span><span class="si">{</span><span class="nb">int</span><span class="p">(</span><span class="mi">100</span><span class="o">*</span><span class="n">training_prevalence</span><span class="p">)</span><span class="si">}</span><span class="s1">'</span> <span class="o">+</span> <span class="s1">'\%}$'</span>
|
||||
<span class="n">method_data</span><span class="o">.</span><span class="n">append</span><span class="p">((</span><span class="n">method_name</span><span class="p">,</span> <span class="n">true_prev</span><span class="p">,</span> <span class="n">estim_prev</span><span class="p">,</span> <span class="n">train_sample</span><span class="o">.</span><span class="n">prevalence</span><span class="p">()))</span>
|
||||
|
||||
<span class="k">return</span> <span class="nb">zip</span><span class="p">(</span><span class="o">*</span><span class="n">method_data</span><span class="p">)</span>
|
||||
</pre></div>
|
||||
|
|
|
@ -31,13 +31,14 @@ Output the class prevalences (showing 2 digit precision):
|
|||
```
|
||||
|
||||
One can easily produce new samples at desired class prevalences:
|
||||
|
||||
```python
|
||||
sample_size = 10
|
||||
prev = [0.4, 0.1, 0.5]
|
||||
sample = data.sampling(sample_size, *prev)
|
||||
|
||||
print('instances:', sample.instances)
|
||||
print('labels:', sample.labels)
|
||||
print('labels:', sample.classes)
|
||||
print('prevalence:', F.strprev(sample.prevalence(), prec=2))
|
||||
```
|
||||
|
||||
|
|
|
@ -50,7 +50,7 @@ indicating the value for the smoothing parameter epsilon.
|
|||
Traditionally, this value is set to 1/(2T) in past literature,
|
||||
with T the sampling size. One could either pass this value
|
||||
to the function each time, or to set a QuaPy's environment
|
||||
variable _SAMPLE_SIZE_ once, and ommit this argument
|
||||
variable _SAMPLE_SIZE_ once, and omit this argument
|
||||
thereafter (recommended);
|
||||
e.g.:
|
||||
|
||||
|
@ -58,7 +58,7 @@ e.g.:
|
|||
qp.environ['SAMPLE_SIZE'] = 100 # once for all
|
||||
true_prev = np.asarray([0.5, 0.3, 0.2]) # let's assume 3 classes
|
||||
estim_prev = np.asarray([0.1, 0.3, 0.6])
|
||||
error = qp.ae_.mrae(true_prev, estim_prev)
|
||||
error = qp.error.mrae(true_prev, estim_prev)
|
||||
print(f'mrae({true_prev}, {estim_prev}) = {error:.3f}')
|
||||
```
|
||||
|
||||
|
@ -71,162 +71,99 @@ Finally, it is possible to instantiate QuaPy's quantification
|
|||
error functions from strings using, e.g.:
|
||||
|
||||
```python
|
||||
error_function = qp.ae_.from_name('mse')
|
||||
error_function = qp.error.from_name('mse')
|
||||
error = error_function(true_prev, estim_prev)
|
||||
```
|
||||
|
||||
## Evaluation Protocols
|
||||
|
||||
QuaPy implements the so-called "artificial sampling protocol",
|
||||
according to which a test set is used to generate samplings at
|
||||
desired prevalences of fixed size and covering the full spectrum
|
||||
of prevalences. This protocol is called "artificial" in contrast
|
||||
to the "natural prevalence sampling" protocol that,
|
||||
despite introducing some variability during sampling, approximately
|
||||
preserves the training class prevalence.
|
||||
|
||||
In the artificial sampling procol, the user specifies the number
|
||||
of (equally distant) points to be generated from the interval [0,1].
|
||||
|
||||
For example, if n_prevpoints=11 then, for each class, the prevalences
|
||||
[0., 0.1, 0.2, ..., 1.] will be used. This means that, for two classes,
|
||||
the number of different prevalences will be 11 (since, once the prevalence
|
||||
of one class is determined, the other one is constrained). For 3 classes,
|
||||
the number of valid combinations can be obtained as 11 + 10 + ... + 1 = 66.
|
||||
In general, the number of valid combinations that will be produced for a given
|
||||
value of n_prevpoints can be consulted by invoking
|
||||
quapy.functional.num_prevalence_combinations, e.g.:
|
||||
An _evaluation protocol_ is an evaluation procedure that uses
|
||||
one specific _sample generation procotol_ to genereate many
|
||||
samples, typically characterized by widely varying amounts of
|
||||
_shift_ with respect to the original distribution, that are then
|
||||
used to evaluate the performance of a (trained) quantifier.
|
||||
These protocols are explained in more detail in a dedicated [entry
|
||||
in the wiki](Protocols.md). For the moment being, let us assume we already have
|
||||
chosen and instantiated one specific such protocol, that we here
|
||||
simply call _prot_. Let also assume our model is called
|
||||
_quantifier_ and that our evaluatio measure of choice is
|
||||
_mae_. The evaluation comes down to:
|
||||
|
||||
```python
|
||||
import quapy.functional as F
|
||||
n_prevpoints = 21
|
||||
n_classes = 4
|
||||
n = F.num_prevalence_combinations(n_prevpoints, n_classes, n_repeats=1)
|
||||
mae = qp.evaluation.evaluate(quantifier, protocol=prot, error_metric='mae')
|
||||
print(f'MAE = {mae:.4f}')
|
||||
```
|
||||
|
||||
in this example, n=1771. Note the last argument, n_repeats, that
|
||||
informs of the number of examples that will be generated for any
|
||||
valid combination (typical values are, e.g., 1 for a single sample,
|
||||
or 10 or higher for computing standard deviations of performing statistical
|
||||
significance tests).
|
||||
|
||||
One can instead work the other way around, i.e., one could set a
|
||||
maximum budged of evaluations and get the number of prevalence points that
|
||||
will generate a number of evaluations close, but not higher, than
|
||||
the fixed budget. This can be achieved with the function
|
||||
quapy.functional.get_nprevpoints_approximation, e.g.:
|
||||
It is often desirable to evaluate our system using more than one
|
||||
single evaluatio measure. In this case, it is convenient to generate
|
||||
a _report_. A report in QuaPy is a dataframe accounting for all the
|
||||
true prevalence values with their corresponding prevalence values
|
||||
as estimated by the quantifier, along with the error each has given
|
||||
rise.
|
||||
|
||||
```python
|
||||
budget = 5000
|
||||
n_prevpoints = F.get_nprevpoints_approximation(budget, n_classes, n_repeats=1)
|
||||
n = F.num_prevalence_combinations(n_prevpoints, n_classes, n_repeats=1)
|
||||
print(f'by setting n_prevpoints={n_prevpoints} the number of evaluations for {n_classes} classes will be {n}')
|
||||
```
|
||||
that will print:
|
||||
```
|
||||
by setting n_prevpoints=30 the number of evaluations for 4 classes will be 4960
|
||||
report = qp.evaluation.evaluation_report(quantifier, protocol=prot, error_metrics=['mae', 'mrae', 'mkld'])
|
||||
```
|
||||
|
||||
The cost of evaluation will depend on the values of _n_prevpoints_, _n_classes_,
|
||||
and _n_repeats_. Since it might sometimes be cumbersome to control the overall
|
||||
cost of an experiment having to do with the number of combinations that
|
||||
will be generated for a particular setting of these arguments (particularly
|
||||
when _n_classes>2_), evaluation functions
|
||||
typically allow the user to rather specify an _evaluation budget_, i.e., a maximum
|
||||
number of samplings to generate. By specifying this argument, one could avoid
|
||||
specifying _n_prevpoints_, and the value for it that would lead to a closer
|
||||
number of evaluation budget, without surpassing it, will be automatically set.
|
||||
|
||||
The following script shows a full example in which a PACC model relying
|
||||
on a Logistic Regressor classifier is
|
||||
tested on the _kindle_ dataset by means of the artificial prevalence
|
||||
sampling protocol on samples of size 500, in terms of various
|
||||
evaluation metrics.
|
||||
|
||||
````python
|
||||
import quapy as qp
|
||||
import quapy.functional as F
|
||||
from sklearn.linear_model import LogisticRegression
|
||||
|
||||
qp.environ['SAMPLE_SIZE'] = 500
|
||||
|
||||
dataset = qp.datasets.fetch_reviews('kindle')
|
||||
qp.data.preprocessing.text2tfidf(dataset, min_df=5, inplace=True)
|
||||
|
||||
training = dataset.training
|
||||
test = dataset.test
|
||||
|
||||
lr = LogisticRegression()
|
||||
pacc = qp.method.aggregative.PACC(lr)
|
||||
|
||||
pacc.fit(training)
|
||||
|
||||
df = qp.evaluation.artificial_sampling_report(
|
||||
pacc, # the quantification method
|
||||
test, # the test set on which the method will be evaluated
|
||||
sample_size=qp.environ['SAMPLE_SIZE'], #indicates the size of samples to be drawn
|
||||
n_prevpoints=11, # how many prevalence points will be extracted from the interval [0, 1] for each category
|
||||
n_repetitions=1, # number of times each prevalence will be used to generate a test sample
|
||||
n_jobs=-1, # indicates the number of parallel workers (-1 indicates, as in sklearn, all CPUs)
|
||||
random_seed=42, # setting a random seed allows to replicate the test samples across runs
|
||||
error_metrics=['mae', 'mrae', 'mkld'], # specify the evaluation metrics
|
||||
verbose=True # set to True to show some standard-line outputs
|
||||
)
|
||||
````
|
||||
|
||||
The resulting report is a pandas' dataframe that can be directly printed.
|
||||
Here, we set some display options from pandas just to make the output clearer;
|
||||
note also that the estimated prevalences are shown as strings using the
|
||||
function strprev function that simply converts a prevalence into a
|
||||
string representing it, with a fixed decimal precision (default 3):
|
||||
From a pandas' dataframe, it is straightforward to visualize all the results,
|
||||
and compute the averaged values, e.g.:
|
||||
|
||||
```python
|
||||
import pandas as pd
|
||||
pd.set_option('display.expand_frame_repr', False)
|
||||
pd.set_option("precision", 3)
|
||||
df['estim-prev'] = df['estim-prev'].map(F.strprev)
|
||||
print(df)
|
||||
report['estim-prev'] = report['estim-prev'].map(F.strprev)
|
||||
print(report)
|
||||
|
||||
print('Averaged values:')
|
||||
print(report.mean())
|
||||
```
|
||||
|
||||
The output should look like:
|
||||
This will produce an output like:
|
||||
|
||||
```
|
||||
true-prev estim-prev mae mrae mkld
|
||||
0 [0.0, 1.0] [0.000, 1.000] 0.000 0.000 0.000e+00
|
||||
1 [0.1, 0.9] [0.091, 0.909] 0.009 0.048 4.426e-04
|
||||
2 [0.2, 0.8] [0.163, 0.837] 0.037 0.114 4.633e-03
|
||||
3 [0.3, 0.7] [0.283, 0.717] 0.017 0.041 7.383e-04
|
||||
4 [0.4, 0.6] [0.366, 0.634] 0.034 0.070 2.412e-03
|
||||
5 [0.5, 0.5] [0.459, 0.541] 0.041 0.082 3.387e-03
|
||||
6 [0.6, 0.4] [0.565, 0.435] 0.035 0.073 2.535e-03
|
||||
7 [0.7, 0.3] [0.654, 0.346] 0.046 0.108 4.701e-03
|
||||
8 [0.8, 0.2] [0.725, 0.275] 0.075 0.235 1.515e-02
|
||||
9 [0.9, 0.1] [0.858, 0.142] 0.042 0.229 7.740e-03
|
||||
10 [1.0, 0.0] [0.945, 0.055] 0.055 27.357 5.219e-02
|
||||
true-prev estim-prev mae mrae mkld
|
||||
0 [0.308, 0.692] [0.314, 0.686] 0.005649 0.013182 0.000074
|
||||
1 [0.896, 0.104] [0.909, 0.091] 0.013145 0.069323 0.000985
|
||||
2 [0.848, 0.152] [0.809, 0.191] 0.039063 0.149806 0.005175
|
||||
3 [0.016, 0.984] [0.033, 0.967] 0.017236 0.487529 0.005298
|
||||
4 [0.728, 0.272] [0.751, 0.249] 0.022769 0.057146 0.001350
|
||||
... ... ... ... ... ...
|
||||
4995 [0.72, 0.28] [0.698, 0.302] 0.021752 0.053631 0.001133
|
||||
4996 [0.868, 0.132] [0.888, 0.112] 0.020490 0.088230 0.001985
|
||||
4997 [0.292, 0.708] [0.298, 0.702] 0.006149 0.014788 0.000090
|
||||
4998 [0.24, 0.76] [0.220, 0.780] 0.019950 0.054309 0.001127
|
||||
4999 [0.948, 0.052] [0.965, 0.035] 0.016941 0.165776 0.003538
|
||||
|
||||
[5000 rows x 5 columns]
|
||||
Averaged values:
|
||||
mae 0.023588
|
||||
mrae 0.108779
|
||||
mkld 0.003631
|
||||
dtype: float64
|
||||
|
||||
Process finished with exit code 0
|
||||
```
|
||||
|
||||
One can get the averaged scores using standard pandas'
|
||||
functions, i.e.:
|
||||
Alternatively, we can simply generate all the predictions by:
|
||||
|
||||
```python
|
||||
print(df.mean())
|
||||
true_prevs, estim_prevs = qp.evaluation.prediction(quantifier, protocol=prot)
|
||||
```
|
||||
|
||||
will produce the following output:
|
||||
|
||||
```
|
||||
true-prev 0.500
|
||||
mae 0.035
|
||||
mrae 2.578
|
||||
mkld 0.009
|
||||
dtype: float64
|
||||
```
|
||||
|
||||
Other evaluation functions include:
|
||||
|
||||
* _artificial_sampling_eval_: that computes the evaluation for a
|
||||
given evaluation metric, returning the average instead of a dataframe.
|
||||
* _artificial_sampling_prediction_: that returns two np.arrays containing the
|
||||
true prevalences and the estimated prevalences.
|
||||
|
||||
See the documentation for further details.
|
||||
All the evaluation functions implement specific optimizations for speeding-up
|
||||
the evaluation of aggregative quantifiers (i.e., of instances of _AggregativeQuantifier_).
|
||||
The optimization comes down to generating classification predictions (either crisp or soft)
|
||||
only once for the entire test set, and then applying the sampling procedure to the
|
||||
predictions, instead of generating samples of instances and then computing the
|
||||
classification predictions every time. This is only possible when the protocol
|
||||
is an instance of _OnLabelledCollectionProtocol_. The optimization is only
|
||||
carried out when the number of classification predictions thus generated would be
|
||||
smaller than the number of predictions required for the entire protocol; e.g.,
|
||||
if the original dataset contains 1M instances, but the protocol is such that it would
|
||||
at most generate 20 samples of 100 instances, then it would be preferable to postpone the
|
||||
classification for each sample. This behaviour is indicated by setting
|
||||
_aggr_speedup="auto"_. Conversely, when indicating _aggr_speedup="force"_ QuaPy will
|
||||
precompute all the predictions irrespectively of the number of instances and number of samples.
|
||||
Finally, this can be deactivated by setting _aggr_speedup=False_. Note that this optimization
|
||||
is not only applied for the final evaluation, but also for the internal evaluations carried
|
||||
out during _model selection_. Since these are typically many, the heuristic can help reduce the
|
||||
execution time a lot.
|
|
@ -16,12 +16,6 @@ and implement some abstract methods:
|
|||
|
||||
@abstractmethod
|
||||
def quantify(self, instances): ...
|
||||
|
||||
@abstractmethod
|
||||
def set_params(self, **parameters): ...
|
||||
|
||||
@abstractmethod
|
||||
def get_params(self, deep=True): ...
|
||||
```
|
||||
The meaning of those functions should be familiar to those
|
||||
used to work with scikit-learn since the class structure of QuaPy
|
||||
|
@ -32,10 +26,10 @@ scikit-learn' structure has not been adopted _as is_ in QuaPy responds to
|
|||
the fact that scikit-learn's _predict_ function is expected to return
|
||||
one output for each input element --e.g., a predicted label for each
|
||||
instance in a sample-- while in quantification the output for a sample
|
||||
is one single array of class prevalences), while functions _set_params_
|
||||
and _get_params_ allow a
|
||||
[model selector](https://github.com/HLT-ISTI/QuaPy/wiki/Model-Selection)
|
||||
to automate the process of hyperparameter search.
|
||||
is one single array of class prevalences).
|
||||
Quantifiers also extend from scikit-learn's `BaseEstimator`, in order
|
||||
to simplify the use of _set_params_ and _get_params_ used in
|
||||
[model selector](https://github.com/HLT-ISTI/QuaPy/wiki/Model-Selection).
|
||||
|
||||
## Aggregative Methods
|
||||
|
||||
|
@ -58,11 +52,11 @@ of _BaseQuantifier.quantify_ is already provided, which looks like:
|
|||
|
||||
```python
|
||||
def quantify(self, instances):
|
||||
classif_predictions = self.preclassify(instances)
|
||||
classif_predictions = self.classify(instances)
|
||||
return self.aggregate(classif_predictions)
|
||||
```
|
||||
Aggregative quantifiers are expected to maintain a classifier (which is
|
||||
accessed through the _@property_ _learner_). This classifier is
|
||||
accessed through the _@property_ _classifier_). This classifier is
|
||||
given as input to the quantifier, and can be already fit
|
||||
on external data (in which case, the _fit_learner_ argument should
|
||||
be set to False), or be fit by the quantifier's fit (default).
|
||||
|
@ -73,13 +67,8 @@ _AggregativeProbabilisticQuantifier(AggregativeQuantifier)_.
|
|||
The particularity of _probabilistic_ aggregative methods (w.r.t.
|
||||
non-probabilistic ones), is that the default quantifier is defined
|
||||
in terms of the posterior probabilities returned by a probabilistic
|
||||
classifier, and not by the crisp decisions of a hard classifier; i.e.:
|
||||
|
||||
```python
|
||||
def quantify(self, instances):
|
||||
classif_posteriors = self.posterior_probabilities(instances)
|
||||
return self.aggregate(classif_posteriors)
|
||||
```
|
||||
classifier, and not by the crisp decisions of a hard classifier.
|
||||
In any case, the interface _classify(instances)_ remains unchanged.
|
||||
|
||||
One advantage of _aggregative_ methods (either probabilistic or not)
|
||||
is that the evaluation according to any sampling procedure (e.g.,
|
||||
|
@ -110,9 +99,7 @@ import quapy as qp
|
|||
import quapy.functional as F
|
||||
from sklearn.svm import LinearSVC
|
||||
|
||||
dataset = qp.datasets.fetch_twitter('hcr', pickle=True)
|
||||
training = dataset.training
|
||||
test = dataset.test
|
||||
training, test = qp.datasets.fetch_twitter('hcr', pickle=True).train_test
|
||||
|
||||
# instantiate a classifier learner, in this case a SVM
|
||||
svm = LinearSVC()
|
||||
|
@ -156,11 +143,12 @@ model.fit(training, val_split=5)
|
|||
```
|
||||
|
||||
The following code illustrates the case in which PCC is used:
|
||||
|
||||
```python
|
||||
model = qp.method.aggregative.PCC(svm)
|
||||
model.fit(training)
|
||||
estim_prevalence = model.quantify(test.instances)
|
||||
print('classifier:', model.learner)
|
||||
print('classifier:', model.classifier)
|
||||
```
|
||||
In this case, QuaPy will print:
|
||||
```
|
||||
|
@ -211,14 +199,22 @@ model.fit(dataset.training)
|
|||
estim_prevalence = model.quantify(dataset.test.instances)
|
||||
```
|
||||
|
||||
_New in v0.1.7_: EMQ now accepts two new parameters in the construction method, namely
|
||||
_exact_train_prev_ which allows to use the true training prevalence as the departing
|
||||
prevalence estimation (default behaviour), or instead an approximation of it as
|
||||
suggested by [Alexandari et al. (2020)](http://proceedings.mlr.press/v119/alexandari20a.html)
|
||||
(by setting _exact_train_prev=False_).
|
||||
The other parameter is _recalib_ which allows to indicate a calibration method, among those
|
||||
proposed by [Alexandari et al. (2020)](http://proceedings.mlr.press/v119/alexandari20a.html),
|
||||
including the Bias-Corrected Temperature Scaling, Vector Scaling, etc.
|
||||
See the API documentation for further details.
|
||||
|
||||
|
||||
### Hellinger Distance y (HDy)
|
||||
|
||||
The method HDy is described in:
|
||||
|
||||
_Implementation of the method based on the Hellinger Distance y (HDy) proposed by
|
||||
González-Castro, V., Alaiz-Rodrı́guez, R., and Alegre, E. (2013). Class distribution
|
||||
estimation based on the Hellinger distance. Information Sciences, 218:146–164._
|
||||
Implementation of the method based on the Hellinger Distance y (HDy) proposed by
|
||||
[González-Castro, V., Alaiz-Rodrı́guez, R., and Alegre, E. (2013). Class distribution
|
||||
estimation based on the Hellinger distance. Information Sciences, 218:146–164.](https://www.sciencedirect.com/science/article/pii/S0020025512004069)
|
||||
|
||||
It is implemented in _qp.method.aggregative.HDy_ (also accessible
|
||||
through the allias _qp.method.aggregative.HellingerDistanceY_).
|
||||
|
@ -249,30 +245,51 @@ model.fit(dataset.training)
|
|||
estim_prevalence = model.quantify(dataset.test.instances)
|
||||
```
|
||||
|
||||
_New in v0.1.7:_ QuaPy now provides an implementation of the generalized
|
||||
"Distribution Matching" approaches for multiclass, inspired by the framework
|
||||
of [Firat (2016)](https://arxiv.org/abs/1606.00868). One can instantiate
|
||||
a variant of HDy for multiclass quantification as follows:
|
||||
|
||||
```python
|
||||
mutliclassHDy = qp.method.aggregative.DistributionMatching(classifier=LogisticRegression(), divergence='HD', cdf=False)
|
||||
```
|
||||
|
||||
_New in v0.1.7:_ QuaPy now provides an implementation of the "DyS"
|
||||
framework proposed by [Maletzke et al (2020)](https://ojs.aaai.org/index.php/AAAI/article/view/4376)
|
||||
and the "SMM" method proposed by [Hassan et al (2019)](https://ieeexplore.ieee.org/document/9260028)
|
||||
(thanks to _Pablo González_ for the contributions!)
|
||||
|
||||
### Threshold Optimization methods
|
||||
|
||||
_New in v0.1.7:_ QuaPy now implements Forman's threshold optimization methods;
|
||||
see, e.g., [(Forman 2006)](https://dl.acm.org/doi/abs/10.1145/1150402.1150423)
|
||||
and [(Forman 2008)](https://link.springer.com/article/10.1007/s10618-008-0097-y).
|
||||
These include: T50, MAX, X, Median Sweep (MS), and its variant MS2.
|
||||
|
||||
### Explicit Loss Minimization
|
||||
|
||||
The Explicit Loss Minimization (ELM) represent a family of methods
|
||||
based on structured output learning, i.e., quantifiers relying on
|
||||
classifiers that have been optimized targeting a
|
||||
quantification-oriented evaluation measure.
|
||||
The original methods are implemented in QuaPy as classify & count (CC)
|
||||
quantifiers that use Joachim's [SVMperf](https://www.cs.cornell.edu/people/tj/svm_light/svm_perf.html)
|
||||
as the underlying classifier, properly set to optimize for the desired loss.
|
||||
|
||||
In QuaPy, the following methods, all relying on Joachim's
|
||||
[SVMperf](https://www.cs.cornell.edu/people/tj/svm_light/svm_perf.html)
|
||||
implementation, are available in _qp.method.aggregative_:
|
||||
In QuaPy, this can be more achieved by calling the functions:
|
||||
|
||||
* SVMQ (SVM-Q) is a quantification method optimizing the metric _Q_ defined
|
||||
in _Barranquero, J., Díez, J., and del Coz, J. J. (2015). Quantification-oriented learning based
|
||||
on reliable classifiers. Pattern Recognition, 48(2):591–604._
|
||||
* SVMKLD (SVM for Kullback-Leibler Divergence) proposed in _Esuli, A. and Sebastiani, F. (2015).
|
||||
* _newSVMQ_: returns the quantification method called SVM(Q) that optimizes for the metric _Q_ defined
|
||||
in [_Barranquero, J., Díez, J., and del Coz, J. J. (2015). Quantification-oriented learning based
|
||||
on reliable classifiers. Pattern Recognition, 48(2):591–604._](https://www.sciencedirect.com/science/article/pii/S003132031400291X)
|
||||
* _newSVMKLD_ and _newSVMNKLD_: returns the quantification method called SVM(KLD) and SVM(nKLD), standing for
|
||||
Kullback-Leibler Divergence and Normalized Kullback-Leibler Divergence, as proposed in [_Esuli, A. and Sebastiani, F. (2015).
|
||||
Optimizing text quantifiers for multivariate loss functions.
|
||||
ACM Transactions on Knowledge Discovery and Data, 9(4):Article 27._
|
||||
* SVMNKLD (SVM for Normalized Kullback-Leibler Divergence) proposed in _Esuli, A. and Sebastiani, F. (2015).
|
||||
Optimizing text quantifiers for multivariate loss functions.
|
||||
ACM Transactions on Knowledge Discovery and Data, 9(4):Article 27._
|
||||
* SVMAE (SVM for Mean Absolute Error)
|
||||
* SVMRAE (SVM for Mean Relative Absolute Error)
|
||||
ACM Transactions on Knowledge Discovery and Data, 9(4):Article 27._](https://dl.acm.org/doi/abs/10.1145/2700406)
|
||||
* _newSVMAE_ and _newSVMRAE_: returns a quantification method called SVM(AE) and SVM(RAE) that optimizes for the (Mean) Absolute Error and for the
|
||||
(Mean) Relative Absolute Error, as first used by
|
||||
[_Moreo, A. and Sebastiani, F. (2021). Tweet sentiment quantification: An experimental re-evaluation. PLOS ONE 17 (9), 1-23._](https://arxiv.org/abs/2011.02552)
|
||||
|
||||
the last two methods (SVMAE and SVMRAE) have been implemented in
|
||||
the last two methods (SVM(AE) and SVM(RAE)) have been implemented in
|
||||
QuaPy in order to make available ELM variants for what nowadays
|
||||
are considered the most well-behaved evaluation metrics in quantification.
|
||||
|
||||
|
@ -306,13 +323,18 @@ currently supports only binary classification.
|
|||
ELM variants (any binary quantifier in general) can be extended
|
||||
to operate in single-label scenarios trivially by adopting a
|
||||
"one-vs-all" strategy (as, e.g., in
|
||||
_Gao, W. and Sebastiani, F. (2016). From classification to quantification in tweet sentiment
|
||||
analysis. Social Network Analysis and Mining, 6(19):1–22_).
|
||||
In QuaPy this is possible by using the _OneVsAll_ class:
|
||||
[_Gao, W. and Sebastiani, F. (2016). From classification to quantification in tweet sentiment
|
||||
analysis. Social Network Analysis and Mining, 6(19):1–22_](https://link.springer.com/article/10.1007/s13278-016-0327-z)).
|
||||
In QuaPy this is possible by using the _OneVsAll_ class.
|
||||
|
||||
There are two ways for instantiating this class, _OneVsAllGeneric_ that works for
|
||||
any quantifier, and _OneVsAllAggregative_ that is optimized for aggregative quantifiers.
|
||||
In general, you can simply use the _getOneVsAll_ function and QuaPy will choose
|
||||
the more convenient of the two.
|
||||
|
||||
```python
|
||||
import quapy as qp
|
||||
from quapy.method.aggregative import SVMQ, OneVsAll
|
||||
from quapy.method.aggregative import SVMQ
|
||||
|
||||
# load a single-label dataset (this one contains 3 classes)
|
||||
dataset = qp.datasets.fetch_twitter('hcr', pickle=True)
|
||||
|
@ -320,11 +342,14 @@ dataset = qp.datasets.fetch_twitter('hcr', pickle=True)
|
|||
# let qp know where svmperf is
|
||||
qp.environ['SVMPERF_HOME'] = '../svm_perf_quantification'
|
||||
|
||||
model = OneVsAll(SVMQ(), n_jobs=-1) # run them on parallel
|
||||
model = getOneVsAll(SVMQ(), n_jobs=-1) # run them on parallel
|
||||
model.fit(dataset.training)
|
||||
estim_prevalence = model.quantify(dataset.test.instances)
|
||||
```
|
||||
|
||||
Check the examples _[explicit_loss_minimization.py](..%2Fexamples%2Fexplicit_loss_minimization.py)_
|
||||
and [one_vs_all.py](..%2Fexamples%2Fone_vs_all.py) for more details.
|
||||
|
||||
## Meta Models
|
||||
|
||||
By _meta_ models we mean quantification methods that are defined on top of other
|
||||
|
@ -337,12 +362,12 @@ _Meta_ models are implemented in the _qp.method.meta_ module.
|
|||
|
||||
QuaPy implements (some of) the variants proposed in:
|
||||
|
||||
* _Pérez-Gállego, P., Quevedo, J. R., & del Coz, J. J. (2017).
|
||||
* [_Pérez-Gállego, P., Quevedo, J. R., & del Coz, J. J. (2017).
|
||||
Using ensembles for problems with characterizable changes in data distribution: A case study on quantification.
|
||||
Information Fusion, 34, 87-100._
|
||||
* _Pérez-Gállego, P., Castano, A., Quevedo, J. R., & del Coz, J. J. (2019).
|
||||
Information Fusion, 34, 87-100._](https://www.sciencedirect.com/science/article/pii/S1566253516300628)
|
||||
* [_Pérez-Gállego, P., Castano, A., Quevedo, J. R., & del Coz, J. J. (2019).
|
||||
Dynamic ensemble selection for quantification tasks.
|
||||
Information Fusion, 45, 1-15._
|
||||
Information Fusion, 45, 1-15._](https://www.sciencedirect.com/science/article/pii/S1566253517303652)
|
||||
|
||||
The following code shows how to instantiate an Ensemble of 30 _Adjusted Classify & Count_ (ACC)
|
||||
quantifiers operating with a _Logistic Regressor_ (LR) as the base classifier, and using the
|
||||
|
@ -378,10 +403,10 @@ wiki if you want to optimize the hyperparameters of ensemble for classification
|
|||
|
||||
QuaPy offers an implementation of QuaNet, a deep learning model presented in:
|
||||
|
||||
_Esuli, A., Moreo, A., & Sebastiani, F. (2018, October).
|
||||
[_Esuli, A., Moreo, A., & Sebastiani, F. (2018, October).
|
||||
A recurrent neural network for sentiment quantification.
|
||||
In Proceedings of the 27th ACM International Conference on
|
||||
Information and Knowledge Management (pp. 1775-1778)._
|
||||
Information and Knowledge Management (pp. 1775-1778)._](https://dl.acm.org/doi/abs/10.1145/3269206.3269287)
|
||||
|
||||
This model requires _torch_ to be installed.
|
||||
QuaNet also requires a classifier that can provide embedded representations
|
||||
|
@ -406,7 +431,8 @@ cnn = CNNnet(dataset.vocabulary_size, dataset.n_classes)
|
|||
learner = NeuralClassifierTrainer(cnn, device='cuda')
|
||||
|
||||
# train QuaNet
|
||||
model = QuaNet(learner, qp.environ['SAMPLE_SIZE'], device='cuda')
|
||||
model = QuaNet(learner, device='cuda')
|
||||
model.fit(dataset.training)
|
||||
estim_prevalence = model.quantify(dataset.test.instances)
|
||||
```
|
||||
|
||||
|
|
|
@ -22,9 +22,9 @@ Quantification has long been regarded as an add-on of
|
|||
classification, and thus the model selection strategies
|
||||
customarily adopted in classification have simply been
|
||||
applied to quantification (see the next section).
|
||||
It has been argued in _Moreo, Alejandro, and Fabrizio Sebastiani.
|
||||
"Re-Assessing the" Classify and Count" Quantification Method."
|
||||
arXiv preprint arXiv:2011.02552 (2020)._
|
||||
It has been argued in [Moreo, Alejandro, and Fabrizio Sebastiani.
|
||||
Re-Assessing the "Classify and Count" Quantification Method.
|
||||
ECIR 2021: Advances in Information Retrieval pp 75–91.](https://link.springer.com/chapter/10.1007/978-3-030-72240-1_6)
|
||||
that specific model selection strategies should
|
||||
be adopted for quantification. That is, model selection
|
||||
strategies for quantification should target
|
||||
|
@ -32,76 +32,86 @@ quantification-oriented losses and be tested in a variety
|
|||
of scenarios exhibiting different degrees of prior
|
||||
probability shift.
|
||||
|
||||
The class
|
||||
_qp.model_selection.GridSearchQ_
|
||||
implements a grid-search exploration over the space of
|
||||
hyper-parameter combinations that evaluates each
|
||||
combination of hyper-parameters
|
||||
by means of a given quantification-oriented
|
||||
The class _qp.model_selection.GridSearchQ_ implements a grid-search exploration over the space of
|
||||
hyper-parameter combinations that [evaluates](https://github.com/HLT-ISTI/QuaPy/wiki/Evaluation)
|
||||
each combination of hyper-parameters by means of a given quantification-oriented
|
||||
error metric (e.g., any of the error functions implemented
|
||||
in _qp.error_) and according to the
|
||||
[_artificial sampling protocol_](https://github.com/HLT-ISTI/QuaPy/wiki/Evaluation).
|
||||
in _qp.error_) and according to a
|
||||
[sampling generation protocol](https://github.com/HLT-ISTI/QuaPy/wiki/Protocols).
|
||||
|
||||
The following is an example of model selection for quantification:
|
||||
The following is an example (also included in the examples folder) of model selection for quantification:
|
||||
|
||||
```python
|
||||
import quapy as qp
|
||||
from quapy.method.aggregative import PCC
|
||||
from quapy.protocol import APP
|
||||
from quapy.method.aggregative import DistributionMatching
|
||||
from sklearn.linear_model import LogisticRegression
|
||||
import numpy as np
|
||||
|
||||
# set a seed to replicate runs
|
||||
np.random.seed(0)
|
||||
qp.environ['SAMPLE_SIZE'] = 500
|
||||
"""
|
||||
In this example, we show how to perform model selection on a DistributionMatching quantifier.
|
||||
"""
|
||||
|
||||
dataset = qp.datasets.fetch_reviews('hp', tfidf=True, min_df=5)
|
||||
model = DistributionMatching(LogisticRegression())
|
||||
|
||||
qp.environ['SAMPLE_SIZE'] = 100
|
||||
qp.environ['N_JOBS'] = -1 # explore hyper-parameters in parallel
|
||||
|
||||
training, test = qp.datasets.fetch_reviews('imdb', tfidf=True, min_df=5).train_test
|
||||
|
||||
# The model will be returned by the fit method of GridSearchQ.
|
||||
# Model selection will be performed with a fixed budget of 1000 evaluations
|
||||
# for each hyper-parameter combination. The error to optimize is the MAE for
|
||||
# quantification, as evaluated on artificially drawn samples at prevalences
|
||||
# covering the entire spectrum on a held-out portion (40%) of the training set.
|
||||
# Every combination of hyper-parameters will be evaluated by confronting the
|
||||
# quantifier thus configured against a series of samples generated by means
|
||||
# of a sample generation protocol. For this example, we will use the
|
||||
# artificial-prevalence protocol (APP), that generates samples with prevalence
|
||||
# values in the entire range of values from a grid (e.g., [0, 0.1, 0.2, ..., 1]).
|
||||
# We devote 30% of the dataset for this exploration.
|
||||
training, validation = training.split_stratified(train_prop=0.7)
|
||||
protocol = APP(validation)
|
||||
|
||||
# We will explore a classification-dependent hyper-parameter (e.g., the 'C'
|
||||
# hyper-parameter of LogisticRegression) and a quantification-dependent hyper-parameter
|
||||
# (e.g., the number of bins in a DistributionMatching quantifier.
|
||||
# Classifier-dependent hyper-parameters have to be marked with a prefix "classifier__"
|
||||
# in order to let the quantifier know this hyper-parameter belongs to its underlying
|
||||
# classifier.
|
||||
param_grid = {
|
||||
'classifier__C': np.logspace(-3,3,7),
|
||||
'nbins': [8, 16, 32, 64],
|
||||
}
|
||||
|
||||
model = qp.model_selection.GridSearchQ(
|
||||
model=PCC(LogisticRegression()),
|
||||
param_grid={'C': np.logspace(-4,5,10), 'class_weight': ['balanced', None]},
|
||||
sample_size=qp.environ['SAMPLE_SIZE'],
|
||||
eval_budget=1000,
|
||||
error='mae',
|
||||
refit=True, # retrain on the whole labelled set
|
||||
val_split=0.4,
|
||||
model=model,
|
||||
param_grid=param_grid,
|
||||
protocol=protocol,
|
||||
error='mae', # the error to optimize is the MAE (a quantification-oriented loss)
|
||||
refit=True, # retrain on the whole labelled set once done
|
||||
verbose=True # show information as the process goes on
|
||||
).fit(dataset.training)
|
||||
).fit(training)
|
||||
|
||||
print(f'model selection ended: best hyper-parameters={model.best_params_}')
|
||||
model = model.best_model_
|
||||
|
||||
# evaluation in terms of MAE
|
||||
results = qp.evaluation.artificial_sampling_eval(
|
||||
model,
|
||||
dataset.test,
|
||||
sample_size=qp.environ['SAMPLE_SIZE'],
|
||||
n_prevpoints=101,
|
||||
n_repetitions=10,
|
||||
error_metric='mae'
|
||||
)
|
||||
# we use the same evaluation protocol (APP) on the test set
|
||||
mae_score = qp.evaluation.evaluate(model, protocol=APP(test), error_metric='mae')
|
||||
|
||||
print(f'MAE={results:.5f}')
|
||||
print(f'MAE={mae_score:.5f}')
|
||||
```
|
||||
|
||||
In this example, the system outputs:
|
||||
```
|
||||
[GridSearchQ]: starting optimization with n_jobs=1
|
||||
[GridSearchQ]: checking hyperparams={'C': 0.0001, 'class_weight': 'balanced'} got mae score 0.24987
|
||||
[GridSearchQ]: checking hyperparams={'C': 0.0001, 'class_weight': None} got mae score 0.48135
|
||||
[GridSearchQ]: checking hyperparams={'C': 0.001, 'class_weight': 'balanced'} got mae score 0.24866
|
||||
[GridSearchQ]: starting model selection with self.n_jobs =-1
|
||||
[GridSearchQ]: hyperparams={'classifier__C': 0.01, 'nbins': 64} got mae score 0.04021 [took 1.1356s]
|
||||
[GridSearchQ]: hyperparams={'classifier__C': 0.01, 'nbins': 32} got mae score 0.04286 [took 1.2139s]
|
||||
[GridSearchQ]: hyperparams={'classifier__C': 0.01, 'nbins': 16} got mae score 0.04888 [took 1.2491s]
|
||||
[GridSearchQ]: hyperparams={'classifier__C': 0.001, 'nbins': 8} got mae score 0.05163 [took 1.5372s]
|
||||
[...]
|
||||
[GridSearchQ]: checking hyperparams={'C': 100000.0, 'class_weight': None} got mae score 0.43676
|
||||
[GridSearchQ]: optimization finished: best params {'C': 0.1, 'class_weight': 'balanced'} (score=0.19982)
|
||||
[GridSearchQ]: hyperparams={'classifier__C': 1000.0, 'nbins': 32} got mae score 0.02445 [took 2.9056s]
|
||||
[GridSearchQ]: optimization finished: best params {'classifier__C': 100.0, 'nbins': 32} (score=0.02234) [took 7.3114s]
|
||||
[GridSearchQ]: refitting on the whole development set
|
||||
model selection ended: best hyper-parameters={'C': 0.1, 'class_weight': 'balanced'}
|
||||
1010 evaluations will be performed for each combination of hyper-parameters
|
||||
[artificial sampling protocol] generating predictions: 100%|██████████| 1010/1010 [00:00<00:00, 5005.54it/s]
|
||||
MAE=0.20342
|
||||
model selection ended: best hyper-parameters={'classifier__C': 100.0, 'nbins': 32}
|
||||
MAE=0.03102
|
||||
```
|
||||
|
||||
The parameter _val_split_ can alternatively be used to indicate
|
||||
|
@ -121,39 +131,20 @@ quantification literature have opted for this strategy.
|
|||
|
||||
In QuaPy, this is achieved by simply instantiating the
|
||||
classifier learner as a GridSearchCV from scikit-learn.
|
||||
The following code illustrates how to do that:
|
||||
The following code illustrates how to do that:
|
||||
|
||||
```python
|
||||
learner = GridSearchCV(
|
||||
LogisticRegression(),
|
||||
param_grid={'C': np.logspace(-4, 5, 10), 'class_weight': ['balanced', None]},
|
||||
cv=5)
|
||||
model = PCC(learner).fit(dataset.training)
|
||||
print(f'model selection ended: best hyper-parameters={model.learner.best_params_}')
|
||||
model = DistributionMatching(learner).fit(dataset.training)
|
||||
```
|
||||
|
||||
In this example, the system outputs:
|
||||
```
|
||||
model selection ended: best hyper-parameters={'C': 10000.0, 'class_weight': None}
|
||||
1010 evaluations will be performed for each combination of hyper-parameters
|
||||
[artificial sampling protocol] generating predictions: 100%|██████████| 1010/1010 [00:00<00:00, 5379.55it/s]
|
||||
MAE=0.41734
|
||||
```
|
||||
|
||||
Note that the MAE is worse than the one we obtained when optimizing
|
||||
for quantification and, indeed, the hyper-parameters found optimal
|
||||
largely differ between the two selection modalities. The
|
||||
hyper-parameters C=10000 and class_weight=None have been found
|
||||
to work well for the specific training prevalence of the HP dataset,
|
||||
but these hyper-parameters turned out to be suboptimal when the
|
||||
class prevalences of the test set differs (as is indeed tested
|
||||
in scenarios of quantification).
|
||||
|
||||
This is, however, not always the case, and one could, in practice,
|
||||
find examples
|
||||
in which optimizing for classification ends up resulting in a better
|
||||
quantifier than when optimizing for quantification.
|
||||
Nonetheless, this is theoretically unlikely to happen.
|
||||
However, this is conceptually flawed, since the model should be
|
||||
optimized for the task at hand (quantification), and not for a surrogate task (classification),
|
||||
i.e., the model should be requested to deliver low quantification errors, rather
|
||||
than low classification errors.
|
||||
|
||||
|
||||
|
||||
|
|
|
@ -43,7 +43,7 @@ quantification methods across different scenarios showcasing
|
|||
the accuracy of the quantifier in predicting class prevalences
|
||||
for a wide range of prior distributions. This can easily be
|
||||
achieved by means of the
|
||||
[artificial sampling protocol](https://github.com/HLT-ISTI/QuaPy/wiki/Evaluation)
|
||||
[artificial sampling protocol](https://github.com/HLT-ISTI/QuaPy/wiki/Protocols)
|
||||
that is implemented in QuaPy.
|
||||
|
||||
The following code shows how to perform one simple experiment
|
||||
|
@ -55,6 +55,7 @@ generating 100 random samples at each prevalence).
|
|||
|
||||
```python
|
||||
import quapy as qp
|
||||
from protocol import APP
|
||||
from quapy.method.aggregative import CC, ACC, PCC, PACC
|
||||
from sklearn.svm import LinearSVC
|
||||
|
||||
|
@ -63,28 +64,26 @@ qp.environ['SAMPLE_SIZE'] = 500
|
|||
def gen_data():
|
||||
|
||||
def base_classifier():
|
||||
return LinearSVC()
|
||||
return LinearSVC(class_weight='balanced')
|
||||
|
||||
def models():
|
||||
yield CC(base_classifier())
|
||||
yield ACC(base_classifier())
|
||||
yield PCC(base_classifier())
|
||||
yield PACC(base_classifier())
|
||||
yield 'CC', CC(base_classifier())
|
||||
yield 'ACC', ACC(base_classifier())
|
||||
yield 'PCC', PCC(base_classifier())
|
||||
yield 'PACC', PACC(base_classifier())
|
||||
|
||||
data = qp.datasets.fetch_reviews('kindle', tfidf=True, min_df=5)
|
||||
train, test = qp.datasets.fetch_reviews('kindle', tfidf=True, min_df=5).train_test
|
||||
|
||||
method_names, true_prevs, estim_prevs, tr_prevs = [], [], [], []
|
||||
|
||||
for model in models():
|
||||
model.fit(data.training)
|
||||
true_prev, estim_prev = qp.evaluation.artificial_sampling_prediction(
|
||||
model, data.test, qp.environ['SAMPLE_SIZE'], n_repetitions=100, n_prevpoints=21
|
||||
)
|
||||
for method_name, model in models():
|
||||
model.fit(train)
|
||||
true_prev, estim_prev = qp.evaluation.prediction(model, APP(test, repeats=100, random_state=0))
|
||||
|
||||
method_names.append(model.__class__.__name__)
|
||||
method_names.append(method_name)
|
||||
true_prevs.append(true_prev)
|
||||
estim_prevs.append(estim_prev)
|
||||
tr_prevs.append(data.training.prevalence())
|
||||
tr_prevs.append(train.prevalence())
|
||||
|
||||
return method_names, true_prevs, estim_prevs, tr_prevs
|
||||
|
||||
|
@ -163,21 +162,19 @@ like this:
|
|||
|
||||
```python
|
||||
def gen_data():
|
||||
data = qp.datasets.fetch_reviews('imdb', tfidf=True, min_df=5)
|
||||
|
||||
train, test = qp.datasets.fetch_reviews('imdb', tfidf=True, min_df=5).train_test
|
||||
model = CC(LinearSVC())
|
||||
|
||||
method_data = []
|
||||
for training_prevalence in np.linspace(0.1, 0.9, 9):
|
||||
training_size = 5000
|
||||
# since the problem is binary, it suffices to specify the negative prevalence (the positive is constrained)
|
||||
training = data.training.sampling(training_size, 1 - training_prevalence)
|
||||
model.fit(training)
|
||||
true_prev, estim_prev = qp.evaluation.artificial_sampling_prediction(
|
||||
model, data.sample, qp.environ['SAMPLE_SIZE'], n_repetitions=100, n_prevpoints=21
|
||||
)
|
||||
# method names can contain Latex syntax
|
||||
method_name = 'CC$_{' + f'{int(100 * training_prevalence)}' + '\%}$'
|
||||
method_data.append((method_name, true_prev, estim_prev, training.prevalence()))
|
||||
# since the problem is binary, it suffices to specify the negative prevalence, since the positive is constrained
|
||||
train_sample = train.sampling(training_size, 1-training_prevalence)
|
||||
model.fit(train_sample)
|
||||
true_prev, estim_prev = qp.evaluation.prediction(model, APP(test, repeats=100, random_state=0))
|
||||
method_name = 'CC$_{'+f'{int(100*training_prevalence)}' + '\%}$'
|
||||
method_data.append((method_name, true_prev, estim_prev, train_sample.prevalence()))
|
||||
|
||||
return zip(*method_data)
|
||||
```
|
||||
|
|
|
@ -64,6 +64,7 @@ Features
|
|||
* 32 UCI Machine Learning datasets.
|
||||
* 11 Twitter Sentiment datasets.
|
||||
* 3 Reviews Sentiment datasets.
|
||||
* 4 tasks from LeQua competition (_new in v0.1.7!_)
|
||||
* Native supports for binary and single-label scenarios of quantification.
|
||||
* Model selection functionality targeting quantification-oriented losses.
|
||||
* Visualization tools for analysing results.
|
||||
|
@ -75,6 +76,7 @@ Features
|
|||
Installation
|
||||
Datasets
|
||||
Evaluation
|
||||
Protocols
|
||||
Methods
|
||||
Model-Selection
|
||||
Plotting
|
||||
|
|
|
@ -56,6 +56,7 @@
|
|||
| <a href="#G"><strong>G</strong></a>
|
||||
| <a href="#H"><strong>H</strong></a>
|
||||
| <a href="#I"><strong>I</strong></a>
|
||||
| <a href="#J"><strong>J</strong></a>
|
||||
| <a href="#K"><strong>K</strong></a>
|
||||
| <a href="#L"><strong>L</strong></a>
|
||||
| <a href="#M"><strong>M</strong></a>
|
||||
|
@ -131,6 +132,8 @@
|
|||
<li><a href="quapy.method.html#quapy.method.aggregative.AggregativeQuantifier">AggregativeQuantifier (class in quapy.method.aggregative)</a>
|
||||
</li>
|
||||
<li><a href="quapy.html#quapy.protocol.APP">APP (class in quapy.protocol)</a>
|
||||
</li>
|
||||
<li><a href="quapy.html#quapy.protocol.ArtificialPrevalenceProtocol">ArtificialPrevalenceProtocol (in module quapy.protocol)</a>
|
||||
</li>
|
||||
<li><a href="quapy.classification.html#quapy.classification.neural.TorchDataset.asDataloader">asDataloader() (quapy.classification.neural.TorchDataset method)</a>
|
||||
</li>
|
||||
|
@ -284,10 +287,10 @@
|
|||
</li>
|
||||
<li><a href="quapy.method.html#quapy.method.aggregative.EMQ">EMQ (class in quapy.method.aggregative)</a>
|
||||
</li>
|
||||
</ul></td>
|
||||
<td style="width: 33%; vertical-align: top;"><ul>
|
||||
<li><a href="quapy.method.html#quapy.method.meta.Ensemble">Ensemble (class in quapy.method.meta)</a>
|
||||
</li>
|
||||
</ul></td>
|
||||
<td style="width: 33%; vertical-align: top;"><ul>
|
||||
<li><a href="quapy.method.html#quapy.method.meta.ensembleFactory">ensembleFactory() (in module quapy.method.meta)</a>
|
||||
</li>
|
||||
<li><a href="quapy.method.html#quapy.method.meta.EPACC">EPACC() (in module quapy.method.meta)</a>
|
||||
|
@ -297,6 +300,8 @@
|
|||
<li><a href="quapy.html#quapy.plot.error_by_drift">error_by_drift() (in module quapy.plot)</a>
|
||||
</li>
|
||||
<li><a href="quapy.html#quapy.evaluation.evaluate">evaluate() (in module quapy.evaluation)</a>
|
||||
</li>
|
||||
<li><a href="quapy.html#quapy.evaluation.evaluate_on_samples">evaluate_on_samples() (in module quapy.evaluation)</a>
|
||||
</li>
|
||||
<li><a href="quapy.html#quapy.evaluation.evaluation_report">evaluation_report() (in module quapy.evaluation)</a>
|
||||
</li>
|
||||
|
@ -459,6 +464,16 @@
|
|||
</ul></td>
|
||||
<td style="width: 33%; vertical-align: top;"><ul>
|
||||
<li><a href="quapy.data.html#quapy.data.preprocessing.IndexTransformer">IndexTransformer (class in quapy.data.preprocessing)</a>
|
||||
</li>
|
||||
<li><a href="quapy.html#quapy.protocol.IterateProtocol">IterateProtocol (class in quapy.protocol)</a>
|
||||
</li>
|
||||
</ul></td>
|
||||
</tr></table>
|
||||
|
||||
<h2 id="J">J</h2>
|
||||
<table style="width: 100%" class="indextable genindextable"><tr>
|
||||
<td style="width: 33%; vertical-align: top;"><ul>
|
||||
<li><a href="quapy.data.html#quapy.data.base.LabelledCollection.join">join() (quapy.data.base.LabelledCollection class method)</a>
|
||||
</li>
|
||||
</ul></td>
|
||||
</tr></table>
|
||||
|
@ -521,8 +536,6 @@
|
|||
<li><a href="quapy.method.html#quapy.method.aggregative.MedianSweep">MedianSweep (in module quapy.method.aggregative)</a>
|
||||
</li>
|
||||
<li><a href="quapy.method.html#quapy.method.aggregative.MedianSweep2">MedianSweep2 (in module quapy.method.aggregative)</a>
|
||||
</li>
|
||||
<li><a href="quapy.data.html#quapy.data.base.LabelledCollection.mix">mix() (quapy.data.base.LabelledCollection class method)</a>
|
||||
</li>
|
||||
<li><a href="quapy.html#quapy.error.mkld">mkld() (in module quapy.error)</a>
|
||||
</li>
|
||||
|
@ -603,6 +616,8 @@
|
|||
<li><a href="quapy.data.html#quapy.data.base.LabelledCollection.n_classes">(quapy.data.base.LabelledCollection property)</a>
|
||||
</li>
|
||||
</ul></li>
|
||||
<li><a href="quapy.html#quapy.protocol.NaturalPrevalenceProtocol">NaturalPrevalenceProtocol (in module quapy.protocol)</a>
|
||||
</li>
|
||||
<li><a href="quapy.classification.html#quapy.classification.calibration.NBVSCalibration">NBVSCalibration (class in quapy.classification.calibration)</a>
|
||||
</li>
|
||||
<li><a href="quapy.classification.html#quapy.classification.neural.NeuralClassifierTrainer">NeuralClassifierTrainer (class in quapy.classification.neural)</a>
|
||||
|
@ -610,11 +625,11 @@
|
|||
<li><a href="quapy.method.html#quapy.method.aggregative.newELM">newELM() (in module quapy.method.aggregative)</a>
|
||||
</li>
|
||||
<li><a href="quapy.method.html#quapy.method.base.newOneVsAll">newOneVsAll() (in module quapy.method.base)</a>
|
||||
</li>
|
||||
<li><a href="quapy.method.html#quapy.method.aggregative.newSVMAE">newSVMAE() (in module quapy.method.aggregative)</a>
|
||||
</li>
|
||||
</ul></td>
|
||||
<td style="width: 33%; vertical-align: top;"><ul>
|
||||
<li><a href="quapy.method.html#quapy.method.aggregative.newSVMAE">newSVMAE() (in module quapy.method.aggregative)</a>
|
||||
</li>
|
||||
<li><a href="quapy.method.html#quapy.method.aggregative.newSVMKLD">newSVMKLD() (in module quapy.method.aggregative)</a>
|
||||
</li>
|
||||
<li><a href="quapy.method.html#quapy.method.aggregative.newSVMQ">newSVMQ() (in module quapy.method.aggregative)</a>
|
||||
|
@ -914,6 +929,8 @@
|
|||
<li><a href="quapy.classification.html#quapy.classification.calibration.RecalibratedProbabilisticClassifier">RecalibratedProbabilisticClassifier (class in quapy.classification.calibration)</a>
|
||||
</li>
|
||||
<li><a href="quapy.classification.html#quapy.classification.calibration.RecalibratedProbabilisticClassifierBase">RecalibratedProbabilisticClassifierBase (class in quapy.classification.calibration)</a>
|
||||
</li>
|
||||
<li><a href="quapy.data.html#quapy.data.base.Dataset.reduce">reduce() (quapy.data.base.Dataset method)</a>
|
||||
</li>
|
||||
</ul></td>
|
||||
<td style="width: 33%; vertical-align: top;"><ul>
|
||||
|
@ -942,7 +959,7 @@
|
|||
</li>
|
||||
<li><a href="quapy.html#quapy.protocol.NPP.sample">(quapy.protocol.NPP method)</a>
|
||||
</li>
|
||||
<li><a href="quapy.html#quapy.protocol.USimplexPP.sample">(quapy.protocol.USimplexPP method)</a>
|
||||
<li><a href="quapy.html#quapy.protocol.UPP.sample">(quapy.protocol.UPP method)</a>
|
||||
</li>
|
||||
</ul></li>
|
||||
<li><a href="quapy.html#quapy.protocol.AbstractStochasticSeededProtocol.samples_parameters">samples_parameters() (quapy.protocol.AbstractStochasticSeededProtocol method)</a>
|
||||
|
@ -954,7 +971,7 @@
|
|||
</li>
|
||||
<li><a href="quapy.html#quapy.protocol.NPP.samples_parameters">(quapy.protocol.NPP method)</a>
|
||||
</li>
|
||||
<li><a href="quapy.html#quapy.protocol.USimplexPP.samples_parameters">(quapy.protocol.USimplexPP method)</a>
|
||||
<li><a href="quapy.html#quapy.protocol.UPP.samples_parameters">(quapy.protocol.UPP method)</a>
|
||||
</li>
|
||||
</ul></li>
|
||||
<li><a href="quapy.data.html#quapy.data.base.LabelledCollection.sampling">sampling() (quapy.data.base.LabelledCollection method)</a>
|
||||
|
@ -1033,10 +1050,12 @@
|
|||
<li><a href="quapy.html#quapy.protocol.APP.total">(quapy.protocol.APP method)</a>
|
||||
</li>
|
||||
<li><a href="quapy.html#quapy.protocol.DomainMixer.total">(quapy.protocol.DomainMixer method)</a>
|
||||
</li>
|
||||
<li><a href="quapy.html#quapy.protocol.IterateProtocol.total">(quapy.protocol.IterateProtocol method)</a>
|
||||
</li>
|
||||
<li><a href="quapy.html#quapy.protocol.NPP.total">(quapy.protocol.NPP method)</a>
|
||||
</li>
|
||||
<li><a href="quapy.html#quapy.protocol.USimplexPP.total">(quapy.protocol.USimplexPP method)</a>
|
||||
<li><a href="quapy.html#quapy.protocol.UPP.total">(quapy.protocol.UPP method)</a>
|
||||
</li>
|
||||
</ul></li>
|
||||
</ul></td>
|
||||
|
@ -1073,13 +1092,15 @@
|
|||
</li>
|
||||
<li><a href="quapy.data.html#quapy.data.base.LabelledCollection.uniform_sampling">uniform_sampling() (quapy.data.base.LabelledCollection method)</a>
|
||||
</li>
|
||||
</ul></td>
|
||||
<td style="width: 33%; vertical-align: top;"><ul>
|
||||
<li><a href="quapy.data.html#quapy.data.base.LabelledCollection.uniform_sampling_index">uniform_sampling_index() (quapy.data.base.LabelledCollection method)</a>
|
||||
</li>
|
||||
</ul></td>
|
||||
<td style="width: 33%; vertical-align: top;"><ul>
|
||||
<li><a href="quapy.html#quapy.functional.uniform_simplex_sampling">uniform_simplex_sampling() (in module quapy.functional)</a>
|
||||
</li>
|
||||
<li><a href="quapy.html#quapy.protocol.USimplexPP">USimplexPP (class in quapy.protocol)</a>
|
||||
<li><a href="quapy.html#quapy.protocol.UniformPrevalenceProtocol">UniformPrevalenceProtocol (in module quapy.protocol)</a>
|
||||
</li>
|
||||
<li><a href="quapy.html#quapy.protocol.UPP">UPP (class in quapy.protocol)</a>
|
||||
</li>
|
||||
</ul></td>
|
||||
</tr></table>
|
||||
|
|
|
@ -102,6 +102,7 @@ See the <a class="reference internal" href="Evaluation.html"><span class="doc">E
|
|||
<li><p>32 UCI Machine Learning datasets.</p></li>
|
||||
<li><p>11 Twitter Sentiment datasets.</p></li>
|
||||
<li><p>3 Reviews Sentiment datasets.</p></li>
|
||||
<li><p>4 tasks from LeQua competition (_new in v0.1.7!_)</p></li>
|
||||
</ul>
|
||||
</dd>
|
||||
</dl>
|
||||
|
@ -130,6 +131,13 @@ See the <a class="reference internal" href="Evaluation.html"><span class="doc">E
|
|||
<li class="toctree-l2"><a class="reference internal" href="Evaluation.html#evaluation-protocols">Evaluation Protocols</a></li>
|
||||
</ul>
|
||||
</li>
|
||||
<li class="toctree-l1"><a class="reference internal" href="Protocols.html">Protocols</a><ul>
|
||||
<li class="toctree-l2"><a class="reference internal" href="Protocols.html#artificial-prevalence-protocol">Artificial-Prevalence Protocol</a></li>
|
||||
<li class="toctree-l2"><a class="reference internal" href="Protocols.html#sampling-from-the-unit-simplex-the-uniform-prevalence-protocol-upp">Sampling from the unit-simplex, the Uniform-Prevalence Protocol (UPP)</a></li>
|
||||
<li class="toctree-l2"><a class="reference internal" href="Protocols.html#natural-prevalence-protocol">Natural-Prevalence Protocol</a></li>
|
||||
<li class="toctree-l2"><a class="reference internal" href="Protocols.html#other-protocols">Other protocols</a></li>
|
||||
</ul>
|
||||
</li>
|
||||
<li class="toctree-l1"><a class="reference internal" href="Methods.html">Quantification Methods</a><ul>
|
||||
<li class="toctree-l2"><a class="reference internal" href="Methods.html#aggregative-methods">Aggregative Methods</a></li>
|
||||
<li class="toctree-l2"><a class="reference internal" href="Methods.html#meta-models">Meta Models</a></li>
|
||||
|
|
Binary file not shown.
|
@ -170,6 +170,23 @@ See <a class="reference internal" href="#quapy.data.base.LabelledCollection.load
|
|||
</dl>
|
||||
</dd></dl>
|
||||
|
||||
<dl class="py method">
|
||||
<dt class="sig sig-object py" id="quapy.data.base.Dataset.reduce">
|
||||
<span class="sig-name descname"><span class="pre">reduce</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">n_train</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">100</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">n_test</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">100</span></span></em><span class="sig-paren">)</span><a class="headerlink" href="#quapy.data.base.Dataset.reduce" title="Permalink to this definition">¶</a></dt>
|
||||
<dd><p>Reduce the number of instances in place for quick experiments. Preserves the prevalence of each set.</p>
|
||||
<dl class="field-list simple">
|
||||
<dt class="field-odd">Parameters<span class="colon">:</span></dt>
|
||||
<dd class="field-odd"><ul class="simple">
|
||||
<li><p><strong>n_train</strong> – number of training documents to keep (default 100)</p></li>
|
||||
<li><p><strong>n_test</strong> – number of test documents to keep (default 100)</p></li>
|
||||
</ul>
|
||||
</dd>
|
||||
<dt class="field-even">Returns<span class="colon">:</span></dt>
|
||||
<dd class="field-even"><p>self</p>
|
||||
</dd>
|
||||
</dl>
|
||||
</dd></dl>
|
||||
|
||||
<dl class="py method">
|
||||
<dt class="sig sig-object py" id="quapy.data.base.Dataset.stats">
|
||||
<span class="sig-name descname"><span class="pre">stats</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">show</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">True</span></span></em><span class="sig-paren">)</span><a class="headerlink" href="#quapy.data.base.Dataset.stats" title="Permalink to this definition">¶</a></dt>
|
||||
|
@ -297,6 +314,20 @@ as listed by <cite>self.classes_</cite></p>
|
|||
</dl>
|
||||
</dd></dl>
|
||||
|
||||
<dl class="py method">
|
||||
<dt class="sig sig-object py" id="quapy.data.base.LabelledCollection.join">
|
||||
<em class="property"><span class="pre">classmethod</span><span class="w"> </span></em><span class="sig-name descname"><span class="pre">join</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="o"><span class="pre">*</span></span><span class="n"><span class="pre">args</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Iterable</span><span class="p"><span class="pre">[</span></span><a class="reference internal" href="#quapy.data.base.LabelledCollection" title="quapy.data.base.LabelledCollection"><span class="pre">LabelledCollection</span></a><span class="p"><span class="pre">]</span></span></span></em><span class="sig-paren">)</span><a class="headerlink" href="#quapy.data.base.LabelledCollection.join" title="Permalink to this definition">¶</a></dt>
|
||||
<dd><p>Returns a new <a class="reference internal" href="#quapy.data.base.LabelledCollection" title="quapy.data.base.LabelledCollection"><code class="xref py py-class docutils literal notranslate"><span class="pre">LabelledCollection</span></code></a> as the union of the collections given in input.</p>
|
||||
<dl class="field-list simple">
|
||||
<dt class="field-odd">Parameters<span class="colon">:</span></dt>
|
||||
<dd class="field-odd"><p><strong>args</strong> – instances of <a class="reference internal" href="#quapy.data.base.LabelledCollection" title="quapy.data.base.LabelledCollection"><code class="xref py py-class docutils literal notranslate"><span class="pre">LabelledCollection</span></code></a></p>
|
||||
</dd>
|
||||
<dt class="field-even">Returns<span class="colon">:</span></dt>
|
||||
<dd class="field-even"><p>a <a class="reference internal" href="#quapy.data.base.LabelledCollection" title="quapy.data.base.LabelledCollection"><code class="xref py py-class docutils literal notranslate"><span class="pre">LabelledCollection</span></code></a> representing the union of both collections</p>
|
||||
</dd>
|
||||
</dl>
|
||||
</dd></dl>
|
||||
|
||||
<dl class="py method">
|
||||
<dt class="sig sig-object py" id="quapy.data.base.LabelledCollection.kFCV">
|
||||
<span class="sig-name descname"><span class="pre">kFCV</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">nfolds</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">5</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">nrepeats</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">1</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">random_state</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">None</span></span></em><span class="sig-paren">)</span><a class="headerlink" href="#quapy.data.base.LabelledCollection.kFCV" title="Permalink to this definition">¶</a></dt>
|
||||
|
@ -338,23 +369,6 @@ these arguments are used to call <cite>loader_func(path, **loader_kwargs)</cite>
|
|||
</dl>
|
||||
</dd></dl>
|
||||
|
||||
<dl class="py method">
|
||||
<dt class="sig sig-object py" id="quapy.data.base.LabelledCollection.mix">
|
||||
<em class="property"><span class="pre">classmethod</span><span class="w"> </span></em><span class="sig-name descname"><span class="pre">mix</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">a</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><a class="reference internal" href="#quapy.data.base.LabelledCollection" title="quapy.data.base.LabelledCollection"><span class="pre">LabelledCollection</span></a></span></em>, <em class="sig-param"><span class="n"><span class="pre">b</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><a class="reference internal" href="#quapy.data.base.LabelledCollection" title="quapy.data.base.LabelledCollection"><span class="pre">LabelledCollection</span></a></span></em><span class="sig-paren">)</span><a class="headerlink" href="#quapy.data.base.LabelledCollection.mix" title="Permalink to this definition">¶</a></dt>
|
||||
<dd><p>Returns a new <a class="reference internal" href="#quapy.data.base.LabelledCollection" title="quapy.data.base.LabelledCollection"><code class="xref py py-class docutils literal notranslate"><span class="pre">LabelledCollection</span></code></a> as the union of this collection with another collection.</p>
|
||||
<dl class="field-list simple">
|
||||
<dt class="field-odd">Parameters<span class="colon">:</span></dt>
|
||||
<dd class="field-odd"><ul class="simple">
|
||||
<li><p><strong>a</strong> – instance of <a class="reference internal" href="#quapy.data.base.LabelledCollection" title="quapy.data.base.LabelledCollection"><code class="xref py py-class docutils literal notranslate"><span class="pre">LabelledCollection</span></code></a></p></li>
|
||||
<li><p><strong>b</strong> – instance of <a class="reference internal" href="#quapy.data.base.LabelledCollection" title="quapy.data.base.LabelledCollection"><code class="xref py py-class docutils literal notranslate"><span class="pre">LabelledCollection</span></code></a></p></li>
|
||||
</ul>
|
||||
</dd>
|
||||
<dt class="field-even">Returns<span class="colon">:</span></dt>
|
||||
<dd class="field-even"><p>a <a class="reference internal" href="#quapy.data.base.LabelledCollection" title="quapy.data.base.LabelledCollection"><code class="xref py py-class docutils literal notranslate"><span class="pre">LabelledCollection</span></code></a> representing the union of both collections</p>
|
||||
</dd>
|
||||
</dl>
|
||||
</dd></dl>
|
||||
|
||||
<dl class="py property">
|
||||
<dt class="sig sig-object py" id="quapy.data.base.LabelledCollection.n_classes">
|
||||
<em class="property"><span class="pre">property</span><span class="w"> </span></em><span class="sig-name descname"><span class="pre">n_classes</span></span><a class="headerlink" href="#quapy.data.base.LabelledCollection.n_classes" title="Permalink to this definition">¶</a></dt>
|
||||
|
|
|
@ -481,18 +481,117 @@ will be taken from the environment variable <cite>SAMPLE_SIZE</cite> (which has
|
|||
<span id="quapy-evaluation"></span><h2>quapy.evaluation<a class="headerlink" href="#module-quapy.evaluation" title="Permalink to this heading">¶</a></h2>
|
||||
<dl class="py function">
|
||||
<dt class="sig sig-object py" id="quapy.evaluation.evaluate">
|
||||
<span class="sig-prename descclassname"><span class="pre">quapy.evaluation.</span></span><span class="sig-name descname"><span class="pre">evaluate</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">model</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><a class="reference internal" href="quapy.method.html#quapy.method.base.BaseQuantifier" title="quapy.method.base.BaseQuantifier"><span class="pre">BaseQuantifier</span></a></span></em>, <em class="sig-param"><span class="n"><span class="pre">protocol</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><a class="reference internal" href="#quapy.protocol.AbstractProtocol" title="quapy.protocol.AbstractProtocol"><span class="pre">AbstractProtocol</span></a></span></em>, <em class="sig-param"><span class="n"><span class="pre">error_metric</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Union</span><span class="p"><span class="pre">[</span></span><span class="pre">str</span><span class="p"><span class="pre">,</span></span><span class="w"> </span><span class="pre">Callable</span><span class="p"><span class="pre">]</span></span></span></em>, <em class="sig-param"><span class="n"><span class="pre">aggr_speedup</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">'auto'</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">verbose</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">False</span></span></em><span class="sig-paren">)</span><a class="headerlink" href="#quapy.evaluation.evaluate" title="Permalink to this definition">¶</a></dt>
|
||||
<dd></dd></dl>
|
||||
<span class="sig-prename descclassname"><span class="pre">quapy.evaluation.</span></span><span class="sig-name descname"><span class="pre">evaluate</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">model</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><a class="reference internal" href="quapy.method.html#quapy.method.base.BaseQuantifier" title="quapy.method.base.BaseQuantifier"><span class="pre">BaseQuantifier</span></a></span></em>, <em class="sig-param"><span class="n"><span class="pre">protocol</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><a class="reference internal" href="#quapy.protocol.AbstractProtocol" title="quapy.protocol.AbstractProtocol"><span class="pre">AbstractProtocol</span></a></span></em>, <em class="sig-param"><span class="n"><span class="pre">error_metric</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Union</span><span class="p"><span class="pre">[</span></span><span class="pre">str</span><span class="p"><span class="pre">,</span></span><span class="w"> </span><span class="pre">Callable</span><span class="p"><span class="pre">]</span></span></span></em>, <em class="sig-param"><span class="n"><span class="pre">aggr_speedup</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Union</span><span class="p"><span class="pre">[</span></span><span class="pre">str</span><span class="p"><span class="pre">,</span></span><span class="w"> </span><span class="pre">bool</span><span class="p"><span class="pre">]</span></span></span><span class="w"> </span><span class="o"><span class="pre">=</span></span><span class="w"> </span><span class="default_value"><span class="pre">'auto'</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">verbose</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">False</span></span></em><span class="sig-paren">)</span><a class="headerlink" href="#quapy.evaluation.evaluate" title="Permalink to this definition">¶</a></dt>
|
||||
<dd><p>Evaluates a quantification model according to a specific sample generation protocol and in terms of one
|
||||
evaluation metric (error).</p>
|
||||
<dl class="field-list simple">
|
||||
<dt class="field-odd">Parameters<span class="colon">:</span></dt>
|
||||
<dd class="field-odd"><ul class="simple">
|
||||
<li><p><strong>model</strong> – a quantifier, instance of <a class="reference internal" href="quapy.method.html#quapy.method.base.BaseQuantifier" title="quapy.method.base.BaseQuantifier"><code class="xref py py-class docutils literal notranslate"><span class="pre">quapy.method.base.BaseQuantifier</span></code></a></p></li>
|
||||
<li><p><strong>protocol</strong> – <a class="reference internal" href="#quapy.protocol.AbstractProtocol" title="quapy.protocol.AbstractProtocol"><code class="xref py py-class docutils literal notranslate"><span class="pre">quapy.protocol.AbstractProtocol</span></code></a>; if this object is also instance of
|
||||
<a class="reference internal" href="#quapy.protocol.OnLabelledCollectionProtocol" title="quapy.protocol.OnLabelledCollectionProtocol"><code class="xref py py-class docutils literal notranslate"><span class="pre">quapy.protocol.OnLabelledCollectionProtocol</span></code></a>, then the aggregation speed-up can be run. This is the
|
||||
protocol in charge of generating the samples in which the model is evaluated.</p></li>
|
||||
<li><p><strong>error_metric</strong> – a string representing the name(s) of an error function in <cite>qp.error</cite>
|
||||
(e.g., ‘mae’), or a callable function implementing the error function itself.</p></li>
|
||||
<li><p><strong>aggr_speedup</strong> – whether or not to apply the speed-up. Set to “force” for applying it even if the number of
|
||||
instances in the original collection on which the protocol acts is larger than the number of instances
|
||||
in the samples to be generated. Set to True or “auto” (default) for letting QuaPy decide whether it is
|
||||
convenient or not. Set to False to deactivate.</p></li>
|
||||
<li><p><strong>verbose</strong> – boolean, show or not information in stdout</p></li>
|
||||
</ul>
|
||||
</dd>
|
||||
<dt class="field-even">Returns<span class="colon">:</span></dt>
|
||||
<dd class="field-even"><p>if the error metric is not averaged (e.g., ‘ae’, ‘rae’), returns an array of shape <cite>(n_samples,)</cite> with
|
||||
the error scores for each sample; if the error metric is averaged (e.g., ‘mae’, ‘mrae’) then returns
|
||||
a single float</p>
|
||||
</dd>
|
||||
</dl>
|
||||
</dd></dl>
|
||||
|
||||
<dl class="py function">
|
||||
<dt class="sig sig-object py" id="quapy.evaluation.evaluate_on_samples">
|
||||
<span class="sig-prename descclassname"><span class="pre">quapy.evaluation.</span></span><span class="sig-name descname"><span class="pre">evaluate_on_samples</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">model</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><a class="reference internal" href="quapy.method.html#quapy.method.base.BaseQuantifier" title="quapy.method.base.BaseQuantifier"><span class="pre">BaseQuantifier</span></a></span></em>, <em class="sig-param"><span class="n"><span class="pre">samples</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Iterable</span><span class="p"><span class="pre">[</span></span><a class="reference internal" href="quapy.data.html#quapy.data.base.LabelledCollection" title="quapy.data.base.LabelledCollection"><span class="pre">LabelledCollection</span></a><span class="p"><span class="pre">]</span></span></span></em>, <em class="sig-param"><span class="n"><span class="pre">error_metric</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Union</span><span class="p"><span class="pre">[</span></span><span class="pre">str</span><span class="p"><span class="pre">,</span></span><span class="w"> </span><span class="pre">Callable</span><span class="p"><span class="pre">]</span></span></span></em>, <em class="sig-param"><span class="n"><span class="pre">verbose</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">False</span></span></em><span class="sig-paren">)</span><a class="headerlink" href="#quapy.evaluation.evaluate_on_samples" title="Permalink to this definition">¶</a></dt>
|
||||
<dd><p>Evaluates a quantification model on a given set of samples and in terms of one evaluation metric (error).</p>
|
||||
<dl class="field-list simple">
|
||||
<dt class="field-odd">Parameters<span class="colon">:</span></dt>
|
||||
<dd class="field-odd"><ul class="simple">
|
||||
<li><p><strong>model</strong> – a quantifier, instance of <a class="reference internal" href="quapy.method.html#quapy.method.base.BaseQuantifier" title="quapy.method.base.BaseQuantifier"><code class="xref py py-class docutils literal notranslate"><span class="pre">quapy.method.base.BaseQuantifier</span></code></a></p></li>
|
||||
<li><p><strong>samples</strong> – a list of samples on which the quantifier is to be evaluated</p></li>
|
||||
<li><p><strong>error_metric</strong> – a string representing the name(s) of an error function in <cite>qp.error</cite>
|
||||
(e.g., ‘mae’), or a callable function implementing the error function itself.</p></li>
|
||||
<li><p><strong>verbose</strong> – boolean, show or not information in stdout</p></li>
|
||||
</ul>
|
||||
</dd>
|
||||
<dt class="field-even">Returns<span class="colon">:</span></dt>
|
||||
<dd class="field-even"><p>if the error metric is not averaged (e.g., ‘ae’, ‘rae’), returns an array of shape <cite>(n_samples,)</cite> with
|
||||
the error scores for each sample; if the error metric is averaged (e.g., ‘mae’, ‘mrae’) then returns
|
||||
a single float</p>
|
||||
</dd>
|
||||
</dl>
|
||||
</dd></dl>
|
||||
|
||||
<dl class="py function">
|
||||
<dt class="sig sig-object py" id="quapy.evaluation.evaluation_report">
|
||||
<span class="sig-prename descclassname"><span class="pre">quapy.evaluation.</span></span><span class="sig-name descname"><span class="pre">evaluation_report</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">model</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><a class="reference internal" href="quapy.method.html#quapy.method.base.BaseQuantifier" title="quapy.method.base.BaseQuantifier"><span class="pre">BaseQuantifier</span></a></span></em>, <em class="sig-param"><span class="n"><span class="pre">protocol</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><a class="reference internal" href="#quapy.protocol.AbstractProtocol" title="quapy.protocol.AbstractProtocol"><span class="pre">AbstractProtocol</span></a></span></em>, <em class="sig-param"><span class="n"><span class="pre">error_metrics</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Iterable</span><span class="p"><span class="pre">[</span></span><span class="pre">Union</span><span class="p"><span class="pre">[</span></span><span class="pre">str</span><span class="p"><span class="pre">,</span></span><span class="w"> </span><span class="pre">Callable</span><span class="p"><span class="pre">]</span></span><span class="p"><span class="pre">]</span></span></span><span class="w"> </span><span class="o"><span class="pre">=</span></span><span class="w"> </span><span class="default_value"><span class="pre">'mae'</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">aggr_speedup</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">'auto'</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">verbose</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">False</span></span></em><span class="sig-paren">)</span><a class="headerlink" href="#quapy.evaluation.evaluation_report" title="Permalink to this definition">¶</a></dt>
|
||||
<dd></dd></dl>
|
||||
<span class="sig-prename descclassname"><span class="pre">quapy.evaluation.</span></span><span class="sig-name descname"><span class="pre">evaluation_report</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">model</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><a class="reference internal" href="quapy.method.html#quapy.method.base.BaseQuantifier" title="quapy.method.base.BaseQuantifier"><span class="pre">BaseQuantifier</span></a></span></em>, <em class="sig-param"><span class="n"><span class="pre">protocol</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><a class="reference internal" href="#quapy.protocol.AbstractProtocol" title="quapy.protocol.AbstractProtocol"><span class="pre">AbstractProtocol</span></a></span></em>, <em class="sig-param"><span class="n"><span class="pre">error_metrics</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Iterable</span><span class="p"><span class="pre">[</span></span><span class="pre">Union</span><span class="p"><span class="pre">[</span></span><span class="pre">str</span><span class="p"><span class="pre">,</span></span><span class="w"> </span><span class="pre">Callable</span><span class="p"><span class="pre">]</span></span><span class="p"><span class="pre">]</span></span></span><span class="w"> </span><span class="o"><span class="pre">=</span></span><span class="w"> </span><span class="default_value"><span class="pre">'mae'</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">aggr_speedup</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Union</span><span class="p"><span class="pre">[</span></span><span class="pre">str</span><span class="p"><span class="pre">,</span></span><span class="w"> </span><span class="pre">bool</span><span class="p"><span class="pre">]</span></span></span><span class="w"> </span><span class="o"><span class="pre">=</span></span><span class="w"> </span><span class="default_value"><span class="pre">'auto'</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">verbose</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">False</span></span></em><span class="sig-paren">)</span><a class="headerlink" href="#quapy.evaluation.evaluation_report" title="Permalink to this definition">¶</a></dt>
|
||||
<dd><p>Generates a report (a pandas’ DataFrame) containing information of the evaluation of the model as according
|
||||
to a specific protocol and in terms of one or more evaluation metrics (errors).</p>
|
||||
<dl class="field-list simple">
|
||||
<dt class="field-odd">Parameters<span class="colon">:</span></dt>
|
||||
<dd class="field-odd"><ul class="simple">
|
||||
<li><p><strong>model</strong> – a quantifier, instance of <a class="reference internal" href="quapy.method.html#quapy.method.base.BaseQuantifier" title="quapy.method.base.BaseQuantifier"><code class="xref py py-class docutils literal notranslate"><span class="pre">quapy.method.base.BaseQuantifier</span></code></a></p></li>
|
||||
<li><p><strong>protocol</strong> – <a class="reference internal" href="#quapy.protocol.AbstractProtocol" title="quapy.protocol.AbstractProtocol"><code class="xref py py-class docutils literal notranslate"><span class="pre">quapy.protocol.AbstractProtocol</span></code></a>; if this object is also instance of
|
||||
<a class="reference internal" href="#quapy.protocol.OnLabelledCollectionProtocol" title="quapy.protocol.OnLabelledCollectionProtocol"><code class="xref py py-class docutils literal notranslate"><span class="pre">quapy.protocol.OnLabelledCollectionProtocol</span></code></a>, then the aggregation speed-up can be run. This is the protocol
|
||||
in charge of generating the samples in which the model is evaluated.</p></li>
|
||||
<li><p><strong>error_metrics</strong> – a string, or list of strings, representing the name(s) of an error function in <cite>qp.error</cite>
|
||||
(e.g., ‘mae’, the default value), or a callable function, or a list of callable functions, implementing
|
||||
the error function itself.</p></li>
|
||||
<li><p><strong>aggr_speedup</strong> – whether or not to apply the speed-up. Set to “force” for applying it even if the number of
|
||||
instances in the original collection on which the protocol acts is larger than the number of instances
|
||||
in the samples to be generated. Set to True or “auto” (default) for letting QuaPy decide whether it is
|
||||
convenient or not. Set to False to deactivate.</p></li>
|
||||
<li><p><strong>verbose</strong> – boolean, show or not information in stdout</p></li>
|
||||
</ul>
|
||||
</dd>
|
||||
<dt class="field-even">Returns<span class="colon">:</span></dt>
|
||||
<dd class="field-even"><p>a pandas’ DataFrame containing the columns ‘true-prev’ (the true prevalence of each sample),
|
||||
‘estim-prev’ (the prevalence estimated by the model for each sample), and as many columns as error metrics
|
||||
have been indicated, each displaying the score in terms of that metric for every sample.</p>
|
||||
</dd>
|
||||
</dl>
|
||||
</dd></dl>
|
||||
|
||||
<dl class="py function">
|
||||
<dt class="sig sig-object py" id="quapy.evaluation.prediction">
|
||||
<span class="sig-prename descclassname"><span class="pre">quapy.evaluation.</span></span><span class="sig-name descname"><span class="pre">prediction</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">model</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><a class="reference internal" href="quapy.method.html#quapy.method.base.BaseQuantifier" title="quapy.method.base.BaseQuantifier"><span class="pre">BaseQuantifier</span></a></span></em>, <em class="sig-param"><span class="n"><span class="pre">protocol</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><a class="reference internal" href="#quapy.protocol.AbstractProtocol" title="quapy.protocol.AbstractProtocol"><span class="pre">AbstractProtocol</span></a></span></em>, <em class="sig-param"><span class="n"><span class="pre">aggr_speedup</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">'auto'</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">verbose</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">False</span></span></em><span class="sig-paren">)</span><a class="headerlink" href="#quapy.evaluation.prediction" title="Permalink to this definition">¶</a></dt>
|
||||
<dd></dd></dl>
|
||||
<span class="sig-prename descclassname"><span class="pre">quapy.evaluation.</span></span><span class="sig-name descname"><span class="pre">prediction</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">model</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><a class="reference internal" href="quapy.method.html#quapy.method.base.BaseQuantifier" title="quapy.method.base.BaseQuantifier"><span class="pre">BaseQuantifier</span></a></span></em>, <em class="sig-param"><span class="n"><span class="pre">protocol</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><a class="reference internal" href="#quapy.protocol.AbstractProtocol" title="quapy.protocol.AbstractProtocol"><span class="pre">AbstractProtocol</span></a></span></em>, <em class="sig-param"><span class="n"><span class="pre">aggr_speedup</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Union</span><span class="p"><span class="pre">[</span></span><span class="pre">str</span><span class="p"><span class="pre">,</span></span><span class="w"> </span><span class="pre">bool</span><span class="p"><span class="pre">]</span></span></span><span class="w"> </span><span class="o"><span class="pre">=</span></span><span class="w"> </span><span class="default_value"><span class="pre">'auto'</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">verbose</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">False</span></span></em><span class="sig-paren">)</span><a class="headerlink" href="#quapy.evaluation.prediction" title="Permalink to this definition">¶</a></dt>
|
||||
<dd><p>Uses a quantification model to generate predictions for the samples generated via a specific protocol.
|
||||
This function is central to all evaluation processes, and is endowed with an optimization to speed-up the
|
||||
prediction of protocols that generate samples from a large collection. The optimization applies to aggregative
|
||||
quantifiers only, and to OnLabelledCollectionProtocol protocols, and comes down to generating the classification
|
||||
predictions once and for all, and then generating samples over the classification predictions (instead of over
|
||||
the raw instances), so that the classifier prediction is never called again. This behaviour is obtained by
|
||||
setting <cite>aggr_speedup</cite> to ‘auto’ or True, and is only carried out if the overall process is convenient in terms
|
||||
of computations (e.g., if the number of classification predictions needed for the original collection exceed the
|
||||
number of classification predictions needed for all samples, then the optimization is not undertaken).</p>
|
||||
<dl class="field-list simple">
|
||||
<dt class="field-odd">Parameters<span class="colon">:</span></dt>
|
||||
<dd class="field-odd"><ul class="simple">
|
||||
<li><p><strong>model</strong> – a quantifier, instance of <a class="reference internal" href="quapy.method.html#quapy.method.base.BaseQuantifier" title="quapy.method.base.BaseQuantifier"><code class="xref py py-class docutils literal notranslate"><span class="pre">quapy.method.base.BaseQuantifier</span></code></a></p></li>
|
||||
<li><p><strong>protocol</strong> – <a class="reference internal" href="#quapy.protocol.AbstractProtocol" title="quapy.protocol.AbstractProtocol"><code class="xref py py-class docutils literal notranslate"><span class="pre">quapy.protocol.AbstractProtocol</span></code></a>; if this object is also instance of
|
||||
<a class="reference internal" href="#quapy.protocol.OnLabelledCollectionProtocol" title="quapy.protocol.OnLabelledCollectionProtocol"><code class="xref py py-class docutils literal notranslate"><span class="pre">quapy.protocol.OnLabelledCollectionProtocol</span></code></a>, then the aggregation speed-up can be run. This is the protocol
|
||||
in charge of generating the samples for which the model has to issue class prevalence predictions.</p></li>
|
||||
<li><p><strong>aggr_speedup</strong> – whether or not to apply the speed-up. Set to “force” for applying it even if the number of
|
||||
instances in the original collection on which the protocol acts is larger than the number of instances
|
||||
in the samples to be generated. Set to True or “auto” (default) for letting QuaPy decide whether it is
|
||||
convenient or not. Set to False to deactivate.</p></li>
|
||||
<li><p><strong>verbose</strong> – boolean, show or not information in stdout</p></li>
|
||||
</ul>
|
||||
</dd>
|
||||
<dt class="field-even">Returns<span class="colon">:</span></dt>
|
||||
<dd class="field-even"><p>a tuple <cite>(true_prevs, estim_prevs)</cite> in which each element in the tuple is an array of shape
|
||||
<cite>(n_samples, n_classes)</cite> containing the true, or predicted, prevalence values for each sample</p>
|
||||
</dd>
|
||||
</dl>
|
||||
</dd></dl>
|
||||
|
||||
</section>
|
||||
<section id="quapy-protocol">
|
||||
|
@ -624,7 +723,21 @@ the sequence will be consistent every time the protocol is called.</p>
|
|||
<dl class="py method">
|
||||
<dt class="sig sig-object py" id="quapy.protocol.AbstractStochasticSeededProtocol.collator">
|
||||
<span class="sig-name descname"><span class="pre">collator</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">sample</span></span></em>, <em class="sig-param"><span class="o"><span class="pre">*</span></span><span class="n"><span class="pre">args</span></span></em><span class="sig-paren">)</span><a class="headerlink" href="#quapy.protocol.AbstractStochasticSeededProtocol.collator" title="Permalink to this definition">¶</a></dt>
|
||||
<dd></dd></dl>
|
||||
<dd><p>The collator prepares the sample to accommodate the desired output format before returning the output.
|
||||
This collator simply returns the sample as it is. Classes inheriting from this abstract class can
|
||||
implement their custom collators.</p>
|
||||
<dl class="field-list simple">
|
||||
<dt class="field-odd">Parameters<span class="colon">:</span></dt>
|
||||
<dd class="field-odd"><ul class="simple">
|
||||
<li><p><strong>sample</strong> – the sample to be returned</p></li>
|
||||
<li><p><strong>args</strong> – additional arguments</p></li>
|
||||
</ul>
|
||||
</dd>
|
||||
<dt class="field-even">Returns<span class="colon">:</span></dt>
|
||||
<dd class="field-even"><p>the sample adhering to a desired output format (in this case, the sample is returned as it is)</p>
|
||||
</dd>
|
||||
</dl>
|
||||
</dd></dl>
|
||||
|
||||
<dl class="py property">
|
||||
<dt class="sig sig-object py" id="quapy.protocol.AbstractStochasticSeededProtocol.random_state">
|
||||
|
@ -658,6 +771,12 @@ the sequence will be consistent every time the protocol is called.</p>
|
|||
|
||||
</dd></dl>
|
||||
|
||||
<dl class="py attribute">
|
||||
<dt class="sig sig-object py" id="quapy.protocol.ArtificialPrevalenceProtocol">
|
||||
<span class="sig-prename descclassname"><span class="pre">quapy.protocol.</span></span><span class="sig-name descname"><span class="pre">ArtificialPrevalenceProtocol</span></span><a class="headerlink" href="#quapy.protocol.ArtificialPrevalenceProtocol" title="Permalink to this definition">¶</a></dt>
|
||||
<dd><p>alias of <a class="reference internal" href="#quapy.protocol.APP" title="quapy.protocol.APP"><code class="xref py py-class docutils literal notranslate"><span class="pre">APP</span></code></a></p>
|
||||
</dd></dl>
|
||||
|
||||
<dl class="py class">
|
||||
<dt class="sig sig-object py" id="quapy.protocol.DomainMixer">
|
||||
<em class="property"><span class="pre">class</span><span class="w"> </span></em><span class="sig-prename descclassname"><span class="pre">quapy.protocol.</span></span><span class="sig-name descname"><span class="pre">DomainMixer</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">domainA</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><a class="reference internal" href="quapy.data.html#quapy.data.base.LabelledCollection" title="quapy.data.base.LabelledCollection"><span class="pre">LabelledCollection</span></a></span></em>, <em class="sig-param"><span class="n"><span class="pre">domainB</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><a class="reference internal" href="quapy.data.html#quapy.data.base.LabelledCollection" title="quapy.data.base.LabelledCollection"><span class="pre">LabelledCollection</span></a></span></em>, <em class="sig-param"><span class="n"><span class="pre">sample_size</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">repeats</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">1</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">prevalence</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">None</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">mixture_points</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">11</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">random_state</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">0</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">return_type</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">'sample_prev'</span></span></em><span class="sig-paren">)</span><a class="headerlink" href="#quapy.protocol.DomainMixer" title="Permalink to this definition">¶</a></dt>
|
||||
|
@ -720,6 +839,29 @@ will be the same every time the protocol is called)</p></li>
|
|||
|
||||
</dd></dl>
|
||||
|
||||
<dl class="py class">
|
||||
<dt class="sig sig-object py" id="quapy.protocol.IterateProtocol">
|
||||
<em class="property"><span class="pre">class</span><span class="w"> </span></em><span class="sig-prename descclassname"><span class="pre">quapy.protocol.</span></span><span class="sig-name descname"><span class="pre">IterateProtocol</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="pre">samples:</span> <span class="pre">[<class</span> <span class="pre">'quapy.data.base.LabelledCollection'>]</span></em><span class="sig-paren">)</span><a class="headerlink" href="#quapy.protocol.IterateProtocol" title="Permalink to this definition">¶</a></dt>
|
||||
<dd><p>Bases: <a class="reference internal" href="#quapy.protocol.AbstractProtocol" title="quapy.protocol.AbstractProtocol"><code class="xref py py-class docutils literal notranslate"><span class="pre">AbstractProtocol</span></code></a></p>
|
||||
<p>A very simple protocol which simply iterates over a list of previously generated samples</p>
|
||||
<dl class="field-list simple">
|
||||
<dt class="field-odd">Parameters<span class="colon">:</span></dt>
|
||||
<dd class="field-odd"><p><strong>samples</strong> – a list of <a class="reference internal" href="quapy.data.html#quapy.data.base.LabelledCollection" title="quapy.data.base.LabelledCollection"><code class="xref py py-class docutils literal notranslate"><span class="pre">quapy.data.base.LabelledCollection</span></code></a></p>
|
||||
</dd>
|
||||
</dl>
|
||||
<dl class="py method">
|
||||
<dt class="sig sig-object py" id="quapy.protocol.IterateProtocol.total">
|
||||
<span class="sig-name descname"><span class="pre">total</span></span><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="headerlink" href="#quapy.protocol.IterateProtocol.total" title="Permalink to this definition">¶</a></dt>
|
||||
<dd><p>Returns the number of samples in this protocol</p>
|
||||
<dl class="field-list simple">
|
||||
<dt class="field-odd">Returns<span class="colon">:</span></dt>
|
||||
<dd class="field-odd"><p>int</p>
|
||||
</dd>
|
||||
</dl>
|
||||
</dd></dl>
|
||||
|
||||
</dd></dl>
|
||||
|
||||
<dl class="py class">
|
||||
<dt class="sig sig-object py" id="quapy.protocol.NPP">
|
||||
<em class="property"><span class="pre">class</span><span class="w"> </span></em><span class="sig-prename descclassname"><span class="pre">quapy.protocol.</span></span><span class="sig-name descname"><span class="pre">NPP</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">data</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><a class="reference internal" href="quapy.data.html#quapy.data.base.LabelledCollection" title="quapy.data.base.LabelledCollection"><span class="pre">LabelledCollection</span></a></span></em>, <em class="sig-param"><span class="n"><span class="pre">sample_size</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">None</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">repeats</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">100</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">random_state</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">0</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">return_type</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">'sample_prev'</span></span></em><span class="sig-paren">)</span><a class="headerlink" href="#quapy.protocol.NPP" title="Permalink to this definition">¶</a></dt>
|
||||
|
@ -778,6 +920,12 @@ to “labelled_collection” to get instead instances of LabelledCollection</p><
|
|||
|
||||
</dd></dl>
|
||||
|
||||
<dl class="py attribute">
|
||||
<dt class="sig sig-object py" id="quapy.protocol.NaturalPrevalenceProtocol">
|
||||
<span class="sig-prename descclassname"><span class="pre">quapy.protocol.</span></span><span class="sig-name descname"><span class="pre">NaturalPrevalenceProtocol</span></span><a class="headerlink" href="#quapy.protocol.NaturalPrevalenceProtocol" title="Permalink to this definition">¶</a></dt>
|
||||
<dd><p>alias of <a class="reference internal" href="#quapy.protocol.NPP" title="quapy.protocol.NPP"><code class="xref py py-class docutils literal notranslate"><span class="pre">NPP</span></code></a></p>
|
||||
</dd></dl>
|
||||
|
||||
<dl class="py class">
|
||||
<dt class="sig sig-object py" id="quapy.protocol.OnLabelledCollectionProtocol">
|
||||
<em class="property"><span class="pre">class</span><span class="w"> </span></em><span class="sig-prename descclassname"><span class="pre">quapy.protocol.</span></span><span class="sig-name descname"><span class="pre">OnLabelledCollectionProtocol</span></span><a class="headerlink" href="#quapy.protocol.OnLabelledCollectionProtocol" title="Permalink to this definition">¶</a></dt>
|
||||
|
@ -785,7 +933,7 @@ to “labelled_collection” to get instead instances of LabelledCollection</p><
|
|||
<p>Protocols that generate samples from a <code class="xref py py-class docutils literal notranslate"><span class="pre">qp.data.LabelledCollection</span></code> object.</p>
|
||||
<dl class="py attribute">
|
||||
<dt class="sig sig-object py" id="quapy.protocol.OnLabelledCollectionProtocol.RETURN_TYPES">
|
||||
<span class="sig-name descname"><span class="pre">RETURN_TYPES</span></span><em class="property"><span class="w"> </span><span class="p"><span class="pre">=</span></span><span class="w"> </span><span class="pre">['sample_prev',</span> <span class="pre">'labelled_collection']</span></em><a class="headerlink" href="#quapy.protocol.OnLabelledCollectionProtocol.RETURN_TYPES" title="Permalink to this definition">¶</a></dt>
|
||||
<span class="sig-name descname"><span class="pre">RETURN_TYPES</span></span><em class="property"><span class="w"> </span><span class="p"><span class="pre">=</span></span><span class="w"> </span><span class="pre">['sample_prev',</span> <span class="pre">'labelled_collection',</span> <span class="pre">'index']</span></em><a class="headerlink" href="#quapy.protocol.OnLabelledCollectionProtocol.RETURN_TYPES" title="Permalink to this definition">¶</a></dt>
|
||||
<dd></dd></dl>
|
||||
|
||||
<dl class="py method">
|
||||
|
@ -841,8 +989,8 @@ with shape <cite>(n_instances,)</cite> when the classifier is a hard one, or wit
|
|||
</dd></dl>
|
||||
|
||||
<dl class="py class">
|
||||
<dt class="sig sig-object py" id="quapy.protocol.USimplexPP">
|
||||
<em class="property"><span class="pre">class</span><span class="w"> </span></em><span class="sig-prename descclassname"><span class="pre">quapy.protocol.</span></span><span class="sig-name descname"><span class="pre">USimplexPP</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">data</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><a class="reference internal" href="quapy.data.html#quapy.data.base.LabelledCollection" title="quapy.data.base.LabelledCollection"><span class="pre">LabelledCollection</span></a></span></em>, <em class="sig-param"><span class="n"><span class="pre">sample_size</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">None</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">repeats</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">100</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">random_state</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">0</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">return_type</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">'sample_prev'</span></span></em><span class="sig-paren">)</span><a class="headerlink" href="#quapy.protocol.USimplexPP" title="Permalink to this definition">¶</a></dt>
|
||||
<dt class="sig sig-object py" id="quapy.protocol.UPP">
|
||||
<em class="property"><span class="pre">class</span><span class="w"> </span></em><span class="sig-prename descclassname"><span class="pre">quapy.protocol.</span></span><span class="sig-name descname"><span class="pre">UPP</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">data</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><a class="reference internal" href="quapy.data.html#quapy.data.base.LabelledCollection" title="quapy.data.base.LabelledCollection"><span class="pre">LabelledCollection</span></a></span></em>, <em class="sig-param"><span class="n"><span class="pre">sample_size</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">None</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">repeats</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">100</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">random_state</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">0</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">return_type</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">'sample_prev'</span></span></em><span class="sig-paren">)</span><a class="headerlink" href="#quapy.protocol.UPP" title="Permalink to this definition">¶</a></dt>
|
||||
<dd><p>Bases: <a class="reference internal" href="#quapy.protocol.AbstractStochasticSeededProtocol" title="quapy.protocol.AbstractStochasticSeededProtocol"><code class="xref py py-class docutils literal notranslate"><span class="pre">AbstractStochasticSeededProtocol</span></code></a>, <a class="reference internal" href="#quapy.protocol.OnLabelledCollectionProtocol" title="quapy.protocol.OnLabelledCollectionProtocol"><code class="xref py py-class docutils literal notranslate"><span class="pre">OnLabelledCollectionProtocol</span></code></a></p>
|
||||
<p>A variant of <a class="reference internal" href="#quapy.protocol.APP" title="quapy.protocol.APP"><code class="xref py py-class docutils literal notranslate"><span class="pre">APP</span></code></a> that, instead of using a grid of equidistant prevalence values,
|
||||
relies on the Kraemer algorithm for sampling unit (k-1)-simplex uniformly at random, with
|
||||
|
@ -865,8 +1013,8 @@ to “labelled_collection” to get instead instances of LabelledCollection</p><
|
|||
</dd>
|
||||
</dl>
|
||||
<dl class="py method">
|
||||
<dt class="sig sig-object py" id="quapy.protocol.USimplexPP.sample">
|
||||
<span class="sig-name descname"><span class="pre">sample</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">index</span></span></em><span class="sig-paren">)</span><a class="headerlink" href="#quapy.protocol.USimplexPP.sample" title="Permalink to this definition">¶</a></dt>
|
||||
<dt class="sig sig-object py" id="quapy.protocol.UPP.sample">
|
||||
<span class="sig-name descname"><span class="pre">sample</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">index</span></span></em><span class="sig-paren">)</span><a class="headerlink" href="#quapy.protocol.UPP.sample" title="Permalink to this definition">¶</a></dt>
|
||||
<dd><p>Realizes the sample given the index of the instances.</p>
|
||||
<dl class="field-list simple">
|
||||
<dt class="field-odd">Parameters<span class="colon">:</span></dt>
|
||||
|
@ -879,19 +1027,19 @@ to “labelled_collection” to get instead instances of LabelledCollection</p><
|
|||
</dd></dl>
|
||||
|
||||
<dl class="py method">
|
||||
<dt class="sig sig-object py" id="quapy.protocol.USimplexPP.samples_parameters">
|
||||
<span class="sig-name descname"><span class="pre">samples_parameters</span></span><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="headerlink" href="#quapy.protocol.USimplexPP.samples_parameters" title="Permalink to this definition">¶</a></dt>
|
||||
<dd><p>Return all the necessary parameters to replicate the samples as according to the USimplexPP protocol.</p>
|
||||
<dt class="sig sig-object py" id="quapy.protocol.UPP.samples_parameters">
|
||||
<span class="sig-name descname"><span class="pre">samples_parameters</span></span><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="headerlink" href="#quapy.protocol.UPP.samples_parameters" title="Permalink to this definition">¶</a></dt>
|
||||
<dd><p>Return all the necessary parameters to replicate the samples as according to the UPP protocol.</p>
|
||||
<dl class="field-list simple">
|
||||
<dt class="field-odd">Returns<span class="colon">:</span></dt>
|
||||
<dd class="field-odd"><p>a list of indexes that realize the USimplexPP sampling</p>
|
||||
<dd class="field-odd"><p>a list of indexes that realize the UPP sampling</p>
|
||||
</dd>
|
||||
</dl>
|
||||
</dd></dl>
|
||||
|
||||
<dl class="py method">
|
||||
<dt class="sig sig-object py" id="quapy.protocol.USimplexPP.total">
|
||||
<span class="sig-name descname"><span class="pre">total</span></span><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="headerlink" href="#quapy.protocol.USimplexPP.total" title="Permalink to this definition">¶</a></dt>
|
||||
<dt class="sig sig-object py" id="quapy.protocol.UPP.total">
|
||||
<span class="sig-name descname"><span class="pre">total</span></span><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="headerlink" href="#quapy.protocol.UPP.total" title="Permalink to this definition">¶</a></dt>
|
||||
<dd><p>Returns the number of samples that will be generated (equals to “repeats”)</p>
|
||||
<dl class="field-list simple">
|
||||
<dt class="field-odd">Returns<span class="colon">:</span></dt>
|
||||
|
@ -902,6 +1050,12 @@ to “labelled_collection” to get instead instances of LabelledCollection</p><
|
|||
|
||||
</dd></dl>
|
||||
|
||||
<dl class="py attribute">
|
||||
<dt class="sig sig-object py" id="quapy.protocol.UniformPrevalenceProtocol">
|
||||
<span class="sig-prename descclassname"><span class="pre">quapy.protocol.</span></span><span class="sig-name descname"><span class="pre">UniformPrevalenceProtocol</span></span><a class="headerlink" href="#quapy.protocol.UniformPrevalenceProtocol" title="Permalink to this definition">¶</a></dt>
|
||||
<dd><p>alias of <a class="reference internal" href="#quapy.protocol.UPP" title="quapy.protocol.UPP"><code class="xref py py-class docutils literal notranslate"><span class="pre">UPP</span></code></a></p>
|
||||
</dd></dl>
|
||||
|
||||
</section>
|
||||
<section id="module-quapy.functional">
|
||||
<span id="quapy-functional"></span><h2>quapy.functional<a class="headerlink" href="#module-quapy.functional" title="Permalink to this heading">¶</a></h2>
|
||||
|
@ -1175,9 +1329,9 @@ protocol for quantification.</p>
|
|||
<dd class="field-odd"><ul class="simple">
|
||||
<li><p><strong>model</strong> (<a class="reference internal" href="quapy.method.html#quapy.method.base.BaseQuantifier" title="quapy.method.base.BaseQuantifier"><em>BaseQuantifier</em></a>) – the quantifier to optimize</p></li>
|
||||
<li><p><strong>param_grid</strong> – a dictionary with keys the parameter names and values the list of values to explore</p></li>
|
||||
<li><p><strong>protocol</strong> – </p></li>
|
||||
<li><p><strong>protocol</strong> – a sample generation protocol, an instance of <a class="reference internal" href="#quapy.protocol.AbstractProtocol" title="quapy.protocol.AbstractProtocol"><code class="xref py py-class docutils literal notranslate"><span class="pre">quapy.protocol.AbstractProtocol</span></code></a></p></li>
|
||||
<li><p><strong>error</strong> – an error function (callable) or a string indicating the name of an error function (valid ones
|
||||
are those in qp.error.QUANTIFICATION_ERROR</p></li>
|
||||
are those in <code class="xref py py-class docutils literal notranslate"><span class="pre">quapy.error.QUANTIFICATION_ERROR</span></code></p></li>
|
||||
<li><p><strong>refit</strong> – whether or not to refit the model on the whole labelled collection (training+validation) with
|
||||
the best chosen hyperparameter combination. Ignored if protocol=’gen’</p></li>
|
||||
<li><p><strong>timeout</strong> – establishes a timer (in seconds) for each of the hyperparameters configurations being tested.
|
||||
|
|
|
@ -1718,7 +1718,7 @@ registered hooks while the latter silently ignores them.</p>
|
|||
|
||||
<dl class="py class">
|
||||
<dt class="sig sig-object py" id="quapy.method.neural.QuaNetTrainer">
|
||||
<em class="property"><span class="pre">class</span><span class="w"> </span></em><span class="sig-prename descclassname"><span class="pre">quapy.method.neural.</span></span><span class="sig-name descname"><span class="pre">QuaNetTrainer</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">classifier</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">sample_size</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">n_epochs</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">100</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">tr_iter_per_poch</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">500</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">va_iter_per_poch</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">100</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">lr</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">0.001</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">lstm_hidden_size</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">64</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">lstm_nlayers</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">1</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">ff_layers</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">[1024,</span> <span class="pre">512]</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">bidirectional</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">True</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">qdrop_p</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">0.5</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">patience</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">10</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">checkpointdir</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">'../checkpoint'</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">checkpointname</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">None</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">device</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">'cuda'</span></span></em><span class="sig-paren">)</span><a class="headerlink" href="#quapy.method.neural.QuaNetTrainer" title="Permalink to this definition">¶</a></dt>
|
||||
<em class="property"><span class="pre">class</span><span class="w"> </span></em><span class="sig-prename descclassname"><span class="pre">quapy.method.neural.</span></span><span class="sig-name descname"><span class="pre">QuaNetTrainer</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">classifier</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">sample_size</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">None</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">n_epochs</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">100</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">tr_iter_per_poch</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">500</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">va_iter_per_poch</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">100</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">lr</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">0.001</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">lstm_hidden_size</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">64</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">lstm_nlayers</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">1</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">ff_layers</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">[1024,</span> <span class="pre">512]</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">bidirectional</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">True</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">qdrop_p</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">0.5</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">patience</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">10</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">checkpointdir</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">'../checkpoint'</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">checkpointname</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">None</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">device</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">'cuda'</span></span></em><span class="sig-paren">)</span><a class="headerlink" href="#quapy.method.neural.QuaNetTrainer" title="Permalink to this definition">¶</a></dt>
|
||||
<dd><p>Bases: <a class="reference internal" href="#quapy.method.base.BaseQuantifier" title="quapy.method.base.BaseQuantifier"><code class="xref py py-class docutils literal notranslate"><span class="pre">BaseQuantifier</span></code></a></p>
|
||||
<p>Implementation of <a class="reference external" href="https://dl.acm.org/doi/abs/10.1145/3269206.3269287">QuaNet</a>, a neural network for
|
||||
quantification. This implementation uses <a class="reference external" href="https://pytorch.org/">PyTorch</a> and can take advantage of GPU
|
||||
|
@ -1751,7 +1751,8 @@ for speeding-up the training phase.</p>
|
|||
<li><p><strong>classifier</strong> – an object implementing <cite>fit</cite> (i.e., that can be trained on labelled data),
|
||||
<cite>predict_proba</cite> (i.e., that can generate posterior probabilities of unlabelled examples) and
|
||||
<cite>transform</cite> (i.e., that can generate embedded representations of the unlabelled instances).</p></li>
|
||||
<li><p><strong>sample_size</strong> – integer, the sample size</p></li>
|
||||
<li><p><strong>sample_size</strong> – integer, the sample size; default is None, meaning that the sample size should be
|
||||
taken from qp.environ[“SAMPLE_SIZE”]</p></li>
|
||||
<li><p><strong>n_epochs</strong> – integer, maximum number of training epochs</p></li>
|
||||
<li><p><strong>tr_iter_per_poch</strong> – integer, number of training iterations before considering an epoch complete</p></li>
|
||||
<li><p><strong>va_iter_per_poch</strong> – integer, number of validation iterations to perform after each epoch</p></li>
|
||||
|
|
File diff suppressed because one or more lines are too long
|
@ -0,0 +1,57 @@
|
|||
import quapy as qp
|
||||
from quapy.protocol import APP
|
||||
from quapy.method.aggregative import DistributionMatching
|
||||
from sklearn.linear_model import LogisticRegression
|
||||
import numpy as np
|
||||
|
||||
"""
|
||||
In this example, we show how to perform model selection on a DistributionMatching quantifier.
|
||||
"""
|
||||
|
||||
model = DistributionMatching(LogisticRegression())
|
||||
|
||||
qp.environ['SAMPLE_SIZE'] = 100
|
||||
qp.environ['N_JOBS'] = -1
|
||||
|
||||
training, test = qp.datasets.fetch_reviews('imdb', tfidf=True, min_df=5).train_test
|
||||
|
||||
# The model will be returned by the fit method of GridSearchQ.
|
||||
# Every combination of hyper-parameters will be evaluated by confronting the
|
||||
# quantifier thus configured against a series of samples generated by means
|
||||
# of a sample generation protocol. For this example, we will use the
|
||||
# artificial-prevalence protocol (APP), that generates samples with prevalence
|
||||
# values in the entire range of values from a grid (e.g., [0, 0.1, 0.2, ..., 1]).
|
||||
# We devote 30% of the dataset for this exploration.
|
||||
training, validation = training.split_stratified(train_prop=0.7)
|
||||
protocol = APP(validation)
|
||||
|
||||
# We will explore a classification-dependent hyper-parameter (e.g., the 'C'
|
||||
# hyper-parameter of LogisticRegression) and a quantification-dependent hyper-parameter
|
||||
# (e.g., the number of bins in a DistributionMatching quantifier.
|
||||
# Classifier-dependent hyper-parameters have to be marked with a prefix "classifier__"
|
||||
# in order to let the quantifier know this hyper-parameter belongs to its underlying
|
||||
# classifier.
|
||||
param_grid = {
|
||||
'classifier__C': np.logspace(-3,3,7),
|
||||
'nbins': [8, 16, 32, 64],
|
||||
}
|
||||
|
||||
model = qp.model_selection.GridSearchQ(
|
||||
model=model,
|
||||
param_grid=param_grid,
|
||||
protocol=protocol,
|
||||
error='mae', # the error to optimize is the MAE (a quantification-oriented loss)
|
||||
refit=True, # retrain on the whole labelled set once done
|
||||
verbose=True # show information as the process goes on
|
||||
).fit(training)
|
||||
|
||||
print(f'model selection ended: best hyper-parameters={model.best_params_}')
|
||||
model = model.best_model_
|
||||
|
||||
# evaluation in terms of MAE
|
||||
# we use the same evaluation protocol (APP) on the test set
|
||||
mae_score = qp.evaluation.evaluate(model, protocol=APP(test), error_metric='mae')
|
||||
|
||||
print(f'MAE={mae_score:.5f}')
|
||||
|
||||
|
|
@ -1,22 +1,16 @@
|
|||
Change Log 0.1.7
|
||||
---------------------
|
||||
----------------
|
||||
|
||||
- Protocols are now abstracted as instances of AbstractProtocol. There is a new class extending AbstractProtocol called
|
||||
AbstractStochasticSeededProtocol, which implements a seeding policy to allow replicate the series of samplings.
|
||||
There are some examples of protocols, APP, NPP, UPP, DomainMixer (experimental).
|
||||
The idea is to start the sampling by simply calling the __call__ method.
|
||||
The idea is to start the sample generation by simply calling the __call__ method.
|
||||
This change has a great impact in the framework, since many functions in qp.evaluation, qp.model_selection,
|
||||
and sampling functions in LabelledCollection relied of the old functions. E.g., the functionality of
|
||||
qp.evaluation.artificial_prevalence_report or qp.evaluation.natural_prevalence_report is now obtained by means of
|
||||
qp.evaluation.report which takes a protocol as an argument. I have not maintained compatibility with the old
|
||||
interfaces because I did not really like them. Check the wiki guide and the examples for more details.
|
||||
|
||||
check guides
|
||||
|
||||
check examples
|
||||
|
||||
- ACC, PACC, Forman's threshold variants have been parallelized.
|
||||
|
||||
- Exploration of hyperparameters in Model selection can now be run in parallel (there was a n_jobs argument in
|
||||
QuaPy 0.1.6 but only the evaluation part for one specific hyperparameter was run in parallel).
|
||||
|
||||
|
@ -26,17 +20,19 @@ Change Log 0.1.7
|
|||
procedure. The user can now specify "force", "auto", True of False, in order to actively decide for applying it
|
||||
or not.
|
||||
|
||||
- n_jobs is now taken from the environment if set to None
|
||||
|
||||
- examples directory created!
|
||||
|
||||
- cross_val_predict (for quantification) added to model_selection: would be nice to allow the user specifies a
|
||||
test protocol maybe, or None for bypassing it?
|
||||
|
||||
- DyS, Topsoe distance and binary search (thanks to Pablo González)
|
||||
|
||||
- Multi-thread reproducibility via seeding (thanks to Pablo González)
|
||||
|
||||
- n_jobs is now taken from the environment if set to None
|
||||
|
||||
- ACC, PACC, Forman's threshold variants have been parallelized.
|
||||
|
||||
- cross_val_predict (for quantification) added to model_selection: would be nice to allow the user specifies a
|
||||
test protocol maybe, or None for bypassing it?
|
||||
|
||||
- Bugfix: adding two labelled collections (with +) now checks for consistency in the classes
|
||||
|
||||
- newer versions of numpy raise a warning when accessing types (e.g., np.float). I have replaced all such instances
|
||||
|
|
|
@ -1,4 +1,6 @@
|
|||
import itertools
|
||||
from functools import cached_property
|
||||
from typing import Iterable
|
||||
|
||||
import numpy as np
|
||||
from scipy.sparse import issparse
|
||||
|
@ -129,11 +131,23 @@ class LabelledCollection:
|
|||
# <= size * prevs[i]) examples are drawn from class i, there could be a remainder number of instances to take
|
||||
# to satisfy the size constrain. The remainder is distributed along the classes with probability = prevs.
|
||||
# (This aims at avoiding the remainder to be placed in a class for which the prevalence requested is 0.)
|
||||
n_requests = {class_: int(size * prevs[i]) for i, class_ in enumerate(self.classes_)}
|
||||
n_requests = {class_: round(size * prevs[i]) for i, class_ in enumerate(self.classes_)}
|
||||
remainder = size - sum(n_requests.values())
|
||||
with temp_seed(random_state):
|
||||
for rand_class in np.random.choice(self.classes_, size=remainder, p=prevs):
|
||||
n_requests[rand_class] += 1
|
||||
# due to rounding, the remainder can be 0, >0, or <0
|
||||
if remainder > 0:
|
||||
# when the remainder is >0 we randomly add 1 to the requests for each class;
|
||||
# more prevalent classes are more likely to be taken in order to minimize the impact in the final prevalence
|
||||
for rand_class in np.random.choice(self.classes_, size=remainder, p=prevs):
|
||||
n_requests[rand_class] += 1
|
||||
elif remainder < 0:
|
||||
# when the remainder is <0 we randomly remove 1 from the requests, unless the request is 0 for a chosen
|
||||
# class; we repeat until remainder==0
|
||||
while remainder!=0:
|
||||
rand_class = np.random.choice(self.classes_, p=prevs)
|
||||
if n_requests[rand_class] > 0:
|
||||
n_requests[rand_class] -= 1
|
||||
remainder += 1
|
||||
|
||||
indexes_sample = []
|
||||
for class_, n_requested in n_requests.items():
|
||||
|
@ -266,31 +280,47 @@ class LabelledCollection:
|
|||
if not all(np.sort(self.classes_)==np.sort(other.classes_)):
|
||||
raise NotImplementedError(f'unsupported operation for collections on different classes; '
|
||||
f'expected {self.classes_}, found {other.classes_}')
|
||||
return LabelledCollection.mix(self, other)
|
||||
return LabelledCollection.join(self, other)
|
||||
|
||||
@classmethod
|
||||
def mix(cls, a:'LabelledCollection', b:'LabelledCollection'):
|
||||
def join(cls, *args: Iterable['LabelledCollection']):
|
||||
"""
|
||||
Returns a new :class:`LabelledCollection` as the union of this collection with another collection.
|
||||
Returns a new :class:`LabelledCollection` as the union of the collections given in input.
|
||||
|
||||
:param a: instance of :class:`LabelledCollection`
|
||||
:param b: instance of :class:`LabelledCollection`
|
||||
:param args: instances of :class:`LabelledCollection`
|
||||
:return: a :class:`LabelledCollection` representing the union of both collections
|
||||
"""
|
||||
if a is None: return b
|
||||
if b is None: return a
|
||||
elif issparse(a.instances) and issparse(b.instances):
|
||||
join_instances = vstack([a.instances, b.instances])
|
||||
elif isinstance(a.instances, list) and isinstance(b.instances, list):
|
||||
join_instances = a.instances + b.instances
|
||||
elif isinstance(a.instances, np.ndarray) and isinstance(b.instances, np.ndarray):
|
||||
join_instances = np.concatenate([a.instances, b.instances])
|
||||
|
||||
args = [lc for lc in args if lc is not None]
|
||||
assert len(args) > 0, 'empty list is not allowed for mix'
|
||||
|
||||
assert all([isinstance(lc, LabelledCollection) for lc in args]), \
|
||||
'only instances of LabelledCollection allowed'
|
||||
|
||||
first_instances = args[0].instances
|
||||
first_type = type(first_instances)
|
||||
assert all([type(lc.instances)==first_type for lc in args[1:]]), \
|
||||
'not all the collections are of instances of the same type'
|
||||
|
||||
if issparse(first_instances) or isinstance(first_instances, np.ndarray):
|
||||
first_ndim = first_instances.ndim
|
||||
assert all([lc.instances.ndim == first_ndim for lc in args[1:]]), \
|
||||
'not all the ndarrays are of the same dimension'
|
||||
if first_ndim > 1:
|
||||
first_shape = first_instances.shape[1:]
|
||||
assert all([lc.instances.shape[1:] == first_shape for lc in args[1:]]), \
|
||||
'not all the ndarrays are of the same shape'
|
||||
if issparse(first_instances):
|
||||
instances = vstack([lc.instances for lc in args])
|
||||
else:
|
||||
instances = np.concatenate([lc.instances for lc in args])
|
||||
elif isinstance(first_instances, list):
|
||||
instances = list(itertools.chain(lc.instances for lc in args))
|
||||
else:
|
||||
raise NotImplementedError('unsupported operation for collection types')
|
||||
labels = np.concatenate([a.labels, b.labels])
|
||||
classes = np.unique(np.concatenate([a.classes_, b.classes_])).sort()
|
||||
return LabelledCollection(join_instances, labels, classes=classes)
|
||||
|
||||
labels = np.concatenate([lc.labels for lc in args])
|
||||
classes = np.unique(labels).sort()
|
||||
return LabelledCollection(instances, labels, classes=classes)
|
||||
|
||||
@property
|
||||
def Xy(self):
|
||||
|
|
|
@ -16,7 +16,7 @@ def prediction(
|
|||
Uses a quantification model to generate predictions for the samples generated via a specific protocol.
|
||||
This function is central to all evaluation processes, and is endowed with an optimization to speed-up the
|
||||
prediction of protocols that generate samples from a large collection. The optimization applies to aggregative
|
||||
quantifiers only, and to OnLabelledCollection protocols, and comes down to generating the classification
|
||||
quantifiers only, and to OnLabelledCollectionProtocol protocols, and comes down to generating the classification
|
||||
predictions once and for all, and then generating samples over the classification predictions (instead of over
|
||||
the raw instances), so that the classifier prediction is never called again. This behaviour is obtained by
|
||||
setting `aggr_speedup` to 'auto' or True, and is only carried out if the overall process is convenient in terms
|
||||
|
@ -25,7 +25,7 @@ def prediction(
|
|||
|
||||
:param model: a quantifier, instance of :class:`quapy.method.base.BaseQuantifier`
|
||||
:param protocol: :class:`quapy.protocol.AbstractProtocol`; if this object is also instance of
|
||||
:class:`quapy.protocol.OnLabelledCollection`, then the aggregation speed-up can be run. This is the protocol
|
||||
:class:`quapy.protocol.OnLabelledCollectionProtocol`, then the aggregation speed-up can be run. This is the protocol
|
||||
in charge of generating the samples for which the model has to issue class prevalence predictions.
|
||||
:param aggr_speedup: whether or not to apply the speed-up. Set to "force" for applying it even if the number of
|
||||
instances in the original collection on which the protocol acts is larger than the number of instances
|
||||
|
@ -90,7 +90,7 @@ def evaluation_report(model: BaseQuantifier,
|
|||
|
||||
:param model: a quantifier, instance of :class:`quapy.method.base.BaseQuantifier`
|
||||
:param protocol: :class:`quapy.protocol.AbstractProtocol`; if this object is also instance of
|
||||
:class:`quapy.protocol.OnLabelledCollection`, then the aggregation speed-up can be run. This is the protocol
|
||||
:class:`quapy.protocol.OnLabelledCollectionProtocol`, then the aggregation speed-up can be run. This is the protocol
|
||||
in charge of generating the samples in which the model is evaluated.
|
||||
:param error_metrics: a string, or list of strings, representing the name(s) of an error function in `qp.error`
|
||||
(e.g., 'mae', the default value), or a callable function, or a list of callable functions, implementing
|
||||
|
@ -141,8 +141,8 @@ def evaluate(
|
|||
|
||||
:param model: a quantifier, instance of :class:`quapy.method.base.BaseQuantifier`
|
||||
:param protocol: :class:`quapy.protocol.AbstractProtocol`; if this object is also instance of
|
||||
:class:`quapy.protocol.OnLabelledCollection`, then the aggregation speed-up can be run. This is the protocol
|
||||
in charge of generating the samples in which the model is evaluated.
|
||||
:class:`quapy.protocol.OnLabelledCollectionProtocol`, then the aggregation speed-up can be run. This is the
|
||||
protocol in charge of generating the samples in which the model is evaluated.
|
||||
:param error_metric: a string representing the name(s) of an error function in `qp.error`
|
||||
(e.g., 'mae'), or a callable function implementing the error function itself.
|
||||
:param aggr_speedup: whether or not to apply the speed-up. Set to "force" for applying it even if the number of
|
||||
|
|
|
@ -23,9 +23,9 @@ class GridSearchQ(BaseQuantifier):
|
|||
:param model: the quantifier to optimize
|
||||
:type model: BaseQuantifier
|
||||
:param param_grid: a dictionary with keys the parameter names and values the list of values to explore
|
||||
:param protocol:
|
||||
:param protocol: a sample generation protocol, an instance of :class:`quapy.protocol.AbstractProtocol`
|
||||
:param error: an error function (callable) or a string indicating the name of an error function (valid ones
|
||||
are those in qp.error.QUANTIFICATION_ERROR
|
||||
are those in :class:`quapy.error.QUANTIFICATION_ERROR`
|
||||
:param refit: whether or not to refit the model on the whole labelled collection (training+validation) with
|
||||
the best chosen hyperparameter combination. Ignored if protocol='gen'
|
||||
:param timeout: establishes a timer (in seconds) for each of the hyperparameters configurations being tested.
|
||||
|
|
|
@ -51,9 +51,10 @@ def binary_diagonal(method_names, true_prevs, estim_prevs, pos_class=1, title=No
|
|||
table = {method_name:[true_prev, estim_prev] for method_name, true_prev, estim_prev in order}
|
||||
order = [(method_name, *table[method_name]) for method_name in method_order]
|
||||
|
||||
cm = plt.get_cmap('tab20')
|
||||
NUM_COLORS = len(method_names)
|
||||
ax.set_prop_cycle(color=[cm(1. * i / NUM_COLORS) for i in range(NUM_COLORS)])
|
||||
if NUM_COLORS>10:
|
||||
cm = plt.get_cmap('tab20')
|
||||
ax.set_prop_cycle(color=[cm(1. * i / NUM_COLORS) for i in range(NUM_COLORS)])
|
||||
for method, true_prev, estim_prev in order:
|
||||
true_prev = true_prev[:,pos_class]
|
||||
estim_prev = estim_prev[:,pos_class]
|
||||
|
@ -76,13 +77,12 @@ def binary_diagonal(method_names, true_prevs, estim_prevs, pos_class=1, title=No
|
|||
ax.set_xlim(0, 1)
|
||||
|
||||
if legend:
|
||||
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))
|
||||
# box = ax.get_position()
|
||||
# ax.set_position([box.x0, box.y0, box.width * 0.8, box.height])
|
||||
# ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))
|
||||
# ax.set_position([box.x0, box.y0, box.width * 0.8, box.height])
|
||||
ax.legend(loc='lower center',
|
||||
bbox_to_anchor=(1, -0.5),
|
||||
ncol=(len(method_names)+1)//2)
|
||||
# ax.legend(loc='lower center',
|
||||
# bbox_to_anchor=(1, -0.5),
|
||||
# ncol=(len(method_names)+1)//2)
|
||||
|
||||
_save_or_show(savepath)
|
||||
|
||||
|
|
|
@ -127,6 +127,15 @@ class AbstractStochasticSeededProtocol(AbstractProtocol):
|
|||
yield self.collator(self.sample(params))
|
||||
|
||||
def collator(self, sample, *args):
|
||||
"""
|
||||
The collator prepares the sample to accommodate the desired output format before returning the output.
|
||||
This collator simply returns the sample as it is. Classes inheriting from this abstract class can
|
||||
implement their custom collators.
|
||||
|
||||
:param sample: the sample to be returned
|
||||
:param args: additional arguments
|
||||
:return: the sample adhering to a desired output format (in this case, the sample is returned as it is)
|
||||
"""
|
||||
return sample
|
||||
|
||||
|
||||
|
|
|
@ -1,5 +1,7 @@
|
|||
import unittest
|
||||
import numpy as np
|
||||
from scipy.sparse import csr_matrix
|
||||
|
||||
import quapy as qp
|
||||
|
||||
|
||||
|
@ -16,6 +18,51 @@ class LabelCollectionTestCase(unittest.TestCase):
|
|||
self.assertEqual(np.allclose(check_prev, data.prevalence()), True)
|
||||
self.assertEqual(len(tr+te), len(data))
|
||||
|
||||
def test_join(self):
|
||||
x = np.arange(50)
|
||||
y = np.random.randint(2, 5, 50)
|
||||
data1 = qp.data.LabelledCollection(x, y)
|
||||
|
||||
x = np.arange(200)
|
||||
y = np.random.randint(0, 3, 200)
|
||||
data2 = qp.data.LabelledCollection(x, y)
|
||||
|
||||
x = np.arange(100)
|
||||
y = np.random.randint(0, 6, 100)
|
||||
data3 = qp.data.LabelledCollection(x, y)
|
||||
|
||||
combined = qp.data.LabelledCollection.join(data1, data2, data3)
|
||||
self.assertEqual(len(combined), len(data1)+len(data2)+len(data3))
|
||||
self.assertEqual(all(combined.classes_ == np.arange(6)), True)
|
||||
|
||||
x = np.random.rand(10, 3)
|
||||
y = np.random.randint(0, 1, 10)
|
||||
data4 = qp.data.LabelledCollection(x, y)
|
||||
with self.assertRaises(Exception):
|
||||
combined = qp.data.LabelledCollection.join(data1, data2, data3, data4)
|
||||
|
||||
x = np.random.rand(20, 3)
|
||||
y = np.random.randint(0, 1, 20)
|
||||
data5 = qp.data.LabelledCollection(x, y)
|
||||
combined = qp.data.LabelledCollection.join(data4, data5)
|
||||
self.assertEqual(len(combined), len(data4)+len(data5))
|
||||
|
||||
x = np.random.rand(10, 4)
|
||||
y = np.random.randint(0, 1, 10)
|
||||
data6 = qp.data.LabelledCollection(x, y)
|
||||
with self.assertRaises(Exception):
|
||||
combined = qp.data.LabelledCollection.join(data4, data5, data6)
|
||||
|
||||
data4.instances = csr_matrix(data4.instances)
|
||||
with self.assertRaises(Exception):
|
||||
combined = qp.data.LabelledCollection.join(data4, data5)
|
||||
data5.instances = csr_matrix(data5.instances)
|
||||
combined = qp.data.LabelledCollection.join(data4, data5)
|
||||
self.assertEqual(len(combined), len(data4) + len(data5))
|
||||
|
||||
# data2.instances = csr_matrix()
|
||||
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
unittest.main()
|
||||
|
|
Loading…
Reference in New Issue