1
0
Fork 0

preparing to merge

This commit is contained in:
Alejandro Moreo Fernandez 2023-02-14 17:00:50 +01:00
parent 25a829996e
commit 49fc486c53
27 changed files with 927 additions and 699 deletions

View File

@ -13,6 +13,7 @@ for facilitating the analysis and interpretation of the experimental results.
### Last updates:
* Version 0.1.7 is released! major changes can be consulted [here](quapy/FCHANGE_LOG.txt).
* A detailed documentation is now available [here](https://hlt-isti.github.io/QuaPy/)
* The developer API documentation is available [here](https://hlt-isti.github.io/QuaPy/build/html/modules.html)
@ -59,13 +60,14 @@ See the [Wiki](https://github.com/HLT-ISTI/QuaPy/wiki) for detailed examples.
## Features
* Implementation of many popular quantification methods (Classify-&-Count and its variants, Expectation Maximization,
quantification methods based on structured output learning, HDy, QuaNet, and quantification ensembles).
* Versatile functionality for performing evaluation based on artificial sampling protocols.
quantification methods based on structured output learning, HDy, QuaNet, quantification ensembles, among others).
* Versatile functionality for performing evaluation based on sampling generation protocols (e.g., APP, NPP, etc.).
* Implementation of most commonly used evaluation metrics (e.g., AE, RAE, SE, KLD, NKLD, etc.).
* Datasets frequently used in quantification (textual and numeric), including:
* 32 UCI Machine Learning datasets.
* 11 Twitter quantification-by-sentiment datasets.
* 3 product reviews quantification-by-sentiment datasets.
* 4 tasks from LeQua competition (_new in v0.1.7!_)
* Native support for binary and single-label multiclass quantification scenarios.
* Model selection functionality that minimizes quantification-oriented loss functions.
* Visualization tools for analysing the experimental results.
@ -80,29 +82,6 @@ quantification methods based on structured output learning, HDy, QuaNet, and qua
* pandas, xlrd
* matplotlib
## SVM-perf with quantification-oriented losses
In order to run experiments involving SVM(Q), SVM(KLD), SVM(NKLD),
SVM(AE), or SVM(RAE), you have to first download the
[svmperf](http://www.cs.cornell.edu/people/tj/svm_light/svm_perf.html)
package, apply the patch
[svm-perf-quantification-ext.patch](./svm-perf-quantification-ext.patch), and compile the sources.
The script [prepare_svmperf.sh](prepare_svmperf.sh) does all the job. Simply run:
```
./prepare_svmperf.sh
```
The resulting directory [svm_perf_quantification](./svm_perf_quantification) contains the
patched version of _svmperf_ with quantification-oriented losses.
The [svm-perf-quantification-ext.patch](./svm-perf-quantification-ext.patch) is an extension of the patch made available by
[Esuli et al. 2015](https://dl.acm.org/doi/abs/10.1145/2700406?casa_token=8D2fHsGCVn0AAAAA:ZfThYOvrzWxMGfZYlQW_y8Cagg-o_l6X_PcF09mdETQ4Tu7jK98mxFbGSXp9ZSO14JkUIYuDGFG0)
that allows SVMperf to optimize for
the _Q_ measure as proposed by [Barranquero et al. 2015](https://www.sciencedirect.com/science/article/abs/pii/S003132031400291X)
and for the _KLD_ and _NKLD_ measures as proposed by [Esuli et al. 2015](https://dl.acm.org/doi/abs/10.1145/2700406?casa_token=8D2fHsGCVn0AAAAA:ZfThYOvrzWxMGfZYlQW_y8Cagg-o_l6X_PcF09mdETQ4Tu7jK98mxFbGSXp9ZSO14JkUIYuDGFG0).
This patch extends the above one by also allowing SVMperf to optimize for
_AE_ and _RAE_.
## Documentation
@ -113,6 +92,8 @@ are provided:
* [Datasets](https://github.com/HLT-ISTI/QuaPy/wiki/Datasets)
* [Evaluation](https://github.com/HLT-ISTI/QuaPy/wiki/Evaluation)
* [Protocols](https://github.com/HLT-ISTI/QuaPy/wiki/Protocols)
* [Methods](https://github.com/HLT-ISTI/QuaPy/wiki/Methods)
* [SVMperf](https://github.com/HLT-ISTI/QuaPy/wiki/ExplicitLossMinimization)
* [Model Selection](https://github.com/HLT-ISTI/QuaPy/wiki/Model-Selection)
* [Plotting](https://github.com/HLT-ISTI/QuaPy/wiki/Plotting)

View File

@ -86,7 +86,7 @@ Take a look at the following code:</p>
<span class="n">sample</span> <span class="o">=</span> <span class="n">data</span><span class="o">.</span><span class="n">sampling</span><span class="p">(</span><span class="n">sample_size</span><span class="p">,</span> <span class="o">*</span><span class="n">prev</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;instances:&#39;</span><span class="p">,</span> <span class="n">sample</span><span class="o">.</span><span class="n">instances</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;labels:&#39;</span><span class="p">,</span> <span class="n">sample</span><span class="o">.</span><span class="n">labels</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;labels:&#39;</span><span class="p">,</span> <span class="n">sample</span><span class="o">.</span><span class="n">classes</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;prevalence:&#39;</span><span class="p">,</span> <span class="n">F</span><span class="o">.</span><span class="n">strprev</span><span class="p">(</span><span class="n">sample</span><span class="o">.</span><span class="n">prevalence</span><span class="p">(),</span> <span class="n">prec</span><span class="o">=</span><span class="mi">2</span><span class="p">))</span>
</pre></div>
</div>

View File

@ -20,7 +20,7 @@
<script src="_static/bizstyle.js"></script>
<link rel="index" title="Index" href="genindex.html" />
<link rel="search" title="Search" href="search.html" />
<link rel="next" title="Quantification Methods" href="Methods.html" />
<link rel="next" title="Protocols" href="Protocols.html" />
<link rel="prev" title="Datasets" href="Datasets.html" />
<meta name="viewport" content="width=device-width,initial-scale=1.0" />
<!--[if lt IE 9]>
@ -37,7 +37,7 @@
<a href="py-modindex.html" title="Python Module Index"
>modules</a> |</li>
<li class="right" >
<a href="Methods.html" title="Quantification Methods"
<a href="Protocols.html" title="Protocols"
accesskey="N">next</a> |</li>
<li class="right" >
<a href="Datasets.html" title="Datasets"
@ -99,13 +99,13 @@ third argument, e.g.:</p>
Traditionally, this value is set to 1/(2T) in past literature,
with T the sampling size. One could either pass this value
to the function each time, or to set a QuaPys environment
variable <em>SAMPLE_SIZE</em> once, and ommit this argument
variable <em>SAMPLE_SIZE</em> once, and omit this argument
thereafter (recommended);
e.g.:</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">qp</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s1">&#39;SAMPLE_SIZE&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="mi">100</span> <span class="c1"># once for all</span>
<span class="n">true_prev</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">asarray</span><span class="p">([</span><span class="mf">0.5</span><span class="p">,</span> <span class="mf">0.3</span><span class="p">,</span> <span class="mf">0.2</span><span class="p">])</span> <span class="c1"># let&#39;s assume 3 classes</span>
<span class="n">estim_prev</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">asarray</span><span class="p">([</span><span class="mf">0.1</span><span class="p">,</span> <span class="mf">0.3</span><span class="p">,</span> <span class="mf">0.6</span><span class="p">])</span>
<span class="n">error</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">ae_</span><span class="o">.</span><span class="n">mrae</span><span class="p">(</span><span class="n">true_prev</span><span class="p">,</span> <span class="n">estim_prev</span><span class="p">)</span>
<span class="n">error</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">error</span><span class="o">.</span><span class="n">mrae</span><span class="p">(</span><span class="n">true_prev</span><span class="p">,</span> <span class="n">estim_prev</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39;mrae(</span><span class="si">{</span><span class="n">true_prev</span><span class="si">}</span><span class="s1">, </span><span class="si">{</span><span class="n">estim_prev</span><span class="si">}</span><span class="s1">) = </span><span class="si">{</span><span class="n">error</span><span class="si">:</span><span class="s1">.3f</span><span class="si">}</span><span class="s1">&#39;</span><span class="p">)</span>
</pre></div>
</div>
@ -115,148 +115,93 @@ e.g.:</p>
</div>
<p>Finally, it is possible to instantiate QuaPys quantification
error functions from strings using, e.g.:</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">error_function</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">ae_</span><span class="o">.</span><span class="n">from_name</span><span class="p">(</span><span class="s1">&#39;mse&#39;</span><span class="p">)</span>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">error_function</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">error</span><span class="o">.</span><span class="n">from_name</span><span class="p">(</span><span class="s1">&#39;mse&#39;</span><span class="p">)</span>
<span class="n">error</span> <span class="o">=</span> <span class="n">error_function</span><span class="p">(</span><span class="n">true_prev</span><span class="p">,</span> <span class="n">estim_prev</span><span class="p">)</span>
</pre></div>
</div>
</section>
<section id="evaluation-protocols">
<h2>Evaluation Protocols<a class="headerlink" href="#evaluation-protocols" title="Permalink to this heading"></a></h2>
<p>QuaPy implements the so-called “artificial sampling protocol”,
according to which a test set is used to generate samplings at
desired prevalences of fixed size and covering the full spectrum
of prevalences. This protocol is called “artificial” in contrast
to the “natural prevalence sampling” protocol that,
despite introducing some variability during sampling, approximately
preserves the training class prevalence.</p>
<p>In the artificial sampling procol, the user specifies the number
of (equally distant) points to be generated from the interval [0,1].</p>
<p>For example, if n_prevpoints=11 then, for each class, the prevalences
[0., 0.1, 0.2, …, 1.] will be used. This means that, for two classes,
the number of different prevalences will be 11 (since, once the prevalence
of one class is determined, the other one is constrained). For 3 classes,
the number of valid combinations can be obtained as 11 + 10 + … + 1 = 66.
In general, the number of valid combinations that will be produced for a given
value of n_prevpoints can be consulted by invoking
quapy.functional.num_prevalence_combinations, e.g.:</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">quapy.functional</span> <span class="k">as</span> <span class="nn">F</span>
<span class="n">n_prevpoints</span> <span class="o">=</span> <span class="mi">21</span>
<span class="n">n_classes</span> <span class="o">=</span> <span class="mi">4</span>
<span class="n">n</span> <span class="o">=</span> <span class="n">F</span><span class="o">.</span><span class="n">num_prevalence_combinations</span><span class="p">(</span><span class="n">n_prevpoints</span><span class="p">,</span> <span class="n">n_classes</span><span class="p">,</span> <span class="n">n_repeats</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<p>An <em>evaluation protocol</em> is an evaluation procedure that uses
one specific <em>sample generation procotol</em> to genereate many
samples, typically characterized by widely varying amounts of
<em>shift</em> with respect to the original distribution, that are then
used to evaluate the performance of a (trained) quantifier.
These protocols are explained in more detail in a dedicated <a class="reference internal" href="Protocols.html"><span class="doc std std-doc">entry
in the wiki</span></a>. For the moment being, let us assume we already have
chosen and instantiated one specific such protocol, that we here
simply call <em>prot</em>. Let also assume our model is called
<em>quantifier</em> and that our evaluatio measure of choice is
<em>mae</em>. The evaluation comes down to:</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">mae</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">evaluation</span><span class="o">.</span><span class="n">evaluate</span><span class="p">(</span><span class="n">quantifier</span><span class="p">,</span> <span class="n">protocol</span><span class="o">=</span><span class="n">prot</span><span class="p">,</span> <span class="n">error_metric</span><span class="o">=</span><span class="s1">&#39;mae&#39;</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39;MAE = </span><span class="si">{</span><span class="n">mae</span><span class="si">:</span><span class="s1">.4f</span><span class="si">}</span><span class="s1">&#39;</span><span class="p">)</span>
</pre></div>
</div>
<p>in this example, n=1771. Note the last argument, n_repeats, that
informs of the number of examples that will be generated for any
valid combination (typical values are, e.g., 1 for a single sample,
or 10 or higher for computing standard deviations of performing statistical
significance tests).</p>
<p>One can instead work the other way around, i.e., one could set a
maximum budged of evaluations and get the number of prevalence points that
will generate a number of evaluations close, but not higher, than
the fixed budget. This can be achieved with the function
quapy.functional.get_nprevpoints_approximation, e.g.:</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">budget</span> <span class="o">=</span> <span class="mi">5000</span>
<span class="n">n_prevpoints</span> <span class="o">=</span> <span class="n">F</span><span class="o">.</span><span class="n">get_nprevpoints_approximation</span><span class="p">(</span><span class="n">budget</span><span class="p">,</span> <span class="n">n_classes</span><span class="p">,</span> <span class="n">n_repeats</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">n</span> <span class="o">=</span> <span class="n">F</span><span class="o">.</span><span class="n">num_prevalence_combinations</span><span class="p">(</span><span class="n">n_prevpoints</span><span class="p">,</span> <span class="n">n_classes</span><span class="p">,</span> <span class="n">n_repeats</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39;by setting n_prevpoints=</span><span class="si">{</span><span class="n">n_prevpoints</span><span class="si">}</span><span class="s1"> the number of evaluations for </span><span class="si">{</span><span class="n">n_classes</span><span class="si">}</span><span class="s1"> classes will be </span><span class="si">{</span><span class="n">n</span><span class="si">}</span><span class="s1">&#39;</span><span class="p">)</span>
<p>It is often desirable to evaluate our system using more than one
single evaluatio measure. In this case, it is convenient to generate
a <em>report</em>. A report in QuaPy is a dataframe accounting for all the
true prevalence values with their corresponding prevalence values
as estimated by the quantifier, along with the error each has given
rise.</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">report</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">evaluation</span><span class="o">.</span><span class="n">evaluation_report</span><span class="p">(</span><span class="n">quantifier</span><span class="p">,</span> <span class="n">protocol</span><span class="o">=</span><span class="n">prot</span><span class="p">,</span> <span class="n">error_metrics</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;mae&#39;</span><span class="p">,</span> <span class="s1">&#39;mrae&#39;</span><span class="p">,</span> <span class="s1">&#39;mkld&#39;</span><span class="p">])</span>
</pre></div>
</div>
<p>that will print:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">by</span> <span class="n">setting</span> <span class="n">n_prevpoints</span><span class="o">=</span><span class="mi">30</span> <span class="n">the</span> <span class="n">number</span> <span class="n">of</span> <span class="n">evaluations</span> <span class="k">for</span> <span class="mi">4</span> <span class="n">classes</span> <span class="n">will</span> <span class="n">be</span> <span class="mi">4960</span>
</pre></div>
</div>
<p>The cost of evaluation will depend on the values of <em>n_prevpoints</em>, <em>n_classes</em>,
and <em>n_repeats</em>. Since it might sometimes be cumbersome to control the overall
cost of an experiment having to do with the number of combinations that
will be generated for a particular setting of these arguments (particularly
when <em>n_classes&gt;2</em>), evaluation functions
typically allow the user to rather specify an <em>evaluation budget</em>, i.e., a maximum
number of samplings to generate. By specifying this argument, one could avoid
specifying <em>n_prevpoints</em>, and the value for it that would lead to a closer
number of evaluation budget, without surpassing it, will be automatically set.</p>
<p>The following script shows a full example in which a PACC model relying
on a Logistic Regressor classifier is
tested on the <em>kindle</em> dataset by means of the artificial prevalence
sampling protocol on samples of size 500, in terms of various
evaluation metrics.</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">quapy</span> <span class="k">as</span> <span class="nn">qp</span>
<span class="kn">import</span> <span class="nn">quapy.functional</span> <span class="k">as</span> <span class="nn">F</span>
<span class="kn">from</span> <span class="nn">sklearn.linear_model</span> <span class="kn">import</span> <span class="n">LogisticRegression</span>
<p>From a pandas dataframe, it is straightforward to visualize all the results,
and compute the averaged values, e.g.:</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">pd</span><span class="o">.</span><span class="n">set_option</span><span class="p">(</span><span class="s1">&#39;display.expand_frame_repr&#39;</span><span class="p">,</span> <span class="kc">False</span><span class="p">)</span>
<span class="n">report</span><span class="p">[</span><span class="s1">&#39;estim-prev&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">report</span><span class="p">[</span><span class="s1">&#39;estim-prev&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="n">F</span><span class="o">.</span><span class="n">strprev</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">report</span><span class="p">)</span>
<span class="n">qp</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s1">&#39;SAMPLE_SIZE&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="mi">500</span>
<span class="n">dataset</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">datasets</span><span class="o">.</span><span class="n">fetch_reviews</span><span class="p">(</span><span class="s1">&#39;kindle&#39;</span><span class="p">)</span>
<span class="n">qp</span><span class="o">.</span><span class="n">data</span><span class="o">.</span><span class="n">preprocessing</span><span class="o">.</span><span class="n">text2tfidf</span><span class="p">(</span><span class="n">dataset</span><span class="p">,</span> <span class="n">min_df</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span> <span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">training</span> <span class="o">=</span> <span class="n">dataset</span><span class="o">.</span><span class="n">training</span>
<span class="n">test</span> <span class="o">=</span> <span class="n">dataset</span><span class="o">.</span><span class="n">test</span>
<span class="n">lr</span> <span class="o">=</span> <span class="n">LogisticRegression</span><span class="p">()</span>
<span class="n">pacc</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">method</span><span class="o">.</span><span class="n">aggregative</span><span class="o">.</span><span class="n">PACC</span><span class="p">(</span><span class="n">lr</span><span class="p">)</span>
<span class="n">pacc</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">training</span><span class="p">)</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">evaluation</span><span class="o">.</span><span class="n">artificial_sampling_report</span><span class="p">(</span>
<span class="n">pacc</span><span class="p">,</span> <span class="c1"># the quantification method</span>
<span class="n">test</span><span class="p">,</span> <span class="c1"># the test set on which the method will be evaluated</span>
<span class="n">sample_size</span><span class="o">=</span><span class="n">qp</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s1">&#39;SAMPLE_SIZE&#39;</span><span class="p">],</span> <span class="c1">#indicates the size of samples to be drawn</span>
<span class="n">n_prevpoints</span><span class="o">=</span><span class="mi">11</span><span class="p">,</span> <span class="c1"># how many prevalence points will be extracted from the interval [0, 1] for each category</span>
<span class="n">n_repetitions</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="c1"># number of times each prevalence will be used to generate a test sample</span>
<span class="n">n_jobs</span><span class="o">=-</span><span class="mi">1</span><span class="p">,</span> <span class="c1"># indicates the number of parallel workers (-1 indicates, as in sklearn, all CPUs)</span>
<span class="n">random_seed</span><span class="o">=</span><span class="mi">42</span><span class="p">,</span> <span class="c1"># setting a random seed allows to replicate the test samples across runs</span>
<span class="n">error_metrics</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;mae&#39;</span><span class="p">,</span> <span class="s1">&#39;mrae&#39;</span><span class="p">,</span> <span class="s1">&#39;mkld&#39;</span><span class="p">],</span> <span class="c1"># specify the evaluation metrics</span>
<span class="n">verbose</span><span class="o">=</span><span class="kc">True</span> <span class="c1"># set to True to show some standard-line outputs</span>
<span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;Averaged values:&#39;</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">report</span><span class="o">.</span><span class="n">mean</span><span class="p">())</span>
</pre></div>
</div>
<p>The resulting report is a pandas dataframe that can be directly printed.
Here, we set some display options from pandas just to make the output clearer;
note also that the estimated prevalences are shown as strings using the
function strprev function that simply converts a prevalence into a
string representing it, with a fixed decimal precision (default 3):</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="n">pd</span><span class="o">.</span><span class="n">set_option</span><span class="p">(</span><span class="s1">&#39;display.expand_frame_repr&#39;</span><span class="p">,</span> <span class="kc">False</span><span class="p">)</span>
<span class="n">pd</span><span class="o">.</span><span class="n">set_option</span><span class="p">(</span><span class="s2">&quot;precision&quot;</span><span class="p">,</span> <span class="mi">3</span><span class="p">)</span>
<span class="n">df</span><span class="p">[</span><span class="s1">&#39;estim-prev&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;estim-prev&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="n">F</span><span class="o">.</span><span class="n">strprev</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>
</pre></div>
</div>
<p>The output should look like:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span> <span class="n">true</span><span class="o">-</span><span class="n">prev</span> <span class="n">estim</span><span class="o">-</span><span class="n">prev</span> <span class="n">mae</span> <span class="n">mrae</span> <span class="n">mkld</span>
<span class="mi">0</span> <span class="p">[</span><span class="mf">0.0</span><span class="p">,</span> <span class="mf">1.0</span><span class="p">]</span> <span class="p">[</span><span class="mf">0.000</span><span class="p">,</span> <span class="mf">1.000</span><span class="p">]</span> <span class="mf">0.000</span> <span class="mf">0.000</span> <span class="mf">0.000e+00</span>
<span class="mi">1</span> <span class="p">[</span><span class="mf">0.1</span><span class="p">,</span> <span class="mf">0.9</span><span class="p">]</span> <span class="p">[</span><span class="mf">0.091</span><span class="p">,</span> <span class="mf">0.909</span><span class="p">]</span> <span class="mf">0.009</span> <span class="mf">0.048</span> <span class="mf">4.426e-04</span>
<span class="mi">2</span> <span class="p">[</span><span class="mf">0.2</span><span class="p">,</span> <span class="mf">0.8</span><span class="p">]</span> <span class="p">[</span><span class="mf">0.163</span><span class="p">,</span> <span class="mf">0.837</span><span class="p">]</span> <span class="mf">0.037</span> <span class="mf">0.114</span> <span class="mf">4.633e-03</span>
<span class="mi">3</span> <span class="p">[</span><span class="mf">0.3</span><span class="p">,</span> <span class="mf">0.7</span><span class="p">]</span> <span class="p">[</span><span class="mf">0.283</span><span class="p">,</span> <span class="mf">0.717</span><span class="p">]</span> <span class="mf">0.017</span> <span class="mf">0.041</span> <span class="mf">7.383e-04</span>
<span class="mi">4</span> <span class="p">[</span><span class="mf">0.4</span><span class="p">,</span> <span class="mf">0.6</span><span class="p">]</span> <span class="p">[</span><span class="mf">0.366</span><span class="p">,</span> <span class="mf">0.634</span><span class="p">]</span> <span class="mf">0.034</span> <span class="mf">0.070</span> <span class="mf">2.412e-03</span>
<span class="mi">5</span> <span class="p">[</span><span class="mf">0.5</span><span class="p">,</span> <span class="mf">0.5</span><span class="p">]</span> <span class="p">[</span><span class="mf">0.459</span><span class="p">,</span> <span class="mf">0.541</span><span class="p">]</span> <span class="mf">0.041</span> <span class="mf">0.082</span> <span class="mf">3.387e-03</span>
<span class="mi">6</span> <span class="p">[</span><span class="mf">0.6</span><span class="p">,</span> <span class="mf">0.4</span><span class="p">]</span> <span class="p">[</span><span class="mf">0.565</span><span class="p">,</span> <span class="mf">0.435</span><span class="p">]</span> <span class="mf">0.035</span> <span class="mf">0.073</span> <span class="mf">2.535e-03</span>
<span class="mi">7</span> <span class="p">[</span><span class="mf">0.7</span><span class="p">,</span> <span class="mf">0.3</span><span class="p">]</span> <span class="p">[</span><span class="mf">0.654</span><span class="p">,</span> <span class="mf">0.346</span><span class="p">]</span> <span class="mf">0.046</span> <span class="mf">0.108</span> <span class="mf">4.701e-03</span>
<span class="mi">8</span> <span class="p">[</span><span class="mf">0.8</span><span class="p">,</span> <span class="mf">0.2</span><span class="p">]</span> <span class="p">[</span><span class="mf">0.725</span><span class="p">,</span> <span class="mf">0.275</span><span class="p">]</span> <span class="mf">0.075</span> <span class="mf">0.235</span> <span class="mf">1.515e-02</span>
<span class="mi">9</span> <span class="p">[</span><span class="mf">0.9</span><span class="p">,</span> <span class="mf">0.1</span><span class="p">]</span> <span class="p">[</span><span class="mf">0.858</span><span class="p">,</span> <span class="mf">0.142</span><span class="p">]</span> <span class="mf">0.042</span> <span class="mf">0.229</span> <span class="mf">7.740e-03</span>
<span class="mi">10</span> <span class="p">[</span><span class="mf">1.0</span><span class="p">,</span> <span class="mf">0.0</span><span class="p">]</span> <span class="p">[</span><span class="mf">0.945</span><span class="p">,</span> <span class="mf">0.055</span><span class="p">]</span> <span class="mf">0.055</span> <span class="mf">27.357</span> <span class="mf">5.219e-02</span>
</pre></div>
</div>
<p>One can get the averaged scores using standard pandas
functions, i.e.:</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="nb">print</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">mean</span><span class="p">())</span>
</pre></div>
</div>
<p>will produce the following output:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">true</span><span class="o">-</span><span class="n">prev</span> <span class="mf">0.500</span>
<span class="n">mae</span> <span class="mf">0.035</span>
<span class="n">mrae</span> <span class="mf">2.578</span>
<span class="n">mkld</span> <span class="mf">0.009</span>
<p>This will produce an output like:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span> <span class="n">true</span><span class="o">-</span><span class="n">prev</span> <span class="n">estim</span><span class="o">-</span><span class="n">prev</span> <span class="n">mae</span> <span class="n">mrae</span> <span class="n">mkld</span>
<span class="mi">0</span> <span class="p">[</span><span class="mf">0.308</span><span class="p">,</span> <span class="mf">0.692</span><span class="p">]</span> <span class="p">[</span><span class="mf">0.314</span><span class="p">,</span> <span class="mf">0.686</span><span class="p">]</span> <span class="mf">0.005649</span> <span class="mf">0.013182</span> <span class="mf">0.000074</span>
<span class="mi">1</span> <span class="p">[</span><span class="mf">0.896</span><span class="p">,</span> <span class="mf">0.104</span><span class="p">]</span> <span class="p">[</span><span class="mf">0.909</span><span class="p">,</span> <span class="mf">0.091</span><span class="p">]</span> <span class="mf">0.013145</span> <span class="mf">0.069323</span> <span class="mf">0.000985</span>
<span class="mi">2</span> <span class="p">[</span><span class="mf">0.848</span><span class="p">,</span> <span class="mf">0.152</span><span class="p">]</span> <span class="p">[</span><span class="mf">0.809</span><span class="p">,</span> <span class="mf">0.191</span><span class="p">]</span> <span class="mf">0.039063</span> <span class="mf">0.149806</span> <span class="mf">0.005175</span>
<span class="mi">3</span> <span class="p">[</span><span class="mf">0.016</span><span class="p">,</span> <span class="mf">0.984</span><span class="p">]</span> <span class="p">[</span><span class="mf">0.033</span><span class="p">,</span> <span class="mf">0.967</span><span class="p">]</span> <span class="mf">0.017236</span> <span class="mf">0.487529</span> <span class="mf">0.005298</span>
<span class="mi">4</span> <span class="p">[</span><span class="mf">0.728</span><span class="p">,</span> <span class="mf">0.272</span><span class="p">]</span> <span class="p">[</span><span class="mf">0.751</span><span class="p">,</span> <span class="mf">0.249</span><span class="p">]</span> <span class="mf">0.022769</span> <span class="mf">0.057146</span> <span class="mf">0.001350</span>
<span class="o">...</span> <span class="o">...</span> <span class="o">...</span> <span class="o">...</span> <span class="o">...</span> <span class="o">...</span>
<span class="mi">4995</span> <span class="p">[</span><span class="mf">0.72</span><span class="p">,</span> <span class="mf">0.28</span><span class="p">]</span> <span class="p">[</span><span class="mf">0.698</span><span class="p">,</span> <span class="mf">0.302</span><span class="p">]</span> <span class="mf">0.021752</span> <span class="mf">0.053631</span> <span class="mf">0.001133</span>
<span class="mi">4996</span> <span class="p">[</span><span class="mf">0.868</span><span class="p">,</span> <span class="mf">0.132</span><span class="p">]</span> <span class="p">[</span><span class="mf">0.888</span><span class="p">,</span> <span class="mf">0.112</span><span class="p">]</span> <span class="mf">0.020490</span> <span class="mf">0.088230</span> <span class="mf">0.001985</span>
<span class="mi">4997</span> <span class="p">[</span><span class="mf">0.292</span><span class="p">,</span> <span class="mf">0.708</span><span class="p">]</span> <span class="p">[</span><span class="mf">0.298</span><span class="p">,</span> <span class="mf">0.702</span><span class="p">]</span> <span class="mf">0.006149</span> <span class="mf">0.014788</span> <span class="mf">0.000090</span>
<span class="mi">4998</span> <span class="p">[</span><span class="mf">0.24</span><span class="p">,</span> <span class="mf">0.76</span><span class="p">]</span> <span class="p">[</span><span class="mf">0.220</span><span class="p">,</span> <span class="mf">0.780</span><span class="p">]</span> <span class="mf">0.019950</span> <span class="mf">0.054309</span> <span class="mf">0.001127</span>
<span class="mi">4999</span> <span class="p">[</span><span class="mf">0.948</span><span class="p">,</span> <span class="mf">0.052</span><span class="p">]</span> <span class="p">[</span><span class="mf">0.965</span><span class="p">,</span> <span class="mf">0.035</span><span class="p">]</span> <span class="mf">0.016941</span> <span class="mf">0.165776</span> <span class="mf">0.003538</span>
<span class="p">[</span><span class="mi">5000</span> <span class="n">rows</span> <span class="n">x</span> <span class="mi">5</span> <span class="n">columns</span><span class="p">]</span>
<span class="n">Averaged</span> <span class="n">values</span><span class="p">:</span>
<span class="n">mae</span> <span class="mf">0.023588</span>
<span class="n">mrae</span> <span class="mf">0.108779</span>
<span class="n">mkld</span> <span class="mf">0.003631</span>
<span class="n">dtype</span><span class="p">:</span> <span class="n">float64</span>
<span class="n">Process</span> <span class="n">finished</span> <span class="k">with</span> <span class="n">exit</span> <span class="n">code</span> <span class="mi">0</span>
</pre></div>
</div>
<p>Other evaluation functions include:</p>
<ul class="simple">
<li><p><em>artificial_sampling_eval</em>: that computes the evaluation for a
given evaluation metric, returning the average instead of a dataframe.</p></li>
<li><p><em>artificial_sampling_prediction</em>: that returns two np.arrays containing the
true prevalences and the estimated prevalences.</p></li>
</ul>
<p>See the documentation for further details.</p>
<p>Alternatively, we can simply generate all the predictions by:</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">true_prevs</span><span class="p">,</span> <span class="n">estim_prevs</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">evaluation</span><span class="o">.</span><span class="n">prediction</span><span class="p">(</span><span class="n">quantifier</span><span class="p">,</span> <span class="n">protocol</span><span class="o">=</span><span class="n">prot</span><span class="p">)</span>
</pre></div>
</div>
<p>All the evaluation functions implement specific optimizations for speeding-up
the evaluation of aggregative quantifiers (i.e., of instances of <em>AggregativeQuantifier</em>).
The optimization comes down to generating classification predictions (either crisp or soft)
only once for the entire test set, and then applying the sampling procedure to the
predictions, instead of generating samples of instances and then computing the
classification predictions every time. This is only possible when the protocol
is an instance of <em>OnLabelledCollectionProtocol</em>. The optimization is only
carried out when the number of classification predictions thus generated would be
smaller than the number of predictions required for the entire protocol; e.g.,
if the original dataset contains 1M instances, but the protocol is such that it would
at most generate 20 samples of 100 instances, then it would be preferable to postpone the
classification for each sample. This behaviour is indicated by setting
<em>aggr_speedup=”auto”</em>. Conversely, when indicating <em>aggr_speedup=”force”</em> QuaPy will
precompute all the predictions irrespectively of the number of instances and number of samples.
Finally, this can be deactivated by setting <em>aggr_speedup=False</em>. Note that this optimization
is not only applied for the final evaluation, but also for the internal evaluations carried
out during <em>model selection</em>. Since these are typically many, the heuristic can help reduce the
execution time a lot.</p>
</section>
</section>
@ -285,8 +230,8 @@ true prevalences and the estimated prevalences.</p></li>
</div>
<div>
<h4>Next topic</h4>
<p class="topless"><a href="Methods.html"
title="next chapter">Quantification Methods</a></p>
<p class="topless"><a href="Protocols.html"
title="next chapter">Protocols</a></p>
</div>
<div role="note" aria-label="source link">
<h3>This Page</h3>
@ -319,7 +264,7 @@ true prevalences and the estimated prevalences.</p></li>
<a href="py-modindex.html" title="Python Module Index"
>modules</a> |</li>
<li class="right" >
<a href="Methods.html" title="Quantification Methods"
<a href="Protocols.html" title="Protocols"
>next</a> |</li>
<li class="right" >
<a href="Datasets.html" title="Datasets"

View File

@ -21,7 +21,7 @@
<link rel="index" title="Index" href="genindex.html" />
<link rel="search" title="Search" href="search.html" />
<link rel="next" title="Model Selection" href="Model-Selection.html" />
<link rel="prev" title="Evaluation" href="Evaluation.html" />
<link rel="prev" title="Protocols" href="Protocols.html" />
<meta name="viewport" content="width=device-width,initial-scale=1.0" />
<!--[if lt IE 9]>
<script src="_static/css3-mediaqueries.js"></script>
@ -40,7 +40,7 @@
<a href="Model-Selection.html" title="Model Selection"
accesskey="N">next</a> |</li>
<li class="right" >
<a href="Evaluation.html" title="Evaluation"
<a href="Protocols.html" title="Protocols"
accesskey="P">previous</a> |</li>
<li class="nav-item nav-item-0"><a href="index.html">QuaPy 0.1.7 documentation</a> &#187;</li>
<li class="nav-item nav-item-this"><a href="">Quantification Methods</a></li>
@ -68,12 +68,6 @@ and implement some abstract methods:</p>
<span class="nd">@abstractmethod</span>
<span class="k">def</span> <span class="nf">quantify</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">instances</span><span class="p">):</span> <span class="o">...</span>
<span class="nd">@abstractmethod</span>
<span class="k">def</span> <span class="nf">set_params</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="o">**</span><span class="n">parameters</span><span class="p">):</span> <span class="o">...</span>
<span class="nd">@abstractmethod</span>
<span class="k">def</span> <span class="nf">get_params</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">deep</span><span class="o">=</span><span class="kc">True</span><span class="p">):</span> <span class="o">...</span>
</pre></div>
</div>
<p>The meaning of those functions should be familiar to those
@ -85,10 +79,10 @@ scikit-learn structure has not been adopted <em>as is</em> in QuaPy responds
the fact that scikit-learns <em>predict</em> function is expected to return
one output for each input element e.g., a predicted label for each
instance in a sample while in quantification the output for a sample
is one single array of class prevalences), while functions <em>set_params</em>
and <em>get_params</em> allow a
<a class="reference external" href="https://github.com/HLT-ISTI/QuaPy/wiki/Model-Selection">model selector</a>
to automate the process of hyperparameter search.</p>
is one single array of class prevalences).
Quantifiers also extend from scikit-learns <code class="docutils literal notranslate"><span class="pre">BaseEstimator</span></code>, in order
to simplify the use of <em>set_params</em> and <em>get_params</em> used in
<a class="reference external" href="https://github.com/HLT-ISTI/QuaPy/wiki/Model-Selection">model selector</a>.</p>
<section id="aggregative-methods">
<h2>Aggregative Methods<a class="headerlink" href="#aggregative-methods" title="Permalink to this heading"></a></h2>
<p>All quantification methods are implemented as part of the
@ -106,12 +100,12 @@ The methods that any <em>aggregative</em> quantifier must implement are:</p>
individual predictions of a classifier. Indeed, a default implementation
of <em>BaseQuantifier.quantify</em> is already provided, which looks like:</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span> <span class="k">def</span> <span class="nf">quantify</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">instances</span><span class="p">):</span>
<span class="n">classif_predictions</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">preclassify</span><span class="p">(</span><span class="n">instances</span><span class="p">)</span>
<span class="n">classif_predictions</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">classify</span><span class="p">(</span><span class="n">instances</span><span class="p">)</span>
<span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">aggregate</span><span class="p">(</span><span class="n">classif_predictions</span><span class="p">)</span>
</pre></div>
</div>
<p>Aggregative quantifiers are expected to maintain a classifier (which is
accessed through the <em>&#64;property</em> <em>learner</em>). This classifier is
accessed through the <em>&#64;property</em> <em>classifier</em>). This classifier is
given as input to the quantifier, and can be already fit
on external data (in which case, the <em>fit_learner</em> argument should
be set to False), or be fit by the quantifiers fit (default).</p>
@ -121,12 +115,8 @@ aggregative methods, that should inherit from the abstract class
The particularity of <em>probabilistic</em> aggregative methods (w.r.t.
non-probabilistic ones), is that the default quantifier is defined
in terms of the posterior probabilities returned by a probabilistic
classifier, and not by the crisp decisions of a hard classifier; i.e.:</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span> <span class="k">def</span> <span class="nf">quantify</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">instances</span><span class="p">):</span>
<span class="n">classif_posteriors</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">posterior_probabilities</span><span class="p">(</span><span class="n">instances</span><span class="p">)</span>
<span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">aggregate</span><span class="p">(</span><span class="n">classif_posteriors</span><span class="p">)</span>
</pre></div>
</div>
classifier, and not by the crisp decisions of a hard classifier.
In any case, the interface <em>classify(instances)</em> remains unchanged.</p>
<p>One advantage of <em>aggregative</em> methods (either probabilistic or not)
is that the evaluation according to any sampling procedure (e.g.,
the <a class="reference external" href="https://github.com/HLT-ISTI/QuaPy/wiki/Evaluation">artificial sampling protocol</a>)
@ -153,9 +143,7 @@ with a SVM as the classifier:</p>
<span class="kn">import</span> <span class="nn">quapy.functional</span> <span class="k">as</span> <span class="nn">F</span>
<span class="kn">from</span> <span class="nn">sklearn.svm</span> <span class="kn">import</span> <span class="n">LinearSVC</span>
<span class="n">dataset</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">datasets</span><span class="o">.</span><span class="n">fetch_twitter</span><span class="p">(</span><span class="s1">&#39;hcr&#39;</span><span class="p">,</span> <span class="n">pickle</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">training</span> <span class="o">=</span> <span class="n">dataset</span><span class="o">.</span><span class="n">training</span>
<span class="n">test</span> <span class="o">=</span> <span class="n">dataset</span><span class="o">.</span><span class="n">test</span>
<span class="n">training</span><span class="p">,</span> <span class="n">test</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">datasets</span><span class="o">.</span><span class="n">fetch_twitter</span><span class="p">(</span><span class="s1">&#39;hcr&#39;</span><span class="p">,</span> <span class="n">pickle</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span><span class="o">.</span><span class="n">train_test</span>
<span class="c1"># instantiate a classifier learner, in this case a SVM</span>
<span class="n">svm</span> <span class="o">=</span> <span class="n">LinearSVC</span><span class="p">()</span>
@ -199,7 +187,7 @@ e.g.:</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">model</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">method</span><span class="o">.</span><span class="n">aggregative</span><span class="o">.</span><span class="n">PCC</span><span class="p">(</span><span class="n">svm</span><span class="p">)</span>
<span class="n">model</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">training</span><span class="p">)</span>
<span class="n">estim_prevalence</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">quantify</span><span class="p">(</span><span class="n">test</span><span class="o">.</span><span class="n">instances</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;classifier:&#39;</span><span class="p">,</span> <span class="n">model</span><span class="o">.</span><span class="n">learner</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;classifier:&#39;</span><span class="p">,</span> <span class="n">model</span><span class="o">.</span><span class="n">classifier</span><span class="p">)</span>
</pre></div>
</div>
<p>In this case, QuaPy will print:</p>
@ -244,13 +232,21 @@ experiments we have carried out.</p>
<span class="n">estim_prevalence</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">quantify</span><span class="p">(</span><span class="n">dataset</span><span class="o">.</span><span class="n">test</span><span class="o">.</span><span class="n">instances</span><span class="p">)</span>
</pre></div>
</div>
<p><em>New in v0.1.7</em>: EMQ now accepts two new parameters in the construction method, namely
<em>exact_train_prev</em> which allows to use the true training prevalence as the departing
prevalence estimation (default behaviour), or instead an approximation of it as
suggested by <a class="reference external" href="http://proceedings.mlr.press/v119/alexandari20a.html">Alexandari et al. (2020)</a>
(by setting <em>exact_train_prev=False</em>).
The other parameter is <em>recalib</em> which allows to indicate a calibration method, among those
proposed by <a class="reference external" href="http://proceedings.mlr.press/v119/alexandari20a.html">Alexandari et al. (2020)</a>,
including the Bias-Corrected Temperature Scaling, Vector Scaling, etc.
See the API documentation for further details.</p>
</section>
<section id="hellinger-distance-y-hdy">
<h3>Hellinger Distance y (HDy)<a class="headerlink" href="#hellinger-distance-y-hdy" title="Permalink to this heading"></a></h3>
<p>The method HDy is described in:</p>
<p><em>Implementation of the method based on the Hellinger Distance y (HDy) proposed by
González-Castro, V., Alaiz-Rodrı́guez, R., and Alegre, E. (2013). Class distribution
estimation based on the Hellinger distance. Information Sciences, 218:146164.</em></p>
<p>Implementation of the method based on the Hellinger Distance y (HDy) proposed by
<a class="reference external" href="https://www.sciencedirect.com/science/article/pii/S0020025512004069">González-Castro, V., Alaiz-Rodrı́guez, R., and Alegre, E. (2013). Class distribution
estimation based on the Hellinger distance. Information Sciences, 218:146164.</a></p>
<p>It is implemented in <em>qp.method.aggregative.HDy</em> (also accessible
through the allias <em>qp.method.aggregative.HellingerDistanceY</em>).
This method works with a probabilistic classifier (hard classifiers
@ -277,30 +273,48 @@ provided in QuaPy accepts only binary datasets.</p>
<span class="n">estim_prevalence</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">quantify</span><span class="p">(</span><span class="n">dataset</span><span class="o">.</span><span class="n">test</span><span class="o">.</span><span class="n">instances</span><span class="p">)</span>
</pre></div>
</div>
<p><em>New in v0.1.7:</em> QuaPy now provides an implementation of the generalized
“Distribution Matching” approaches for multiclass, inspired by the framework
of <a class="reference external" href="https://arxiv.org/abs/1606.00868">Firat (2016)</a>. One can instantiate
a variant of HDy for multiclass quantification as follows:</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">mutliclassHDy</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">method</span><span class="o">.</span><span class="n">aggregative</span><span class="o">.</span><span class="n">DistributionMatching</span><span class="p">(</span><span class="n">classifier</span><span class="o">=</span><span class="n">LogisticRegression</span><span class="p">(),</span> <span class="n">divergence</span><span class="o">=</span><span class="s1">&#39;HD&#39;</span><span class="p">,</span> <span class="n">cdf</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
</pre></div>
</div>
<p><em>New in v0.1.7:</em> QuaPy now provides an implementation of the “DyS”
framework proposed by <a class="reference external" href="https://ojs.aaai.org/index.php/AAAI/article/view/4376">Maletzke et al (2020)</a>
and the “SMM” method proposed by <a class="reference external" href="https://ieeexplore.ieee.org/document/9260028">Hassan et al (2019)</a>
(thanks to <em>Pablo González</em> for the contributions!)</p>
</section>
<section id="threshold-optimization-methods">
<h3>Threshold Optimization methods<a class="headerlink" href="#threshold-optimization-methods" title="Permalink to this heading"></a></h3>
<p><em>New in v0.1.7:</em> QuaPy now implements Formans threshold optimization methods;
see, e.g., <a class="reference external" href="https://dl.acm.org/doi/abs/10.1145/1150402.1150423">(Forman 2006)</a>
and <a class="reference external" href="https://link.springer.com/article/10.1007/s10618-008-0097-y">(Forman 2008)</a>.
These include: T50, MAX, X, Median Sweep (MS), and its variant MS2.</p>
</section>
<section id="explicit-loss-minimization">
<h3>Explicit Loss Minimization<a class="headerlink" href="#explicit-loss-minimization" title="Permalink to this heading"></a></h3>
<p>The Explicit Loss Minimization (ELM) represent a family of methods
based on structured output learning, i.e., quantifiers relying on
classifiers that have been optimized targeting a
quantification-oriented evaluation measure.</p>
<p>In QuaPy, the following methods, all relying on Joachims
<a class="reference external" href="https://www.cs.cornell.edu/people/tj/svm_light/svm_perf.html">SVMperf</a>
implementation, are available in <em>qp.method.aggregative</em>:</p>
quantification-oriented evaluation measure.
The original methods are implemented in QuaPy as classify &amp; count (CC)
quantifiers that use Joachims <a class="reference external" href="https://www.cs.cornell.edu/people/tj/svm_light/svm_perf.html">SVMperf</a>
as the underlying classifier, properly set to optimize for the desired loss.</p>
<p>In QuaPy, this can be more achieved by calling the functions:</p>
<ul class="simple">
<li><p>SVMQ (SVM-Q) is a quantification method optimizing the metric <em>Q</em> defined
in <em>Barranquero, J., Díez, J., and del Coz, J. J. (2015). Quantification-oriented learning based
on reliable classifiers. Pattern Recognition, 48(2):591604.</em></p></li>
<li><p>SVMKLD (SVM for Kullback-Leibler Divergence) proposed in <em>Esuli, A. and Sebastiani, F. (2015).
<li><p><em>newSVMQ</em>: returns the quantification method called SVM(Q) that optimizes for the metric <em>Q</em> defined
in <a class="reference external" href="https://www.sciencedirect.com/science/article/pii/S003132031400291X"><em>Barranquero, J., Díez, J., and del Coz, J. J. (2015). Quantification-oriented learning based
on reliable classifiers. Pattern Recognition, 48(2):591604.</em></a></p></li>
<li><p><em>newSVMKLD</em> and <em>newSVMNKLD</em>: returns the quantification method called SVM(KLD) and SVM(nKLD), standing for
Kullback-Leibler Divergence and Normalized Kullback-Leibler Divergence, as proposed in <a class="reference external" href="https://dl.acm.org/doi/abs/10.1145/2700406"><em>Esuli, A. and Sebastiani, F. (2015).
Optimizing text quantifiers for multivariate loss functions.
ACM Transactions on Knowledge Discovery and Data, 9(4):Article 27.</em></p></li>
<li><p>SVMNKLD (SVM for Normalized Kullback-Leibler Divergence) proposed in <em>Esuli, A. and Sebastiani, F. (2015).
Optimizing text quantifiers for multivariate loss functions.
ACM Transactions on Knowledge Discovery and Data, 9(4):Article 27.</em></p></li>
<li><p>SVMAE (SVM for Mean Absolute Error)</p></li>
<li><p>SVMRAE (SVM for Mean Relative Absolute Error)</p></li>
ACM Transactions on Knowledge Discovery and Data, 9(4):Article 27.</em></a></p></li>
<li><p><em>newSVMAE</em> and <em>newSVMRAE</em>: returns a quantification method called SVM(AE) and SVM(RAE) that optimizes for the (Mean) Absolute Error and for the
(Mean) Relative Absolute Error, as first used by
<a class="reference external" href="https://arxiv.org/abs/2011.02552"><em>Moreo, A. and Sebastiani, F. (2021). Tweet sentiment quantification: An experimental re-evaluation. PLOS ONE 17 (9), 1-23.</em></a></p></li>
</ul>
<p>the last two methods (SVMAE and SVMRAE) have been implemented in
<p>the last two methods (SVM(AE) and SVM(RAE)) have been implemented in
QuaPy in order to make available ELM variants for what nowadays
are considered the most well-behaved evaluation metrics in quantification.</p>
<p>In order to make these models work, you would need to run the script
@ -330,11 +344,15 @@ currently supports only binary classification.
ELM variants (any binary quantifier in general) can be extended
to operate in single-label scenarios trivially by adopting a
“one-vs-all” strategy (as, e.g., in
<em>Gao, W. and Sebastiani, F. (2016). From classification to quantification in tweet sentiment
analysis. Social Network Analysis and Mining, 6(19):122</em>).
In QuaPy this is possible by using the <em>OneVsAll</em> class:</p>
<a class="reference external" href="https://link.springer.com/article/10.1007/s13278-016-0327-z"><em>Gao, W. and Sebastiani, F. (2016). From classification to quantification in tweet sentiment
analysis. Social Network Analysis and Mining, 6(19):122</em></a>).
In QuaPy this is possible by using the <em>OneVsAll</em> class.</p>
<p>There are two ways for instantiating this class, <em>OneVsAllGeneric</em> that works for
any quantifier, and <em>OneVsAllAggregative</em> that is optimized for aggregative quantifiers.
In general, you can simply use the <em>getOneVsAll</em> function and QuaPy will choose
the more convenient of the two.</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">quapy</span> <span class="k">as</span> <span class="nn">qp</span>
<span class="kn">from</span> <span class="nn">quapy.method.aggregative</span> <span class="kn">import</span> <span class="n">SVMQ</span><span class="p">,</span> <span class="n">OneVsAll</span>
<span class="kn">from</span> <span class="nn">quapy.method.aggregative</span> <span class="kn">import</span> <span class="n">SVMQ</span>
<span class="c1"># load a single-label dataset (this one contains 3 classes)</span>
<span class="n">dataset</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">datasets</span><span class="o">.</span><span class="n">fetch_twitter</span><span class="p">(</span><span class="s1">&#39;hcr&#39;</span><span class="p">,</span> <span class="n">pickle</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
@ -342,11 +360,13 @@ In QuaPy this is possible by using the <em>OneVsAll</em> class:</p>
<span class="c1"># let qp know where svmperf is</span>
<span class="n">qp</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s1">&#39;SVMPERF_HOME&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="s1">&#39;../svm_perf_quantification&#39;</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">OneVsAll</span><span class="p">(</span><span class="n">SVMQ</span><span class="p">(),</span> <span class="n">n_jobs</span><span class="o">=-</span><span class="mi">1</span><span class="p">)</span> <span class="c1"># run them on parallel</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">getOneVsAll</span><span class="p">(</span><span class="n">SVMQ</span><span class="p">(),</span> <span class="n">n_jobs</span><span class="o">=-</span><span class="mi">1</span><span class="p">)</span> <span class="c1"># run them on parallel</span>
<span class="n">model</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">dataset</span><span class="o">.</span><span class="n">training</span><span class="p">)</span>
<span class="n">estim_prevalence</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">quantify</span><span class="p">(</span><span class="n">dataset</span><span class="o">.</span><span class="n">test</span><span class="o">.</span><span class="n">instances</span><span class="p">)</span>
</pre></div>
</div>
<p>Check the examples <em><span class="xref myst">explicit_loss_minimization.py</span></em>
and <span class="xref myst">one_vs_all.py</span> for more details.</p>
</section>
</section>
<section id="meta-models">
@ -360,12 +380,12 @@ groups).
<h3>Ensembles<a class="headerlink" href="#ensembles" title="Permalink to this heading"></a></h3>
<p>QuaPy implements (some of) the variants proposed in:</p>
<ul class="simple">
<li><p><em>Pérez-Gállego, P., Quevedo, J. R., &amp; del Coz, J. J. (2017).
<li><p><a class="reference external" href="https://www.sciencedirect.com/science/article/pii/S1566253516300628"><em>Pérez-Gállego, P., Quevedo, J. R., &amp; del Coz, J. J. (2017).
Using ensembles for problems with characterizable changes in data distribution: A case study on quantification.
Information Fusion, 34, 87-100.</em></p></li>
<li><p><em>Pérez-Gállego, P., Castano, A., Quevedo, J. R., &amp; del Coz, J. J. (2019).
Information Fusion, 34, 87-100.</em></a></p></li>
<li><p><a class="reference external" href="https://www.sciencedirect.com/science/article/pii/S1566253517303652"><em>Pérez-Gállego, P., Castano, A., Quevedo, J. R., &amp; del Coz, J. J. (2019).
Dynamic ensemble selection for quantification tasks.
Information Fusion, 45, 1-15.</em></p></li>
Information Fusion, 45, 1-15.</em></a></p></li>
</ul>
<p>The following code shows how to instantiate an Ensemble of 30 <em>Adjusted Classify &amp; Count</em> (ACC)
quantifiers operating with a <em>Logistic Regressor</em> (LR) as the base classifier, and using the
@ -398,10 +418,10 @@ wiki if you want to optimize the hyperparameters of ensemble for classification
<section id="the-quanet-neural-network">
<h3>The QuaNet neural network<a class="headerlink" href="#the-quanet-neural-network" title="Permalink to this heading"></a></h3>
<p>QuaPy offers an implementation of QuaNet, a deep learning model presented in:</p>
<p><em>Esuli, A., Moreo, A., &amp; Sebastiani, F. (2018, October).
<p><a class="reference external" href="https://dl.acm.org/doi/abs/10.1145/3269206.3269287"><em>Esuli, A., Moreo, A., &amp; Sebastiani, F. (2018, October).
A recurrent neural network for sentiment quantification.
In Proceedings of the 27th ACM International Conference on
Information and Knowledge Management (pp. 1775-1778).</em></p>
Information and Knowledge Management (pp. 1775-1778).</em></a></p>
<p>This model requires <em>torch</em> to be installed.
QuaNet also requires a classifier that can provide embedded representations
of the inputs.
@ -423,7 +443,7 @@ In the following example, we show an instantiation of QuaNet that instead uses C
<span class="n">learner</span> <span class="o">=</span> <span class="n">NeuralClassifierTrainer</span><span class="p">(</span><span class="n">cnn</span><span class="p">,</span> <span class="n">device</span><span class="o">=</span><span class="s1">&#39;cuda&#39;</span><span class="p">)</span>
<span class="c1"># train QuaNet</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">QuaNet</span><span class="p">(</span><span class="n">learner</span><span class="p">,</span> <span class="n">qp</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s1">&#39;SAMPLE_SIZE&#39;</span><span class="p">],</span> <span class="n">device</span><span class="o">=</span><span class="s1">&#39;cuda&#39;</span><span class="p">)</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">QuaNet</span><span class="p">(</span><span class="n">learner</span><span class="p">,</span> <span class="n">device</span><span class="o">=</span><span class="s1">&#39;cuda&#39;</span><span class="p">)</span>
<span class="n">model</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">dataset</span><span class="o">.</span><span class="n">training</span><span class="p">)</span>
<span class="n">estim_prevalence</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">quantify</span><span class="p">(</span><span class="n">dataset</span><span class="o">.</span><span class="n">test</span><span class="o">.</span><span class="n">instances</span><span class="p">)</span>
</pre></div>
@ -447,6 +467,7 @@ In the following example, we show an instantiation of QuaNet that instead uses C
<li><a class="reference internal" href="#the-classify-count-variants">The Classify &amp; Count variants</a></li>
<li><a class="reference internal" href="#expectation-maximization-emq">Expectation Maximization (EMQ)</a></li>
<li><a class="reference internal" href="#hellinger-distance-y-hdy">Hellinger Distance y (HDy)</a></li>
<li><a class="reference internal" href="#threshold-optimization-methods">Threshold Optimization methods</a></li>
<li><a class="reference internal" href="#explicit-loss-minimization">Explicit Loss Minimization</a></li>
</ul>
</li>
@ -462,8 +483,8 @@ In the following example, we show an instantiation of QuaNet that instead uses C
</div>
<div>
<h4>Previous topic</h4>
<p class="topless"><a href="Evaluation.html"
title="previous chapter">Evaluation</a></p>
<p class="topless"><a href="Protocols.html"
title="previous chapter">Protocols</a></p>
</div>
<div>
<h4>Next topic</h4>
@ -504,7 +525,7 @@ In the following example, we show an instantiation of QuaNet that instead uses C
<a href="Model-Selection.html" title="Model Selection"
>next</a> |</li>
<li class="right" >
<a href="Evaluation.html" title="Evaluation"
<a href="Protocols.html" title="Protocols"
>previous</a> |</li>
<li class="nav-item nav-item-0"><a href="index.html">QuaPy 0.1.7 documentation</a> &#187;</li>
<li class="nav-item nav-item-this"><a href="">Quantification Methods</a></li>

View File

@ -74,81 +74,91 @@ specifically designed for the task of quantification.</p>
classification, and thus the model selection strategies
customarily adopted in classification have simply been
applied to quantification (see the next section).
It has been argued in <em>Moreo, Alejandro, and Fabrizio Sebastiani.
Re-Assessing the Classify and Count” Quantification Method.
arXiv preprint arXiv:2011.02552 (2020).</em>
It has been argued in <a class="reference external" href="https://link.springer.com/chapter/10.1007/978-3-030-72240-1_6">Moreo, Alejandro, and Fabrizio Sebastiani.
Re-Assessing the Classify and Count” Quantification Method.
ECIR 2021: Advances in Information Retrieval pp 7591.</a>
that specific model selection strategies should
be adopted for quantification. That is, model selection
strategies for quantification should target
quantification-oriented losses and be tested in a variety
of scenarios exhibiting different degrees of prior
probability shift.</p>
<p>The class
<em>qp.model_selection.GridSearchQ</em>
implements a grid-search exploration over the space of
hyper-parameter combinations that evaluates each<br />
combination of hyper-parameters
by means of a given quantification-oriented
<p>The class <em>qp.model_selection.GridSearchQ</em> implements a grid-search exploration over the space of
hyper-parameter combinations that <a class="reference external" href="https://github.com/HLT-ISTI/QuaPy/wiki/Evaluation">evaluates</a>
each combination of hyper-parameters by means of a given quantification-oriented
error metric (e.g., any of the error functions implemented
in <em>qp.error</em>) and according to the
<a class="reference external" href="https://github.com/HLT-ISTI/QuaPy/wiki/Evaluation"><em>artificial sampling protocol</em></a>.</p>
<p>The following is an example of model selection for quantification:</p>
in <em>qp.error</em>) and according to a
<a class="reference external" href="https://github.com/HLT-ISTI/QuaPy/wiki/Protocols">sampling generation protocol</a>.</p>
<p>The following is an example (also included in the examples folder) of model selection for quantification:</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">quapy</span> <span class="k">as</span> <span class="nn">qp</span>
<span class="kn">from</span> <span class="nn">quapy.method.aggregative</span> <span class="kn">import</span> <span class="n">PCC</span>
<span class="kn">from</span> <span class="nn">quapy.protocol</span> <span class="kn">import</span> <span class="n">APP</span>
<span class="kn">from</span> <span class="nn">quapy.method.aggregative</span> <span class="kn">import</span> <span class="n">DistributionMatching</span>
<span class="kn">from</span> <span class="nn">sklearn.linear_model</span> <span class="kn">import</span> <span class="n">LogisticRegression</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="c1"># set a seed to replicate runs</span>
<span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">seed</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
<span class="n">qp</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s1">&#39;SAMPLE_SIZE&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="mi">500</span>
<span class="sd">&quot;&quot;&quot;</span>
<span class="sd">In this example, we show how to perform model selection on a DistributionMatching quantifier.</span>
<span class="sd">&quot;&quot;&quot;</span>
<span class="n">dataset</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">datasets</span><span class="o">.</span><span class="n">fetch_reviews</span><span class="p">(</span><span class="s1">&#39;hp&#39;</span><span class="p">,</span> <span class="n">tfidf</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">min_df</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">DistributionMatching</span><span class="p">(</span><span class="n">LogisticRegression</span><span class="p">())</span>
<span class="n">qp</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s1">&#39;SAMPLE_SIZE&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="mi">100</span>
<span class="n">qp</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s1">&#39;N_JOBS&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span> <span class="c1"># explore hyper-parameters in parallel</span>
<span class="n">training</span><span class="p">,</span> <span class="n">test</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">datasets</span><span class="o">.</span><span class="n">fetch_reviews</span><span class="p">(</span><span class="s1">&#39;imdb&#39;</span><span class="p">,</span> <span class="n">tfidf</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">min_df</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span><span class="o">.</span><span class="n">train_test</span>
<span class="c1"># The model will be returned by the fit method of GridSearchQ.</span>
<span class="c1"># Model selection will be performed with a fixed budget of 1000 evaluations</span>
<span class="c1"># for each hyper-parameter combination. The error to optimize is the MAE for</span>
<span class="c1"># quantification, as evaluated on artificially drawn samples at prevalences </span>
<span class="c1"># covering the entire spectrum on a held-out portion (40%) of the training set.</span>
<span class="c1"># Every combination of hyper-parameters will be evaluated by confronting the</span>
<span class="c1"># quantifier thus configured against a series of samples generated by means</span>
<span class="c1"># of a sample generation protocol. For this example, we will use the</span>
<span class="c1"># artificial-prevalence protocol (APP), that generates samples with prevalence</span>
<span class="c1"># values in the entire range of values from a grid (e.g., [0, 0.1, 0.2, ..., 1]).</span>
<span class="c1"># We devote 30% of the dataset for this exploration.</span>
<span class="n">training</span><span class="p">,</span> <span class="n">validation</span> <span class="o">=</span> <span class="n">training</span><span class="o">.</span><span class="n">split_stratified</span><span class="p">(</span><span class="n">train_prop</span><span class="o">=</span><span class="mf">0.7</span><span class="p">)</span>
<span class="n">protocol</span> <span class="o">=</span> <span class="n">APP</span><span class="p">(</span><span class="n">validation</span><span class="p">)</span>
<span class="c1"># We will explore a classification-dependent hyper-parameter (e.g., the &#39;C&#39;</span>
<span class="c1"># hyper-parameter of LogisticRegression) and a quantification-dependent hyper-parameter</span>
<span class="c1"># (e.g., the number of bins in a DistributionMatching quantifier.</span>
<span class="c1"># Classifier-dependent hyper-parameters have to be marked with a prefix &quot;classifier__&quot;</span>
<span class="c1"># in order to let the quantifier know this hyper-parameter belongs to its underlying</span>
<span class="c1"># classifier.</span>
<span class="n">param_grid</span> <span class="o">=</span> <span class="p">{</span>
<span class="s1">&#39;classifier__C&#39;</span><span class="p">:</span> <span class="n">np</span><span class="o">.</span><span class="n">logspace</span><span class="p">(</span><span class="o">-</span><span class="mi">3</span><span class="p">,</span><span class="mi">3</span><span class="p">,</span><span class="mi">7</span><span class="p">),</span>
<span class="s1">&#39;nbins&#39;</span><span class="p">:</span> <span class="p">[</span><span class="mi">8</span><span class="p">,</span> <span class="mi">16</span><span class="p">,</span> <span class="mi">32</span><span class="p">,</span> <span class="mi">64</span><span class="p">],</span>
<span class="p">}</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">model_selection</span><span class="o">.</span><span class="n">GridSearchQ</span><span class="p">(</span>
<span class="n">model</span><span class="o">=</span><span class="n">PCC</span><span class="p">(</span><span class="n">LogisticRegression</span><span class="p">()),</span>
<span class="n">param_grid</span><span class="o">=</span><span class="p">{</span><span class="s1">&#39;C&#39;</span><span class="p">:</span> <span class="n">np</span><span class="o">.</span><span class="n">logspace</span><span class="p">(</span><span class="o">-</span><span class="mi">4</span><span class="p">,</span><span class="mi">5</span><span class="p">,</span><span class="mi">10</span><span class="p">),</span> <span class="s1">&#39;class_weight&#39;</span><span class="p">:</span> <span class="p">[</span><span class="s1">&#39;balanced&#39;</span><span class="p">,</span> <span class="kc">None</span><span class="p">]},</span>
<span class="n">sample_size</span><span class="o">=</span><span class="n">qp</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s1">&#39;SAMPLE_SIZE&#39;</span><span class="p">],</span>
<span class="n">eval_budget</span><span class="o">=</span><span class="mi">1000</span><span class="p">,</span>
<span class="n">error</span><span class="o">=</span><span class="s1">&#39;mae&#39;</span><span class="p">,</span>
<span class="n">refit</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="c1"># retrain on the whole labelled set</span>
<span class="n">val_split</span><span class="o">=</span><span class="mf">0.4</span><span class="p">,</span>
<span class="n">model</span><span class="o">=</span><span class="n">model</span><span class="p">,</span>
<span class="n">param_grid</span><span class="o">=</span><span class="n">param_grid</span><span class="p">,</span>
<span class="n">protocol</span><span class="o">=</span><span class="n">protocol</span><span class="p">,</span>
<span class="n">error</span><span class="o">=</span><span class="s1">&#39;mae&#39;</span><span class="p">,</span> <span class="c1"># the error to optimize is the MAE (a quantification-oriented loss)</span>
<span class="n">refit</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="c1"># retrain on the whole labelled set once done</span>
<span class="n">verbose</span><span class="o">=</span><span class="kc">True</span> <span class="c1"># show information as the process goes on</span>
<span class="p">)</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">dataset</span><span class="o">.</span><span class="n">training</span><span class="p">)</span>
<span class="p">)</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">training</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39;model selection ended: best hyper-parameters=</span><span class="si">{</span><span class="n">model</span><span class="o">.</span><span class="n">best_params_</span><span class="si">}</span><span class="s1">&#39;</span><span class="p">)</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">best_model_</span>
<span class="c1"># evaluation in terms of MAE</span>
<span class="n">results</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">evaluation</span><span class="o">.</span><span class="n">artificial_sampling_eval</span><span class="p">(</span>
<span class="n">model</span><span class="p">,</span>
<span class="n">dataset</span><span class="o">.</span><span class="n">test</span><span class="p">,</span>
<span class="n">sample_size</span><span class="o">=</span><span class="n">qp</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s1">&#39;SAMPLE_SIZE&#39;</span><span class="p">],</span>
<span class="n">n_prevpoints</span><span class="o">=</span><span class="mi">101</span><span class="p">,</span>
<span class="n">n_repetitions</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span>
<span class="n">error_metric</span><span class="o">=</span><span class="s1">&#39;mae&#39;</span>
<span class="p">)</span>
<span class="c1"># we use the same evaluation protocol (APP) on the test set</span>
<span class="n">mae_score</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">evaluation</span><span class="o">.</span><span class="n">evaluate</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">protocol</span><span class="o">=</span><span class="n">APP</span><span class="p">(</span><span class="n">test</span><span class="p">),</span> <span class="n">error_metric</span><span class="o">=</span><span class="s1">&#39;mae&#39;</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39;MAE=</span><span class="si">{</span><span class="n">results</span><span class="si">:</span><span class="s1">.5f</span><span class="si">}</span><span class="s1">&#39;</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39;MAE=</span><span class="si">{</span><span class="n">mae_score</span><span class="si">:</span><span class="s1">.5f</span><span class="si">}</span><span class="s1">&#39;</span><span class="p">)</span>
</pre></div>
</div>
<p>In this example, the system outputs:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span>[GridSearchQ]: starting optimization with n_jobs=1
[GridSearchQ]: checking hyperparams={&#39;C&#39;: 0.0001, &#39;class_weight&#39;: &#39;balanced&#39;} got mae score 0.24987
[GridSearchQ]: checking hyperparams={&#39;C&#39;: 0.0001, &#39;class_weight&#39;: None} got mae score 0.48135
[GridSearchQ]: checking hyperparams={&#39;C&#39;: 0.001, &#39;class_weight&#39;: &#39;balanced&#39;} got mae score 0.24866
[...]
[GridSearchQ]: checking hyperparams={&#39;C&#39;: 100000.0, &#39;class_weight&#39;: None} got mae score 0.43676
[GridSearchQ]: optimization finished: best params {&#39;C&#39;: 0.1, &#39;class_weight&#39;: &#39;balanced&#39;} (score=0.19982)
[GridSearchQ]: refitting on the whole development set
model selection ended: best hyper-parameters={&#39;C&#39;: 0.1, &#39;class_weight&#39;: &#39;balanced&#39;}
1010 evaluations will be performed for each combination of hyper-parameters
[artificial sampling protocol] generating predictions: 100%|██████████| 1010/1010 [00:00&lt;00:00, 5005.54it/s]
MAE=0.20342
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="p">[</span><span class="n">GridSearchQ</span><span class="p">]:</span> <span class="n">starting</span> <span class="n">model</span> <span class="n">selection</span> <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">n_jobs</span> <span class="o">=-</span><span class="mi">1</span>
<span class="p">[</span><span class="n">GridSearchQ</span><span class="p">]:</span> <span class="n">hyperparams</span><span class="o">=</span><span class="p">{</span><span class="s1">&#39;classifier__C&#39;</span><span class="p">:</span> <span class="mf">0.01</span><span class="p">,</span> <span class="s1">&#39;nbins&#39;</span><span class="p">:</span> <span class="mi">64</span><span class="p">}</span> <span class="n">got</span> <span class="n">mae</span> <span class="n">score</span> <span class="mf">0.04021</span> <span class="p">[</span><span class="n">took</span> <span class="mf">1.1356</span><span class="n">s</span><span class="p">]</span>
<span class="p">[</span><span class="n">GridSearchQ</span><span class="p">]:</span> <span class="n">hyperparams</span><span class="o">=</span><span class="p">{</span><span class="s1">&#39;classifier__C&#39;</span><span class="p">:</span> <span class="mf">0.01</span><span class="p">,</span> <span class="s1">&#39;nbins&#39;</span><span class="p">:</span> <span class="mi">32</span><span class="p">}</span> <span class="n">got</span> <span class="n">mae</span> <span class="n">score</span> <span class="mf">0.04286</span> <span class="p">[</span><span class="n">took</span> <span class="mf">1.2139</span><span class="n">s</span><span class="p">]</span>
<span class="p">[</span><span class="n">GridSearchQ</span><span class="p">]:</span> <span class="n">hyperparams</span><span class="o">=</span><span class="p">{</span><span class="s1">&#39;classifier__C&#39;</span><span class="p">:</span> <span class="mf">0.01</span><span class="p">,</span> <span class="s1">&#39;nbins&#39;</span><span class="p">:</span> <span class="mi">16</span><span class="p">}</span> <span class="n">got</span> <span class="n">mae</span> <span class="n">score</span> <span class="mf">0.04888</span> <span class="p">[</span><span class="n">took</span> <span class="mf">1.2491</span><span class="n">s</span><span class="p">]</span>
<span class="p">[</span><span class="n">GridSearchQ</span><span class="p">]:</span> <span class="n">hyperparams</span><span class="o">=</span><span class="p">{</span><span class="s1">&#39;classifier__C&#39;</span><span class="p">:</span> <span class="mf">0.001</span><span class="p">,</span> <span class="s1">&#39;nbins&#39;</span><span class="p">:</span> <span class="mi">8</span><span class="p">}</span> <span class="n">got</span> <span class="n">mae</span> <span class="n">score</span> <span class="mf">0.05163</span> <span class="p">[</span><span class="n">took</span> <span class="mf">1.5372</span><span class="n">s</span><span class="p">]</span>
<span class="p">[</span><span class="o">...</span><span class="p">]</span>
<span class="p">[</span><span class="n">GridSearchQ</span><span class="p">]:</span> <span class="n">hyperparams</span><span class="o">=</span><span class="p">{</span><span class="s1">&#39;classifier__C&#39;</span><span class="p">:</span> <span class="mf">1000.0</span><span class="p">,</span> <span class="s1">&#39;nbins&#39;</span><span class="p">:</span> <span class="mi">32</span><span class="p">}</span> <span class="n">got</span> <span class="n">mae</span> <span class="n">score</span> <span class="mf">0.02445</span> <span class="p">[</span><span class="n">took</span> <span class="mf">2.9056</span><span class="n">s</span><span class="p">]</span>
<span class="p">[</span><span class="n">GridSearchQ</span><span class="p">]:</span> <span class="n">optimization</span> <span class="n">finished</span><span class="p">:</span> <span class="n">best</span> <span class="n">params</span> <span class="p">{</span><span class="s1">&#39;classifier__C&#39;</span><span class="p">:</span> <span class="mf">100.0</span><span class="p">,</span> <span class="s1">&#39;nbins&#39;</span><span class="p">:</span> <span class="mi">32</span><span class="p">}</span> <span class="p">(</span><span class="n">score</span><span class="o">=</span><span class="mf">0.02234</span><span class="p">)</span> <span class="p">[</span><span class="n">took</span> <span class="mf">7.3114</span><span class="n">s</span><span class="p">]</span>
<span class="p">[</span><span class="n">GridSearchQ</span><span class="p">]:</span> <span class="n">refitting</span> <span class="n">on</span> <span class="n">the</span> <span class="n">whole</span> <span class="n">development</span> <span class="nb">set</span>
<span class="n">model</span> <span class="n">selection</span> <span class="n">ended</span><span class="p">:</span> <span class="n">best</span> <span class="n">hyper</span><span class="o">-</span><span class="n">parameters</span><span class="o">=</span><span class="p">{</span><span class="s1">&#39;classifier__C&#39;</span><span class="p">:</span> <span class="mf">100.0</span><span class="p">,</span> <span class="s1">&#39;nbins&#39;</span><span class="p">:</span> <span class="mi">32</span><span class="p">}</span>
<span class="n">MAE</span><span class="o">=</span><span class="mf">0.03102</span>
</pre></div>
</div>
<p>The parameter <em>val_split</em> can alternatively be used to indicate
@ -172,30 +182,13 @@ The following code illustrates how to do that:</p>
<span class="n">LogisticRegression</span><span class="p">(),</span>
<span class="n">param_grid</span><span class="o">=</span><span class="p">{</span><span class="s1">&#39;C&#39;</span><span class="p">:</span> <span class="n">np</span><span class="o">.</span><span class="n">logspace</span><span class="p">(</span><span class="o">-</span><span class="mi">4</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">10</span><span class="p">),</span> <span class="s1">&#39;class_weight&#39;</span><span class="p">:</span> <span class="p">[</span><span class="s1">&#39;balanced&#39;</span><span class="p">,</span> <span class="kc">None</span><span class="p">]},</span>
<span class="n">cv</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">PCC</span><span class="p">(</span><span class="n">learner</span><span class="p">)</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">dataset</span><span class="o">.</span><span class="n">training</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39;model selection ended: best hyper-parameters=</span><span class="si">{</span><span class="n">model</span><span class="o">.</span><span class="n">learner</span><span class="o">.</span><span class="n">best_params_</span><span class="si">}</span><span class="s1">&#39;</span><span class="p">)</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">DistributionMatching</span><span class="p">(</span><span class="n">learner</span><span class="p">)</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">dataset</span><span class="o">.</span><span class="n">training</span><span class="p">)</span>
</pre></div>
</div>
<p>In this example, the system outputs:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span>model selection ended: best hyper-parameters={&#39;C&#39;: 10000.0, &#39;class_weight&#39;: None}
1010 evaluations will be performed for each combination of hyper-parameters
[artificial sampling protocol] generating predictions: 100%|██████████| 1010/1010 [00:00&lt;00:00, 5379.55it/s]
MAE=0.41734
</pre></div>
</div>
<p>Note that the MAE is worse than the one we obtained when optimizing
for quantification and, indeed, the hyper-parameters found optimal
largely differ between the two selection modalities. The
hyper-parameters C=10000 and class_weight=None have been found
to work well for the specific training prevalence of the HP dataset,
but these hyper-parameters turned out to be suboptimal when the
class prevalences of the test set differs (as is indeed tested
in scenarios of quantification).</p>
<p>This is, however, not always the case, and one could, in practice,
find examples
in which optimizing for classification ends up resulting in a better
quantifier than when optimizing for quantification.
Nonetheless, this is theoretically unlikely to happen.</p>
<p>However, this is conceptually flawed, since the model should be
optimized for the task at hand (quantification), and not for a surrogate task (classification),
i.e., the model should be requested to deliver low quantification errors, rather
than low classification errors.</p>
</section>
</section>

View File

@ -94,7 +94,7 @@ quantification methods across different scenarios showcasing
the accuracy of the quantifier in predicting class prevalences
for a wide range of prior distributions. This can easily be
achieved by means of the
<a class="reference external" href="https://github.com/HLT-ISTI/QuaPy/wiki/Evaluation">artificial sampling protocol</a>
<a class="reference external" href="https://github.com/HLT-ISTI/QuaPy/wiki/Protocols">artificial sampling protocol</a>
that is implemented in QuaPy.</p>
<p>The following code shows how to perform one simple experiment
in which the 4 <em>CC-variants</em>, all equipped with a linear SVM, are
@ -103,6 +103,7 @@ tested across the entire spectrum of class priors (taking 21 splits
of the interval [0,1], i.e., using prevalence steps of 0.05, and
generating 100 random samples at each prevalence).</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">quapy</span> <span class="k">as</span> <span class="nn">qp</span>
<span class="kn">from</span> <span class="nn">protocol</span> <span class="kn">import</span> <span class="n">APP</span>
<span class="kn">from</span> <span class="nn">quapy.method.aggregative</span> <span class="kn">import</span> <span class="n">CC</span><span class="p">,</span> <span class="n">ACC</span><span class="p">,</span> <span class="n">PCC</span><span class="p">,</span> <span class="n">PACC</span>
<span class="kn">from</span> <span class="nn">sklearn.svm</span> <span class="kn">import</span> <span class="n">LinearSVC</span>
@ -111,28 +112,26 @@ generating 100 random samples at each prevalence).</p>
<span class="k">def</span> <span class="nf">gen_data</span><span class="p">():</span>
<span class="k">def</span> <span class="nf">base_classifier</span><span class="p">():</span>
<span class="k">return</span> <span class="n">LinearSVC</span><span class="p">()</span>
<span class="k">return</span> <span class="n">LinearSVC</span><span class="p">(</span><span class="n">class_weight</span><span class="o">=</span><span class="s1">&#39;balanced&#39;</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">models</span><span class="p">():</span>
<span class="k">yield</span> <span class="n">CC</span><span class="p">(</span><span class="n">base_classifier</span><span class="p">())</span>
<span class="k">yield</span> <span class="n">ACC</span><span class="p">(</span><span class="n">base_classifier</span><span class="p">())</span>
<span class="k">yield</span> <span class="n">PCC</span><span class="p">(</span><span class="n">base_classifier</span><span class="p">())</span>
<span class="k">yield</span> <span class="n">PACC</span><span class="p">(</span><span class="n">base_classifier</span><span class="p">())</span>
<span class="k">yield</span> <span class="s1">&#39;CC&#39;</span><span class="p">,</span> <span class="n">CC</span><span class="p">(</span><span class="n">base_classifier</span><span class="p">())</span>
<span class="k">yield</span> <span class="s1">&#39;ACC&#39;</span><span class="p">,</span> <span class="n">ACC</span><span class="p">(</span><span class="n">base_classifier</span><span class="p">())</span>
<span class="k">yield</span> <span class="s1">&#39;PCC&#39;</span><span class="p">,</span> <span class="n">PCC</span><span class="p">(</span><span class="n">base_classifier</span><span class="p">())</span>
<span class="k">yield</span> <span class="s1">&#39;PACC&#39;</span><span class="p">,</span> <span class="n">PACC</span><span class="p">(</span><span class="n">base_classifier</span><span class="p">())</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">datasets</span><span class="o">.</span><span class="n">fetch_reviews</span><span class="p">(</span><span class="s1">&#39;kindle&#39;</span><span class="p">,</span> <span class="n">tfidf</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">min_df</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span>
<span class="n">train</span><span class="p">,</span> <span class="n">test</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">datasets</span><span class="o">.</span><span class="n">fetch_reviews</span><span class="p">(</span><span class="s1">&#39;kindle&#39;</span><span class="p">,</span> <span class="n">tfidf</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">min_df</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span><span class="o">.</span><span class="n">train_test</span>
<span class="n">method_names</span><span class="p">,</span> <span class="n">true_prevs</span><span class="p">,</span> <span class="n">estim_prevs</span><span class="p">,</span> <span class="n">tr_prevs</span> <span class="o">=</span> <span class="p">[],</span> <span class="p">[],</span> <span class="p">[],</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">model</span> <span class="ow">in</span> <span class="n">models</span><span class="p">():</span>
<span class="n">model</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">data</span><span class="o">.</span><span class="n">training</span><span class="p">)</span>
<span class="n">true_prev</span><span class="p">,</span> <span class="n">estim_prev</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">evaluation</span><span class="o">.</span><span class="n">artificial_sampling_prediction</span><span class="p">(</span>
<span class="n">model</span><span class="p">,</span> <span class="n">data</span><span class="o">.</span><span class="n">test</span><span class="p">,</span> <span class="n">qp</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s1">&#39;SAMPLE_SIZE&#39;</span><span class="p">],</span> <span class="n">n_repetitions</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span> <span class="n">n_prevpoints</span><span class="o">=</span><span class="mi">21</span>
<span class="p">)</span>
<span class="k">for</span> <span class="n">method_name</span><span class="p">,</span> <span class="n">model</span> <span class="ow">in</span> <span class="n">models</span><span class="p">():</span>
<span class="n">model</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">train</span><span class="p">)</span>
<span class="n">true_prev</span><span class="p">,</span> <span class="n">estim_prev</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">evaluation</span><span class="o">.</span><span class="n">prediction</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">APP</span><span class="p">(</span><span class="n">test</span><span class="p">,</span> <span class="n">repeats</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">0</span><span class="p">))</span>
<span class="n">method_names</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">model</span><span class="o">.</span><span class="vm">__class__</span><span class="o">.</span><span class="vm">__name__</span><span class="p">)</span>
<span class="n">method_names</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">method_name</span><span class="p">)</span>
<span class="n">true_prevs</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">true_prev</span><span class="p">)</span>
<span class="n">estim_prevs</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">estim_prev</span><span class="p">)</span>
<span class="n">tr_prevs</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">data</span><span class="o">.</span><span class="n">training</span><span class="o">.</span><span class="n">prevalence</span><span class="p">())</span>
<span class="n">tr_prevs</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">train</span><span class="o">.</span><span class="n">prevalence</span><span class="p">())</span>
<span class="k">return</span> <span class="n">method_names</span><span class="p">,</span> <span class="n">true_prevs</span><span class="p">,</span> <span class="n">estim_prevs</span><span class="p">,</span> <span class="n">tr_prevs</span>
@ -199,21 +198,19 @@ IMDb dataset, and generate the bias plot again.
This example can be run by rewritting the <em>gen_data()</em> function
like this:</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="k">def</span> <span class="nf">gen_data</span><span class="p">():</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">datasets</span><span class="o">.</span><span class="n">fetch_reviews</span><span class="p">(</span><span class="s1">&#39;imdb&#39;</span><span class="p">,</span> <span class="n">tfidf</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">min_df</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span>
<span class="n">train</span><span class="p">,</span> <span class="n">test</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">datasets</span><span class="o">.</span><span class="n">fetch_reviews</span><span class="p">(</span><span class="s1">&#39;imdb&#39;</span><span class="p">,</span> <span class="n">tfidf</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">min_df</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span><span class="o">.</span><span class="n">train_test</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">CC</span><span class="p">(</span><span class="n">LinearSVC</span><span class="p">())</span>
<span class="n">method_data</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">training_prevalence</span> <span class="ow">in</span> <span class="n">np</span><span class="o">.</span><span class="n">linspace</span><span class="p">(</span><span class="mf">0.1</span><span class="p">,</span> <span class="mf">0.9</span><span class="p">,</span> <span class="mi">9</span><span class="p">):</span>
<span class="n">training_size</span> <span class="o">=</span> <span class="mi">5000</span>
<span class="c1"># since the problem is binary, it suffices to specify the negative prevalence (the positive is constrained)</span>
<span class="n">training</span> <span class="o">=</span> <span class="n">data</span><span class="o">.</span><span class="n">training</span><span class="o">.</span><span class="n">sampling</span><span class="p">(</span><span class="n">training_size</span><span class="p">,</span> <span class="mi">1</span> <span class="o">-</span> <span class="n">training_prevalence</span><span class="p">)</span>
<span class="n">model</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">training</span><span class="p">)</span>
<span class="n">true_prev</span><span class="p">,</span> <span class="n">estim_prev</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">evaluation</span><span class="o">.</span><span class="n">artificial_sampling_prediction</span><span class="p">(</span>
<span class="n">model</span><span class="p">,</span> <span class="n">data</span><span class="o">.</span><span class="n">sample</span><span class="p">,</span> <span class="n">qp</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s1">&#39;SAMPLE_SIZE&#39;</span><span class="p">],</span> <span class="n">n_repetitions</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span> <span class="n">n_prevpoints</span><span class="o">=</span><span class="mi">21</span>
<span class="p">)</span>
<span class="c1"># method names can contain Latex syntax</span>
<span class="n">method_name</span> <span class="o">=</span> <span class="s1">&#39;CC$_{&#39;</span> <span class="o">+</span> <span class="sa">f</span><span class="s1">&#39;</span><span class="si">{</span><span class="nb">int</span><span class="p">(</span><span class="mi">100</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">training_prevalence</span><span class="p">)</span><span class="si">}</span><span class="s1">&#39;</span> <span class="o">+</span> <span class="s1">&#39;\%}$&#39;</span>
<span class="n">method_data</span><span class="o">.</span><span class="n">append</span><span class="p">((</span><span class="n">method_name</span><span class="p">,</span> <span class="n">true_prev</span><span class="p">,</span> <span class="n">estim_prev</span><span class="p">,</span> <span class="n">training</span><span class="o">.</span><span class="n">prevalence</span><span class="p">()))</span>
<span class="c1"># since the problem is binary, it suffices to specify the negative prevalence, since the positive is constrained</span>
<span class="n">train_sample</span> <span class="o">=</span> <span class="n">train</span><span class="o">.</span><span class="n">sampling</span><span class="p">(</span><span class="n">training_size</span><span class="p">,</span> <span class="mi">1</span><span class="o">-</span><span class="n">training_prevalence</span><span class="p">)</span>
<span class="n">model</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">train_sample</span><span class="p">)</span>
<span class="n">true_prev</span><span class="p">,</span> <span class="n">estim_prev</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">evaluation</span><span class="o">.</span><span class="n">prediction</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">APP</span><span class="p">(</span><span class="n">test</span><span class="p">,</span> <span class="n">repeats</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">0</span><span class="p">))</span>
<span class="n">method_name</span> <span class="o">=</span> <span class="s1">&#39;CC$_{&#39;</span><span class="o">+</span><span class="sa">f</span><span class="s1">&#39;</span><span class="si">{</span><span class="nb">int</span><span class="p">(</span><span class="mi">100</span><span class="o">*</span><span class="n">training_prevalence</span><span class="p">)</span><span class="si">}</span><span class="s1">&#39;</span> <span class="o">+</span> <span class="s1">&#39;\%}$&#39;</span>
<span class="n">method_data</span><span class="o">.</span><span class="n">append</span><span class="p">((</span><span class="n">method_name</span><span class="p">,</span> <span class="n">true_prev</span><span class="p">,</span> <span class="n">estim_prev</span><span class="p">,</span> <span class="n">train_sample</span><span class="o">.</span><span class="n">prevalence</span><span class="p">()))</span>
<span class="k">return</span> <span class="nb">zip</span><span class="p">(</span><span class="o">*</span><span class="n">method_data</span><span class="p">)</span>
</pre></div>

View File

@ -31,13 +31,14 @@ Output the class prevalences (showing 2 digit precision):
```
One can easily produce new samples at desired class prevalences:
```python
sample_size = 10
prev = [0.4, 0.1, 0.5]
sample = data.sampling(sample_size, *prev)
print('instances:', sample.instances)
print('labels:', sample.labels)
print('labels:', sample.classes)
print('prevalence:', F.strprev(sample.prevalence(), prec=2))
```

View File

@ -50,7 +50,7 @@ indicating the value for the smoothing parameter epsilon.
Traditionally, this value is set to 1/(2T) in past literature,
with T the sampling size. One could either pass this value
to the function each time, or to set a QuaPy's environment
variable _SAMPLE_SIZE_ once, and ommit this argument
variable _SAMPLE_SIZE_ once, and omit this argument
thereafter (recommended);
e.g.:
@ -58,7 +58,7 @@ e.g.:
qp.environ['SAMPLE_SIZE'] = 100 # once for all
true_prev = np.asarray([0.5, 0.3, 0.2]) # let's assume 3 classes
estim_prev = np.asarray([0.1, 0.3, 0.6])
error = qp.ae_.mrae(true_prev, estim_prev)
error = qp.error.mrae(true_prev, estim_prev)
print(f'mrae({true_prev}, {estim_prev}) = {error:.3f}')
```
@ -71,162 +71,99 @@ Finally, it is possible to instantiate QuaPy's quantification
error functions from strings using, e.g.:
```python
error_function = qp.ae_.from_name('mse')
error_function = qp.error.from_name('mse')
error = error_function(true_prev, estim_prev)
```
## Evaluation Protocols
QuaPy implements the so-called "artificial sampling protocol",
according to which a test set is used to generate samplings at
desired prevalences of fixed size and covering the full spectrum
of prevalences. This protocol is called "artificial" in contrast
to the "natural prevalence sampling" protocol that,
despite introducing some variability during sampling, approximately
preserves the training class prevalence.
In the artificial sampling procol, the user specifies the number
of (equally distant) points to be generated from the interval [0,1].
For example, if n_prevpoints=11 then, for each class, the prevalences
[0., 0.1, 0.2, ..., 1.] will be used. This means that, for two classes,
the number of different prevalences will be 11 (since, once the prevalence
of one class is determined, the other one is constrained). For 3 classes,
the number of valid combinations can be obtained as 11 + 10 + ... + 1 = 66.
In general, the number of valid combinations that will be produced for a given
value of n_prevpoints can be consulted by invoking
quapy.functional.num_prevalence_combinations, e.g.:
An _evaluation protocol_ is an evaluation procedure that uses
one specific _sample generation procotol_ to genereate many
samples, typically characterized by widely varying amounts of
_shift_ with respect to the original distribution, that are then
used to evaluate the performance of a (trained) quantifier.
These protocols are explained in more detail in a dedicated [entry
in the wiki](Protocols.md). For the moment being, let us assume we already have
chosen and instantiated one specific such protocol, that we here
simply call _prot_. Let also assume our model is called
_quantifier_ and that our evaluatio measure of choice is
_mae_. The evaluation comes down to:
```python
import quapy.functional as F
n_prevpoints = 21
n_classes = 4
n = F.num_prevalence_combinations(n_prevpoints, n_classes, n_repeats=1)
mae = qp.evaluation.evaluate(quantifier, protocol=prot, error_metric='mae')
print(f'MAE = {mae:.4f}')
```
in this example, n=1771. Note the last argument, n_repeats, that
informs of the number of examples that will be generated for any
valid combination (typical values are, e.g., 1 for a single sample,
or 10 or higher for computing standard deviations of performing statistical
significance tests).
One can instead work the other way around, i.e., one could set a
maximum budged of evaluations and get the number of prevalence points that
will generate a number of evaluations close, but not higher, than
the fixed budget. This can be achieved with the function
quapy.functional.get_nprevpoints_approximation, e.g.:
It is often desirable to evaluate our system using more than one
single evaluatio measure. In this case, it is convenient to generate
a _report_. A report in QuaPy is a dataframe accounting for all the
true prevalence values with their corresponding prevalence values
as estimated by the quantifier, along with the error each has given
rise.
```python
budget = 5000
n_prevpoints = F.get_nprevpoints_approximation(budget, n_classes, n_repeats=1)
n = F.num_prevalence_combinations(n_prevpoints, n_classes, n_repeats=1)
print(f'by setting n_prevpoints={n_prevpoints} the number of evaluations for {n_classes} classes will be {n}')
```
that will print:
```
by setting n_prevpoints=30 the number of evaluations for 4 classes will be 4960
report = qp.evaluation.evaluation_report(quantifier, protocol=prot, error_metrics=['mae', 'mrae', 'mkld'])
```
The cost of evaluation will depend on the values of _n_prevpoints_, _n_classes_,
and _n_repeats_. Since it might sometimes be cumbersome to control the overall
cost of an experiment having to do with the number of combinations that
will be generated for a particular setting of these arguments (particularly
when _n_classes>2_), evaluation functions
typically allow the user to rather specify an _evaluation budget_, i.e., a maximum
number of samplings to generate. By specifying this argument, one could avoid
specifying _n_prevpoints_, and the value for it that would lead to a closer
number of evaluation budget, without surpassing it, will be automatically set.
The following script shows a full example in which a PACC model relying
on a Logistic Regressor classifier is
tested on the _kindle_ dataset by means of the artificial prevalence
sampling protocol on samples of size 500, in terms of various
evaluation metrics.
````python
import quapy as qp
import quapy.functional as F
from sklearn.linear_model import LogisticRegression
qp.environ['SAMPLE_SIZE'] = 500
dataset = qp.datasets.fetch_reviews('kindle')
qp.data.preprocessing.text2tfidf(dataset, min_df=5, inplace=True)
training = dataset.training
test = dataset.test
lr = LogisticRegression()
pacc = qp.method.aggregative.PACC(lr)
pacc.fit(training)
df = qp.evaluation.artificial_sampling_report(
pacc, # the quantification method
test, # the test set on which the method will be evaluated
sample_size=qp.environ['SAMPLE_SIZE'], #indicates the size of samples to be drawn
n_prevpoints=11, # how many prevalence points will be extracted from the interval [0, 1] for each category
n_repetitions=1, # number of times each prevalence will be used to generate a test sample
n_jobs=-1, # indicates the number of parallel workers (-1 indicates, as in sklearn, all CPUs)
random_seed=42, # setting a random seed allows to replicate the test samples across runs
error_metrics=['mae', 'mrae', 'mkld'], # specify the evaluation metrics
verbose=True # set to True to show some standard-line outputs
)
````
The resulting report is a pandas' dataframe that can be directly printed.
Here, we set some display options from pandas just to make the output clearer;
note also that the estimated prevalences are shown as strings using the
function strprev function that simply converts a prevalence into a
string representing it, with a fixed decimal precision (default 3):
From a pandas' dataframe, it is straightforward to visualize all the results,
and compute the averaged values, e.g.:
```python
import pandas as pd
pd.set_option('display.expand_frame_repr', False)
pd.set_option("precision", 3)
df['estim-prev'] = df['estim-prev'].map(F.strprev)
print(df)
report['estim-prev'] = report['estim-prev'].map(F.strprev)
print(report)
print('Averaged values:')
print(report.mean())
```
The output should look like:
This will produce an output like:
```
true-prev estim-prev mae mrae mkld
0 [0.0, 1.0] [0.000, 1.000] 0.000 0.000 0.000e+00
1 [0.1, 0.9] [0.091, 0.909] 0.009 0.048 4.426e-04
2 [0.2, 0.8] [0.163, 0.837] 0.037 0.114 4.633e-03
3 [0.3, 0.7] [0.283, 0.717] 0.017 0.041 7.383e-04
4 [0.4, 0.6] [0.366, 0.634] 0.034 0.070 2.412e-03
5 [0.5, 0.5] [0.459, 0.541] 0.041 0.082 3.387e-03
6 [0.6, 0.4] [0.565, 0.435] 0.035 0.073 2.535e-03
7 [0.7, 0.3] [0.654, 0.346] 0.046 0.108 4.701e-03
8 [0.8, 0.2] [0.725, 0.275] 0.075 0.235 1.515e-02
9 [0.9, 0.1] [0.858, 0.142] 0.042 0.229 7.740e-03
10 [1.0, 0.0] [0.945, 0.055] 0.055 27.357 5.219e-02
true-prev estim-prev mae mrae mkld
0 [0.308, 0.692] [0.314, 0.686] 0.005649 0.013182 0.000074
1 [0.896, 0.104] [0.909, 0.091] 0.013145 0.069323 0.000985
2 [0.848, 0.152] [0.809, 0.191] 0.039063 0.149806 0.005175
3 [0.016, 0.984] [0.033, 0.967] 0.017236 0.487529 0.005298
4 [0.728, 0.272] [0.751, 0.249] 0.022769 0.057146 0.001350
... ... ... ... ... ...
4995 [0.72, 0.28] [0.698, 0.302] 0.021752 0.053631 0.001133
4996 [0.868, 0.132] [0.888, 0.112] 0.020490 0.088230 0.001985
4997 [0.292, 0.708] [0.298, 0.702] 0.006149 0.014788 0.000090
4998 [0.24, 0.76] [0.220, 0.780] 0.019950 0.054309 0.001127
4999 [0.948, 0.052] [0.965, 0.035] 0.016941 0.165776 0.003538
[5000 rows x 5 columns]
Averaged values:
mae 0.023588
mrae 0.108779
mkld 0.003631
dtype: float64
Process finished with exit code 0
```
One can get the averaged scores using standard pandas'
functions, i.e.:
Alternatively, we can simply generate all the predictions by:
```python
print(df.mean())
true_prevs, estim_prevs = qp.evaluation.prediction(quantifier, protocol=prot)
```
will produce the following output:
```
true-prev 0.500
mae 0.035
mrae 2.578
mkld 0.009
dtype: float64
```
Other evaluation functions include:
* _artificial_sampling_eval_: that computes the evaluation for a
given evaluation metric, returning the average instead of a dataframe.
* _artificial_sampling_prediction_: that returns two np.arrays containing the
true prevalences and the estimated prevalences.
See the documentation for further details.
All the evaluation functions implement specific optimizations for speeding-up
the evaluation of aggregative quantifiers (i.e., of instances of _AggregativeQuantifier_).
The optimization comes down to generating classification predictions (either crisp or soft)
only once for the entire test set, and then applying the sampling procedure to the
predictions, instead of generating samples of instances and then computing the
classification predictions every time. This is only possible when the protocol
is an instance of _OnLabelledCollectionProtocol_. The optimization is only
carried out when the number of classification predictions thus generated would be
smaller than the number of predictions required for the entire protocol; e.g.,
if the original dataset contains 1M instances, but the protocol is such that it would
at most generate 20 samples of 100 instances, then it would be preferable to postpone the
classification for each sample. This behaviour is indicated by setting
_aggr_speedup="auto"_. Conversely, when indicating _aggr_speedup="force"_ QuaPy will
precompute all the predictions irrespectively of the number of instances and number of samples.
Finally, this can be deactivated by setting _aggr_speedup=False_. Note that this optimization
is not only applied for the final evaluation, but also for the internal evaluations carried
out during _model selection_. Since these are typically many, the heuristic can help reduce the
execution time a lot.

View File

@ -16,12 +16,6 @@ and implement some abstract methods:
@abstractmethod
def quantify(self, instances): ...
@abstractmethod
def set_params(self, **parameters): ...
@abstractmethod
def get_params(self, deep=True): ...
```
The meaning of those functions should be familiar to those
used to work with scikit-learn since the class structure of QuaPy
@ -32,10 +26,10 @@ scikit-learn' structure has not been adopted _as is_ in QuaPy responds to
the fact that scikit-learn's _predict_ function is expected to return
one output for each input element --e.g., a predicted label for each
instance in a sample-- while in quantification the output for a sample
is one single array of class prevalences), while functions _set_params_
and _get_params_ allow a
[model selector](https://github.com/HLT-ISTI/QuaPy/wiki/Model-Selection)
to automate the process of hyperparameter search.
is one single array of class prevalences).
Quantifiers also extend from scikit-learn's `BaseEstimator`, in order
to simplify the use of _set_params_ and _get_params_ used in
[model selector](https://github.com/HLT-ISTI/QuaPy/wiki/Model-Selection).
## Aggregative Methods
@ -58,11 +52,11 @@ of _BaseQuantifier.quantify_ is already provided, which looks like:
```python
def quantify(self, instances):
classif_predictions = self.preclassify(instances)
classif_predictions = self.classify(instances)
return self.aggregate(classif_predictions)
```
Aggregative quantifiers are expected to maintain a classifier (which is
accessed through the _@property_ _learner_). This classifier is
accessed through the _@property_ _classifier_). This classifier is
given as input to the quantifier, and can be already fit
on external data (in which case, the _fit_learner_ argument should
be set to False), or be fit by the quantifier's fit (default).
@ -73,13 +67,8 @@ _AggregativeProbabilisticQuantifier(AggregativeQuantifier)_.
The particularity of _probabilistic_ aggregative methods (w.r.t.
non-probabilistic ones), is that the default quantifier is defined
in terms of the posterior probabilities returned by a probabilistic
classifier, and not by the crisp decisions of a hard classifier; i.e.:
```python
def quantify(self, instances):
classif_posteriors = self.posterior_probabilities(instances)
return self.aggregate(classif_posteriors)
```
classifier, and not by the crisp decisions of a hard classifier.
In any case, the interface _classify(instances)_ remains unchanged.
One advantage of _aggregative_ methods (either probabilistic or not)
is that the evaluation according to any sampling procedure (e.g.,
@ -110,9 +99,7 @@ import quapy as qp
import quapy.functional as F
from sklearn.svm import LinearSVC
dataset = qp.datasets.fetch_twitter('hcr', pickle=True)
training = dataset.training
test = dataset.test
training, test = qp.datasets.fetch_twitter('hcr', pickle=True).train_test
# instantiate a classifier learner, in this case a SVM
svm = LinearSVC()
@ -156,11 +143,12 @@ model.fit(training, val_split=5)
```
The following code illustrates the case in which PCC is used:
```python
model = qp.method.aggregative.PCC(svm)
model.fit(training)
estim_prevalence = model.quantify(test.instances)
print('classifier:', model.learner)
print('classifier:', model.classifier)
```
In this case, QuaPy will print:
```
@ -211,14 +199,22 @@ model.fit(dataset.training)
estim_prevalence = model.quantify(dataset.test.instances)
```
_New in v0.1.7_: EMQ now accepts two new parameters in the construction method, namely
_exact_train_prev_ which allows to use the true training prevalence as the departing
prevalence estimation (default behaviour), or instead an approximation of it as
suggested by [Alexandari et al. (2020)](http://proceedings.mlr.press/v119/alexandari20a.html)
(by setting _exact_train_prev=False_).
The other parameter is _recalib_ which allows to indicate a calibration method, among those
proposed by [Alexandari et al. (2020)](http://proceedings.mlr.press/v119/alexandari20a.html),
including the Bias-Corrected Temperature Scaling, Vector Scaling, etc.
See the API documentation for further details.
### Hellinger Distance y (HDy)
The method HDy is described in:
_Implementation of the method based on the Hellinger Distance y (HDy) proposed by
González-Castro, V., Alaiz-Rodrı́guez, R., and Alegre, E. (2013). Class distribution
estimation based on the Hellinger distance. Information Sciences, 218:146164._
Implementation of the method based on the Hellinger Distance y (HDy) proposed by
[González-Castro, V., Alaiz-Rodrı́guez, R., and Alegre, E. (2013). Class distribution
estimation based on the Hellinger distance. Information Sciences, 218:146164.](https://www.sciencedirect.com/science/article/pii/S0020025512004069)
It is implemented in _qp.method.aggregative.HDy_ (also accessible
through the allias _qp.method.aggregative.HellingerDistanceY_).
@ -249,30 +245,51 @@ model.fit(dataset.training)
estim_prevalence = model.quantify(dataset.test.instances)
```
_New in v0.1.7:_ QuaPy now provides an implementation of the generalized
"Distribution Matching" approaches for multiclass, inspired by the framework
of [Firat (2016)](https://arxiv.org/abs/1606.00868). One can instantiate
a variant of HDy for multiclass quantification as follows:
```python
mutliclassHDy = qp.method.aggregative.DistributionMatching(classifier=LogisticRegression(), divergence='HD', cdf=False)
```
_New in v0.1.7:_ QuaPy now provides an implementation of the "DyS"
framework proposed by [Maletzke et al (2020)](https://ojs.aaai.org/index.php/AAAI/article/view/4376)
and the "SMM" method proposed by [Hassan et al (2019)](https://ieeexplore.ieee.org/document/9260028)
(thanks to _Pablo González_ for the contributions!)
### Threshold Optimization methods
_New in v0.1.7:_ QuaPy now implements Forman's threshold optimization methods;
see, e.g., [(Forman 2006)](https://dl.acm.org/doi/abs/10.1145/1150402.1150423)
and [(Forman 2008)](https://link.springer.com/article/10.1007/s10618-008-0097-y).
These include: T50, MAX, X, Median Sweep (MS), and its variant MS2.
### Explicit Loss Minimization
The Explicit Loss Minimization (ELM) represent a family of methods
based on structured output learning, i.e., quantifiers relying on
classifiers that have been optimized targeting a
quantification-oriented evaluation measure.
The original methods are implemented in QuaPy as classify & count (CC)
quantifiers that use Joachim's [SVMperf](https://www.cs.cornell.edu/people/tj/svm_light/svm_perf.html)
as the underlying classifier, properly set to optimize for the desired loss.
In QuaPy, the following methods, all relying on Joachim's
[SVMperf](https://www.cs.cornell.edu/people/tj/svm_light/svm_perf.html)
implementation, are available in _qp.method.aggregative_:
In QuaPy, this can be more achieved by calling the functions:
* SVMQ (SVM-Q) is a quantification method optimizing the metric _Q_ defined
in _Barranquero, J., Díez, J., and del Coz, J. J. (2015). Quantification-oriented learning based
on reliable classifiers. Pattern Recognition, 48(2):591604._
* SVMKLD (SVM for Kullback-Leibler Divergence) proposed in _Esuli, A. and Sebastiani, F. (2015).
* _newSVMQ_: returns the quantification method called SVM(Q) that optimizes for the metric _Q_ defined
in [_Barranquero, J., Díez, J., and del Coz, J. J. (2015). Quantification-oriented learning based
on reliable classifiers. Pattern Recognition, 48(2):591604._](https://www.sciencedirect.com/science/article/pii/S003132031400291X)
* _newSVMKLD_ and _newSVMNKLD_: returns the quantification method called SVM(KLD) and SVM(nKLD), standing for
Kullback-Leibler Divergence and Normalized Kullback-Leibler Divergence, as proposed in [_Esuli, A. and Sebastiani, F. (2015).
Optimizing text quantifiers for multivariate loss functions.
ACM Transactions on Knowledge Discovery and Data, 9(4):Article 27._
* SVMNKLD (SVM for Normalized Kullback-Leibler Divergence) proposed in _Esuli, A. and Sebastiani, F. (2015).
Optimizing text quantifiers for multivariate loss functions.
ACM Transactions on Knowledge Discovery and Data, 9(4):Article 27._
* SVMAE (SVM for Mean Absolute Error)
* SVMRAE (SVM for Mean Relative Absolute Error)
ACM Transactions on Knowledge Discovery and Data, 9(4):Article 27._](https://dl.acm.org/doi/abs/10.1145/2700406)
* _newSVMAE_ and _newSVMRAE_: returns a quantification method called SVM(AE) and SVM(RAE) that optimizes for the (Mean) Absolute Error and for the
(Mean) Relative Absolute Error, as first used by
[_Moreo, A. and Sebastiani, F. (2021). Tweet sentiment quantification: An experimental re-evaluation. PLOS ONE 17 (9), 1-23._](https://arxiv.org/abs/2011.02552)
the last two methods (SVMAE and SVMRAE) have been implemented in
the last two methods (SVM(AE) and SVM(RAE)) have been implemented in
QuaPy in order to make available ELM variants for what nowadays
are considered the most well-behaved evaluation metrics in quantification.
@ -306,13 +323,18 @@ currently supports only binary classification.
ELM variants (any binary quantifier in general) can be extended
to operate in single-label scenarios trivially by adopting a
"one-vs-all" strategy (as, e.g., in
_Gao, W. and Sebastiani, F. (2016). From classification to quantification in tweet sentiment
analysis. Social Network Analysis and Mining, 6(19):122_).
In QuaPy this is possible by using the _OneVsAll_ class:
[_Gao, W. and Sebastiani, F. (2016). From classification to quantification in tweet sentiment
analysis. Social Network Analysis and Mining, 6(19):122_](https://link.springer.com/article/10.1007/s13278-016-0327-z)).
In QuaPy this is possible by using the _OneVsAll_ class.
There are two ways for instantiating this class, _OneVsAllGeneric_ that works for
any quantifier, and _OneVsAllAggregative_ that is optimized for aggregative quantifiers.
In general, you can simply use the _getOneVsAll_ function and QuaPy will choose
the more convenient of the two.
```python
import quapy as qp
from quapy.method.aggregative import SVMQ, OneVsAll
from quapy.method.aggregative import SVMQ
# load a single-label dataset (this one contains 3 classes)
dataset = qp.datasets.fetch_twitter('hcr', pickle=True)
@ -320,11 +342,14 @@ dataset = qp.datasets.fetch_twitter('hcr', pickle=True)
# let qp know where svmperf is
qp.environ['SVMPERF_HOME'] = '../svm_perf_quantification'
model = OneVsAll(SVMQ(), n_jobs=-1) # run them on parallel
model = getOneVsAll(SVMQ(), n_jobs=-1) # run them on parallel
model.fit(dataset.training)
estim_prevalence = model.quantify(dataset.test.instances)
```
Check the examples _[explicit_loss_minimization.py](..%2Fexamples%2Fexplicit_loss_minimization.py)_
and [one_vs_all.py](..%2Fexamples%2Fone_vs_all.py) for more details.
## Meta Models
By _meta_ models we mean quantification methods that are defined on top of other
@ -337,12 +362,12 @@ _Meta_ models are implemented in the _qp.method.meta_ module.
QuaPy implements (some of) the variants proposed in:
* _Pérez-Gállego, P., Quevedo, J. R., & del Coz, J. J. (2017).
* [_Pérez-Gállego, P., Quevedo, J. R., & del Coz, J. J. (2017).
Using ensembles for problems with characterizable changes in data distribution: A case study on quantification.
Information Fusion, 34, 87-100._
* _Pérez-Gállego, P., Castano, A., Quevedo, J. R., & del Coz, J. J. (2019).
Information Fusion, 34, 87-100._](https://www.sciencedirect.com/science/article/pii/S1566253516300628)
* [_Pérez-Gállego, P., Castano, A., Quevedo, J. R., & del Coz, J. J. (2019).
Dynamic ensemble selection for quantification tasks.
Information Fusion, 45, 1-15._
Information Fusion, 45, 1-15._](https://www.sciencedirect.com/science/article/pii/S1566253517303652)
The following code shows how to instantiate an Ensemble of 30 _Adjusted Classify & Count_ (ACC)
quantifiers operating with a _Logistic Regressor_ (LR) as the base classifier, and using the
@ -378,10 +403,10 @@ wiki if you want to optimize the hyperparameters of ensemble for classification
QuaPy offers an implementation of QuaNet, a deep learning model presented in:
_Esuli, A., Moreo, A., & Sebastiani, F. (2018, October).
[_Esuli, A., Moreo, A., & Sebastiani, F. (2018, October).
A recurrent neural network for sentiment quantification.
In Proceedings of the 27th ACM International Conference on
Information and Knowledge Management (pp. 1775-1778)._
Information and Knowledge Management (pp. 1775-1778)._](https://dl.acm.org/doi/abs/10.1145/3269206.3269287)
This model requires _torch_ to be installed.
QuaNet also requires a classifier that can provide embedded representations
@ -406,7 +431,8 @@ cnn = CNNnet(dataset.vocabulary_size, dataset.n_classes)
learner = NeuralClassifierTrainer(cnn, device='cuda')
# train QuaNet
model = QuaNet(learner, qp.environ['SAMPLE_SIZE'], device='cuda')
model = QuaNet(learner, device='cuda')
model.fit(dataset.training)
estim_prevalence = model.quantify(dataset.test.instances)
```

View File

@ -22,9 +22,9 @@ Quantification has long been regarded as an add-on of
classification, and thus the model selection strategies
customarily adopted in classification have simply been
applied to quantification (see the next section).
It has been argued in _Moreo, Alejandro, and Fabrizio Sebastiani.
"Re-Assessing the" Classify and Count" Quantification Method."
arXiv preprint arXiv:2011.02552 (2020)._
It has been argued in [Moreo, Alejandro, and Fabrizio Sebastiani.
Re-Assessing the "Classify and Count" Quantification Method.
ECIR 2021: Advances in Information Retrieval pp 7591.](https://link.springer.com/chapter/10.1007/978-3-030-72240-1_6)
that specific model selection strategies should
be adopted for quantification. That is, model selection
strategies for quantification should target
@ -32,76 +32,86 @@ quantification-oriented losses and be tested in a variety
of scenarios exhibiting different degrees of prior
probability shift.
The class
_qp.model_selection.GridSearchQ_
implements a grid-search exploration over the space of
hyper-parameter combinations that evaluates each
combination of hyper-parameters
by means of a given quantification-oriented
The class _qp.model_selection.GridSearchQ_ implements a grid-search exploration over the space of
hyper-parameter combinations that [evaluates](https://github.com/HLT-ISTI/QuaPy/wiki/Evaluation)
each combination of hyper-parameters by means of a given quantification-oriented
error metric (e.g., any of the error functions implemented
in _qp.error_) and according to the
[_artificial sampling protocol_](https://github.com/HLT-ISTI/QuaPy/wiki/Evaluation).
in _qp.error_) and according to a
[sampling generation protocol](https://github.com/HLT-ISTI/QuaPy/wiki/Protocols).
The following is an example of model selection for quantification:
The following is an example (also included in the examples folder) of model selection for quantification:
```python
import quapy as qp
from quapy.method.aggregative import PCC
from quapy.protocol import APP
from quapy.method.aggregative import DistributionMatching
from sklearn.linear_model import LogisticRegression
import numpy as np
# set a seed to replicate runs
np.random.seed(0)
qp.environ['SAMPLE_SIZE'] = 500
"""
In this example, we show how to perform model selection on a DistributionMatching quantifier.
"""
dataset = qp.datasets.fetch_reviews('hp', tfidf=True, min_df=5)
model = DistributionMatching(LogisticRegression())
qp.environ['SAMPLE_SIZE'] = 100
qp.environ['N_JOBS'] = -1 # explore hyper-parameters in parallel
training, test = qp.datasets.fetch_reviews('imdb', tfidf=True, min_df=5).train_test
# The model will be returned by the fit method of GridSearchQ.
# Model selection will be performed with a fixed budget of 1000 evaluations
# for each hyper-parameter combination. The error to optimize is the MAE for
# quantification, as evaluated on artificially drawn samples at prevalences
# covering the entire spectrum on a held-out portion (40%) of the training set.
# Every combination of hyper-parameters will be evaluated by confronting the
# quantifier thus configured against a series of samples generated by means
# of a sample generation protocol. For this example, we will use the
# artificial-prevalence protocol (APP), that generates samples with prevalence
# values in the entire range of values from a grid (e.g., [0, 0.1, 0.2, ..., 1]).
# We devote 30% of the dataset for this exploration.
training, validation = training.split_stratified(train_prop=0.7)
protocol = APP(validation)
# We will explore a classification-dependent hyper-parameter (e.g., the 'C'
# hyper-parameter of LogisticRegression) and a quantification-dependent hyper-parameter
# (e.g., the number of bins in a DistributionMatching quantifier.
# Classifier-dependent hyper-parameters have to be marked with a prefix "classifier__"
# in order to let the quantifier know this hyper-parameter belongs to its underlying
# classifier.
param_grid = {
'classifier__C': np.logspace(-3,3,7),
'nbins': [8, 16, 32, 64],
}
model = qp.model_selection.GridSearchQ(
model=PCC(LogisticRegression()),
param_grid={'C': np.logspace(-4,5,10), 'class_weight': ['balanced', None]},
sample_size=qp.environ['SAMPLE_SIZE'],
eval_budget=1000,
error='mae',
refit=True, # retrain on the whole labelled set
val_split=0.4,
model=model,
param_grid=param_grid,
protocol=protocol,
error='mae', # the error to optimize is the MAE (a quantification-oriented loss)
refit=True, # retrain on the whole labelled set once done
verbose=True # show information as the process goes on
).fit(dataset.training)
).fit(training)
print(f'model selection ended: best hyper-parameters={model.best_params_}')
model = model.best_model_
# evaluation in terms of MAE
results = qp.evaluation.artificial_sampling_eval(
model,
dataset.test,
sample_size=qp.environ['SAMPLE_SIZE'],
n_prevpoints=101,
n_repetitions=10,
error_metric='mae'
)
# we use the same evaluation protocol (APP) on the test set
mae_score = qp.evaluation.evaluate(model, protocol=APP(test), error_metric='mae')
print(f'MAE={results:.5f}')
print(f'MAE={mae_score:.5f}')
```
In this example, the system outputs:
```
[GridSearchQ]: starting optimization with n_jobs=1
[GridSearchQ]: checking hyperparams={'C': 0.0001, 'class_weight': 'balanced'} got mae score 0.24987
[GridSearchQ]: checking hyperparams={'C': 0.0001, 'class_weight': None} got mae score 0.48135
[GridSearchQ]: checking hyperparams={'C': 0.001, 'class_weight': 'balanced'} got mae score 0.24866
[GridSearchQ]: starting model selection with self.n_jobs =-1
[GridSearchQ]: hyperparams={'classifier__C': 0.01, 'nbins': 64} got mae score 0.04021 [took 1.1356s]
[GridSearchQ]: hyperparams={'classifier__C': 0.01, 'nbins': 32} got mae score 0.04286 [took 1.2139s]
[GridSearchQ]: hyperparams={'classifier__C': 0.01, 'nbins': 16} got mae score 0.04888 [took 1.2491s]
[GridSearchQ]: hyperparams={'classifier__C': 0.001, 'nbins': 8} got mae score 0.05163 [took 1.5372s]
[...]
[GridSearchQ]: checking hyperparams={'C': 100000.0, 'class_weight': None} got mae score 0.43676
[GridSearchQ]: optimization finished: best params {'C': 0.1, 'class_weight': 'balanced'} (score=0.19982)
[GridSearchQ]: hyperparams={'classifier__C': 1000.0, 'nbins': 32} got mae score 0.02445 [took 2.9056s]
[GridSearchQ]: optimization finished: best params {'classifier__C': 100.0, 'nbins': 32} (score=0.02234) [took 7.3114s]
[GridSearchQ]: refitting on the whole development set
model selection ended: best hyper-parameters={'C': 0.1, 'class_weight': 'balanced'}
1010 evaluations will be performed for each combination of hyper-parameters
[artificial sampling protocol] generating predictions: 100%|██████████| 1010/1010 [00:00<00:00, 5005.54it/s]
MAE=0.20342
model selection ended: best hyper-parameters={'classifier__C': 100.0, 'nbins': 32}
MAE=0.03102
```
The parameter _val_split_ can alternatively be used to indicate
@ -128,32 +138,13 @@ learner = GridSearchCV(
LogisticRegression(),
param_grid={'C': np.logspace(-4, 5, 10), 'class_weight': ['balanced', None]},
cv=5)
model = PCC(learner).fit(dataset.training)
print(f'model selection ended: best hyper-parameters={model.learner.best_params_}')
model = DistributionMatching(learner).fit(dataset.training)
```
In this example, the system outputs:
```
model selection ended: best hyper-parameters={'C': 10000.0, 'class_weight': None}
1010 evaluations will be performed for each combination of hyper-parameters
[artificial sampling protocol] generating predictions: 100%|██████████| 1010/1010 [00:00<00:00, 5379.55it/s]
MAE=0.41734
```
Note that the MAE is worse than the one we obtained when optimizing
for quantification and, indeed, the hyper-parameters found optimal
largely differ between the two selection modalities. The
hyper-parameters C=10000 and class_weight=None have been found
to work well for the specific training prevalence of the HP dataset,
but these hyper-parameters turned out to be suboptimal when the
class prevalences of the test set differs (as is indeed tested
in scenarios of quantification).
This is, however, not always the case, and one could, in practice,
find examples
in which optimizing for classification ends up resulting in a better
quantifier than when optimizing for quantification.
Nonetheless, this is theoretically unlikely to happen.
However, this is conceptually flawed, since the model should be
optimized for the task at hand (quantification), and not for a surrogate task (classification),
i.e., the model should be requested to deliver low quantification errors, rather
than low classification errors.

View File

@ -43,7 +43,7 @@ quantification methods across different scenarios showcasing
the accuracy of the quantifier in predicting class prevalences
for a wide range of prior distributions. This can easily be
achieved by means of the
[artificial sampling protocol](https://github.com/HLT-ISTI/QuaPy/wiki/Evaluation)
[artificial sampling protocol](https://github.com/HLT-ISTI/QuaPy/wiki/Protocols)
that is implemented in QuaPy.
The following code shows how to perform one simple experiment
@ -55,6 +55,7 @@ generating 100 random samples at each prevalence).
```python
import quapy as qp
from protocol import APP
from quapy.method.aggregative import CC, ACC, PCC, PACC
from sklearn.svm import LinearSVC
@ -63,28 +64,26 @@ qp.environ['SAMPLE_SIZE'] = 500
def gen_data():
def base_classifier():
return LinearSVC()
return LinearSVC(class_weight='balanced')
def models():
yield CC(base_classifier())
yield ACC(base_classifier())
yield PCC(base_classifier())
yield PACC(base_classifier())
yield 'CC', CC(base_classifier())
yield 'ACC', ACC(base_classifier())
yield 'PCC', PCC(base_classifier())
yield 'PACC', PACC(base_classifier())
data = qp.datasets.fetch_reviews('kindle', tfidf=True, min_df=5)
train, test = qp.datasets.fetch_reviews('kindle', tfidf=True, min_df=5).train_test
method_names, true_prevs, estim_prevs, tr_prevs = [], [], [], []
for model in models():
model.fit(data.training)
true_prev, estim_prev = qp.evaluation.artificial_sampling_prediction(
model, data.test, qp.environ['SAMPLE_SIZE'], n_repetitions=100, n_prevpoints=21
)
for method_name, model in models():
model.fit(train)
true_prev, estim_prev = qp.evaluation.prediction(model, APP(test, repeats=100, random_state=0))
method_names.append(model.__class__.__name__)
method_names.append(method_name)
true_prevs.append(true_prev)
estim_prevs.append(estim_prev)
tr_prevs.append(data.training.prevalence())
tr_prevs.append(train.prevalence())
return method_names, true_prevs, estim_prevs, tr_prevs
@ -163,21 +162,19 @@ like this:
```python
def gen_data():
data = qp.datasets.fetch_reviews('imdb', tfidf=True, min_df=5)
train, test = qp.datasets.fetch_reviews('imdb', tfidf=True, min_df=5).train_test
model = CC(LinearSVC())
method_data = []
for training_prevalence in np.linspace(0.1, 0.9, 9):
training_size = 5000
# since the problem is binary, it suffices to specify the negative prevalence (the positive is constrained)
training = data.training.sampling(training_size, 1 - training_prevalence)
model.fit(training)
true_prev, estim_prev = qp.evaluation.artificial_sampling_prediction(
model, data.sample, qp.environ['SAMPLE_SIZE'], n_repetitions=100, n_prevpoints=21
)
# method names can contain Latex syntax
method_name = 'CC$_{' + f'{int(100 * training_prevalence)}' + '\%}$'
method_data.append((method_name, true_prev, estim_prev, training.prevalence()))
# since the problem is binary, it suffices to specify the negative prevalence, since the positive is constrained
train_sample = train.sampling(training_size, 1-training_prevalence)
model.fit(train_sample)
true_prev, estim_prev = qp.evaluation.prediction(model, APP(test, repeats=100, random_state=0))
method_name = 'CC$_{'+f'{int(100*training_prevalence)}' + '\%}$'
method_data.append((method_name, true_prev, estim_prev, train_sample.prevalence()))
return zip(*method_data)
```

View File

@ -64,6 +64,7 @@ Features
* 32 UCI Machine Learning datasets.
* 11 Twitter Sentiment datasets.
* 3 Reviews Sentiment datasets.
* 4 tasks from LeQua competition (_new in v0.1.7!_)
* Native supports for binary and single-label scenarios of quantification.
* Model selection functionality targeting quantification-oriented losses.
* Visualization tools for analysing results.
@ -75,6 +76,7 @@ Features
Installation
Datasets
Evaluation
Protocols
Methods
Model-Selection
Plotting

View File

@ -56,6 +56,7 @@
| <a href="#G"><strong>G</strong></a>
| <a href="#H"><strong>H</strong></a>
| <a href="#I"><strong>I</strong></a>
| <a href="#J"><strong>J</strong></a>
| <a href="#K"><strong>K</strong></a>
| <a href="#L"><strong>L</strong></a>
| <a href="#M"><strong>M</strong></a>
@ -131,6 +132,8 @@
<li><a href="quapy.method.html#quapy.method.aggregative.AggregativeQuantifier">AggregativeQuantifier (class in quapy.method.aggregative)</a>
</li>
<li><a href="quapy.html#quapy.protocol.APP">APP (class in quapy.protocol)</a>
</li>
<li><a href="quapy.html#quapy.protocol.ArtificialPrevalenceProtocol">ArtificialPrevalenceProtocol (in module quapy.protocol)</a>
</li>
<li><a href="quapy.classification.html#quapy.classification.neural.TorchDataset.asDataloader">asDataloader() (quapy.classification.neural.TorchDataset method)</a>
</li>
@ -284,10 +287,10 @@
</li>
<li><a href="quapy.method.html#quapy.method.aggregative.EMQ">EMQ (class in quapy.method.aggregative)</a>
</li>
</ul></td>
<td style="width: 33%; vertical-align: top;"><ul>
<li><a href="quapy.method.html#quapy.method.meta.Ensemble">Ensemble (class in quapy.method.meta)</a>
</li>
</ul></td>
<td style="width: 33%; vertical-align: top;"><ul>
<li><a href="quapy.method.html#quapy.method.meta.ensembleFactory">ensembleFactory() (in module quapy.method.meta)</a>
</li>
<li><a href="quapy.method.html#quapy.method.meta.EPACC">EPACC() (in module quapy.method.meta)</a>
@ -297,6 +300,8 @@
<li><a href="quapy.html#quapy.plot.error_by_drift">error_by_drift() (in module quapy.plot)</a>
</li>
<li><a href="quapy.html#quapy.evaluation.evaluate">evaluate() (in module quapy.evaluation)</a>
</li>
<li><a href="quapy.html#quapy.evaluation.evaluate_on_samples">evaluate_on_samples() (in module quapy.evaluation)</a>
</li>
<li><a href="quapy.html#quapy.evaluation.evaluation_report">evaluation_report() (in module quapy.evaluation)</a>
</li>
@ -459,6 +464,16 @@
</ul></td>
<td style="width: 33%; vertical-align: top;"><ul>
<li><a href="quapy.data.html#quapy.data.preprocessing.IndexTransformer">IndexTransformer (class in quapy.data.preprocessing)</a>
</li>
<li><a href="quapy.html#quapy.protocol.IterateProtocol">IterateProtocol (class in quapy.protocol)</a>
</li>
</ul></td>
</tr></table>
<h2 id="J">J</h2>
<table style="width: 100%" class="indextable genindextable"><tr>
<td style="width: 33%; vertical-align: top;"><ul>
<li><a href="quapy.data.html#quapy.data.base.LabelledCollection.join">join() (quapy.data.base.LabelledCollection class method)</a>
</li>
</ul></td>
</tr></table>
@ -521,8 +536,6 @@
<li><a href="quapy.method.html#quapy.method.aggregative.MedianSweep">MedianSweep (in module quapy.method.aggregative)</a>
</li>
<li><a href="quapy.method.html#quapy.method.aggregative.MedianSweep2">MedianSweep2 (in module quapy.method.aggregative)</a>
</li>
<li><a href="quapy.data.html#quapy.data.base.LabelledCollection.mix">mix() (quapy.data.base.LabelledCollection class method)</a>
</li>
<li><a href="quapy.html#quapy.error.mkld">mkld() (in module quapy.error)</a>
</li>
@ -603,6 +616,8 @@
<li><a href="quapy.data.html#quapy.data.base.LabelledCollection.n_classes">(quapy.data.base.LabelledCollection property)</a>
</li>
</ul></li>
<li><a href="quapy.html#quapy.protocol.NaturalPrevalenceProtocol">NaturalPrevalenceProtocol (in module quapy.protocol)</a>
</li>
<li><a href="quapy.classification.html#quapy.classification.calibration.NBVSCalibration">NBVSCalibration (class in quapy.classification.calibration)</a>
</li>
<li><a href="quapy.classification.html#quapy.classification.neural.NeuralClassifierTrainer">NeuralClassifierTrainer (class in quapy.classification.neural)</a>
@ -610,11 +625,11 @@
<li><a href="quapy.method.html#quapy.method.aggregative.newELM">newELM() (in module quapy.method.aggregative)</a>
</li>
<li><a href="quapy.method.html#quapy.method.base.newOneVsAll">newOneVsAll() (in module quapy.method.base)</a>
</li>
<li><a href="quapy.method.html#quapy.method.aggregative.newSVMAE">newSVMAE() (in module quapy.method.aggregative)</a>
</li>
</ul></td>
<td style="width: 33%; vertical-align: top;"><ul>
<li><a href="quapy.method.html#quapy.method.aggregative.newSVMAE">newSVMAE() (in module quapy.method.aggregative)</a>
</li>
<li><a href="quapy.method.html#quapy.method.aggregative.newSVMKLD">newSVMKLD() (in module quapy.method.aggregative)</a>
</li>
<li><a href="quapy.method.html#quapy.method.aggregative.newSVMQ">newSVMQ() (in module quapy.method.aggregative)</a>
@ -914,6 +929,8 @@
<li><a href="quapy.classification.html#quapy.classification.calibration.RecalibratedProbabilisticClassifier">RecalibratedProbabilisticClassifier (class in quapy.classification.calibration)</a>
</li>
<li><a href="quapy.classification.html#quapy.classification.calibration.RecalibratedProbabilisticClassifierBase">RecalibratedProbabilisticClassifierBase (class in quapy.classification.calibration)</a>
</li>
<li><a href="quapy.data.html#quapy.data.base.Dataset.reduce">reduce() (quapy.data.base.Dataset method)</a>
</li>
</ul></td>
<td style="width: 33%; vertical-align: top;"><ul>
@ -942,7 +959,7 @@
</li>
<li><a href="quapy.html#quapy.protocol.NPP.sample">(quapy.protocol.NPP method)</a>
</li>
<li><a href="quapy.html#quapy.protocol.USimplexPP.sample">(quapy.protocol.USimplexPP method)</a>
<li><a href="quapy.html#quapy.protocol.UPP.sample">(quapy.protocol.UPP method)</a>
</li>
</ul></li>
<li><a href="quapy.html#quapy.protocol.AbstractStochasticSeededProtocol.samples_parameters">samples_parameters() (quapy.protocol.AbstractStochasticSeededProtocol method)</a>
@ -954,7 +971,7 @@
</li>
<li><a href="quapy.html#quapy.protocol.NPP.samples_parameters">(quapy.protocol.NPP method)</a>
</li>
<li><a href="quapy.html#quapy.protocol.USimplexPP.samples_parameters">(quapy.protocol.USimplexPP method)</a>
<li><a href="quapy.html#quapy.protocol.UPP.samples_parameters">(quapy.protocol.UPP method)</a>
</li>
</ul></li>
<li><a href="quapy.data.html#quapy.data.base.LabelledCollection.sampling">sampling() (quapy.data.base.LabelledCollection method)</a>
@ -1033,10 +1050,12 @@
<li><a href="quapy.html#quapy.protocol.APP.total">(quapy.protocol.APP method)</a>
</li>
<li><a href="quapy.html#quapy.protocol.DomainMixer.total">(quapy.protocol.DomainMixer method)</a>
</li>
<li><a href="quapy.html#quapy.protocol.IterateProtocol.total">(quapy.protocol.IterateProtocol method)</a>
</li>
<li><a href="quapy.html#quapy.protocol.NPP.total">(quapy.protocol.NPP method)</a>
</li>
<li><a href="quapy.html#quapy.protocol.USimplexPP.total">(quapy.protocol.USimplexPP method)</a>
<li><a href="quapy.html#quapy.protocol.UPP.total">(quapy.protocol.UPP method)</a>
</li>
</ul></li>
</ul></td>
@ -1073,13 +1092,15 @@
</li>
<li><a href="quapy.data.html#quapy.data.base.LabelledCollection.uniform_sampling">uniform_sampling() (quapy.data.base.LabelledCollection method)</a>
</li>
</ul></td>
<td style="width: 33%; vertical-align: top;"><ul>
<li><a href="quapy.data.html#quapy.data.base.LabelledCollection.uniform_sampling_index">uniform_sampling_index() (quapy.data.base.LabelledCollection method)</a>
</li>
</ul></td>
<td style="width: 33%; vertical-align: top;"><ul>
<li><a href="quapy.html#quapy.functional.uniform_simplex_sampling">uniform_simplex_sampling() (in module quapy.functional)</a>
</li>
<li><a href="quapy.html#quapy.protocol.USimplexPP">USimplexPP (class in quapy.protocol)</a>
<li><a href="quapy.html#quapy.protocol.UniformPrevalenceProtocol">UniformPrevalenceProtocol (in module quapy.protocol)</a>
</li>
<li><a href="quapy.html#quapy.protocol.UPP">UPP (class in quapy.protocol)</a>
</li>
</ul></td>
</tr></table>

View File

@ -102,6 +102,7 @@ See the <a class="reference internal" href="Evaluation.html"><span class="doc">E
<li><p>32 UCI Machine Learning datasets.</p></li>
<li><p>11 Twitter Sentiment datasets.</p></li>
<li><p>3 Reviews Sentiment datasets.</p></li>
<li><p>4 tasks from LeQua competition (_new in v0.1.7!_)</p></li>
</ul>
</dd>
</dl>
@ -130,6 +131,13 @@ See the <a class="reference internal" href="Evaluation.html"><span class="doc">E
<li class="toctree-l2"><a class="reference internal" href="Evaluation.html#evaluation-protocols">Evaluation Protocols</a></li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="Protocols.html">Protocols</a><ul>
<li class="toctree-l2"><a class="reference internal" href="Protocols.html#artificial-prevalence-protocol">Artificial-Prevalence Protocol</a></li>
<li class="toctree-l2"><a class="reference internal" href="Protocols.html#sampling-from-the-unit-simplex-the-uniform-prevalence-protocol-upp">Sampling from the unit-simplex, the Uniform-Prevalence Protocol (UPP)</a></li>
<li class="toctree-l2"><a class="reference internal" href="Protocols.html#natural-prevalence-protocol">Natural-Prevalence Protocol</a></li>
<li class="toctree-l2"><a class="reference internal" href="Protocols.html#other-protocols">Other protocols</a></li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="Methods.html">Quantification Methods</a><ul>
<li class="toctree-l2"><a class="reference internal" href="Methods.html#aggregative-methods">Aggregative Methods</a></li>
<li class="toctree-l2"><a class="reference internal" href="Methods.html#meta-models">Meta Models</a></li>

Binary file not shown.

View File

@ -170,6 +170,23 @@ See <a class="reference internal" href="#quapy.data.base.LabelledCollection.load
</dl>
</dd></dl>
<dl class="py method">
<dt class="sig sig-object py" id="quapy.data.base.Dataset.reduce">
<span class="sig-name descname"><span class="pre">reduce</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">n_train</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">100</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">n_test</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">100</span></span></em><span class="sig-paren">)</span><a class="headerlink" href="#quapy.data.base.Dataset.reduce" title="Permalink to this definition"></a></dt>
<dd><p>Reduce the number of instances in place for quick experiments. Preserves the prevalence of each set.</p>
<dl class="field-list simple">
<dt class="field-odd">Parameters<span class="colon">:</span></dt>
<dd class="field-odd"><ul class="simple">
<li><p><strong>n_train</strong> number of training documents to keep (default 100)</p></li>
<li><p><strong>n_test</strong> number of test documents to keep (default 100)</p></li>
</ul>
</dd>
<dt class="field-even">Returns<span class="colon">:</span></dt>
<dd class="field-even"><p>self</p>
</dd>
</dl>
</dd></dl>
<dl class="py method">
<dt class="sig sig-object py" id="quapy.data.base.Dataset.stats">
<span class="sig-name descname"><span class="pre">stats</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">show</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">True</span></span></em><span class="sig-paren">)</span><a class="headerlink" href="#quapy.data.base.Dataset.stats" title="Permalink to this definition"></a></dt>
@ -297,6 +314,20 @@ as listed by <cite>self.classes_</cite></p>
</dl>
</dd></dl>
<dl class="py method">
<dt class="sig sig-object py" id="quapy.data.base.LabelledCollection.join">
<em class="property"><span class="pre">classmethod</span><span class="w"> </span></em><span class="sig-name descname"><span class="pre">join</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="o"><span class="pre">*</span></span><span class="n"><span class="pre">args</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Iterable</span><span class="p"><span class="pre">[</span></span><a class="reference internal" href="#quapy.data.base.LabelledCollection" title="quapy.data.base.LabelledCollection"><span class="pre">LabelledCollection</span></a><span class="p"><span class="pre">]</span></span></span></em><span class="sig-paren">)</span><a class="headerlink" href="#quapy.data.base.LabelledCollection.join" title="Permalink to this definition"></a></dt>
<dd><p>Returns a new <a class="reference internal" href="#quapy.data.base.LabelledCollection" title="quapy.data.base.LabelledCollection"><code class="xref py py-class docutils literal notranslate"><span class="pre">LabelledCollection</span></code></a> as the union of the collections given in input.</p>
<dl class="field-list simple">
<dt class="field-odd">Parameters<span class="colon">:</span></dt>
<dd class="field-odd"><p><strong>args</strong> instances of <a class="reference internal" href="#quapy.data.base.LabelledCollection" title="quapy.data.base.LabelledCollection"><code class="xref py py-class docutils literal notranslate"><span class="pre">LabelledCollection</span></code></a></p>
</dd>
<dt class="field-even">Returns<span class="colon">:</span></dt>
<dd class="field-even"><p>a <a class="reference internal" href="#quapy.data.base.LabelledCollection" title="quapy.data.base.LabelledCollection"><code class="xref py py-class docutils literal notranslate"><span class="pre">LabelledCollection</span></code></a> representing the union of both collections</p>
</dd>
</dl>
</dd></dl>
<dl class="py method">
<dt class="sig sig-object py" id="quapy.data.base.LabelledCollection.kFCV">
<span class="sig-name descname"><span class="pre">kFCV</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">nfolds</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">5</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">nrepeats</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">1</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">random_state</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">None</span></span></em><span class="sig-paren">)</span><a class="headerlink" href="#quapy.data.base.LabelledCollection.kFCV" title="Permalink to this definition"></a></dt>
@ -338,23 +369,6 @@ these arguments are used to call <cite>loader_func(path, **loader_kwargs)</cite>
</dl>
</dd></dl>
<dl class="py method">
<dt class="sig sig-object py" id="quapy.data.base.LabelledCollection.mix">
<em class="property"><span class="pre">classmethod</span><span class="w"> </span></em><span class="sig-name descname"><span class="pre">mix</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">a</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><a class="reference internal" href="#quapy.data.base.LabelledCollection" title="quapy.data.base.LabelledCollection"><span class="pre">LabelledCollection</span></a></span></em>, <em class="sig-param"><span class="n"><span class="pre">b</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><a class="reference internal" href="#quapy.data.base.LabelledCollection" title="quapy.data.base.LabelledCollection"><span class="pre">LabelledCollection</span></a></span></em><span class="sig-paren">)</span><a class="headerlink" href="#quapy.data.base.LabelledCollection.mix" title="Permalink to this definition"></a></dt>
<dd><p>Returns a new <a class="reference internal" href="#quapy.data.base.LabelledCollection" title="quapy.data.base.LabelledCollection"><code class="xref py py-class docutils literal notranslate"><span class="pre">LabelledCollection</span></code></a> as the union of this collection with another collection.</p>
<dl class="field-list simple">
<dt class="field-odd">Parameters<span class="colon">:</span></dt>
<dd class="field-odd"><ul class="simple">
<li><p><strong>a</strong> instance of <a class="reference internal" href="#quapy.data.base.LabelledCollection" title="quapy.data.base.LabelledCollection"><code class="xref py py-class docutils literal notranslate"><span class="pre">LabelledCollection</span></code></a></p></li>
<li><p><strong>b</strong> instance of <a class="reference internal" href="#quapy.data.base.LabelledCollection" title="quapy.data.base.LabelledCollection"><code class="xref py py-class docutils literal notranslate"><span class="pre">LabelledCollection</span></code></a></p></li>
</ul>
</dd>
<dt class="field-even">Returns<span class="colon">:</span></dt>
<dd class="field-even"><p>a <a class="reference internal" href="#quapy.data.base.LabelledCollection" title="quapy.data.base.LabelledCollection"><code class="xref py py-class docutils literal notranslate"><span class="pre">LabelledCollection</span></code></a> representing the union of both collections</p>
</dd>
</dl>
</dd></dl>
<dl class="py property">
<dt class="sig sig-object py" id="quapy.data.base.LabelledCollection.n_classes">
<em class="property"><span class="pre">property</span><span class="w"> </span></em><span class="sig-name descname"><span class="pre">n_classes</span></span><a class="headerlink" href="#quapy.data.base.LabelledCollection.n_classes" title="Permalink to this definition"></a></dt>

View File

@ -481,18 +481,117 @@ will be taken from the environment variable <cite>SAMPLE_SIZE</cite> (which has
<span id="quapy-evaluation"></span><h2>quapy.evaluation<a class="headerlink" href="#module-quapy.evaluation" title="Permalink to this heading"></a></h2>
<dl class="py function">
<dt class="sig sig-object py" id="quapy.evaluation.evaluate">
<span class="sig-prename descclassname"><span class="pre">quapy.evaluation.</span></span><span class="sig-name descname"><span class="pre">evaluate</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">model</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><a class="reference internal" href="quapy.method.html#quapy.method.base.BaseQuantifier" title="quapy.method.base.BaseQuantifier"><span class="pre">BaseQuantifier</span></a></span></em>, <em class="sig-param"><span class="n"><span class="pre">protocol</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><a class="reference internal" href="#quapy.protocol.AbstractProtocol" title="quapy.protocol.AbstractProtocol"><span class="pre">AbstractProtocol</span></a></span></em>, <em class="sig-param"><span class="n"><span class="pre">error_metric</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Union</span><span class="p"><span class="pre">[</span></span><span class="pre">str</span><span class="p"><span class="pre">,</span></span><span class="w"> </span><span class="pre">Callable</span><span class="p"><span class="pre">]</span></span></span></em>, <em class="sig-param"><span class="n"><span class="pre">aggr_speedup</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">'auto'</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">verbose</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">False</span></span></em><span class="sig-paren">)</span><a class="headerlink" href="#quapy.evaluation.evaluate" title="Permalink to this definition"></a></dt>
<dd></dd></dl>
<span class="sig-prename descclassname"><span class="pre">quapy.evaluation.</span></span><span class="sig-name descname"><span class="pre">evaluate</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">model</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><a class="reference internal" href="quapy.method.html#quapy.method.base.BaseQuantifier" title="quapy.method.base.BaseQuantifier"><span class="pre">BaseQuantifier</span></a></span></em>, <em class="sig-param"><span class="n"><span class="pre">protocol</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><a class="reference internal" href="#quapy.protocol.AbstractProtocol" title="quapy.protocol.AbstractProtocol"><span class="pre">AbstractProtocol</span></a></span></em>, <em class="sig-param"><span class="n"><span class="pre">error_metric</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Union</span><span class="p"><span class="pre">[</span></span><span class="pre">str</span><span class="p"><span class="pre">,</span></span><span class="w"> </span><span class="pre">Callable</span><span class="p"><span class="pre">]</span></span></span></em>, <em class="sig-param"><span class="n"><span class="pre">aggr_speedup</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Union</span><span class="p"><span class="pre">[</span></span><span class="pre">str</span><span class="p"><span class="pre">,</span></span><span class="w"> </span><span class="pre">bool</span><span class="p"><span class="pre">]</span></span></span><span class="w"> </span><span class="o"><span class="pre">=</span></span><span class="w"> </span><span class="default_value"><span class="pre">'auto'</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">verbose</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">False</span></span></em><span class="sig-paren">)</span><a class="headerlink" href="#quapy.evaluation.evaluate" title="Permalink to this definition"></a></dt>
<dd><p>Evaluates a quantification model according to a specific sample generation protocol and in terms of one
evaluation metric (error).</p>
<dl class="field-list simple">
<dt class="field-odd">Parameters<span class="colon">:</span></dt>
<dd class="field-odd"><ul class="simple">
<li><p><strong>model</strong> a quantifier, instance of <a class="reference internal" href="quapy.method.html#quapy.method.base.BaseQuantifier" title="quapy.method.base.BaseQuantifier"><code class="xref py py-class docutils literal notranslate"><span class="pre">quapy.method.base.BaseQuantifier</span></code></a></p></li>
<li><p><strong>protocol</strong> <a class="reference internal" href="#quapy.protocol.AbstractProtocol" title="quapy.protocol.AbstractProtocol"><code class="xref py py-class docutils literal notranslate"><span class="pre">quapy.protocol.AbstractProtocol</span></code></a>; if this object is also instance of
<a class="reference internal" href="#quapy.protocol.OnLabelledCollectionProtocol" title="quapy.protocol.OnLabelledCollectionProtocol"><code class="xref py py-class docutils literal notranslate"><span class="pre">quapy.protocol.OnLabelledCollectionProtocol</span></code></a>, then the aggregation speed-up can be run. This is the
protocol in charge of generating the samples in which the model is evaluated.</p></li>
<li><p><strong>error_metric</strong> a string representing the name(s) of an error function in <cite>qp.error</cite>
(e.g., mae), or a callable function implementing the error function itself.</p></li>
<li><p><strong>aggr_speedup</strong> whether or not to apply the speed-up. Set to “force” for applying it even if the number of
instances in the original collection on which the protocol acts is larger than the number of instances
in the samples to be generated. Set to True or “auto” (default) for letting QuaPy decide whether it is
convenient or not. Set to False to deactivate.</p></li>
<li><p><strong>verbose</strong> boolean, show or not information in stdout</p></li>
</ul>
</dd>
<dt class="field-even">Returns<span class="colon">:</span></dt>
<dd class="field-even"><p>if the error metric is not averaged (e.g., ae, rae), returns an array of shape <cite>(n_samples,)</cite> with
the error scores for each sample; if the error metric is averaged (e.g., mae, mrae) then returns
a single float</p>
</dd>
</dl>
</dd></dl>
<dl class="py function">
<dt class="sig sig-object py" id="quapy.evaluation.evaluate_on_samples">
<span class="sig-prename descclassname"><span class="pre">quapy.evaluation.</span></span><span class="sig-name descname"><span class="pre">evaluate_on_samples</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">model</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><a class="reference internal" href="quapy.method.html#quapy.method.base.BaseQuantifier" title="quapy.method.base.BaseQuantifier"><span class="pre">BaseQuantifier</span></a></span></em>, <em class="sig-param"><span class="n"><span class="pre">samples</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Iterable</span><span class="p"><span class="pre">[</span></span><a class="reference internal" href="quapy.data.html#quapy.data.base.LabelledCollection" title="quapy.data.base.LabelledCollection"><span class="pre">LabelledCollection</span></a><span class="p"><span class="pre">]</span></span></span></em>, <em class="sig-param"><span class="n"><span class="pre">error_metric</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Union</span><span class="p"><span class="pre">[</span></span><span class="pre">str</span><span class="p"><span class="pre">,</span></span><span class="w"> </span><span class="pre">Callable</span><span class="p"><span class="pre">]</span></span></span></em>, <em class="sig-param"><span class="n"><span class="pre">verbose</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">False</span></span></em><span class="sig-paren">)</span><a class="headerlink" href="#quapy.evaluation.evaluate_on_samples" title="Permalink to this definition"></a></dt>
<dd><p>Evaluates a quantification model on a given set of samples and in terms of one evaluation metric (error).</p>
<dl class="field-list simple">
<dt class="field-odd">Parameters<span class="colon">:</span></dt>
<dd class="field-odd"><ul class="simple">
<li><p><strong>model</strong> a quantifier, instance of <a class="reference internal" href="quapy.method.html#quapy.method.base.BaseQuantifier" title="quapy.method.base.BaseQuantifier"><code class="xref py py-class docutils literal notranslate"><span class="pre">quapy.method.base.BaseQuantifier</span></code></a></p></li>
<li><p><strong>samples</strong> a list of samples on which the quantifier is to be evaluated</p></li>
<li><p><strong>error_metric</strong> a string representing the name(s) of an error function in <cite>qp.error</cite>
(e.g., mae), or a callable function implementing the error function itself.</p></li>
<li><p><strong>verbose</strong> boolean, show or not information in stdout</p></li>
</ul>
</dd>
<dt class="field-even">Returns<span class="colon">:</span></dt>
<dd class="field-even"><p>if the error metric is not averaged (e.g., ae, rae), returns an array of shape <cite>(n_samples,)</cite> with
the error scores for each sample; if the error metric is averaged (e.g., mae, mrae) then returns
a single float</p>
</dd>
</dl>
</dd></dl>
<dl class="py function">
<dt class="sig sig-object py" id="quapy.evaluation.evaluation_report">
<span class="sig-prename descclassname"><span class="pre">quapy.evaluation.</span></span><span class="sig-name descname"><span class="pre">evaluation_report</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">model</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><a class="reference internal" href="quapy.method.html#quapy.method.base.BaseQuantifier" title="quapy.method.base.BaseQuantifier"><span class="pre">BaseQuantifier</span></a></span></em>, <em class="sig-param"><span class="n"><span class="pre">protocol</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><a class="reference internal" href="#quapy.protocol.AbstractProtocol" title="quapy.protocol.AbstractProtocol"><span class="pre">AbstractProtocol</span></a></span></em>, <em class="sig-param"><span class="n"><span class="pre">error_metrics</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Iterable</span><span class="p"><span class="pre">[</span></span><span class="pre">Union</span><span class="p"><span class="pre">[</span></span><span class="pre">str</span><span class="p"><span class="pre">,</span></span><span class="w"> </span><span class="pre">Callable</span><span class="p"><span class="pre">]</span></span><span class="p"><span class="pre">]</span></span></span><span class="w"> </span><span class="o"><span class="pre">=</span></span><span class="w"> </span><span class="default_value"><span class="pre">'mae'</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">aggr_speedup</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">'auto'</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">verbose</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">False</span></span></em><span class="sig-paren">)</span><a class="headerlink" href="#quapy.evaluation.evaluation_report" title="Permalink to this definition"></a></dt>
<dd></dd></dl>
<span class="sig-prename descclassname"><span class="pre">quapy.evaluation.</span></span><span class="sig-name descname"><span class="pre">evaluation_report</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">model</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><a class="reference internal" href="quapy.method.html#quapy.method.base.BaseQuantifier" title="quapy.method.base.BaseQuantifier"><span class="pre">BaseQuantifier</span></a></span></em>, <em class="sig-param"><span class="n"><span class="pre">protocol</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><a class="reference internal" href="#quapy.protocol.AbstractProtocol" title="quapy.protocol.AbstractProtocol"><span class="pre">AbstractProtocol</span></a></span></em>, <em class="sig-param"><span class="n"><span class="pre">error_metrics</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Iterable</span><span class="p"><span class="pre">[</span></span><span class="pre">Union</span><span class="p"><span class="pre">[</span></span><span class="pre">str</span><span class="p"><span class="pre">,</span></span><span class="w"> </span><span class="pre">Callable</span><span class="p"><span class="pre">]</span></span><span class="p"><span class="pre">]</span></span></span><span class="w"> </span><span class="o"><span class="pre">=</span></span><span class="w"> </span><span class="default_value"><span class="pre">'mae'</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">aggr_speedup</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Union</span><span class="p"><span class="pre">[</span></span><span class="pre">str</span><span class="p"><span class="pre">,</span></span><span class="w"> </span><span class="pre">bool</span><span class="p"><span class="pre">]</span></span></span><span class="w"> </span><span class="o"><span class="pre">=</span></span><span class="w"> </span><span class="default_value"><span class="pre">'auto'</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">verbose</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">False</span></span></em><span class="sig-paren">)</span><a class="headerlink" href="#quapy.evaluation.evaluation_report" title="Permalink to this definition"></a></dt>
<dd><p>Generates a report (a pandas DataFrame) containing information of the evaluation of the model as according
to a specific protocol and in terms of one or more evaluation metrics (errors).</p>
<dl class="field-list simple">
<dt class="field-odd">Parameters<span class="colon">:</span></dt>
<dd class="field-odd"><ul class="simple">
<li><p><strong>model</strong> a quantifier, instance of <a class="reference internal" href="quapy.method.html#quapy.method.base.BaseQuantifier" title="quapy.method.base.BaseQuantifier"><code class="xref py py-class docutils literal notranslate"><span class="pre">quapy.method.base.BaseQuantifier</span></code></a></p></li>
<li><p><strong>protocol</strong> <a class="reference internal" href="#quapy.protocol.AbstractProtocol" title="quapy.protocol.AbstractProtocol"><code class="xref py py-class docutils literal notranslate"><span class="pre">quapy.protocol.AbstractProtocol</span></code></a>; if this object is also instance of
<a class="reference internal" href="#quapy.protocol.OnLabelledCollectionProtocol" title="quapy.protocol.OnLabelledCollectionProtocol"><code class="xref py py-class docutils literal notranslate"><span class="pre">quapy.protocol.OnLabelledCollectionProtocol</span></code></a>, then the aggregation speed-up can be run. This is the protocol
in charge of generating the samples in which the model is evaluated.</p></li>
<li><p><strong>error_metrics</strong> a string, or list of strings, representing the name(s) of an error function in <cite>qp.error</cite>
(e.g., mae, the default value), or a callable function, or a list of callable functions, implementing
the error function itself.</p></li>
<li><p><strong>aggr_speedup</strong> whether or not to apply the speed-up. Set to “force” for applying it even if the number of
instances in the original collection on which the protocol acts is larger than the number of instances
in the samples to be generated. Set to True or “auto” (default) for letting QuaPy decide whether it is
convenient or not. Set to False to deactivate.</p></li>
<li><p><strong>verbose</strong> boolean, show or not information in stdout</p></li>
</ul>
</dd>
<dt class="field-even">Returns<span class="colon">:</span></dt>
<dd class="field-even"><p>a pandas DataFrame containing the columns true-prev (the true prevalence of each sample),
estim-prev (the prevalence estimated by the model for each sample), and as many columns as error metrics
have been indicated, each displaying the score in terms of that metric for every sample.</p>
</dd>
</dl>
</dd></dl>
<dl class="py function">
<dt class="sig sig-object py" id="quapy.evaluation.prediction">
<span class="sig-prename descclassname"><span class="pre">quapy.evaluation.</span></span><span class="sig-name descname"><span class="pre">prediction</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">model</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><a class="reference internal" href="quapy.method.html#quapy.method.base.BaseQuantifier" title="quapy.method.base.BaseQuantifier"><span class="pre">BaseQuantifier</span></a></span></em>, <em class="sig-param"><span class="n"><span class="pre">protocol</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><a class="reference internal" href="#quapy.protocol.AbstractProtocol" title="quapy.protocol.AbstractProtocol"><span class="pre">AbstractProtocol</span></a></span></em>, <em class="sig-param"><span class="n"><span class="pre">aggr_speedup</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">'auto'</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">verbose</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">False</span></span></em><span class="sig-paren">)</span><a class="headerlink" href="#quapy.evaluation.prediction" title="Permalink to this definition"></a></dt>
<dd></dd></dl>
<span class="sig-prename descclassname"><span class="pre">quapy.evaluation.</span></span><span class="sig-name descname"><span class="pre">prediction</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">model</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><a class="reference internal" href="quapy.method.html#quapy.method.base.BaseQuantifier" title="quapy.method.base.BaseQuantifier"><span class="pre">BaseQuantifier</span></a></span></em>, <em class="sig-param"><span class="n"><span class="pre">protocol</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><a class="reference internal" href="#quapy.protocol.AbstractProtocol" title="quapy.protocol.AbstractProtocol"><span class="pre">AbstractProtocol</span></a></span></em>, <em class="sig-param"><span class="n"><span class="pre">aggr_speedup</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Union</span><span class="p"><span class="pre">[</span></span><span class="pre">str</span><span class="p"><span class="pre">,</span></span><span class="w"> </span><span class="pre">bool</span><span class="p"><span class="pre">]</span></span></span><span class="w"> </span><span class="o"><span class="pre">=</span></span><span class="w"> </span><span class="default_value"><span class="pre">'auto'</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">verbose</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">False</span></span></em><span class="sig-paren">)</span><a class="headerlink" href="#quapy.evaluation.prediction" title="Permalink to this definition"></a></dt>
<dd><p>Uses a quantification model to generate predictions for the samples generated via a specific protocol.
This function is central to all evaluation processes, and is endowed with an optimization to speed-up the
prediction of protocols that generate samples from a large collection. The optimization applies to aggregative
quantifiers only, and to OnLabelledCollectionProtocol protocols, and comes down to generating the classification
predictions once and for all, and then generating samples over the classification predictions (instead of over
the raw instances), so that the classifier prediction is never called again. This behaviour is obtained by
setting <cite>aggr_speedup</cite> to auto or True, and is only carried out if the overall process is convenient in terms
of computations (e.g., if the number of classification predictions needed for the original collection exceed the
number of classification predictions needed for all samples, then the optimization is not undertaken).</p>
<dl class="field-list simple">
<dt class="field-odd">Parameters<span class="colon">:</span></dt>
<dd class="field-odd"><ul class="simple">
<li><p><strong>model</strong> a quantifier, instance of <a class="reference internal" href="quapy.method.html#quapy.method.base.BaseQuantifier" title="quapy.method.base.BaseQuantifier"><code class="xref py py-class docutils literal notranslate"><span class="pre">quapy.method.base.BaseQuantifier</span></code></a></p></li>
<li><p><strong>protocol</strong> <a class="reference internal" href="#quapy.protocol.AbstractProtocol" title="quapy.protocol.AbstractProtocol"><code class="xref py py-class docutils literal notranslate"><span class="pre">quapy.protocol.AbstractProtocol</span></code></a>; if this object is also instance of
<a class="reference internal" href="#quapy.protocol.OnLabelledCollectionProtocol" title="quapy.protocol.OnLabelledCollectionProtocol"><code class="xref py py-class docutils literal notranslate"><span class="pre">quapy.protocol.OnLabelledCollectionProtocol</span></code></a>, then the aggregation speed-up can be run. This is the protocol
in charge of generating the samples for which the model has to issue class prevalence predictions.</p></li>
<li><p><strong>aggr_speedup</strong> whether or not to apply the speed-up. Set to “force” for applying it even if the number of
instances in the original collection on which the protocol acts is larger than the number of instances
in the samples to be generated. Set to True or “auto” (default) for letting QuaPy decide whether it is
convenient or not. Set to False to deactivate.</p></li>
<li><p><strong>verbose</strong> boolean, show or not information in stdout</p></li>
</ul>
</dd>
<dt class="field-even">Returns<span class="colon">:</span></dt>
<dd class="field-even"><p>a tuple <cite>(true_prevs, estim_prevs)</cite> in which each element in the tuple is an array of shape
<cite>(n_samples, n_classes)</cite> containing the true, or predicted, prevalence values for each sample</p>
</dd>
</dl>
</dd></dl>
</section>
<section id="quapy-protocol">
@ -624,7 +723,21 @@ the sequence will be consistent every time the protocol is called.</p>
<dl class="py method">
<dt class="sig sig-object py" id="quapy.protocol.AbstractStochasticSeededProtocol.collator">
<span class="sig-name descname"><span class="pre">collator</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">sample</span></span></em>, <em class="sig-param"><span class="o"><span class="pre">*</span></span><span class="n"><span class="pre">args</span></span></em><span class="sig-paren">)</span><a class="headerlink" href="#quapy.protocol.AbstractStochasticSeededProtocol.collator" title="Permalink to this definition"></a></dt>
<dd></dd></dl>
<dd><p>The collator prepares the sample to accommodate the desired output format before returning the output.
This collator simply returns the sample as it is. Classes inheriting from this abstract class can
implement their custom collators.</p>
<dl class="field-list simple">
<dt class="field-odd">Parameters<span class="colon">:</span></dt>
<dd class="field-odd"><ul class="simple">
<li><p><strong>sample</strong> the sample to be returned</p></li>
<li><p><strong>args</strong> additional arguments</p></li>
</ul>
</dd>
<dt class="field-even">Returns<span class="colon">:</span></dt>
<dd class="field-even"><p>the sample adhering to a desired output format (in this case, the sample is returned as it is)</p>
</dd>
</dl>
</dd></dl>
<dl class="py property">
<dt class="sig sig-object py" id="quapy.protocol.AbstractStochasticSeededProtocol.random_state">
@ -658,6 +771,12 @@ the sequence will be consistent every time the protocol is called.</p>
</dd></dl>
<dl class="py attribute">
<dt class="sig sig-object py" id="quapy.protocol.ArtificialPrevalenceProtocol">
<span class="sig-prename descclassname"><span class="pre">quapy.protocol.</span></span><span class="sig-name descname"><span class="pre">ArtificialPrevalenceProtocol</span></span><a class="headerlink" href="#quapy.protocol.ArtificialPrevalenceProtocol" title="Permalink to this definition"></a></dt>
<dd><p>alias of <a class="reference internal" href="#quapy.protocol.APP" title="quapy.protocol.APP"><code class="xref py py-class docutils literal notranslate"><span class="pre">APP</span></code></a></p>
</dd></dl>
<dl class="py class">
<dt class="sig sig-object py" id="quapy.protocol.DomainMixer">
<em class="property"><span class="pre">class</span><span class="w"> </span></em><span class="sig-prename descclassname"><span class="pre">quapy.protocol.</span></span><span class="sig-name descname"><span class="pre">DomainMixer</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">domainA</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><a class="reference internal" href="quapy.data.html#quapy.data.base.LabelledCollection" title="quapy.data.base.LabelledCollection"><span class="pre">LabelledCollection</span></a></span></em>, <em class="sig-param"><span class="n"><span class="pre">domainB</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><a class="reference internal" href="quapy.data.html#quapy.data.base.LabelledCollection" title="quapy.data.base.LabelledCollection"><span class="pre">LabelledCollection</span></a></span></em>, <em class="sig-param"><span class="n"><span class="pre">sample_size</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">repeats</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">1</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">prevalence</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">None</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">mixture_points</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">11</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">random_state</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">0</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">return_type</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">'sample_prev'</span></span></em><span class="sig-paren">)</span><a class="headerlink" href="#quapy.protocol.DomainMixer" title="Permalink to this definition"></a></dt>
@ -720,6 +839,29 @@ will be the same every time the protocol is called)</p></li>
</dd></dl>
<dl class="py class">
<dt class="sig sig-object py" id="quapy.protocol.IterateProtocol">
<em class="property"><span class="pre">class</span><span class="w"> </span></em><span class="sig-prename descclassname"><span class="pre">quapy.protocol.</span></span><span class="sig-name descname"><span class="pre">IterateProtocol</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="pre">samples:</span> <span class="pre">[&lt;class</span> <span class="pre">'quapy.data.base.LabelledCollection'&gt;]</span></em><span class="sig-paren">)</span><a class="headerlink" href="#quapy.protocol.IterateProtocol" title="Permalink to this definition"></a></dt>
<dd><p>Bases: <a class="reference internal" href="#quapy.protocol.AbstractProtocol" title="quapy.protocol.AbstractProtocol"><code class="xref py py-class docutils literal notranslate"><span class="pre">AbstractProtocol</span></code></a></p>
<p>A very simple protocol which simply iterates over a list of previously generated samples</p>
<dl class="field-list simple">
<dt class="field-odd">Parameters<span class="colon">:</span></dt>
<dd class="field-odd"><p><strong>samples</strong> a list of <a class="reference internal" href="quapy.data.html#quapy.data.base.LabelledCollection" title="quapy.data.base.LabelledCollection"><code class="xref py py-class docutils literal notranslate"><span class="pre">quapy.data.base.LabelledCollection</span></code></a></p>
</dd>
</dl>
<dl class="py method">
<dt class="sig sig-object py" id="quapy.protocol.IterateProtocol.total">
<span class="sig-name descname"><span class="pre">total</span></span><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="headerlink" href="#quapy.protocol.IterateProtocol.total" title="Permalink to this definition"></a></dt>
<dd><p>Returns the number of samples in this protocol</p>
<dl class="field-list simple">
<dt class="field-odd">Returns<span class="colon">:</span></dt>
<dd class="field-odd"><p>int</p>
</dd>
</dl>
</dd></dl>
</dd></dl>
<dl class="py class">
<dt class="sig sig-object py" id="quapy.protocol.NPP">
<em class="property"><span class="pre">class</span><span class="w"> </span></em><span class="sig-prename descclassname"><span class="pre">quapy.protocol.</span></span><span class="sig-name descname"><span class="pre">NPP</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">data</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><a class="reference internal" href="quapy.data.html#quapy.data.base.LabelledCollection" title="quapy.data.base.LabelledCollection"><span class="pre">LabelledCollection</span></a></span></em>, <em class="sig-param"><span class="n"><span class="pre">sample_size</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">None</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">repeats</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">100</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">random_state</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">0</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">return_type</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">'sample_prev'</span></span></em><span class="sig-paren">)</span><a class="headerlink" href="#quapy.protocol.NPP" title="Permalink to this definition"></a></dt>
@ -778,6 +920,12 @@ to “labelled_collection” to get instead instances of LabelledCollection</p><
</dd></dl>
<dl class="py attribute">
<dt class="sig sig-object py" id="quapy.protocol.NaturalPrevalenceProtocol">
<span class="sig-prename descclassname"><span class="pre">quapy.protocol.</span></span><span class="sig-name descname"><span class="pre">NaturalPrevalenceProtocol</span></span><a class="headerlink" href="#quapy.protocol.NaturalPrevalenceProtocol" title="Permalink to this definition"></a></dt>
<dd><p>alias of <a class="reference internal" href="#quapy.protocol.NPP" title="quapy.protocol.NPP"><code class="xref py py-class docutils literal notranslate"><span class="pre">NPP</span></code></a></p>
</dd></dl>
<dl class="py class">
<dt class="sig sig-object py" id="quapy.protocol.OnLabelledCollectionProtocol">
<em class="property"><span class="pre">class</span><span class="w"> </span></em><span class="sig-prename descclassname"><span class="pre">quapy.protocol.</span></span><span class="sig-name descname"><span class="pre">OnLabelledCollectionProtocol</span></span><a class="headerlink" href="#quapy.protocol.OnLabelledCollectionProtocol" title="Permalink to this definition"></a></dt>
@ -785,7 +933,7 @@ to “labelled_collection” to get instead instances of LabelledCollection</p><
<p>Protocols that generate samples from a <code class="xref py py-class docutils literal notranslate"><span class="pre">qp.data.LabelledCollection</span></code> object.</p>
<dl class="py attribute">
<dt class="sig sig-object py" id="quapy.protocol.OnLabelledCollectionProtocol.RETURN_TYPES">
<span class="sig-name descname"><span class="pre">RETURN_TYPES</span></span><em class="property"><span class="w"> </span><span class="p"><span class="pre">=</span></span><span class="w"> </span><span class="pre">['sample_prev',</span> <span class="pre">'labelled_collection']</span></em><a class="headerlink" href="#quapy.protocol.OnLabelledCollectionProtocol.RETURN_TYPES" title="Permalink to this definition"></a></dt>
<span class="sig-name descname"><span class="pre">RETURN_TYPES</span></span><em class="property"><span class="w"> </span><span class="p"><span class="pre">=</span></span><span class="w"> </span><span class="pre">['sample_prev',</span> <span class="pre">'labelled_collection',</span> <span class="pre">'index']</span></em><a class="headerlink" href="#quapy.protocol.OnLabelledCollectionProtocol.RETURN_TYPES" title="Permalink to this definition"></a></dt>
<dd></dd></dl>
<dl class="py method">
@ -841,8 +989,8 @@ with shape <cite>(n_instances,)</cite> when the classifier is a hard one, or wit
</dd></dl>
<dl class="py class">
<dt class="sig sig-object py" id="quapy.protocol.USimplexPP">
<em class="property"><span class="pre">class</span><span class="w"> </span></em><span class="sig-prename descclassname"><span class="pre">quapy.protocol.</span></span><span class="sig-name descname"><span class="pre">USimplexPP</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">data</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><a class="reference internal" href="quapy.data.html#quapy.data.base.LabelledCollection" title="quapy.data.base.LabelledCollection"><span class="pre">LabelledCollection</span></a></span></em>, <em class="sig-param"><span class="n"><span class="pre">sample_size</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">None</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">repeats</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">100</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">random_state</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">0</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">return_type</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">'sample_prev'</span></span></em><span class="sig-paren">)</span><a class="headerlink" href="#quapy.protocol.USimplexPP" title="Permalink to this definition"></a></dt>
<dt class="sig sig-object py" id="quapy.protocol.UPP">
<em class="property"><span class="pre">class</span><span class="w"> </span></em><span class="sig-prename descclassname"><span class="pre">quapy.protocol.</span></span><span class="sig-name descname"><span class="pre">UPP</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">data</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><a class="reference internal" href="quapy.data.html#quapy.data.base.LabelledCollection" title="quapy.data.base.LabelledCollection"><span class="pre">LabelledCollection</span></a></span></em>, <em class="sig-param"><span class="n"><span class="pre">sample_size</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">None</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">repeats</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">100</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">random_state</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">0</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">return_type</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">'sample_prev'</span></span></em><span class="sig-paren">)</span><a class="headerlink" href="#quapy.protocol.UPP" title="Permalink to this definition"></a></dt>
<dd><p>Bases: <a class="reference internal" href="#quapy.protocol.AbstractStochasticSeededProtocol" title="quapy.protocol.AbstractStochasticSeededProtocol"><code class="xref py py-class docutils literal notranslate"><span class="pre">AbstractStochasticSeededProtocol</span></code></a>, <a class="reference internal" href="#quapy.protocol.OnLabelledCollectionProtocol" title="quapy.protocol.OnLabelledCollectionProtocol"><code class="xref py py-class docutils literal notranslate"><span class="pre">OnLabelledCollectionProtocol</span></code></a></p>
<p>A variant of <a class="reference internal" href="#quapy.protocol.APP" title="quapy.protocol.APP"><code class="xref py py-class docutils literal notranslate"><span class="pre">APP</span></code></a> that, instead of using a grid of equidistant prevalence values,
relies on the Kraemer algorithm for sampling unit (k-1)-simplex uniformly at random, with
@ -865,8 +1013,8 @@ to “labelled_collection” to get instead instances of LabelledCollection</p><
</dd>
</dl>
<dl class="py method">
<dt class="sig sig-object py" id="quapy.protocol.USimplexPP.sample">
<span class="sig-name descname"><span class="pre">sample</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">index</span></span></em><span class="sig-paren">)</span><a class="headerlink" href="#quapy.protocol.USimplexPP.sample" title="Permalink to this definition"></a></dt>
<dt class="sig sig-object py" id="quapy.protocol.UPP.sample">
<span class="sig-name descname"><span class="pre">sample</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">index</span></span></em><span class="sig-paren">)</span><a class="headerlink" href="#quapy.protocol.UPP.sample" title="Permalink to this definition"></a></dt>
<dd><p>Realizes the sample given the index of the instances.</p>
<dl class="field-list simple">
<dt class="field-odd">Parameters<span class="colon">:</span></dt>
@ -879,19 +1027,19 @@ to “labelled_collection” to get instead instances of LabelledCollection</p><
</dd></dl>
<dl class="py method">
<dt class="sig sig-object py" id="quapy.protocol.USimplexPP.samples_parameters">
<span class="sig-name descname"><span class="pre">samples_parameters</span></span><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="headerlink" href="#quapy.protocol.USimplexPP.samples_parameters" title="Permalink to this definition"></a></dt>
<dd><p>Return all the necessary parameters to replicate the samples as according to the USimplexPP protocol.</p>
<dt class="sig sig-object py" id="quapy.protocol.UPP.samples_parameters">
<span class="sig-name descname"><span class="pre">samples_parameters</span></span><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="headerlink" href="#quapy.protocol.UPP.samples_parameters" title="Permalink to this definition"></a></dt>
<dd><p>Return all the necessary parameters to replicate the samples as according to the UPP protocol.</p>
<dl class="field-list simple">
<dt class="field-odd">Returns<span class="colon">:</span></dt>
<dd class="field-odd"><p>a list of indexes that realize the USimplexPP sampling</p>
<dd class="field-odd"><p>a list of indexes that realize the UPP sampling</p>
</dd>
</dl>
</dd></dl>
<dl class="py method">
<dt class="sig sig-object py" id="quapy.protocol.USimplexPP.total">
<span class="sig-name descname"><span class="pre">total</span></span><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="headerlink" href="#quapy.protocol.USimplexPP.total" title="Permalink to this definition"></a></dt>
<dt class="sig sig-object py" id="quapy.protocol.UPP.total">
<span class="sig-name descname"><span class="pre">total</span></span><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="headerlink" href="#quapy.protocol.UPP.total" title="Permalink to this definition"></a></dt>
<dd><p>Returns the number of samples that will be generated (equals to “repeats”)</p>
<dl class="field-list simple">
<dt class="field-odd">Returns<span class="colon">:</span></dt>
@ -902,6 +1050,12 @@ to “labelled_collection” to get instead instances of LabelledCollection</p><
</dd></dl>
<dl class="py attribute">
<dt class="sig sig-object py" id="quapy.protocol.UniformPrevalenceProtocol">
<span class="sig-prename descclassname"><span class="pre">quapy.protocol.</span></span><span class="sig-name descname"><span class="pre">UniformPrevalenceProtocol</span></span><a class="headerlink" href="#quapy.protocol.UniformPrevalenceProtocol" title="Permalink to this definition"></a></dt>
<dd><p>alias of <a class="reference internal" href="#quapy.protocol.UPP" title="quapy.protocol.UPP"><code class="xref py py-class docutils literal notranslate"><span class="pre">UPP</span></code></a></p>
</dd></dl>
</section>
<section id="module-quapy.functional">
<span id="quapy-functional"></span><h2>quapy.functional<a class="headerlink" href="#module-quapy.functional" title="Permalink to this heading"></a></h2>
@ -1175,9 +1329,9 @@ protocol for quantification.</p>
<dd class="field-odd"><ul class="simple">
<li><p><strong>model</strong> (<a class="reference internal" href="quapy.method.html#quapy.method.base.BaseQuantifier" title="quapy.method.base.BaseQuantifier"><em>BaseQuantifier</em></a>) the quantifier to optimize</p></li>
<li><p><strong>param_grid</strong> a dictionary with keys the parameter names and values the list of values to explore</p></li>
<li><p><strong>protocol</strong> </p></li>
<li><p><strong>protocol</strong> a sample generation protocol, an instance of <a class="reference internal" href="#quapy.protocol.AbstractProtocol" title="quapy.protocol.AbstractProtocol"><code class="xref py py-class docutils literal notranslate"><span class="pre">quapy.protocol.AbstractProtocol</span></code></a></p></li>
<li><p><strong>error</strong> an error function (callable) or a string indicating the name of an error function (valid ones
are those in qp.error.QUANTIFICATION_ERROR</p></li>
are those in <code class="xref py py-class docutils literal notranslate"><span class="pre">quapy.error.QUANTIFICATION_ERROR</span></code></p></li>
<li><p><strong>refit</strong> whether or not to refit the model on the whole labelled collection (training+validation) with
the best chosen hyperparameter combination. Ignored if protocol=gen</p></li>
<li><p><strong>timeout</strong> establishes a timer (in seconds) for each of the hyperparameters configurations being tested.

View File

@ -1718,7 +1718,7 @@ registered hooks while the latter silently ignores them.</p>
<dl class="py class">
<dt class="sig sig-object py" id="quapy.method.neural.QuaNetTrainer">
<em class="property"><span class="pre">class</span><span class="w"> </span></em><span class="sig-prename descclassname"><span class="pre">quapy.method.neural.</span></span><span class="sig-name descname"><span class="pre">QuaNetTrainer</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">classifier</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">sample_size</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">n_epochs</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">100</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">tr_iter_per_poch</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">500</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">va_iter_per_poch</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">100</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">lr</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">0.001</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">lstm_hidden_size</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">64</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">lstm_nlayers</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">1</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">ff_layers</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">[1024,</span> <span class="pre">512]</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">bidirectional</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">True</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">qdrop_p</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">0.5</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">patience</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">10</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">checkpointdir</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">'../checkpoint'</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">checkpointname</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">None</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">device</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">'cuda'</span></span></em><span class="sig-paren">)</span><a class="headerlink" href="#quapy.method.neural.QuaNetTrainer" title="Permalink to this definition"></a></dt>
<em class="property"><span class="pre">class</span><span class="w"> </span></em><span class="sig-prename descclassname"><span class="pre">quapy.method.neural.</span></span><span class="sig-name descname"><span class="pre">QuaNetTrainer</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">classifier</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">sample_size</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">None</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">n_epochs</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">100</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">tr_iter_per_poch</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">500</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">va_iter_per_poch</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">100</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">lr</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">0.001</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">lstm_hidden_size</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">64</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">lstm_nlayers</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">1</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">ff_layers</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">[1024,</span> <span class="pre">512]</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">bidirectional</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">True</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">qdrop_p</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">0.5</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">patience</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">10</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">checkpointdir</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">'../checkpoint'</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">checkpointname</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">None</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">device</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">'cuda'</span></span></em><span class="sig-paren">)</span><a class="headerlink" href="#quapy.method.neural.QuaNetTrainer" title="Permalink to this definition"></a></dt>
<dd><p>Bases: <a class="reference internal" href="#quapy.method.base.BaseQuantifier" title="quapy.method.base.BaseQuantifier"><code class="xref py py-class docutils literal notranslate"><span class="pre">BaseQuantifier</span></code></a></p>
<p>Implementation of <a class="reference external" href="https://dl.acm.org/doi/abs/10.1145/3269206.3269287">QuaNet</a>, a neural network for
quantification. This implementation uses <a class="reference external" href="https://pytorch.org/">PyTorch</a> and can take advantage of GPU
@ -1751,7 +1751,8 @@ for speeding-up the training phase.</p>
<li><p><strong>classifier</strong> an object implementing <cite>fit</cite> (i.e., that can be trained on labelled data),
<cite>predict_proba</cite> (i.e., that can generate posterior probabilities of unlabelled examples) and
<cite>transform</cite> (i.e., that can generate embedded representations of the unlabelled instances).</p></li>
<li><p><strong>sample_size</strong> integer, the sample size</p></li>
<li><p><strong>sample_size</strong> integer, the sample size; default is None, meaning that the sample size should be
taken from qp.environ[“SAMPLE_SIZE”]</p></li>
<li><p><strong>n_epochs</strong> integer, maximum number of training epochs</p></li>
<li><p><strong>tr_iter_per_poch</strong> integer, number of training iterations before considering an epoch complete</p></li>
<li><p><strong>va_iter_per_poch</strong> integer, number of validation iterations to perform after each epoch</p></li>

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1,57 @@
import quapy as qp
from quapy.protocol import APP
from quapy.method.aggregative import DistributionMatching
from sklearn.linear_model import LogisticRegression
import numpy as np
"""
In this example, we show how to perform model selection on a DistributionMatching quantifier.
"""
model = DistributionMatching(LogisticRegression())
qp.environ['SAMPLE_SIZE'] = 100
qp.environ['N_JOBS'] = -1
training, test = qp.datasets.fetch_reviews('imdb', tfidf=True, min_df=5).train_test
# The model will be returned by the fit method of GridSearchQ.
# Every combination of hyper-parameters will be evaluated by confronting the
# quantifier thus configured against a series of samples generated by means
# of a sample generation protocol. For this example, we will use the
# artificial-prevalence protocol (APP), that generates samples with prevalence
# values in the entire range of values from a grid (e.g., [0, 0.1, 0.2, ..., 1]).
# We devote 30% of the dataset for this exploration.
training, validation = training.split_stratified(train_prop=0.7)
protocol = APP(validation)
# We will explore a classification-dependent hyper-parameter (e.g., the 'C'
# hyper-parameter of LogisticRegression) and a quantification-dependent hyper-parameter
# (e.g., the number of bins in a DistributionMatching quantifier.
# Classifier-dependent hyper-parameters have to be marked with a prefix "classifier__"
# in order to let the quantifier know this hyper-parameter belongs to its underlying
# classifier.
param_grid = {
'classifier__C': np.logspace(-3,3,7),
'nbins': [8, 16, 32, 64],
}
model = qp.model_selection.GridSearchQ(
model=model,
param_grid=param_grid,
protocol=protocol,
error='mae', # the error to optimize is the MAE (a quantification-oriented loss)
refit=True, # retrain on the whole labelled set once done
verbose=True # show information as the process goes on
).fit(training)
print(f'model selection ended: best hyper-parameters={model.best_params_}')
model = model.best_model_
# evaluation in terms of MAE
# we use the same evaluation protocol (APP) on the test set
mae_score = qp.evaluation.evaluate(model, protocol=APP(test), error_metric='mae')
print(f'MAE={mae_score:.5f}')

View File

@ -1,22 +1,16 @@
Change Log 0.1.7
---------------------
----------------
- Protocols are now abstracted as instances of AbstractProtocol. There is a new class extending AbstractProtocol called
AbstractStochasticSeededProtocol, which implements a seeding policy to allow replicate the series of samplings.
There are some examples of protocols, APP, NPP, UPP, DomainMixer (experimental).
The idea is to start the sampling by simply calling the __call__ method.
The idea is to start the sample generation by simply calling the __call__ method.
This change has a great impact in the framework, since many functions in qp.evaluation, qp.model_selection,
and sampling functions in LabelledCollection relied of the old functions. E.g., the functionality of
qp.evaluation.artificial_prevalence_report or qp.evaluation.natural_prevalence_report is now obtained by means of
qp.evaluation.report which takes a protocol as an argument. I have not maintained compatibility with the old
interfaces because I did not really like them. Check the wiki guide and the examples for more details.
check guides
check examples
- ACC, PACC, Forman's threshold variants have been parallelized.
- Exploration of hyperparameters in Model selection can now be run in parallel (there was a n_jobs argument in
QuaPy 0.1.6 but only the evaluation part for one specific hyperparameter was run in parallel).
@ -26,17 +20,19 @@ Change Log 0.1.7
procedure. The user can now specify "force", "auto", True of False, in order to actively decide for applying it
or not.
- n_jobs is now taken from the environment if set to None
- examples directory created!
- cross_val_predict (for quantification) added to model_selection: would be nice to allow the user specifies a
test protocol maybe, or None for bypassing it?
- DyS, Topsoe distance and binary search (thanks to Pablo González)
- Multi-thread reproducibility via seeding (thanks to Pablo González)
- n_jobs is now taken from the environment if set to None
- ACC, PACC, Forman's threshold variants have been parallelized.
- cross_val_predict (for quantification) added to model_selection: would be nice to allow the user specifies a
test protocol maybe, or None for bypassing it?
- Bugfix: adding two labelled collections (with +) now checks for consistency in the classes
- newer versions of numpy raise a warning when accessing types (e.g., np.float). I have replaced all such instances

View File

@ -1,4 +1,6 @@
import itertools
from functools import cached_property
from typing import Iterable
import numpy as np
from scipy.sparse import issparse
@ -129,11 +131,23 @@ class LabelledCollection:
# <= size * prevs[i]) examples are drawn from class i, there could be a remainder number of instances to take
# to satisfy the size constrain. The remainder is distributed along the classes with probability = prevs.
# (This aims at avoiding the remainder to be placed in a class for which the prevalence requested is 0.)
n_requests = {class_: int(size * prevs[i]) for i, class_ in enumerate(self.classes_)}
n_requests = {class_: round(size * prevs[i]) for i, class_ in enumerate(self.classes_)}
remainder = size - sum(n_requests.values())
with temp_seed(random_state):
for rand_class in np.random.choice(self.classes_, size=remainder, p=prevs):
n_requests[rand_class] += 1
# due to rounding, the remainder can be 0, >0, or <0
if remainder > 0:
# when the remainder is >0 we randomly add 1 to the requests for each class;
# more prevalent classes are more likely to be taken in order to minimize the impact in the final prevalence
for rand_class in np.random.choice(self.classes_, size=remainder, p=prevs):
n_requests[rand_class] += 1
elif remainder < 0:
# when the remainder is <0 we randomly remove 1 from the requests, unless the request is 0 for a chosen
# class; we repeat until remainder==0
while remainder!=0:
rand_class = np.random.choice(self.classes_, p=prevs)
if n_requests[rand_class] > 0:
n_requests[rand_class] -= 1
remainder += 1
indexes_sample = []
for class_, n_requested in n_requests.items():
@ -266,31 +280,47 @@ class LabelledCollection:
if not all(np.sort(self.classes_)==np.sort(other.classes_)):
raise NotImplementedError(f'unsupported operation for collections on different classes; '
f'expected {self.classes_}, found {other.classes_}')
return LabelledCollection.mix(self, other)
return LabelledCollection.join(self, other)
@classmethod
def mix(cls, a:'LabelledCollection', b:'LabelledCollection'):
def join(cls, *args: Iterable['LabelledCollection']):
"""
Returns a new :class:`LabelledCollection` as the union of this collection with another collection.
Returns a new :class:`LabelledCollection` as the union of the collections given in input.
:param a: instance of :class:`LabelledCollection`
:param b: instance of :class:`LabelledCollection`
:param args: instances of :class:`LabelledCollection`
:return: a :class:`LabelledCollection` representing the union of both collections
"""
if a is None: return b
if b is None: return a
elif issparse(a.instances) and issparse(b.instances):
join_instances = vstack([a.instances, b.instances])
elif isinstance(a.instances, list) and isinstance(b.instances, list):
join_instances = a.instances + b.instances
elif isinstance(a.instances, np.ndarray) and isinstance(b.instances, np.ndarray):
join_instances = np.concatenate([a.instances, b.instances])
args = [lc for lc in args if lc is not None]
assert len(args) > 0, 'empty list is not allowed for mix'
assert all([isinstance(lc, LabelledCollection) for lc in args]), \
'only instances of LabelledCollection allowed'
first_instances = args[0].instances
first_type = type(first_instances)
assert all([type(lc.instances)==first_type for lc in args[1:]]), \
'not all the collections are of instances of the same type'
if issparse(first_instances) or isinstance(first_instances, np.ndarray):
first_ndim = first_instances.ndim
assert all([lc.instances.ndim == first_ndim for lc in args[1:]]), \
'not all the ndarrays are of the same dimension'
if first_ndim > 1:
first_shape = first_instances.shape[1:]
assert all([lc.instances.shape[1:] == first_shape for lc in args[1:]]), \
'not all the ndarrays are of the same shape'
if issparse(first_instances):
instances = vstack([lc.instances for lc in args])
else:
instances = np.concatenate([lc.instances for lc in args])
elif isinstance(first_instances, list):
instances = list(itertools.chain(lc.instances for lc in args))
else:
raise NotImplementedError('unsupported operation for collection types')
labels = np.concatenate([a.labels, b.labels])
classes = np.unique(np.concatenate([a.classes_, b.classes_])).sort()
return LabelledCollection(join_instances, labels, classes=classes)
labels = np.concatenate([lc.labels for lc in args])
classes = np.unique(labels).sort()
return LabelledCollection(instances, labels, classes=classes)
@property
def Xy(self):

View File

@ -16,7 +16,7 @@ def prediction(
Uses a quantification model to generate predictions for the samples generated via a specific protocol.
This function is central to all evaluation processes, and is endowed with an optimization to speed-up the
prediction of protocols that generate samples from a large collection. The optimization applies to aggregative
quantifiers only, and to OnLabelledCollection protocols, and comes down to generating the classification
quantifiers only, and to OnLabelledCollectionProtocol protocols, and comes down to generating the classification
predictions once and for all, and then generating samples over the classification predictions (instead of over
the raw instances), so that the classifier prediction is never called again. This behaviour is obtained by
setting `aggr_speedup` to 'auto' or True, and is only carried out if the overall process is convenient in terms
@ -25,7 +25,7 @@ def prediction(
:param model: a quantifier, instance of :class:`quapy.method.base.BaseQuantifier`
:param protocol: :class:`quapy.protocol.AbstractProtocol`; if this object is also instance of
:class:`quapy.protocol.OnLabelledCollection`, then the aggregation speed-up can be run. This is the protocol
:class:`quapy.protocol.OnLabelledCollectionProtocol`, then the aggregation speed-up can be run. This is the protocol
in charge of generating the samples for which the model has to issue class prevalence predictions.
:param aggr_speedup: whether or not to apply the speed-up. Set to "force" for applying it even if the number of
instances in the original collection on which the protocol acts is larger than the number of instances
@ -90,7 +90,7 @@ def evaluation_report(model: BaseQuantifier,
:param model: a quantifier, instance of :class:`quapy.method.base.BaseQuantifier`
:param protocol: :class:`quapy.protocol.AbstractProtocol`; if this object is also instance of
:class:`quapy.protocol.OnLabelledCollection`, then the aggregation speed-up can be run. This is the protocol
:class:`quapy.protocol.OnLabelledCollectionProtocol`, then the aggregation speed-up can be run. This is the protocol
in charge of generating the samples in which the model is evaluated.
:param error_metrics: a string, or list of strings, representing the name(s) of an error function in `qp.error`
(e.g., 'mae', the default value), or a callable function, or a list of callable functions, implementing
@ -141,8 +141,8 @@ def evaluate(
:param model: a quantifier, instance of :class:`quapy.method.base.BaseQuantifier`
:param protocol: :class:`quapy.protocol.AbstractProtocol`; if this object is also instance of
:class:`quapy.protocol.OnLabelledCollection`, then the aggregation speed-up can be run. This is the protocol
in charge of generating the samples in which the model is evaluated.
:class:`quapy.protocol.OnLabelledCollectionProtocol`, then the aggregation speed-up can be run. This is the
protocol in charge of generating the samples in which the model is evaluated.
:param error_metric: a string representing the name(s) of an error function in `qp.error`
(e.g., 'mae'), or a callable function implementing the error function itself.
:param aggr_speedup: whether or not to apply the speed-up. Set to "force" for applying it even if the number of

View File

@ -23,9 +23,9 @@ class GridSearchQ(BaseQuantifier):
:param model: the quantifier to optimize
:type model: BaseQuantifier
:param param_grid: a dictionary with keys the parameter names and values the list of values to explore
:param protocol:
:param protocol: a sample generation protocol, an instance of :class:`quapy.protocol.AbstractProtocol`
:param error: an error function (callable) or a string indicating the name of an error function (valid ones
are those in qp.error.QUANTIFICATION_ERROR
are those in :class:`quapy.error.QUANTIFICATION_ERROR`
:param refit: whether or not to refit the model on the whole labelled collection (training+validation) with
the best chosen hyperparameter combination. Ignored if protocol='gen'
:param timeout: establishes a timer (in seconds) for each of the hyperparameters configurations being tested.

View File

@ -51,9 +51,10 @@ def binary_diagonal(method_names, true_prevs, estim_prevs, pos_class=1, title=No
table = {method_name:[true_prev, estim_prev] for method_name, true_prev, estim_prev in order}
order = [(method_name, *table[method_name]) for method_name in method_order]
cm = plt.get_cmap('tab20')
NUM_COLORS = len(method_names)
ax.set_prop_cycle(color=[cm(1. * i / NUM_COLORS) for i in range(NUM_COLORS)])
if NUM_COLORS>10:
cm = plt.get_cmap('tab20')
ax.set_prop_cycle(color=[cm(1. * i / NUM_COLORS) for i in range(NUM_COLORS)])
for method, true_prev, estim_prev in order:
true_prev = true_prev[:,pos_class]
estim_prev = estim_prev[:,pos_class]
@ -76,13 +77,12 @@ def binary_diagonal(method_names, true_prevs, estim_prevs, pos_class=1, title=No
ax.set_xlim(0, 1)
if legend:
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))
# box = ax.get_position()
# ax.set_position([box.x0, box.y0, box.width * 0.8, box.height])
# ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))
# ax.set_position([box.x0, box.y0, box.width * 0.8, box.height])
ax.legend(loc='lower center',
bbox_to_anchor=(1, -0.5),
ncol=(len(method_names)+1)//2)
# ax.legend(loc='lower center',
# bbox_to_anchor=(1, -0.5),
# ncol=(len(method_names)+1)//2)
_save_or_show(savepath)

View File

@ -127,6 +127,15 @@ class AbstractStochasticSeededProtocol(AbstractProtocol):
yield self.collator(self.sample(params))
def collator(self, sample, *args):
"""
The collator prepares the sample to accommodate the desired output format before returning the output.
This collator simply returns the sample as it is. Classes inheriting from this abstract class can
implement their custom collators.
:param sample: the sample to be returned
:param args: additional arguments
:return: the sample adhering to a desired output format (in this case, the sample is returned as it is)
"""
return sample

View File

@ -1,5 +1,7 @@
import unittest
import numpy as np
from scipy.sparse import csr_matrix
import quapy as qp
@ -16,6 +18,51 @@ class LabelCollectionTestCase(unittest.TestCase):
self.assertEqual(np.allclose(check_prev, data.prevalence()), True)
self.assertEqual(len(tr+te), len(data))
def test_join(self):
x = np.arange(50)
y = np.random.randint(2, 5, 50)
data1 = qp.data.LabelledCollection(x, y)
x = np.arange(200)
y = np.random.randint(0, 3, 200)
data2 = qp.data.LabelledCollection(x, y)
x = np.arange(100)
y = np.random.randint(0, 6, 100)
data3 = qp.data.LabelledCollection(x, y)
combined = qp.data.LabelledCollection.join(data1, data2, data3)
self.assertEqual(len(combined), len(data1)+len(data2)+len(data3))
self.assertEqual(all(combined.classes_ == np.arange(6)), True)
x = np.random.rand(10, 3)
y = np.random.randint(0, 1, 10)
data4 = qp.data.LabelledCollection(x, y)
with self.assertRaises(Exception):
combined = qp.data.LabelledCollection.join(data1, data2, data3, data4)
x = np.random.rand(20, 3)
y = np.random.randint(0, 1, 20)
data5 = qp.data.LabelledCollection(x, y)
combined = qp.data.LabelledCollection.join(data4, data5)
self.assertEqual(len(combined), len(data4)+len(data5))
x = np.random.rand(10, 4)
y = np.random.randint(0, 1, 10)
data6 = qp.data.LabelledCollection(x, y)
with self.assertRaises(Exception):
combined = qp.data.LabelledCollection.join(data4, data5, data6)
data4.instances = csr_matrix(data4.instances)
with self.assertRaises(Exception):
combined = qp.data.LabelledCollection.join(data4, data5)
data5.instances = csr_matrix(data5.instances)
combined = qp.data.LabelledCollection.join(data4, data5)
self.assertEqual(len(combined), len(data4) + len(data5))
# data2.instances = csr_matrix()
if __name__ == '__main__':
unittest.main()