adding documentation

This commit is contained in:
Alejandro Moreo Fernandez 2023-02-14 18:04:13 +01:00
parent 49fc486c53
commit 9aa53db6ef
5 changed files with 135 additions and 57 deletions

View File

@ -80,13 +80,13 @@ Take a look at the following code:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="p">[</span><span class="mf">0.17</span><span class="p">,</span> <span class="mf">0.50</span><span class="p">,</span> <span class="mf">0.33</span><span class="p">]</span>
</pre></div>
</div>
<p>One can easily produce new samples at desired class prevalences:</p>
<p>One can easily produce new samples at desired class prevalence values:</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">sample_size</span> <span class="o">=</span> <span class="mi">10</span>
<span class="n">prev</span> <span class="o">=</span> <span class="p">[</span><span class="mf">0.4</span><span class="p">,</span> <span class="mf">0.1</span><span class="p">,</span> <span class="mf">0.5</span><span class="p">]</span>
<span class="n">sample</span> <span class="o">=</span> <span class="n">data</span><span class="o">.</span><span class="n">sampling</span><span class="p">(</span><span class="n">sample_size</span><span class="p">,</span> <span class="o">*</span><span class="n">prev</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;instances:&#39;</span><span class="p">,</span> <span class="n">sample</span><span class="o">.</span><span class="n">instances</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;labels:&#39;</span><span class="p">,</span> <span class="n">sample</span><span class="o">.</span><span class="n">classes</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;labels:&#39;</span><span class="p">,</span> <span class="n">sample</span><span class="o">.</span><span class="n">labels</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;prevalence:&#39;</span><span class="p">,</span> <span class="n">F</span><span class="o">.</span><span class="n">strprev</span><span class="p">(</span><span class="n">sample</span><span class="o">.</span><span class="n">prevalence</span><span class="p">(),</span> <span class="n">prec</span><span class="o">=</span><span class="mi">2</span><span class="p">))</span>
</pre></div>
</div>
@ -109,29 +109,10 @@ the indexes, that can then be used to generate the sample:</p>
<span class="o">...</span>
</pre></div>
</div>
<p>QuaPy also implements the artificial sampling protocol that produces (via a
Pythons generator) a series of <em>LabelledCollection</em> objects with equidistant
prevalences ranging across the entire prevalence spectrum in the simplex space, e.g.:</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="k">for</span> <span class="n">sample</span> <span class="ow">in</span> <span class="n">data</span><span class="o">.</span><span class="n">artificial_sampling_generator</span><span class="p">(</span><span class="n">sample_size</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span> <span class="n">n_prevalences</span><span class="o">=</span><span class="mi">5</span><span class="p">):</span>
<span class="nb">print</span><span class="p">(</span><span class="n">F</span><span class="o">.</span><span class="n">strprev</span><span class="p">(</span><span class="n">sample</span><span class="o">.</span><span class="n">prevalence</span><span class="p">(),</span> <span class="n">prec</span><span class="o">=</span><span class="mi">2</span><span class="p">))</span>
</pre></div>
</div>
<p>produces one sampling for each (valid) combination of prevalences originating from
splitting the range [0,1] into n_prevalences=5 points (i.e., [0, 0.25, 0.5, 0.75, 1]),
that is:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="p">[</span><span class="mf">0.00</span><span class="p">,</span> <span class="mf">0.00</span><span class="p">,</span> <span class="mf">1.00</span><span class="p">]</span>
<span class="p">[</span><span class="mf">0.00</span><span class="p">,</span> <span class="mf">0.25</span><span class="p">,</span> <span class="mf">0.75</span><span class="p">]</span>
<span class="p">[</span><span class="mf">0.00</span><span class="p">,</span> <span class="mf">0.50</span><span class="p">,</span> <span class="mf">0.50</span><span class="p">]</span>
<span class="p">[</span><span class="mf">0.00</span><span class="p">,</span> <span class="mf">0.75</span><span class="p">,</span> <span class="mf">0.25</span><span class="p">]</span>
<span class="p">[</span><span class="mf">0.00</span><span class="p">,</span> <span class="mf">1.00</span><span class="p">,</span> <span class="mf">0.00</span><span class="p">]</span>
<span class="p">[</span><span class="mf">0.25</span><span class="p">,</span> <span class="mf">0.00</span><span class="p">,</span> <span class="mf">0.75</span><span class="p">]</span>
<span class="o">...</span>
<span class="p">[</span><span class="mf">1.00</span><span class="p">,</span> <span class="mf">0.00</span><span class="p">,</span> <span class="mf">0.00</span><span class="p">]</span>
</pre></div>
</div>
<p>See the <a class="reference external" href="https://github.com/HLT-ISTI/QuaPy/wiki/Evaluation">Evaluation wiki</a> for
further details on how to use the artificial sampling protocol to properly
evaluate a quantification method.</p>
<p>However, generating samples for evaluation purposes is tackled in QuaPy
by means of the evaluation protocols (see the dedicated entries in the Wiki
for <a class="reference external" href="https://github.com/HLT-ISTI/QuaPy/wiki/Evaluation">evaluation</a> and
<a class="reference external" href="https://github.com/HLT-ISTI/QuaPy/wiki/Protocols">protocols</a>).</p>
<section id="reviews-datasets">
<h2>Reviews Datasets<a class="headerlink" href="#reviews-datasets" title="Permalink to this heading"></a></h2>
<p>Three datasets of reviews about Kindle devices, Harry Potters series, and
@ -636,6 +617,78 @@ time the dataset is invoked.</p></li>
</ul>
</section>
</section>
<section id="lequa-datasets">
<h2>LeQua Datasets<a class="headerlink" href="#lequa-datasets" title="Permalink to this heading"></a></h2>
<p>QuaPy also provides the datasets used for the LeQua competition.
In brief, there are 4 tasks (T1A, T1B, T2A, T2B) having to do with text quantification
problems. Tasks T1A and T1B provide documents in vector form, while T2A and T2B provide
raw documents instead.
Tasks T1A and T2A are binary sentiment quantification problems, while T2A and T2B
are multiclass quantification problems consisting of estimating the class prevalence
values of 28 different merchandise products.</p>
<p>Every task consists of a training set, a set of validation samples (for model selection)
and a set of test samples (for evaluation). QuaPy returns this data as a LabelledCollection
(training) and two generation protocols (for validation and test samples), as follows:</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">training</span><span class="p">,</span> <span class="n">val_generator</span><span class="p">,</span> <span class="n">test_generator</span> <span class="o">=</span> <span class="n">fetch_lequa2022</span><span class="p">(</span><span class="n">task</span><span class="o">=</span><span class="n">task</span><span class="p">)</span>
</pre></div>
</div>
<p>See the <code class="docutils literal notranslate"><span class="pre">lequa2022_experiments.py</span></code> in the examples folder for further details on how to
carry out experiments using these datasets.</p>
<p>The datasets are downloaded only once, and stored for fast reuse.</p>
<p>Some statistics are summarized below:</p>
<table class="docutils align-default">
<thead>
<tr class="row-odd"><th class="head"><p>Dataset</p></th>
<th class="head text-center"><p>classes</p></th>
<th class="head text-center"><p>train size</p></th>
<th class="head text-center"><p>validation samples</p></th>
<th class="head text-center"><p>test samples</p></th>
<th class="head text-center"><p>docs by sample</p></th>
<th class="head text-center"><p>type</p></th>
</tr>
</thead>
<tbody>
<tr class="row-even"><td><p>T1A</p></td>
<td class="text-center"><p>2</p></td>
<td class="text-center"><p>5000</p></td>
<td class="text-center"><p>1000</p></td>
<td class="text-center"><p>5000</p></td>
<td class="text-center"><p>250</p></td>
<td class="text-center"><p>vector</p></td>
</tr>
<tr class="row-odd"><td><p>T1B</p></td>
<td class="text-center"><p>28</p></td>
<td class="text-center"><p>20000</p></td>
<td class="text-center"><p>1000</p></td>
<td class="text-center"><p>5000</p></td>
<td class="text-center"><p>1000</p></td>
<td class="text-center"><p>vector</p></td>
</tr>
<tr class="row-even"><td><p>T2A</p></td>
<td class="text-center"><p>2</p></td>
<td class="text-center"><p>5000</p></td>
<td class="text-center"><p>1000</p></td>
<td class="text-center"><p>5000</p></td>
<td class="text-center"><p>250</p></td>
<td class="text-center"><p>text</p></td>
</tr>
<tr class="row-odd"><td><p>T2B</p></td>
<td class="text-center"><p>28</p></td>
<td class="text-center"><p>20000</p></td>
<td class="text-center"><p>1000</p></td>
<td class="text-center"><p>5000</p></td>
<td class="text-center"><p>1000</p></td>
<td class="text-center"><p>text</p></td>
</tr>
</tbody>
</table>
<p>For further details on the datasets, we refer to the original
<a class="reference external" href="https://ceur-ws.org/Vol-3180/paper-146.pdf">paper</a>:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">Esuli</span><span class="p">,</span> <span class="n">A</span><span class="o">.</span><span class="p">,</span> <span class="n">Moreo</span><span class="p">,</span> <span class="n">A</span><span class="o">.</span><span class="p">,</span> <span class="n">Sebastiani</span><span class="p">,</span> <span class="n">F</span><span class="o">.</span><span class="p">,</span> <span class="o">&amp;</span> <span class="n">Sperduti</span><span class="p">,</span> <span class="n">G</span><span class="o">.</span> <span class="p">(</span><span class="mi">2022</span><span class="p">)</span><span class="o">.</span>
<span class="n">A</span> <span class="n">Detailed</span> <span class="n">Overview</span> <span class="n">of</span> <span class="n">LeQua</span><span class="o">@</span> <span class="n">CLEF</span> <span class="mi">2022</span><span class="p">:</span> <span class="n">Learning</span> <span class="n">to</span> <span class="n">Quantify</span><span class="o">.</span>
</pre></div>
</div>
</section>
<section id="adding-custom-datasets">
<h2>Adding Custom Datasets<a class="headerlink" href="#adding-custom-datasets" title="Permalink to this heading"></a></h2>
<p>QuaPy provides data loaders for simple formats dealing with
@ -667,12 +720,15 @@ all classes to be present in the collection).</p>
paths, in order to create a training and test pair of <em>LabelledCollection</em>,
e.g.:</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">quapy</span> <span class="k">as</span> <span class="nn">qp</span>
<span class="n">train_path</span> <span class="o">=</span> <span class="s1">&#39;../my_data/train.dat&#39;</span>
<span class="n">test_path</span> <span class="o">=</span> <span class="s1">&#39;../my_data/test.dat&#39;</span>
<span class="k">def</span> <span class="nf">my_custom_loader</span><span class="p">(</span><span class="n">path</span><span class="p">):</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">path</span><span class="p">,</span> <span class="s1">&#39;rb&#39;</span><span class="p">)</span> <span class="k">as</span> <span class="n">fin</span><span class="p">:</span>
<span class="o">...</span>
<span class="k">return</span> <span class="n">instances</span><span class="p">,</span> <span class="n">labels</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">data</span><span class="o">.</span><span class="n">Dataset</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="n">train_path</span><span class="p">,</span> <span class="n">test_path</span><span class="p">,</span> <span class="n">my_custom_loader</span><span class="p">)</span>
</pre></div>
</div>
@ -707,6 +763,7 @@ that the column values have zero mean and unit variance).</p></li>
<li><a class="reference internal" href="#issues">Issues:</a></li>
</ul>
</li>
<li><a class="reference internal" href="#lequa-datasets">LeQua Datasets</a></li>
<li><a class="reference internal" href="#adding-custom-datasets">Adding Custom Datasets</a><ul>
<li><a class="reference internal" href="#data-processing">Data Processing</a></li>
</ul>

View File

@ -30,7 +30,7 @@ Output the class prevalences (showing 2 digit precision):
[0.17, 0.50, 0.33]
```
One can easily produce new samples at desired class prevalences:
One can easily produce new samples at desired class prevalence values:
```python
sample_size = 10
@ -38,7 +38,7 @@ prev = [0.4, 0.1, 0.5]
sample = data.sampling(sample_size, *prev)
print('instances:', sample.instances)
print('labels:', sample.classes)
print('labels:', sample.labels)
print('prevalence:', F.strprev(sample.prevalence(), prec=2))
```
@ -64,32 +64,10 @@ for method in methods:
...
```
QuaPy also implements the artificial sampling protocol that produces (via a
Python's generator) a series of _LabelledCollection_ objects with equidistant
prevalences ranging across the entire prevalence spectrum in the simplex space, e.g.:
```python
for sample in data.artificial_sampling_generator(sample_size=100, n_prevalences=5):
print(F.strprev(sample.prevalence(), prec=2))
```
produces one sampling for each (valid) combination of prevalences originating from
splitting the range [0,1] into n_prevalences=5 points (i.e., [0, 0.25, 0.5, 0.75, 1]),
that is:
```
[0.00, 0.00, 1.00]
[0.00, 0.25, 0.75]
[0.00, 0.50, 0.50]
[0.00, 0.75, 0.25]
[0.00, 1.00, 0.00]
[0.25, 0.00, 0.75]
...
[1.00, 0.00, 0.00]
```
See the [Evaluation wiki](https://github.com/HLT-ISTI/QuaPy/wiki/Evaluation) for
further details on how to use the artificial sampling protocol to properly
evaluate a quantification method.
However, generating samples for evaluation purposes is tackled in QuaPy
by means of the evaluation protocols (see the dedicated entries in the Wiki
for [evaluation](https://github.com/HLT-ISTI/QuaPy/wiki/Evaluation) and
[protocols](https://github.com/HLT-ISTI/QuaPy/wiki/Protocols)).
## Reviews Datasets
@ -178,6 +156,8 @@ Some details can be found below:
| sst | 3 | 2971 | 1271 | 376132 | [0.261, 0.452, 0.288] | [0.207, 0.481, 0.312] | sparse |
| wa | 3 | 2184 | 936 | 248563 | [0.305, 0.414, 0.281] | [0.282, 0.446, 0.272] | sparse |
| wb | 3 | 4259 | 1823 | 404333 | [0.270, 0.392, 0.337] | [0.274, 0.392, 0.335] | sparse |
## UCI Machine Learning
A set of 32 datasets from the [UCI Machine Learning repository](https://archive.ics.uci.edu/ml/datasets.php)
@ -273,6 +253,46 @@ standard Pythons packages like gzip or zip. This file would need to be uncompres
OS-dependent software manually. Information on how to do it will be printed the first
time the dataset is invoked.
## LeQua Datasets
QuaPy also provides the datasets used for the LeQua competition.
In brief, there are 4 tasks (T1A, T1B, T2A, T2B) having to do with text quantification
problems. Tasks T1A and T1B provide documents in vector form, while T2A and T2B provide
raw documents instead.
Tasks T1A and T2A are binary sentiment quantification problems, while T2A and T2B
are multiclass quantification problems consisting of estimating the class prevalence
values of 28 different merchandise products.
Every task consists of a training set, a set of validation samples (for model selection)
and a set of test samples (for evaluation). QuaPy returns this data as a LabelledCollection
(training) and two generation protocols (for validation and test samples), as follows:
```python
training, val_generator, test_generator = fetch_lequa2022(task=task)
```
See the `lequa2022_experiments.py` in the examples folder for further details on how to
carry out experiments using these datasets.
The datasets are downloaded only once, and stored for fast reuse.
Some statistics are summarized below:
| Dataset | classes | train size | validation samples | test samples | docs by sample | type |
|---------|:-------:|:----------:|:------------------:|:------------:|:----------------:|:--------:|
| T1A | 2 | 5000 | 1000 | 5000 | 250 | vector |
| T1B | 28 | 20000 | 1000 | 5000 | 1000 | vector |
| T2A | 2 | 5000 | 1000 | 5000 | 250 | text |
| T2B | 28 | 20000 | 1000 | 5000 | 1000 | text |
For further details on the datasets, we refer to the original
[paper](https://ceur-ws.org/Vol-3180/paper-146.pdf):
```
Esuli, A., Moreo, A., Sebastiani, F., & Sperduti, G. (2022).
A Detailed Overview of LeQua@ CLEF 2022: Learning to Quantify.
```
## Adding Custom Datasets
QuaPy provides data loaders for simple formats dealing with
@ -313,12 +333,15 @@ e.g.:
```python
import quapy as qp
train_path = '../my_data/train.dat'
test_path = '../my_data/test.dat'
def my_custom_loader(path):
with open(path, 'rb') as fin:
...
return instances, labels
data = qp.data.Dataset.load(train_path, test_path, my_custom_loader)
```

View File

@ -123,6 +123,7 @@ See the <a class="reference internal" href="Evaluation.html"><span class="doc">E
<li class="toctree-l2"><a class="reference internal" href="Datasets.html#reviews-datasets">Reviews Datasets</a></li>
<li class="toctree-l2"><a class="reference internal" href="Datasets.html#twitter-sentiment-datasets">Twitter Sentiment Datasets</a></li>
<li class="toctree-l2"><a class="reference internal" href="Datasets.html#uci-machine-learning">UCI Machine Learning</a></li>
<li class="toctree-l2"><a class="reference internal" href="Datasets.html#lequa-datasets">LeQua Datasets</a></li>
<li class="toctree-l2"><a class="reference internal" href="Datasets.html#adding-custom-datasets">Adding Custom Datasets</a></li>
</ul>
</li>

File diff suppressed because one or more lines are too long

View File

@ -60,9 +60,6 @@ class LabelCollectionTestCase(unittest.TestCase):
combined = qp.data.LabelledCollection.join(data4, data5)
self.assertEqual(len(combined), len(data4) + len(data5))
# data2.instances = csr_matrix()
if __name__ == '__main__':
unittest.main()