forked from moreo/QuaPy
adding documentation
This commit is contained in:
parent
49fc486c53
commit
9aa53db6ef
|
@ -80,13 +80,13 @@ Take a look at the following code:</p>
|
|||
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="p">[</span><span class="mf">0.17</span><span class="p">,</span> <span class="mf">0.50</span><span class="p">,</span> <span class="mf">0.33</span><span class="p">]</span>
|
||||
</pre></div>
|
||||
</div>
|
||||
<p>One can easily produce new samples at desired class prevalences:</p>
|
||||
<p>One can easily produce new samples at desired class prevalence values:</p>
|
||||
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">sample_size</span> <span class="o">=</span> <span class="mi">10</span>
|
||||
<span class="n">prev</span> <span class="o">=</span> <span class="p">[</span><span class="mf">0.4</span><span class="p">,</span> <span class="mf">0.1</span><span class="p">,</span> <span class="mf">0.5</span><span class="p">]</span>
|
||||
<span class="n">sample</span> <span class="o">=</span> <span class="n">data</span><span class="o">.</span><span class="n">sampling</span><span class="p">(</span><span class="n">sample_size</span><span class="p">,</span> <span class="o">*</span><span class="n">prev</span><span class="p">)</span>
|
||||
|
||||
<span class="nb">print</span><span class="p">(</span><span class="s1">'instances:'</span><span class="p">,</span> <span class="n">sample</span><span class="o">.</span><span class="n">instances</span><span class="p">)</span>
|
||||
<span class="nb">print</span><span class="p">(</span><span class="s1">'labels:'</span><span class="p">,</span> <span class="n">sample</span><span class="o">.</span><span class="n">classes</span><span class="p">)</span>
|
||||
<span class="nb">print</span><span class="p">(</span><span class="s1">'labels:'</span><span class="p">,</span> <span class="n">sample</span><span class="o">.</span><span class="n">labels</span><span class="p">)</span>
|
||||
<span class="nb">print</span><span class="p">(</span><span class="s1">'prevalence:'</span><span class="p">,</span> <span class="n">F</span><span class="o">.</span><span class="n">strprev</span><span class="p">(</span><span class="n">sample</span><span class="o">.</span><span class="n">prevalence</span><span class="p">(),</span> <span class="n">prec</span><span class="o">=</span><span class="mi">2</span><span class="p">))</span>
|
||||
</pre></div>
|
||||
</div>
|
||||
|
@ -109,29 +109,10 @@ the indexes, that can then be used to generate the sample:</p>
|
|||
<span class="o">...</span>
|
||||
</pre></div>
|
||||
</div>
|
||||
<p>QuaPy also implements the artificial sampling protocol that produces (via a
|
||||
Python’s generator) a series of <em>LabelledCollection</em> objects with equidistant
|
||||
prevalences ranging across the entire prevalence spectrum in the simplex space, e.g.:</p>
|
||||
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="k">for</span> <span class="n">sample</span> <span class="ow">in</span> <span class="n">data</span><span class="o">.</span><span class="n">artificial_sampling_generator</span><span class="p">(</span><span class="n">sample_size</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span> <span class="n">n_prevalences</span><span class="o">=</span><span class="mi">5</span><span class="p">):</span>
|
||||
<span class="nb">print</span><span class="p">(</span><span class="n">F</span><span class="o">.</span><span class="n">strprev</span><span class="p">(</span><span class="n">sample</span><span class="o">.</span><span class="n">prevalence</span><span class="p">(),</span> <span class="n">prec</span><span class="o">=</span><span class="mi">2</span><span class="p">))</span>
|
||||
</pre></div>
|
||||
</div>
|
||||
<p>produces one sampling for each (valid) combination of prevalences originating from
|
||||
splitting the range [0,1] into n_prevalences=5 points (i.e., [0, 0.25, 0.5, 0.75, 1]),
|
||||
that is:</p>
|
||||
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="p">[</span><span class="mf">0.00</span><span class="p">,</span> <span class="mf">0.00</span><span class="p">,</span> <span class="mf">1.00</span><span class="p">]</span>
|
||||
<span class="p">[</span><span class="mf">0.00</span><span class="p">,</span> <span class="mf">0.25</span><span class="p">,</span> <span class="mf">0.75</span><span class="p">]</span>
|
||||
<span class="p">[</span><span class="mf">0.00</span><span class="p">,</span> <span class="mf">0.50</span><span class="p">,</span> <span class="mf">0.50</span><span class="p">]</span>
|
||||
<span class="p">[</span><span class="mf">0.00</span><span class="p">,</span> <span class="mf">0.75</span><span class="p">,</span> <span class="mf">0.25</span><span class="p">]</span>
|
||||
<span class="p">[</span><span class="mf">0.00</span><span class="p">,</span> <span class="mf">1.00</span><span class="p">,</span> <span class="mf">0.00</span><span class="p">]</span>
|
||||
<span class="p">[</span><span class="mf">0.25</span><span class="p">,</span> <span class="mf">0.00</span><span class="p">,</span> <span class="mf">0.75</span><span class="p">]</span>
|
||||
<span class="o">...</span>
|
||||
<span class="p">[</span><span class="mf">1.00</span><span class="p">,</span> <span class="mf">0.00</span><span class="p">,</span> <span class="mf">0.00</span><span class="p">]</span>
|
||||
</pre></div>
|
||||
</div>
|
||||
<p>See the <a class="reference external" href="https://github.com/HLT-ISTI/QuaPy/wiki/Evaluation">Evaluation wiki</a> for
|
||||
further details on how to use the artificial sampling protocol to properly
|
||||
evaluate a quantification method.</p>
|
||||
<p>However, generating samples for evaluation purposes is tackled in QuaPy
|
||||
by means of the evaluation protocols (see the dedicated entries in the Wiki
|
||||
for <a class="reference external" href="https://github.com/HLT-ISTI/QuaPy/wiki/Evaluation">evaluation</a> and
|
||||
<a class="reference external" href="https://github.com/HLT-ISTI/QuaPy/wiki/Protocols">protocols</a>).</p>
|
||||
<section id="reviews-datasets">
|
||||
<h2>Reviews Datasets<a class="headerlink" href="#reviews-datasets" title="Permalink to this heading">¶</a></h2>
|
||||
<p>Three datasets of reviews about Kindle devices, Harry Potter’s series, and
|
||||
|
@ -636,6 +617,78 @@ time the dataset is invoked.</p></li>
|
|||
</ul>
|
||||
</section>
|
||||
</section>
|
||||
<section id="lequa-datasets">
|
||||
<h2>LeQua Datasets<a class="headerlink" href="#lequa-datasets" title="Permalink to this heading">¶</a></h2>
|
||||
<p>QuaPy also provides the datasets used for the LeQua competition.
|
||||
In brief, there are 4 tasks (T1A, T1B, T2A, T2B) having to do with text quantification
|
||||
problems. Tasks T1A and T1B provide documents in vector form, while T2A and T2B provide
|
||||
raw documents instead.
|
||||
Tasks T1A and T2A are binary sentiment quantification problems, while T2A and T2B
|
||||
are multiclass quantification problems consisting of estimating the class prevalence
|
||||
values of 28 different merchandise products.</p>
|
||||
<p>Every task consists of a training set, a set of validation samples (for model selection)
|
||||
and a set of test samples (for evaluation). QuaPy returns this data as a LabelledCollection
|
||||
(training) and two generation protocols (for validation and test samples), as follows:</p>
|
||||
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">training</span><span class="p">,</span> <span class="n">val_generator</span><span class="p">,</span> <span class="n">test_generator</span> <span class="o">=</span> <span class="n">fetch_lequa2022</span><span class="p">(</span><span class="n">task</span><span class="o">=</span><span class="n">task</span><span class="p">)</span>
|
||||
</pre></div>
|
||||
</div>
|
||||
<p>See the <code class="docutils literal notranslate"><span class="pre">lequa2022_experiments.py</span></code> in the examples folder for further details on how to
|
||||
carry out experiments using these datasets.</p>
|
||||
<p>The datasets are downloaded only once, and stored for fast reuse.</p>
|
||||
<p>Some statistics are summarized below:</p>
|
||||
<table class="docutils align-default">
|
||||
<thead>
|
||||
<tr class="row-odd"><th class="head"><p>Dataset</p></th>
|
||||
<th class="head text-center"><p>classes</p></th>
|
||||
<th class="head text-center"><p>train size</p></th>
|
||||
<th class="head text-center"><p>validation samples</p></th>
|
||||
<th class="head text-center"><p>test samples</p></th>
|
||||
<th class="head text-center"><p>docs by sample</p></th>
|
||||
<th class="head text-center"><p>type</p></th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr class="row-even"><td><p>T1A</p></td>
|
||||
<td class="text-center"><p>2</p></td>
|
||||
<td class="text-center"><p>5000</p></td>
|
||||
<td class="text-center"><p>1000</p></td>
|
||||
<td class="text-center"><p>5000</p></td>
|
||||
<td class="text-center"><p>250</p></td>
|
||||
<td class="text-center"><p>vector</p></td>
|
||||
</tr>
|
||||
<tr class="row-odd"><td><p>T1B</p></td>
|
||||
<td class="text-center"><p>28</p></td>
|
||||
<td class="text-center"><p>20000</p></td>
|
||||
<td class="text-center"><p>1000</p></td>
|
||||
<td class="text-center"><p>5000</p></td>
|
||||
<td class="text-center"><p>1000</p></td>
|
||||
<td class="text-center"><p>vector</p></td>
|
||||
</tr>
|
||||
<tr class="row-even"><td><p>T2A</p></td>
|
||||
<td class="text-center"><p>2</p></td>
|
||||
<td class="text-center"><p>5000</p></td>
|
||||
<td class="text-center"><p>1000</p></td>
|
||||
<td class="text-center"><p>5000</p></td>
|
||||
<td class="text-center"><p>250</p></td>
|
||||
<td class="text-center"><p>text</p></td>
|
||||
</tr>
|
||||
<tr class="row-odd"><td><p>T2B</p></td>
|
||||
<td class="text-center"><p>28</p></td>
|
||||
<td class="text-center"><p>20000</p></td>
|
||||
<td class="text-center"><p>1000</p></td>
|
||||
<td class="text-center"><p>5000</p></td>
|
||||
<td class="text-center"><p>1000</p></td>
|
||||
<td class="text-center"><p>text</p></td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
<p>For further details on the datasets, we refer to the original
|
||||
<a class="reference external" href="https://ceur-ws.org/Vol-3180/paper-146.pdf">paper</a>:</p>
|
||||
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">Esuli</span><span class="p">,</span> <span class="n">A</span><span class="o">.</span><span class="p">,</span> <span class="n">Moreo</span><span class="p">,</span> <span class="n">A</span><span class="o">.</span><span class="p">,</span> <span class="n">Sebastiani</span><span class="p">,</span> <span class="n">F</span><span class="o">.</span><span class="p">,</span> <span class="o">&</span> <span class="n">Sperduti</span><span class="p">,</span> <span class="n">G</span><span class="o">.</span> <span class="p">(</span><span class="mi">2022</span><span class="p">)</span><span class="o">.</span>
|
||||
<span class="n">A</span> <span class="n">Detailed</span> <span class="n">Overview</span> <span class="n">of</span> <span class="n">LeQua</span><span class="o">@</span> <span class="n">CLEF</span> <span class="mi">2022</span><span class="p">:</span> <span class="n">Learning</span> <span class="n">to</span> <span class="n">Quantify</span><span class="o">.</span>
|
||||
</pre></div>
|
||||
</div>
|
||||
</section>
|
||||
<section id="adding-custom-datasets">
|
||||
<h2>Adding Custom Datasets<a class="headerlink" href="#adding-custom-datasets" title="Permalink to this heading">¶</a></h2>
|
||||
<p>QuaPy provides data loaders for simple formats dealing with
|
||||
|
@ -667,12 +720,15 @@ all classes to be present in the collection).</p>
|
|||
paths, in order to create a training and test pair of <em>LabelledCollection</em>,
|
||||
e.g.:</p>
|
||||
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">quapy</span> <span class="k">as</span> <span class="nn">qp</span>
|
||||
|
||||
<span class="n">train_path</span> <span class="o">=</span> <span class="s1">'../my_data/train.dat'</span>
|
||||
<span class="n">test_path</span> <span class="o">=</span> <span class="s1">'../my_data/test.dat'</span>
|
||||
|
||||
<span class="k">def</span> <span class="nf">my_custom_loader</span><span class="p">(</span><span class="n">path</span><span class="p">):</span>
|
||||
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">path</span><span class="p">,</span> <span class="s1">'rb'</span><span class="p">)</span> <span class="k">as</span> <span class="n">fin</span><span class="p">:</span>
|
||||
<span class="o">...</span>
|
||||
<span class="k">return</span> <span class="n">instances</span><span class="p">,</span> <span class="n">labels</span>
|
||||
|
||||
<span class="n">data</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">data</span><span class="o">.</span><span class="n">Dataset</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="n">train_path</span><span class="p">,</span> <span class="n">test_path</span><span class="p">,</span> <span class="n">my_custom_loader</span><span class="p">)</span>
|
||||
</pre></div>
|
||||
</div>
|
||||
|
@ -707,6 +763,7 @@ that the column values have zero mean and unit variance).</p></li>
|
|||
<li><a class="reference internal" href="#issues">Issues:</a></li>
|
||||
</ul>
|
||||
</li>
|
||||
<li><a class="reference internal" href="#lequa-datasets">LeQua Datasets</a></li>
|
||||
<li><a class="reference internal" href="#adding-custom-datasets">Adding Custom Datasets</a><ul>
|
||||
<li><a class="reference internal" href="#data-processing">Data Processing</a></li>
|
||||
</ul>
|
||||
|
|
|
@ -30,7 +30,7 @@ Output the class prevalences (showing 2 digit precision):
|
|||
[0.17, 0.50, 0.33]
|
||||
```
|
||||
|
||||
One can easily produce new samples at desired class prevalences:
|
||||
One can easily produce new samples at desired class prevalence values:
|
||||
|
||||
```python
|
||||
sample_size = 10
|
||||
|
@ -38,7 +38,7 @@ prev = [0.4, 0.1, 0.5]
|
|||
sample = data.sampling(sample_size, *prev)
|
||||
|
||||
print('instances:', sample.instances)
|
||||
print('labels:', sample.classes)
|
||||
print('labels:', sample.labels)
|
||||
print('prevalence:', F.strprev(sample.prevalence(), prec=2))
|
||||
```
|
||||
|
||||
|
@ -64,32 +64,10 @@ for method in methods:
|
|||
...
|
||||
```
|
||||
|
||||
QuaPy also implements the artificial sampling protocol that produces (via a
|
||||
Python's generator) a series of _LabelledCollection_ objects with equidistant
|
||||
prevalences ranging across the entire prevalence spectrum in the simplex space, e.g.:
|
||||
|
||||
```python
|
||||
for sample in data.artificial_sampling_generator(sample_size=100, n_prevalences=5):
|
||||
print(F.strprev(sample.prevalence(), prec=2))
|
||||
```
|
||||
|
||||
produces one sampling for each (valid) combination of prevalences originating from
|
||||
splitting the range [0,1] into n_prevalences=5 points (i.e., [0, 0.25, 0.5, 0.75, 1]),
|
||||
that is:
|
||||
```
|
||||
[0.00, 0.00, 1.00]
|
||||
[0.00, 0.25, 0.75]
|
||||
[0.00, 0.50, 0.50]
|
||||
[0.00, 0.75, 0.25]
|
||||
[0.00, 1.00, 0.00]
|
||||
[0.25, 0.00, 0.75]
|
||||
...
|
||||
[1.00, 0.00, 0.00]
|
||||
```
|
||||
|
||||
See the [Evaluation wiki](https://github.com/HLT-ISTI/QuaPy/wiki/Evaluation) for
|
||||
further details on how to use the artificial sampling protocol to properly
|
||||
evaluate a quantification method.
|
||||
However, generating samples for evaluation purposes is tackled in QuaPy
|
||||
by means of the evaluation protocols (see the dedicated entries in the Wiki
|
||||
for [evaluation](https://github.com/HLT-ISTI/QuaPy/wiki/Evaluation) and
|
||||
[protocols](https://github.com/HLT-ISTI/QuaPy/wiki/Protocols)).
|
||||
|
||||
|
||||
## Reviews Datasets
|
||||
|
@ -178,6 +156,8 @@ Some details can be found below:
|
|||
| sst | 3 | 2971 | 1271 | 376132 | [0.261, 0.452, 0.288] | [0.207, 0.481, 0.312] | sparse |
|
||||
| wa | 3 | 2184 | 936 | 248563 | [0.305, 0.414, 0.281] | [0.282, 0.446, 0.272] | sparse |
|
||||
| wb | 3 | 4259 | 1823 | 404333 | [0.270, 0.392, 0.337] | [0.274, 0.392, 0.335] | sparse |
|
||||
|
||||
|
||||
## UCI Machine Learning
|
||||
|
||||
A set of 32 datasets from the [UCI Machine Learning repository](https://archive.ics.uci.edu/ml/datasets.php)
|
||||
|
@ -273,6 +253,46 @@ standard Pythons packages like gzip or zip. This file would need to be uncompres
|
|||
OS-dependent software manually. Information on how to do it will be printed the first
|
||||
time the dataset is invoked.
|
||||
|
||||
## LeQua Datasets
|
||||
|
||||
QuaPy also provides the datasets used for the LeQua competition.
|
||||
In brief, there are 4 tasks (T1A, T1B, T2A, T2B) having to do with text quantification
|
||||
problems. Tasks T1A and T1B provide documents in vector form, while T2A and T2B provide
|
||||
raw documents instead.
|
||||
Tasks T1A and T2A are binary sentiment quantification problems, while T2A and T2B
|
||||
are multiclass quantification problems consisting of estimating the class prevalence
|
||||
values of 28 different merchandise products.
|
||||
|
||||
Every task consists of a training set, a set of validation samples (for model selection)
|
||||
and a set of test samples (for evaluation). QuaPy returns this data as a LabelledCollection
|
||||
(training) and two generation protocols (for validation and test samples), as follows:
|
||||
|
||||
```python
|
||||
training, val_generator, test_generator = fetch_lequa2022(task=task)
|
||||
```
|
||||
|
||||
See the `lequa2022_experiments.py` in the examples folder for further details on how to
|
||||
carry out experiments using these datasets.
|
||||
|
||||
The datasets are downloaded only once, and stored for fast reuse.
|
||||
|
||||
Some statistics are summarized below:
|
||||
|
||||
| Dataset | classes | train size | validation samples | test samples | docs by sample | type |
|
||||
|---------|:-------:|:----------:|:------------------:|:------------:|:----------------:|:--------:|
|
||||
| T1A | 2 | 5000 | 1000 | 5000 | 250 | vector |
|
||||
| T1B | 28 | 20000 | 1000 | 5000 | 1000 | vector |
|
||||
| T2A | 2 | 5000 | 1000 | 5000 | 250 | text |
|
||||
| T2B | 28 | 20000 | 1000 | 5000 | 1000 | text |
|
||||
|
||||
For further details on the datasets, we refer to the original
|
||||
[paper](https://ceur-ws.org/Vol-3180/paper-146.pdf):
|
||||
|
||||
```
|
||||
Esuli, A., Moreo, A., Sebastiani, F., & Sperduti, G. (2022).
|
||||
A Detailed Overview of LeQua@ CLEF 2022: Learning to Quantify.
|
||||
```
|
||||
|
||||
## Adding Custom Datasets
|
||||
|
||||
QuaPy provides data loaders for simple formats dealing with
|
||||
|
@ -313,12 +333,15 @@ e.g.:
|
|||
|
||||
```python
|
||||
import quapy as qp
|
||||
|
||||
train_path = '../my_data/train.dat'
|
||||
test_path = '../my_data/test.dat'
|
||||
|
||||
def my_custom_loader(path):
|
||||
with open(path, 'rb') as fin:
|
||||
...
|
||||
return instances, labels
|
||||
|
||||
data = qp.data.Dataset.load(train_path, test_path, my_custom_loader)
|
||||
```
|
||||
|
||||
|
|
|
@ -123,6 +123,7 @@ See the <a class="reference internal" href="Evaluation.html"><span class="doc">E
|
|||
<li class="toctree-l2"><a class="reference internal" href="Datasets.html#reviews-datasets">Reviews Datasets</a></li>
|
||||
<li class="toctree-l2"><a class="reference internal" href="Datasets.html#twitter-sentiment-datasets">Twitter Sentiment Datasets</a></li>
|
||||
<li class="toctree-l2"><a class="reference internal" href="Datasets.html#uci-machine-learning">UCI Machine Learning</a></li>
|
||||
<li class="toctree-l2"><a class="reference internal" href="Datasets.html#lequa-datasets">LeQua Datasets</a></li>
|
||||
<li class="toctree-l2"><a class="reference internal" href="Datasets.html#adding-custom-datasets">Adding Custom Datasets</a></li>
|
||||
</ul>
|
||||
</li>
|
||||
|
|
File diff suppressed because one or more lines are too long
|
@ -60,9 +60,6 @@ class LabelCollectionTestCase(unittest.TestCase):
|
|||
combined = qp.data.LabelledCollection.join(data4, data5)
|
||||
self.assertEqual(len(combined), len(data4) + len(data5))
|
||||
|
||||
# data2.instances = csr_matrix()
|
||||
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
unittest.main()
|
||||
|
|
Loading…
Reference in New Issue