adding documentation

2023-02-14 18:04:13 +01:00 · 2023-02-14 18:04:13 +01:00 · 9aa53db6ef
parent 49fc486c53
commit 9aa53db6ef
5 changed files with 135 additions and 57 deletions
--- a/docs/build/html/Datasets.html
+++ b/docs/build/html/Datasets.html
@ -80,13 +80,13 @@ Take a look at the following code:</p>
 <div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="p">[</span><span class="mf">0.17</span><span class="p">,</span> <span class="mf">0.50</span><span class="p">,</span> <span class="mf">0.33</span><span class="p">]</span>
 </pre></div>
 </div>
-<p>One can easily produce new samples at desired class prevalences:</p>
+<p>One can easily produce new samples at desired class prevalence values:</p>
 <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">sample_size</span> <span class="o">=</span> <span class="mi">10</span>
 <span class="n">prev</span> <span class="o">=</span> <span class="p">[</span><span class="mf">0.4</span><span class="p">,</span> <span class="mf">0.1</span><span class="p">,</span> <span class="mf">0.5</span><span class="p">]</span>
 <span class="n">sample</span> <span class="o">=</span> <span class="n">data</span><span class="o">.</span><span class="n">sampling</span><span class="p">(</span><span class="n">sample_size</span><span class="p">,</span> <span class="o">*</span><span class="n">prev</span><span class="p">)</span>

 <span class="nb">print</span><span class="p">(</span><span class="s1">&#39;instances:&#39;</span><span class="p">,</span> <span class="n">sample</span><span class="o">.</span><span class="n">instances</span><span class="p">)</span>
-<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;labels:&#39;</span><span class="p">,</span> <span class="n">sample</span><span class="o">.</span><span class="n">classes</span><span class="p">)</span>
+<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;labels:&#39;</span><span class="p">,</span> <span class="n">sample</span><span class="o">.</span><span class="n">labels</span><span class="p">)</span>
 <span class="nb">print</span><span class="p">(</span><span class="s1">&#39;prevalence:&#39;</span><span class="p">,</span> <span class="n">F</span><span class="o">.</span><span class="n">strprev</span><span class="p">(</span><span class="n">sample</span><span class="o">.</span><span class="n">prevalence</span><span class="p">(),</span> <span class="n">prec</span><span class="o">=</span><span class="mi">2</span><span class="p">))</span>
 </pre></div>
 </div>
@ -109,29 +109,10 @@ the indexes, that can then be used to generate the sample:</p>
    <span class="o">...</span>
 </pre></div>
 </div>
-<p>QuaPy also implements the artificial sampling protocol that produces (via a
-Python’s generator) a series of <em>LabelledCollection</em> objects with equidistant
-prevalences ranging across the entire prevalence spectrum in the simplex space, e.g.:</p>
-<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="k">for</span> <span class="n">sample</span> <span class="ow">in</span> <span class="n">data</span><span class="o">.</span><span class="n">artificial_sampling_generator</span><span class="p">(</span><span class="n">sample_size</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span> <span class="n">n_prevalences</span><span class="o">=</span><span class="mi">5</span><span class="p">):</span>
-    <span class="nb">print</span><span class="p">(</span><span class="n">F</span><span class="o">.</span><span class="n">strprev</span><span class="p">(</span><span class="n">sample</span><span class="o">.</span><span class="n">prevalence</span><span class="p">(),</span> <span class="n">prec</span><span class="o">=</span><span class="mi">2</span><span class="p">))</span>
-</pre></div>
-</div>
-<p>produces one sampling for each (valid) combination of prevalences originating from
-splitting the range [0,1] into n_prevalences=5 points (i.e., [0, 0.25, 0.5, 0.75, 1]),
-that is:</p>
-<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="p">[</span><span class="mf">0.00</span><span class="p">,</span> <span class="mf">0.00</span><span class="p">,</span> <span class="mf">1.00</span><span class="p">]</span>
-<span class="p">[</span><span class="mf">0.00</span><span class="p">,</span> <span class="mf">0.25</span><span class="p">,</span> <span class="mf">0.75</span><span class="p">]</span>
-<span class="p">[</span><span class="mf">0.00</span><span class="p">,</span> <span class="mf">0.50</span><span class="p">,</span> <span class="mf">0.50</span><span class="p">]</span>
-<span class="p">[</span><span class="mf">0.00</span><span class="p">,</span> <span class="mf">0.75</span><span class="p">,</span> <span class="mf">0.25</span><span class="p">]</span>
-<span class="p">[</span><span class="mf">0.00</span><span class="p">,</span> <span class="mf">1.00</span><span class="p">,</span> <span class="mf">0.00</span><span class="p">]</span>
-<span class="p">[</span><span class="mf">0.25</span><span class="p">,</span> <span class="mf">0.00</span><span class="p">,</span> <span class="mf">0.75</span><span class="p">]</span>
-<span class="o">...</span>
-<span class="p">[</span><span class="mf">1.00</span><span class="p">,</span> <span class="mf">0.00</span><span class="p">,</span> <span class="mf">0.00</span><span class="p">]</span>
-</pre></div>
-</div>
-<p>See the <a class="reference external" href="https://github.com/HLT-ISTI/QuaPy/wiki/Evaluation">Evaluation wiki</a> for
-further details on how to use the artificial sampling protocol to properly
-evaluate a quantification method.</p>
+<p>However, generating samples for evaluation purposes is tackled in QuaPy
+by means of the evaluation protocols (see the dedicated entries in the Wiki
+for <a class="reference external" href="https://github.com/HLT-ISTI/QuaPy/wiki/Evaluation">evaluation</a> and
+<a class="reference external" href="https://github.com/HLT-ISTI/QuaPy/wiki/Protocols">protocols</a>).</p>
 <section id="reviews-datasets">
 <h2>Reviews Datasets<a class="headerlink" href="#reviews-datasets" title="Permalink to this heading">¶</a></h2>
 <p>Three datasets of reviews about Kindle devices, Harry Potter’s series, and
@ -636,6 +617,78 @@ time the dataset is invoked.</p></li>
 </ul>
 </section>
 </section>
+<section id="lequa-datasets">
+<h2>LeQua Datasets<a class="headerlink" href="#lequa-datasets" title="Permalink to this heading">¶</a></h2>
+<p>QuaPy also provides the datasets used for the LeQua competition.
+In brief, there are 4 tasks (T1A, T1B, T2A, T2B) having to do with text quantification
+problems. Tasks T1A and T1B provide documents in vector form, while T2A and T2B provide
+raw documents instead.
+Tasks T1A and T2A are binary sentiment quantification problems, while T2A and T2B
+are multiclass quantification problems consisting of estimating the class prevalence
+values of 28 different merchandise products.</p>
+<p>Every task consists of a training set, a set of validation samples (for model selection)
+and a set of test samples (for evaluation). QuaPy returns this data as a LabelledCollection
+(training) and two generation protocols (for validation and test samples), as follows:</p>
+<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">training</span><span class="p">,</span> <span class="n">val_generator</span><span class="p">,</span> <span class="n">test_generator</span> <span class="o">=</span> <span class="n">fetch_lequa2022</span><span class="p">(</span><span class="n">task</span><span class="o">=</span><span class="n">task</span><span class="p">)</span>
+</pre></div>
+</div>
+<p>See the <code class="docutils literal notranslate"><span class="pre">lequa2022_experiments.py</span></code> in the examples folder for further details on how to
+carry out experiments using these datasets.</p>
+<p>The datasets are downloaded only once, and stored for fast reuse.</p>
+<p>Some statistics are summarized below:</p>
+<table class="docutils align-default">
+<thead>
+<tr class="row-odd"><th class="head"><p>Dataset</p></th>
+<th class="head text-center"><p>classes</p></th>
+<th class="head text-center"><p>train size</p></th>
+<th class="head text-center"><p>validation samples</p></th>
+<th class="head text-center"><p>test samples</p></th>
+<th class="head text-center"><p>docs by sample</p></th>
+<th class="head text-center"><p>type</p></th>
+</tr>
+</thead>
+<tbody>
+<tr class="row-even"><td><p>T1A</p></td>
+<td class="text-center"><p>2</p></td>
+<td class="text-center"><p>5000</p></td>
+<td class="text-center"><p>1000</p></td>
+<td class="text-center"><p>5000</p></td>
+<td class="text-center"><p>250</p></td>
+<td class="text-center"><p>vector</p></td>
+</tr>
+<tr class="row-odd"><td><p>T1B</p></td>
+<td class="text-center"><p>28</p></td>
+<td class="text-center"><p>20000</p></td>
+<td class="text-center"><p>1000</p></td>
+<td class="text-center"><p>5000</p></td>
+<td class="text-center"><p>1000</p></td>
+<td class="text-center"><p>vector</p></td>
+</tr>
+<tr class="row-even"><td><p>T2A</p></td>
+<td class="text-center"><p>2</p></td>
+<td class="text-center"><p>5000</p></td>
+<td class="text-center"><p>1000</p></td>
+<td class="text-center"><p>5000</p></td>
+<td class="text-center"><p>250</p></td>
+<td class="text-center"><p>text</p></td>
+</tr>
+<tr class="row-odd"><td><p>T2B</p></td>
+<td class="text-center"><p>28</p></td>
+<td class="text-center"><p>20000</p></td>
+<td class="text-center"><p>1000</p></td>
+<td class="text-center"><p>5000</p></td>
+<td class="text-center"><p>1000</p></td>
+<td class="text-center"><p>text</p></td>
+</tr>
+</tbody>
+</table>
+<p>For further details on the datasets, we refer to the original
+<a class="reference external" href="https://ceur-ws.org/Vol-3180/paper-146.pdf">paper</a>:</p>
+<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">Esuli</span><span class="p">,</span> <span class="n">A</span><span class="o">.</span><span class="p">,</span> <span class="n">Moreo</span><span class="p">,</span> <span class="n">A</span><span class="o">.</span><span class="p">,</span> <span class="n">Sebastiani</span><span class="p">,</span> <span class="n">F</span><span class="o">.</span><span class="p">,</span> <span class="o">&amp;</span> <span class="n">Sperduti</span><span class="p">,</span> <span class="n">G</span><span class="o">.</span> <span class="p">(</span><span class="mi">2022</span><span class="p">)</span><span class="o">.</span>
+<span class="n">A</span> <span class="n">Detailed</span> <span class="n">Overview</span> <span class="n">of</span> <span class="n">LeQua</span><span class="o">@</span> <span class="n">CLEF</span> <span class="mi">2022</span><span class="p">:</span> <span class="n">Learning</span> <span class="n">to</span> <span class="n">Quantify</span><span class="o">.</span>
+</pre></div>
+</div>
+</section>
 <section id="adding-custom-datasets">
 <h2>Adding Custom Datasets<a class="headerlink" href="#adding-custom-datasets" title="Permalink to this heading">¶</a></h2>
 <p>QuaPy provides data loaders for simple formats dealing with
@ -667,12 +720,15 @@ all classes to be present in the collection).</p>
 paths, in order to create a training and test pair of <em>LabelledCollection</em>,
 e.g.:</p>
 <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">quapy</span> <span class="k">as</span> <span class="nn">qp</span>
+
 <span class="n">train_path</span> <span class="o">=</span> <span class="s1">&#39;../my_data/train.dat&#39;</span>
 <span class="n">test_path</span> <span class="o">=</span> <span class="s1">&#39;../my_data/test.dat&#39;</span>
+
 <span class="k">def</span> <span class="nf">my_custom_loader</span><span class="p">(</span><span class="n">path</span><span class="p">):</span>
    <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">path</span><span class="p">,</span> <span class="s1">&#39;rb&#39;</span><span class="p">)</span> <span class="k">as</span> <span class="n">fin</span><span class="p">:</span>
        <span class="o">...</span>
    <span class="k">return</span> <span class="n">instances</span><span class="p">,</span> <span class="n">labels</span>
+
 <span class="n">data</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">data</span><span class="o">.</span><span class="n">Dataset</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="n">train_path</span><span class="p">,</span> <span class="n">test_path</span><span class="p">,</span> <span class="n">my_custom_loader</span><span class="p">)</span>
 </pre></div>
 </div>
@ -707,6 +763,7 @@ that the column values have zero mean and unit variance).</p></li>
 <li><a class="reference internal" href="#issues">Issues:</a></li>
 </ul>
 </li>
+<li><a class="reference internal" href="#lequa-datasets">LeQua Datasets</a></li>
 <li><a class="reference internal" href="#adding-custom-datasets">Adding Custom Datasets</a><ul>
 <li><a class="reference internal" href="#data-processing">Data Processing</a></li>
 </ul>
--- a/docs/build/html/_sources/Datasets.md.txt
+++ b/docs/build/html/_sources/Datasets.md.txt
@ -30,7 +30,7 @@ Output the class prevalences (showing 2 digit precision):
 [0.17, 0.50, 0.33]
 ```

-One can easily produce new samples at desired class prevalences:
+One can easily produce new samples at desired class prevalence values:

 ```python
 sample_size = 10
@ -38,7 +38,7 @@ prev = [0.4, 0.1, 0.5]
 sample = data.sampling(sample_size, *prev)

 print('instances:', sample.instances)
-print('labels:', sample.classes)
+print('labels:', sample.labels)
 print('prevalence:', F.strprev(sample.prevalence(), prec=2))
 ```

@ -64,32 +64,10 @@ for method in methods:
    ...
 ```

-QuaPy also implements the artificial sampling protocol that produces (via a
-Python's generator) a series of _LabelledCollection_ objects with equidistant 
-prevalences ranging across the entire prevalence spectrum in the simplex space, e.g.:
-
-```python
-for sample in data.artificial_sampling_generator(sample_size=100, n_prevalences=5):
-    print(F.strprev(sample.prevalence(), prec=2))
-```
-
-produces one sampling for each (valid) combination of prevalences originating from
-splitting the range [0,1] into n_prevalences=5 points (i.e., [0, 0.25, 0.5, 0.75, 1]), 
-that is:
-```
-[0.00, 0.00, 1.00]
-[0.00, 0.25, 0.75]
-[0.00, 0.50, 0.50]
-[0.00, 0.75, 0.25]
-[0.00, 1.00, 0.00]
-[0.25, 0.00, 0.75]
-...
-[1.00, 0.00, 0.00]
-```
-
-See the [Evaluation wiki](https://github.com/HLT-ISTI/QuaPy/wiki/Evaluation) for 
-further details on how to use the artificial sampling protocol to properly
-evaluate a quantification method.
+However, generating samples for evaluation purposes is tackled in QuaPy
+by means of the evaluation protocols (see the dedicated entries in the Wiki
+for [evaluation](https://github.com/HLT-ISTI/QuaPy/wiki/Evaluation) and 
+[protocols](https://github.com/HLT-ISTI/QuaPy/wiki/Protocols)).


 ## Reviews Datasets
@ -178,6 +156,8 @@ Some details can be found below:
 | sst | 3 | 2971 | 1271 | 376132 | [0.261, 0.452, 0.288] | [0.207, 0.481, 0.312] | sparse |
 | wa | 3 | 2184 | 936 | 248563 | [0.305, 0.414, 0.281] | [0.282, 0.446, 0.272] | sparse |
 | wb | 3 | 4259 | 1823 | 404333 | [0.270, 0.392, 0.337] | [0.274, 0.392, 0.335] | sparse |
+
+
 ## UCI Machine Learning

 A set of 32 datasets from the [UCI Machine Learning repository](https://archive.ics.uci.edu/ml/datasets.php) 
@ -273,6 +253,46 @@ standard Pythons packages like gzip or zip. This file would need to be uncompres
 OS-dependent software manually. Information on how to do it will be printed the first
 time the dataset is invoked. 

+## LeQua Datasets
+
+QuaPy also provides the datasets used for the LeQua competition.
+In brief, there are 4 tasks (T1A, T1B, T2A, T2B) having to do with text quantification
+problems. Tasks T1A and T1B provide documents in vector form, while T2A and T2B provide 
+raw documents instead.
+Tasks T1A and T2A are binary sentiment quantification problems, while T2A and T2B 
+are multiclass quantification problems consisting of estimating the class prevalence 
+values of 28 different merchandise products.
+
+Every task consists of a training set, a set of validation samples (for model selection)
+and a set of test samples (for evaluation). QuaPy returns this data as a LabelledCollection
+(training) and two generation protocols (for validation and test samples), as follows:
+
+```python
+training, val_generator, test_generator = fetch_lequa2022(task=task)
+```
+
+See the `lequa2022_experiments.py` in the examples folder for further details on how to
+carry out experiments using these datasets.  
+
+The datasets are downloaded only once, and stored for fast reuse.
+
+Some statistics are summarized below:
+
+| Dataset | classes | train size | validation samples | test samples |  docs by sample  |   type   |
+|---------|:-------:|:----------:|:------------------:|:------------:|:----------------:|:--------:| 
+| T1A     |   2     |    5000    |        1000        |     5000     |       250        |  vector  | 
+| T1B     |   28    |   20000    |        1000        |     5000     |       1000       |  vector  |
+| T2A     |    2    |    5000    |        1000        |     5000     |       250        |   text   |
+| T2B     |   28    |   20000    |        1000        |     5000     |       1000       |   text   |
+
+For further details on the datasets, we refer to the original 
+[paper](https://ceur-ws.org/Vol-3180/paper-146.pdf):
+
+```
+Esuli, A., Moreo, A., Sebastiani, F., & Sperduti, G. (2022).
+A Detailed Overview of LeQua@ CLEF 2022: Learning to Quantify.
+```
+
 ## Adding Custom Datasets

 QuaPy provides data loaders for simple formats dealing with 
@ -313,12 +333,15 @@ e.g.:

 ```python
 import quapy as qp
+
 train_path = '../my_data/train.dat'
 test_path = '../my_data/test.dat'
+
 def my_custom_loader(path):
    with open(path, 'rb') as fin:
        ...
    return instances, labels
+
 data = qp.data.Dataset.load(train_path, test_path, my_custom_loader)
 ```

--- a/docs/build/html/index.html
+++ b/docs/build/html/index.html
@ -123,6 +123,7 @@ See the <a class="reference internal" href="Evaluation.html"><span class="doc">E
 <li class="toctree-l2"><a class="reference internal" href="Datasets.html#reviews-datasets">Reviews Datasets</a></li>
 <li class="toctree-l2"><a class="reference internal" href="Datasets.html#twitter-sentiment-datasets">Twitter Sentiment Datasets</a></li>
 <li class="toctree-l2"><a class="reference internal" href="Datasets.html#uci-machine-learning">UCI Machine Learning</a></li>
+<li class="toctree-l2"><a class="reference internal" href="Datasets.html#lequa-datasets">LeQua Datasets</a></li>
 <li class="toctree-l2"><a class="reference internal" href="Datasets.html#adding-custom-datasets">Adding Custom Datasets</a></li>
 </ul>
 </li>
--- a/docs/build/html/searchindex.js
+++ b/docs/build/html/searchindex.js
--- a/quapy/tests/test_labelcollection.py
+++ b/quapy/tests/test_labelcollection.py
@ -60,9 +60,6 @@ class LabelCollectionTestCase(unittest.TestCase):
        combined = qp.data.LabelledCollection.join(data4, data5)
        self.assertEqual(len(combined), len(data4) + len(data5))

-        # data2.instances = csr_matrix()
-
-

 if __name__ == '__main__':
    unittest.main()