adding documentation

2023-02-14 18:04:13 +01:00 · 2023-02-14 18:04:13 +01:00 · 9aa53db6ef
parent 49fc486c53
commit 9aa53db6ef
5 changed files with 135 additions and 57 deletions
--- a/docs/build/html/Datasets.html
+++ b/docs/build/html/Datasets.html
@ -80,13 +80,13 @@ Take a look at the following code:</p>
 <div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="p">[</span><span class="mf">0.17</span><span class="p">,</span> <span class="mf">0.50</span><span class="p">,</span> <span class="mf">0.33</span><span class="p">]</span>
 </pre></div>
 </div>
-<p>One can easily produce new samples at desired class prevalences:</p>
+<p>One can easily produce new samples at desired class prevalence values:</p>
 <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">sample_size</span> <span class="o">=</span> <span class="mi">10</span>
 <span class="n">prev</span> <span class="o">=</span> <span class="p">[</span><span class="mf">0.4</span><span class="p">,</span> <span class="mf">0.1</span><span class="p">,</span> <span class="mf">0.5</span><span class="p">]</span>
 <span class="n">sample</span> <span class="o">=</span> <span class="n">data</span><span class="o">.</span><span class="n">sampling</span><span class="p">(</span><span class="n">sample_size</span><span class="p">,</span> <span class="o">*</span><span class="n">prev</span><span class="p">)</span>
 <span class="nb">print</span><span class="p">(</span><span class="s1">&#39;instances:&#39;</span><span class="p">,</span> <span class="n">sample</span><span class="o">.</span><span class="n">instances</span><span class="p">)</span>
-<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;labels:&#39;</span><span class="p">,</span> <span class="n">sample</span><span class="o">.</span><span class="n">classes</span><span class="p">)</span>
+<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;labels:&#39;</span><span class="p">,</span> <span class="n">sample</span><span class="o">.</span><span class="n">labels</span><span class="p">)</span>
 <span class="nb">print</span><span class="p">(</span><span class="s1">&#39;prevalence:&#39;</span><span class="p">,</span> <span class="n">F</span><span class="o">.</span><span class="n">strprev</span><span class="p">(</span><span class="n">sample</span><span class="o">.</span><span class="n">prevalence</span><span class="p">(),</span> <span class="n">prec</span><span class="o">=</span><span class="mi">2</span><span class="p">))</span>
 </pre></div>
 </div>
@ -109,29 +109,10 @@ the indexes, that can then be used to generate the sample:</p>
    <span class="o">...</span>
 </pre></div>
 </div>
-<p>QuaPy also implements the artificial sampling protocol that produces (via a
+<p>However, generating samples for evaluation purposes is tackled in QuaPy
-Python’s generator) a series of <em>LabelledCollection</em> objects with equidistant
+by means of the evaluation protocols (see the dedicated entries in the Wiki
-prevalences ranging across the entire prevalence spectrum in the simplex space, e.g.:</p>
+for <a class="reference external" href="https://github.com/HLT-ISTI/QuaPy/wiki/Evaluation">evaluation</a> and
-<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="k">for</span> <span class="n">sample</span> <span class="ow">in</span> <span class="n">data</span><span class="o">.</span><span class="n">artificial_sampling_generator</span><span class="p">(</span><span class="n">sample_size</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span> <span class="n">n_prevalences</span><span class="o">=</span><span class="mi">5</span><span class="p">):</span>
+<a class="reference external" href="https://github.com/HLT-ISTI/QuaPy/wiki/Protocols">protocols</a>).</p>
    <span class="nb">print</span><span class="p">(</span><span class="n">F</span><span class="o">.</span><span class="n">strprev</span><span class="p">(</span><span class="n">sample</span><span class="o">.</span><span class="n">prevalence</span><span class="p">(),</span> <span class="n">prec</span><span class="o">=</span><span class="mi">2</span><span class="p">))</span>
 </pre></div>
 </div>
 <p>produces one sampling for each (valid) combination of prevalences originating from
 splitting the range [0,1] into n_prevalences=5 points (i.e., [0, 0.25, 0.5, 0.75, 1]),
 that is:</p>
 <div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="p">[</span><span class="mf">0.00</span><span class="p">,</span> <span class="mf">0.00</span><span class="p">,</span> <span class="mf">1.00</span><span class="p">]</span>
 <span class="p">[</span><span class="mf">0.00</span><span class="p">,</span> <span class="mf">0.25</span><span class="p">,</span> <span class="mf">0.75</span><span class="p">]</span>
 <span class="p">[</span><span class="mf">0.00</span><span class="p">,</span> <span class="mf">0.50</span><span class="p">,</span> <span class="mf">0.50</span><span class="p">]</span>
 <span class="p">[</span><span class="mf">0.00</span><span class="p">,</span> <span class="mf">0.75</span><span class="p">,</span> <span class="mf">0.25</span><span class="p">]</span>
 <span class="p">[</span><span class="mf">0.00</span><span class="p">,</span> <span class="mf">1.00</span><span class="p">,</span> <span class="mf">0.00</span><span class="p">]</span>
 <span class="p">[</span><span class="mf">0.25</span><span class="p">,</span> <span class="mf">0.00</span><span class="p">,</span> <span class="mf">0.75</span><span class="p">]</span>
 <span class="o">...</span>
 <span class="p">[</span><span class="mf">1.00</span><span class="p">,</span> <span class="mf">0.00</span><span class="p">,</span> <span class="mf">0.00</span><span class="p">]</span>
 </pre></div>
 </div>
 <p>See the <a class="reference external" href="https://github.com/HLT-ISTI/QuaPy/wiki/Evaluation">Evaluation wiki</a> for
 further details on how to use the artificial sampling protocol to properly
 evaluate a quantification method.</p>
 <section id="reviews-datasets">
 <h2>Reviews Datasets<a class="headerlink" href="#reviews-datasets" title="Permalink to this heading">¶</a></h2>
 <p>Three datasets of reviews about Kindle devices, Harry Potter’s series, and
@ -636,6 +617,78 @@ time the dataset is invoked.</p></li>
 </ul>
 </section>
 </section>
 <section id="lequa-datasets">
 <h2>LeQua Datasets<a class="headerlink" href="#lequa-datasets" title="Permalink to this heading">¶</a></h2>
 <p>QuaPy also provides the datasets used for the LeQua competition.
 In brief, there are 4 tasks (T1A, T1B, T2A, T2B) having to do with text quantification
 problems. Tasks T1A and T1B provide documents in vector form, while T2A and T2B provide
 raw documents instead.
 Tasks T1A and T2A are binary sentiment quantification problems, while T2A and T2B
 are multiclass quantification problems consisting of estimating the class prevalence
 values of 28 different merchandise products.</p>
 <p>Every task consists of a training set, a set of validation samples (for model selection)
 and a set of test samples (for evaluation). QuaPy returns this data as a LabelledCollection
 (training) and two generation protocols (for validation and test samples), as follows:</p>
 <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">training</span><span class="p">,</span> <span class="n">val_generator</span><span class="p">,</span> <span class="n">test_generator</span> <span class="o">=</span> <span class="n">fetch_lequa2022</span><span class="p">(</span><span class="n">task</span><span class="o">=</span><span class="n">task</span><span class="p">)</span>
 </pre></div>
 </div>
 <p>See the <code class="docutils literal notranslate"><span class="pre">lequa2022_experiments.py</span></code> in the examples folder for further details on how to
 carry out experiments using these datasets.</p>
 <p>The datasets are downloaded only once, and stored for fast reuse.</p>
 <p>Some statistics are summarized below:</p>
 <table class="docutils align-default">
 <thead>
 <tr class="row-odd"><th class="head"><p>Dataset</p></th>
 <th class="head text-center"><p>classes</p></th>
 <th class="head text-center"><p>train size</p></th>
 <th class="head text-center"><p>validation samples</p></th>
 <th class="head text-center"><p>test samples</p></th>
 <th class="head text-center"><p>docs by sample</p></th>
 <th class="head text-center"><p>type</p></th>
 </tr>
 </thead>
 <tbody>
 <tr class="row-even"><td><p>T1A</p></td>
 <td class="text-center"><p>2</p></td>
 <td class="text-center"><p>5000</p></td>
 <td class="text-center"><p>1000</p></td>
 <td class="text-center"><p>5000</p></td>
 <td class="text-center"><p>250</p></td>
 <td class="text-center"><p>vector</p></td>
 </tr>
 <tr class="row-odd"><td><p>T1B</p></td>
 <td class="text-center"><p>28</p></td>
 <td class="text-center"><p>20000</p></td>
 <td class="text-center"><p>1000</p></td>
 <td class="text-center"><p>5000</p></td>
 <td class="text-center"><p>1000</p></td>
 <td class="text-center"><p>vector</p></td>
 </tr>
 <tr class="row-even"><td><p>T2A</p></td>
 <td class="text-center"><p>2</p></td>
 <td class="text-center"><p>5000</p></td>
 <td class="text-center"><p>1000</p></td>
 <td class="text-center"><p>5000</p></td>
 <td class="text-center"><p>250</p></td>
 <td class="text-center"><p>text</p></td>
 </tr>
 <tr class="row-odd"><td><p>T2B</p></td>
 <td class="text-center"><p>28</p></td>
 <td class="text-center"><p>20000</p></td>
 <td class="text-center"><p>1000</p></td>
 <td class="text-center"><p>5000</p></td>
 <td class="text-center"><p>1000</p></td>
 <td class="text-center"><p>text</p></td>
 </tr>
 </tbody>
 </table>
 <p>For further details on the datasets, we refer to the original
 <a class="reference external" href="https://ceur-ws.org/Vol-3180/paper-146.pdf">paper</a>:</p>
 <div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">Esuli</span><span class="p">,</span> <span class="n">A</span><span class="o">.</span><span class="p">,</span> <span class="n">Moreo</span><span class="p">,</span> <span class="n">A</span><span class="o">.</span><span class="p">,</span> <span class="n">Sebastiani</span><span class="p">,</span> <span class="n">F</span><span class="o">.</span><span class="p">,</span> <span class="o">&amp;</span> <span class="n">Sperduti</span><span class="p">,</span> <span class="n">G</span><span class="o">.</span> <span class="p">(</span><span class="mi">2022</span><span class="p">)</span><span class="o">.</span>
 <span class="n">A</span> <span class="n">Detailed</span> <span class="n">Overview</span> <span class="n">of</span> <span class="n">LeQua</span><span class="o">@</span> <span class="n">CLEF</span> <span class="mi">2022</span><span class="p">:</span> <span class="n">Learning</span> <span class="n">to</span> <span class="n">Quantify</span><span class="o">.</span>
 </pre></div>
 </div>
 </section>
 <section id="adding-custom-datasets">
 <h2>Adding Custom Datasets<a class="headerlink" href="#adding-custom-datasets" title="Permalink to this heading">¶</a></h2>
 <p>QuaPy provides data loaders for simple formats dealing with
@ -667,12 +720,15 @@ all classes to be present in the collection).</p>
 paths, in order to create a training and test pair of <em>LabelledCollection</em>,
 e.g.:</p>
 <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">quapy</span> <span class="k">as</span> <span class="nn">qp</span>
 <span class="n">train_path</span> <span class="o">=</span> <span class="s1">&#39;../my_data/train.dat&#39;</span>
 <span class="n">test_path</span> <span class="o">=</span> <span class="s1">&#39;../my_data/test.dat&#39;</span>
 <span class="k">def</span> <span class="nf">my_custom_loader</span><span class="p">(</span><span class="n">path</span><span class="p">):</span>
    <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">path</span><span class="p">,</span> <span class="s1">&#39;rb&#39;</span><span class="p">)</span> <span class="k">as</span> <span class="n">fin</span><span class="p">:</span>
        <span class="o">...</span>
    <span class="k">return</span> <span class="n">instances</span><span class="p">,</span> <span class="n">labels</span>
 <span class="n">data</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">data</span><span class="o">.</span><span class="n">Dataset</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="n">train_path</span><span class="p">,</span> <span class="n">test_path</span><span class="p">,</span> <span class="n">my_custom_loader</span><span class="p">)</span>
 </pre></div>
 </div>
@ -707,6 +763,7 @@ that the column values have zero mean and unit variance).</p></li>
 <li><a class="reference internal" href="#issues">Issues:</a></li>
 </ul>
 </li>
 <li><a class="reference internal" href="#lequa-datasets">LeQua Datasets</a></li>
 <li><a class="reference internal" href="#adding-custom-datasets">Adding Custom Datasets</a><ul>
 <li><a class="reference internal" href="#data-processing">Data Processing</a></li>
 </ul>
--- a/docs/build/html/_sources/Datasets.md.txt
+++ b/docs/build/html/_sources/Datasets.md.txt
@ -30,7 +30,7 @@ Output the class prevalences (showing 2 digit precision):
 [0.17, 0.50, 0.33]
 ```
-One can easily produce new samples at desired class prevalences:
+One can easily produce new samples at desired class prevalence values:
 ```python
 sample_size = 10
@ -38,7 +38,7 @@ prev = [0.4, 0.1, 0.5]
 sample = data.sampling(sample_size, *prev)
 print('instances:', sample.instances)
-print('labels:', sample.classes)
+print('labels:', sample.labels)
 print('prevalence:', F.strprev(sample.prevalence(), prec=2))
 ```
@ -64,32 +64,10 @@ for method in methods:
    ...
 ```
-QuaPy also implements the artificial sampling protocol that produces (via a
+However, generating samples for evaluation purposes is tackled in QuaPy
-Python's generator) a series of _LabelledCollection_ objects with equidistant 
+by means of the evaluation protocols (see the dedicated entries in the Wiki
-prevalences ranging across the entire prevalence spectrum in the simplex space, e.g.:
+for [evaluation](https://github.com/HLT-ISTI/QuaPy/wiki/Evaluation) and 
-
+[protocols](https://github.com/HLT-ISTI/QuaPy/wiki/Protocols)).
 ```python
 for sample in data.artificial_sampling_generator(sample_size=100, n_prevalences=5):
    print(F.strprev(sample.prevalence(), prec=2))
 ```
 produces one sampling for each (valid) combination of prevalences originating from
 splitting the range [0,1] into n_prevalences=5 points (i.e., [0, 0.25, 0.5, 0.75, 1]), 
 that is:
 ```
 [0.00, 0.00, 1.00]
 [0.00, 0.25, 0.75]
 [0.00, 0.50, 0.50]
 [0.00, 0.75, 0.25]
 [0.00, 1.00, 0.00]
 [0.25, 0.00, 0.75]
 ...
 [1.00, 0.00, 0.00]
 ```
 See the [Evaluation wiki](https://github.com/HLT-ISTI/QuaPy/wiki/Evaluation) for 
 further details on how to use the artificial sampling protocol to properly
 evaluate a quantification method.
 ## Reviews Datasets
@ -178,6 +156,8 @@ Some details can be found below:
 | sst | 3 | 2971 | 1271 | 376132 | [0.261, 0.452, 0.288] | [0.207, 0.481, 0.312] | sparse |
 | wa | 3 | 2184 | 936 | 248563 | [0.305, 0.414, 0.281] | [0.282, 0.446, 0.272] | sparse |
 | wb | 3 | 4259 | 1823 | 404333 | [0.270, 0.392, 0.337] | [0.274, 0.392, 0.335] | sparse |
 ## UCI Machine Learning
 A set of 32 datasets from the [UCI Machine Learning repository](https://archive.ics.uci.edu/ml/datasets.php) 
@ -273,6 +253,46 @@ standard Pythons packages like gzip or zip. This file would need to be uncompres
 OS-dependent software manually. Information on how to do it will be printed the first
 time the dataset is invoked. 
 ## LeQua Datasets
 QuaPy also provides the datasets used for the LeQua competition.
 In brief, there are 4 tasks (T1A, T1B, T2A, T2B) having to do with text quantification
 problems. Tasks T1A and T1B provide documents in vector form, while T2A and T2B provide 
 raw documents instead.
 Tasks T1A and T2A are binary sentiment quantification problems, while T2A and T2B 
 are multiclass quantification problems consisting of estimating the class prevalence 
 values of 28 different merchandise products.
 Every task consists of a training set, a set of validation samples (for model selection)
 and a set of test samples (for evaluation). QuaPy returns this data as a LabelledCollection
 (training) and two generation protocols (for validation and test samples), as follows:
 ```python
 training, val_generator, test_generator = fetch_lequa2022(task=task)
 ```
 See the `lequa2022_experiments.py` in the examples folder for further details on how to
 carry out experiments using these datasets.  
 The datasets are downloaded only once, and stored for fast reuse.
 Some statistics are summarized below:
 | Dataset | classes | train size | validation samples | test samples |  docs by sample  |   type   |
 |---------|:-------:|:----------:|:------------------:|:------------:|:----------------:|:--------:| 
 | T1A     |   2     |    5000    |        1000        |     5000     |       250        |  vector  | 
 | T1B     |   28    |   20000    |        1000        |     5000     |       1000       |  vector  |
 | T2A     |    2    |    5000    |        1000        |     5000     |       250        |   text   |
 | T2B     |   28    |   20000    |        1000        |     5000     |       1000       |   text   |
 For further details on the datasets, we refer to the original 
 [paper](https://ceur-ws.org/Vol-3180/paper-146.pdf):
 ```
 Esuli, A., Moreo, A., Sebastiani, F., & Sperduti, G. (2022).
 A Detailed Overview of LeQua@ CLEF 2022: Learning to Quantify.
 ```
 ## Adding Custom Datasets
 QuaPy provides data loaders for simple formats dealing with 
@ -313,12 +333,15 @@ e.g.:
 ```python
 import quapy as qp
 train_path = '../my_data/train.dat'
 test_path = '../my_data/test.dat'
 def my_custom_loader(path):
    with open(path, 'rb') as fin:
        ...
    return instances, labels
 data = qp.data.Dataset.load(train_path, test_path, my_custom_loader)
 ```
--- a/docs/build/html/index.html
+++ b/docs/build/html/index.html
@ -123,6 +123,7 @@ See the <a class="reference internal" href="Evaluation.html"><span class="doc">E
 <li class="toctree-l2"><a class="reference internal" href="Datasets.html#reviews-datasets">Reviews Datasets</a></li>
 <li class="toctree-l2"><a class="reference internal" href="Datasets.html#twitter-sentiment-datasets">Twitter Sentiment Datasets</a></li>
 <li class="toctree-l2"><a class="reference internal" href="Datasets.html#uci-machine-learning">UCI Machine Learning</a></li>
 <li class="toctree-l2"><a class="reference internal" href="Datasets.html#lequa-datasets">LeQua Datasets</a></li>
 <li class="toctree-l2"><a class="reference internal" href="Datasets.html#adding-custom-datasets">Adding Custom Datasets</a></li>
 </ul>
 </li>
--- a/docs/build/html/searchindex.js
+++ b/docs/build/html/searchindex.js
--- a/quapy/tests/test_labelcollection.py
+++ b/quapy/tests/test_labelcollection.py
@ -60,9 +60,6 @@ class LabelCollectionTestCase(unittest.TestCase):
        combined = qp.data.LabelledCollection.join(data4, data5)
        self.assertEqual(len(combined), len(data4) + len(data5))
        # data2.instances = csr_matrix()
 if __name__ == '__main__':
    unittest.main()