Datasets¶
+QuaPy makes available several datasets that have been used in +quantification literature, as well as an interface to allow +anyone import their custom datasets.
+A Dataset object in QuaPy is roughly a pair of LabelledCollection objects, +one playing the role of the training set, another the test set. +LabelledCollection is a data class consisting of the (iterable) +instances and labels. This class handles most of the sampling functionality in QuaPy. +Take a look at the following code:
+import quapy as qp
+import quapy.functional as F
+
+instances = [
+ '1st positive document', '2nd positive document',
+ 'the only negative document',
+ '1st neutral document', '2nd neutral document', '3rd neutral document'
+]
+labels = [2, 2, 0, 1, 1, 1]
+
+data = qp.data.LabelledCollection(instances, labels)
+print(F.strprev(data.prevalence(), prec=2))
+
Output the class prevalences (showing 2 digit precision):
+[0.17, 0.50, 0.33]
+
One can easily produce new samples at desired class prevalences:
+sample_size = 10
+prev = [0.4, 0.1, 0.5]
+sample = data.sampling(sample_size, *prev)
+
+print('instances:', sample.instances)
+print('labels:', sample.labels)
+print('prevalence:', F.strprev(sample.prevalence(), prec=2))
+
Which outputs:
+instances: ['the only negative document' '2nd positive document'
+ '2nd positive document' '2nd neutral document' '1st positive document'
+ 'the only negative document' 'the only negative document'
+ 'the only negative document' '2nd positive document'
+ '1st positive document']
+labels: [0 2 2 1 2 0 0 0 2 2]
+prevalence: [0.40, 0.10, 0.50]
+
Samples can be made consistent across different runs (e.g., to test +different methods on the same exact samples) by sampling and retaining +the indexes, that can then be used to generate the sample:
+index = data.sampling_index(sample_size, *prev)
+for method in methods:
+ sample = data.sampling_from_index(index)
+ ...
+
QuaPy also implements the artificial sampling protocol that produces (via a +Python’s generator) a series of LabelledCollection objects with equidistant +prevalences ranging across the entire prevalence spectrum in the simplex space, e.g.:
+for sample in data.artificial_sampling_generator(sample_size=100, n_prevalences=5):
+ print(F.strprev(sample.prevalence(), prec=2))
+
produces one sampling for each (valid) combination of prevalences originating from +splitting the range [0,1] into n_prevalences=5 points (i.e., [0, 0.25, 0.5, 0.75, 1]), +that is:
+[0.00, 0.00, 1.00]
+[0.00, 0.25, 0.75]
+[0.00, 0.50, 0.50]
+[0.00, 0.75, 0.25]
+[0.00, 1.00, 0.00]
+[0.25, 0.00, 0.75]
+...
+[1.00, 0.00, 0.00]
+
See the Evaluation wiki for +further details on how to use the artificial sampling protocol to properly +evaluate a quantification method.
+Reviews Datasets¶
+Three datasets of reviews about Kindle devices, Harry Potter’s series, and +the well-known IMDb movie reviews can be fetched using a unified interface. +For example:
+import quapy as qp
+data = qp.datasets.fetch_reviews('kindle')
+
These datasets have been used in:
+Esuli, A., Moreo, A., & Sebastiani, F. (2018, October).
+A recurrent neural network for sentiment quantification.
+In Proceedings of the 27th ACM International Conference on
+Information and Knowledge Management (pp. 1775-1778).
+
The list of reviews ids is available in:
+qp.datasets.REVIEWS_SENTIMENT_DATASETS
+
Some statistics of the fhe available datasets are summarized below:
+Dataset |
+classes |
+train size |
+test size |
+train prev |
+test prev |
+type |
+
---|---|---|---|---|---|---|
hp |
+2 |
+9533 |
+18399 |
+[0.018, 0.982] |
+[0.065, 0.935] |
+text |
+
kindle |
+2 |
+3821 |
+21591 |
+[0.081, 0.919] |
+[0.063, 0.937] |
+text |
+
imdb |
+2 |
+25000 |
+25000 |
+[0.500, 0.500] |
+[0.500, 0.500] |
+text |
+
Twitter Sentiment Datasets¶
+11 Twitter datasets for sentiment analysis. +Text is not accessible, and the documents were made available +in tf-idf format. Each dataset presents two splits: a train/val +split for model selection purposes, and a train+val/test split +for model evaluation. The following code exemplifies how to load +a twitter dataset for model selection.
+import quapy as qp
+data = qp.datasets.fetch_twitter('gasp', for_model_selection=True)
+
The datasets were used in:
+Gao, W., & Sebastiani, F. (2015, August).
+Tweet sentiment: From classification to quantification.
+In 2015 IEEE/ACM International Conference on Advances in
+Social Networks Analysis and Mining (ASONAM) (pp. 97-104). IEEE.
+
Three of the datasets (semeval13, semeval14, and semeval15) share the +same training set (semeval), meaning that the training split one would get +when requesting any of them is the same. The dataset “semeval” can only +be requested with “for_model_selection=True”. +The lists of the Twitter dataset’s ids can be consulted in:
+# a list of 11 dataset ids that can be used for model selection or model evaluation
+qp.datasets.TWITTER_SENTIMENT_DATASETS_TEST
+
+# 9 dataset ids in which "semeval13", "semeval14", and "semeval15" are replaced with "semeval"
+qp.datasets.TWITTER_SENTIMENT_DATASETS_TRAIN
+
Some details can be found below:
+Dataset |
+classes |
+train size |
+test size |
+features |
+train prev |
+test prev |
+type |
+
---|---|---|---|---|---|---|---|
gasp |
+3 |
+8788 |
+3765 |
+694582 |
+[0.421, 0.496, 0.082] |
+[0.407, 0.507, 0.086] |
+sparse |
+
hcr |
+3 |
+1594 |
+798 |
+222046 |
+[0.546, 0.211, 0.243] |
+[0.640, 0.167, 0.193] |
+sparse |
+
omd |
+3 |
+1839 |
+787 |
+199151 |
+[0.463, 0.271, 0.266] |
+[0.437, 0.283, 0.280] |
+sparse |
+
sanders |
+3 |
+2155 |
+923 |
+229399 |
+[0.161, 0.691, 0.148] |
+[0.164, 0.688, 0.148] |
+sparse |
+
semeval13 |
+3 |
+11338 |
+3813 |
+1215742 |
+[0.159, 0.470, 0.372] |
+[0.158, 0.430, 0.412] |
+sparse |
+
semeval14 |
+3 |
+11338 |
+1853 |
+1215742 |
+[0.159, 0.470, 0.372] |
+[0.109, 0.361, 0.530] |
+sparse |
+
semeval15 |
+3 |
+11338 |
+2390 |
+1215742 |
+[0.159, 0.470, 0.372] |
+[0.153, 0.413, 0.434] |
+sparse |
+
semeval16 |
+3 |
+8000 |
+2000 |
+889504 |
+[0.157, 0.351, 0.492] |
+[0.163, 0.341, 0.497] |
+sparse |
+
sst |
+3 |
+2971 |
+1271 |
+376132 |
+[0.261, 0.452, 0.288] |
+[0.207, 0.481, 0.312] |
+sparse |
+
wa |
+3 |
+2184 |
+936 |
+248563 |
+[0.305, 0.414, 0.281] |
+[0.282, 0.446, 0.272] |
+sparse |
+
wb |
+3 |
+4259 |
+1823 |
+404333 |
+[0.270, 0.392, 0.337] |
+[0.274, 0.392, 0.335] |
+sparse |
+
UCI Machine Learning¶
+A set of 32 datasets from the UCI Machine Learning repository +used in:
+Pérez-Gállego, P., Quevedo, J. R., & del Coz, J. J. (2017).
+Using ensembles for problems with characterizable changes
+in data distribution: A case study on quantification.
+Information Fusion, 34, 87-100.
+
The list does not exactly coincide with that used in Pérez-Gállego et al. 2017 +since we were unable to find the datasets with ids “diabetes” and “phoneme”.
+These dataset can be loaded by calling, e.g.:
+import quapy as qp
+data = qp.datasets.fetch_UCIDataset('yeast', verbose=True)
+
This call will return a Dataset object in which the training and +test splits are randomly drawn, in a stratified manner, from the whole +collection at 70% and 30%, respectively. The verbose=True option indicates +that the dataset description should be printed in standard output. +The original data is not split, +and some papers submit the entire collection to a kFCV validation. +In order to accommodate with these practices, one could first instantiate +the entire collection, and then creating a generator that will return one +training+test dataset at a time, following a kFCV protocol:
+import quapy as qp
+collection = qp.datasets.fetch_UCILabelledCollection("yeast")
+for data in qp.data.Dataset.kFCV(collection, nfolds=5, nrepeats=2):
+ ...
+
Above code will allow to conduct a 2x5FCV evaluation on the “yeast” dataset.
+All datasets come in numerical form (dense matrices); some statistics +are summarized below.
+Dataset |
+classes |
+instances |
+features |
+prev |
+type |
+
---|---|---|---|---|---|
acute.a |
+2 |
+120 |
+6 |
+[0.508, 0.492] |
+dense |
+
acute.b |
+2 |
+120 |
+6 |
+[0.583, 0.417] |
+dense |
+
balance.1 |
+2 |
+625 |
+4 |
+[0.539, 0.461] |
+dense |
+
balance.2 |
+2 |
+625 |
+4 |
+[0.922, 0.078] |
+dense |
+
balance.3 |
+2 |
+625 |
+4 |
+[0.539, 0.461] |
+dense |
+
breast-cancer |
+2 |
+683 |
+9 |
+[0.350, 0.650] |
+dense |
+
cmc.1 |
+2 |
+1473 |
+9 |
+[0.573, 0.427] |
+dense |
+
cmc.2 |
+2 |
+1473 |
+9 |
+[0.774, 0.226] |
+dense |
+
cmc.3 |
+2 |
+1473 |
+9 |
+[0.653, 0.347] |
+dense |
+
ctg.1 |
+2 |
+2126 |
+22 |
+[0.222, 0.778] |
+dense |
+
ctg.2 |
+2 |
+2126 |
+22 |
+[0.861, 0.139] |
+dense |
+
ctg.3 |
+2 |
+2126 |
+22 |
+[0.917, 0.083] |
+dense |
+
german |
+2 |
+1000 |
+24 |
+[0.300, 0.700] |
+dense |
+
haberman |
+2 |
+306 |
+3 |
+[0.735, 0.265] |
+dense |
+
ionosphere |
+2 |
+351 |
+34 |
+[0.641, 0.359] |
+dense |
+
iris.1 |
+2 |
+150 |
+4 |
+[0.667, 0.333] |
+dense |
+
iris.2 |
+2 |
+150 |
+4 |
+[0.667, 0.333] |
+dense |
+
iris.3 |
+2 |
+150 |
+4 |
+[0.667, 0.333] |
+dense |
+
mammographic |
+2 |
+830 |
+5 |
+[0.514, 0.486] |
+dense |
+
pageblocks.5 |
+2 |
+5473 |
+10 |
+[0.979, 0.021] |
+dense |
+
semeion |
+2 |
+1593 |
+256 |
+[0.901, 0.099] |
+dense |
+
sonar |
+2 |
+208 |
+60 |
+[0.534, 0.466] |
+dense |
+
spambase |
+2 |
+4601 |
+57 |
+[0.606, 0.394] |
+dense |
+
spectf |
+2 |
+267 |
+44 |
+[0.794, 0.206] |
+dense |
+
tictactoe |
+2 |
+958 |
+9 |
+[0.653, 0.347] |
+dense |
+
transfusion |
+2 |
+748 |
+4 |
+[0.762, 0.238] |
+dense |
+
wdbc |
+2 |
+569 |
+30 |
+[0.627, 0.373] |
+dense |
+
wine.1 |
+2 |
+178 |
+13 |
+[0.669, 0.331] |
+dense |
+
wine.2 |
+2 |
+178 |
+13 |
+[0.601, 0.399] |
+dense |
+
wine.3 |
+2 |
+178 |
+13 |
+[0.730, 0.270] |
+dense |
+
wine-q-red |
+2 |
+1599 |
+11 |
+[0.465, 0.535] |
+dense |
+
wine-q-white |
+2 |
+4898 |
+11 |
+[0.335, 0.665] |
+dense |
+
yeast |
+2 |
+1484 |
+8 |
+[0.711, 0.289] |
+dense |
+
Issues:¶
+All datasets will be downloaded automatically the first time they are requested, and +stored in the quapy_data folder for faster further reuse. +However, some datasets require special actions that at the moment are not fully +automated.
+-
+
Datasets with ids “ctg.1”, “ctg.2”, and “ctg.3” (Cardiotocography Data Set) load +an Excel file, which requires the user to install the xlrd Python module in order +to open it.
+The dataset with id “pageblocks.5” (Page Blocks Classification (5)) needs to +open a “unix compressed file” (extension .Z), which is not directly doable with +standard Pythons packages like gzip or zip. This file would need to be uncompressed using +OS-dependent software manually. Information on how to do it will be printed the first +time the dataset is invoked.
+
Adding Custom Datasets¶
+QuaPy provides data loaders for simple formats dealing with +text, following the format:
+class-id \t first document's pre-processed text \n
+class-id \t second document's pre-processed text \n
+...
+
and sparse representations of the form:
+{-1, 0, or +1} col(int):val(float) col(int):val(float) ... \n
+...
+
The code in charge in loading a LabelledCollection is:
+@classmethod
+def load(cls, path:str, loader_func:callable):
+ return LabelledCollection(*loader_func(path))
+
indicating that any loader_func (e.g., a user-defined one) which +returns valid arguments for initializing a LabelledCollection object will allow +to load any collection. In particular, the LabelledCollection receives as +arguments the instances (as an iterable) and the labels (as an iterable) and, +additionally, the number of classes can be specified (it would otherwise be +inferred from the labels, but that requires at least one positive example for +all classes to be present in the collection).
+The same loader_func can be passed to a Dataset, along with two +paths, in order to create a training and test pair of LabelledCollection, +e.g.:
+import quapy as qp
+train_path = '../my_data/train.dat'
+test_path = '../my_data/test.dat'
+def my_custom_loader(path):
+ with open(path, 'rb') as fin:
+ ...
+ return instances, labels
+data = qp.data.Dataset.load(train_path, test_path, my_custom_loader)
+
Data Processing¶
+QuaPy implements a number of preprocessing functions in the package qp.data.preprocessing, including:
+-
+
text2tfidf: tfidf vectorization
+reduce_columns: reducing the number of columns based on term frequency
+standardize: transforms the column values into z-scores (i.e., subtract the mean and normalizes by the standard deviation, so +that the column values have zero mean and unit variance).
+index: transforms textual tokens into lists of numeric ids)
+