quapy.data package

Submodules

quapy.data.base

class quapy.data.base.Dataset(training: LabelledCollection, test: LabelledCollection, vocabulary: Optional[dict] = None, name='')

Bases: object

Abstraction of training and test LabelledCollection objects.

Parameters:
  • training – a LabelledCollection instance

  • test – a LabelledCollection instance

  • vocabulary – if indicated, is a dictionary of the terms used in this textual dataset

  • name – a string representing the name of the dataset

classmethod SplitStratified(collection: LabelledCollection, train_size=0.6)

Generates a Dataset from a stratified split of a LabelledCollection instance. See LabelledCollection.split_stratified()

Parameters:
  • collectionLabelledCollection

  • train_size – the proportion of training documents (the rest conforms the test split)

Returns:

an instance of Dataset

property binary

Returns True if the training collection is labelled according to two classes

Returns:

boolean

property classes_

The classes according to which the training collection is labelled

Returns:

The classes according to which the training collection is labelled

classmethod kFCV(data: LabelledCollection, nfolds=5, nrepeats=1, random_state=0)

Generator of stratified folds to be used in k-fold cross validation. This function is only a wrapper around LabelledCollection.kFCV() that returns Dataset instances made of training and test folds.

Parameters:
  • nfolds – integer (default 5), the number of folds to generate

  • nrepeats – integer (default 1), the number of rounds of k-fold cross validation to run

  • random_state – integer (default 0), guarantees that the folds generated are reproducible

Returns:

yields nfolds * nrepeats folds for k-fold cross validation as instances of Dataset

classmethod load(train_path, test_path, loader_func: callable, classes=None, **loader_kwargs)

Loads a training and a test labelled set of data and convert it into a Dataset instance. The function in charge of reading the instances must be specified. This function can be a custom one, or any of the reading functions defined in quapy.data.reader module.

Parameters:
  • train_path – string, the path to the file containing the training instances

  • test_path – string, the path to the file containing the test instances

  • loader_func – a custom function that implements the data loader and returns a tuple with instances and labels

  • classes – array-like, the classes according to which the instances are labelled

  • loader_kwargs – any argument that the loader_func function needs in order to read the instances. See LabelledCollection.load() for further details.

Returns:

a Dataset object

property n_classes

The number of classes according to which the training collection is labelled

Returns:

integer

reduce(n_train=100, n_test=100)

Reduce the number of instances in place for quick experiments. Preserves the prevalence of each set.

Parameters:
  • n_train – number of training documents to keep (default 100)

  • n_test – number of test documents to keep (default 100)

Returns:

self

stats(show=True)

Returns (and eventually prints) a dictionary with some stats of this dataset. E.g.,:

>>> data = qp.datasets.fetch_reviews('kindle', tfidf=True, min_df=5)
>>> data.stats()
>>> Dataset=kindle #tr-instances=3821, #te-instances=21591, type=<class 'scipy.sparse.csr.csr_matrix'>, #features=4403, #classes=[0 1], tr-prevs=[0.081, 0.919], te-prevs=[0.063, 0.937]
Parameters:

show – if set to True (default), prints the stats in standard output

Returns:

a dictionary containing some stats of this collection for the training and test collections. The keys are train and test, and point to dedicated dictionaries of stats, for each collection, with keys #instances (the number of instances), type (the type representing the instances), #features (the number of features, if the instances are in array-like format), #classes (the classes of the collection), prevs (the prevalence values for each class)

property train_test

Alias to self.training and self.test

Returns:

the training and test collections

Returns:

the training and test collections

property vocabulary_size

If the dataset is textual, and the vocabulary was indicated, returns the size of the vocabulary

Returns:

integer

class quapy.data.base.LabelledCollection(instances, labels, classes=None)

Bases: object

A LabelledCollection is a set of objects each with a label attached to each of them. This class implements several sampling routines and other utilities.

Parameters:
  • instances – array-like (np.ndarray, list, or csr_matrix are supported)

  • labels – array-like with the same length of instances

  • classes – optional, list of classes from which labels are taken. If not specified, the classes are inferred from the labels. The classes must be indicated in cases in which some of the labels might have no examples (i.e., a prevalence of 0)

property X

An alias to self.instances

Returns:

self.instances

property Xp

Gets the instances and the true prevalence. This is useful when implementing evaluation protocols from a LabelledCollection object.

Returns:

a tuple (instances, prevalence) from this collection

property Xy

Gets the instances and labels. This is useful when working with sklearn estimators, e.g.:

>>> svm = LinearSVC().fit(*my_collection.Xy)
Returns:

a tuple (instances, labels) from this collection

property binary

Returns True if the number of classes is 2

Returns:

boolean

counts()

Returns the number of instances for each of the classes in the codeframe.

Returns:

a np.ndarray of shape (n_classes) with the number of instances of each class, in the same order as listed by self.classes_

classmethod join(*args: Iterable[LabelledCollection])

Returns a new LabelledCollection as the union of the collections given in input.

Parameters:

args – instances of LabelledCollection

Returns:

a LabelledCollection representing the union of both collections

kFCV(nfolds=5, nrepeats=1, random_state=None)

Generator of stratified folds to be used in k-fold cross validation.

Parameters:
  • nfolds – integer (default 5), the number of folds to generate

  • nrepeats – integer (default 1), the number of rounds of k-fold cross validation to run

  • random_state – integer (default 0), guarantees that the folds generated are reproducible

Returns:

yields nfolds * nrepeats folds for k-fold cross validation

classmethod load(path: str, loader_func: callable, classes=None, **loader_kwargs)

Loads a labelled set of data and convert it into a LabelledCollection instance. The function in charge of reading the instances must be specified. This function can be a custom one, or any of the reading functions defined in quapy.data.reader module.

Parameters:
  • path – string, the path to the file containing the labelled instances

  • loader_func – a custom function that implements the data loader and returns a tuple with instances and labels

  • classes – array-like, the classes according to which the instances are labelled

  • loader_kwargs – any argument that the loader_func function needs in order to read the instances, i.e., these arguments are used to call loader_func(path, **loader_kwargs)

Returns:

a LabelledCollection object

property n_classes

The number of classes

Returns:

integer

property p

An alias to self.prevalence()

Returns:

self.prevalence()

prevalence()

Returns the prevalence, or relative frequency, of the classes in the codeframe.

Returns:

a np.ndarray of shape (n_classes) with the relative frequencies of each class, in the same order as listed by self.classes_

sampling(size, *prevs, shuffle=True, random_state=None)

Return a random sample (an instance of LabelledCollection) of desired size and desired prevalence values. For each class, the sampling is drawn without replacement if the requested prevalence is larger than the actual prevalence of the class, or with replacement otherwise.

Parameters:
  • size – integer, the requested size

  • prevs – the prevalence for each class; the prevalence value for the last class can be lead empty since it is constrained. E.g., for binary collections, only the prevalence p for the first class (as listed in self.classes_ can be specified, while the other class takes prevalence value 1-p

  • shuffle – if set to True (default), shuffles the index before returning it

  • random_state – seed for reproducing sampling

Returns:

an instance of LabelledCollection with length == size and prevalence close to prevs (or prevalence == prevs if the exact prevalence values can be met as proportions of instances)

sampling_from_index(index)

Returns an instance of LabelledCollection whose elements are sampled from this collection using the index.

Parameters:

index – np.ndarray

Returns:

an instance of LabelledCollection

sampling_index(size, *prevs, shuffle=True, random_state=None)

Returns an index to be used to extract a random sample of desired size and desired prevalence values. If the prevalence values are not specified, then returns the index of a uniform sampling. For each class, the sampling is drawn with replacement if the requested prevalence is larger than the actual prevalence of the class, or without replacement otherwise.

Parameters:
  • size – integer, the requested size

  • prevs – the prevalence for each class; the prevalence value for the last class can be lead empty since it is constrained. E.g., for binary collections, only the prevalence p for the first class (as listed in self.classes_ can be specified, while the other class takes prevalence value 1-p

  • shuffle – if set to True (default), shuffles the index before returning it

  • random_state – seed for reproducing sampling

Returns:

a np.ndarray of shape (size) with the indexes

split_random(train_prop=0.6, random_state=None)

Returns two instances of LabelledCollection split randomly from this collection, at desired proportion.

Parameters:
  • train_prop – the proportion of elements to include in the left-most returned collection (typically used as the training collection). The rest of elements are included in the right-most returned collection (typically used as a test collection).

  • random_state – if specified, guarantees reproducibility of the split.

Returns:

two instances of LabelledCollection, the first one with train_prop elements, and the second one with 1-train_prop elements

split_stratified(train_prop=0.6, random_state=None)

Returns two instances of LabelledCollection split with stratification from this collection, at desired proportion.

Parameters:
  • train_prop – the proportion of elements to include in the left-most returned collection (typically used as the training collection). The rest of elements are included in the right-most returned collection (typically used as a test collection).

  • random_state – if specified, guarantees reproducibility of the split.

Returns:

two instances of LabelledCollection, the first one with train_prop elements, and the second one with 1-train_prop elements

stats(show=True)

Returns (and eventually prints) a dictionary with some stats of this collection. E.g.,:

>>> data = qp.datasets.fetch_reviews('kindle', tfidf=True, min_df=5)
>>> data.training.stats()
>>> #instances=3821, type=<class 'scipy.sparse.csr.csr_matrix'>, #features=4403, #classes=[0 1], prevs=[0.081, 0.919]
Parameters:

show – if set to True (default), prints the stats in standard output

Returns:

a dictionary containing some stats of this collection. Keys include #instances (the number of instances), type (the type representing the instances), #features (the number of features, if the instances are in array-like format), #classes (the classes of the collection), prevs (the prevalence values for each class)

uniform_sampling(size, random_state=None)

Returns a uniform sample (an instance of LabelledCollection) of desired size. The sampling is drawn with replacement if the requested size is greater than the number of instances, or without replacement otherwise.

Parameters:
  • size – integer, the requested size

  • random_state – if specified, guarantees reproducibility of the split.

Returns:

an instance of LabelledCollection with length == size

uniform_sampling_index(size, random_state=None)

Returns an index to be used to extract a uniform sample of desired size. The sampling is drawn with replacement if the requested size is greater than the number of instances, or without replacement otherwise.

Parameters:
  • size – integer, the size of the uniform sample

  • random_state – if specified, guarantees reproducibility of the split.

Returns:

a np.ndarray of shape (size) with the indexes

property y

An alias to self.labels

Returns:

self.labels

quapy.data.datasets

quapy.data.datasets.fetch_UCIDataset(dataset_name, data_home=None, test_split=0.3, verbose=False) Dataset

Loads a UCI dataset as an instance of quapy.data.base.Dataset, as used in Pérez-Gállego, P., Quevedo, J. R., & del Coz, J. J. (2017). Using ensembles for problems with characterizable changes in data distribution: A case study on quantification. Information Fusion, 34, 87-100. and Pérez-Gállego, P., Castano, A., Quevedo, J. R., & del Coz, J. J. (2019). Dynamic ensemble selection for quantification tasks. Information Fusion, 45, 1-15.. The datasets do not come with a predefined train-test split (see fetch_UCILabelledCollection() for further information on how to use these collections), and so a train-test split is generated at desired proportion. The list of valid dataset names can be accessed in quapy.data.datasets.UCI_DATASETS

Parameters:
  • dataset_name – a dataset name

  • data_home – specify the quapy home directory where collections will be dumped (leave empty to use the default ~/quay_data/ directory)

  • test_split – proportion of documents to be included in the test set. The rest conforms the training set

  • verbose – set to True (default is False) to get information (from the UCI ML repository) about the datasets

Returns:

a quapy.data.base.Dataset instance

quapy.data.datasets.fetch_UCILabelledCollection(dataset_name, data_home=None, verbose=False) Dataset

Loads a UCI collection as an instance of quapy.data.base.LabelledCollection, as used in Pérez-Gállego, P., Quevedo, J. R., & del Coz, J. J. (2017). Using ensembles for problems with characterizable changes in data distribution: A case study on quantification. Information Fusion, 34, 87-100. and Pérez-Gállego, P., Castano, A., Quevedo, J. R., & del Coz, J. J. (2019). Dynamic ensemble selection for quantification tasks. Information Fusion, 45, 1-15.. The datasets do not come with a predefined train-test split, and so Pérez-Gállego et al. adopted a 5FCVx2 evaluation protocol, meaning that each collection was used to generate two rounds (hence the x2) of 5 fold cross validation. This can be reproduced by using quapy.data.base.Dataset.kFCV(), e.g.:

>>> import quapy as qp
>>> collection = qp.datasets.fetch_UCILabelledCollection("yeast")
>>> for data in qp.data.Dataset.kFCV(collection, nfolds=5, nrepeats=2):
>>>     ...

The list of valid dataset names can be accessed in quapy.data.datasets.UCI_DATASETS

Parameters:
  • dataset_name – a dataset name

  • data_home – specify the quapy home directory where collections will be dumped (leave empty to use the default ~/quay_data/ directory)

  • test_split – proportion of documents to be included in the test set. The rest conforms the training set

  • verbose – set to True (default is False) to get information (from the UCI ML repository) about the datasets

Returns:

a quapy.data.base.Dataset instance

quapy.data.datasets.fetch_lequa2022(task, data_home=None)

Loads the official datasets provided for the LeQua competition. In brief, there are 4 tasks (T1A, T1B, T2A, T2B) having to do with text quantification problems. Tasks T1A and T1B provide documents in vector form, while T2A and T2B provide raw documents instead. Tasks T1A and T2A are binary sentiment quantification problems, while T2A and T2B are multiclass quantification problems consisting of estimating the class prevalence values of 28 different merchandise products. We refer to the Esuli, A., Moreo, A., Sebastiani, F., & Sperduti, G. (2022). A Detailed Overview of LeQua@ CLEF 2022: Learning to Quantify. for a detailed description on the tasks and datasets.

The datasets are downloaded only once, and stored for fast reuse.

See lequa2022_experiments.py provided in the example folder, that can serve as a guide on how to use these datasets.

Parameters:
  • task – a string representing the task name; valid ones are T1A, T1B, T2A, and T2B

  • data_home – specify the quapy home directory where collections will be dumped (leave empty to use the default ~/quay_data/ directory)

Returns:

a tuple (train, val_gen, test_gen) where train is an instance of quapy.data.base.LabelledCollection, val_gen and test_gen are instances of quapy.protocol.SamplesFromDir, i.e., are sampling protocols that return a series of samples labelled by prevalence.

quapy.data.datasets.fetch_reviews(dataset_name, tfidf=False, min_df=None, data_home=None, pickle=False) Dataset

Loads a Reviews dataset as a Dataset instance, as used in Esuli, A., Moreo, A., and Sebastiani, F. “A recurrent neural network for sentiment quantification.” Proceedings of the 27th ACM International Conference on Information and Knowledge Management. 2018.. The list of valid dataset names can be accessed in quapy.data.datasets.REVIEWS_SENTIMENT_DATASETS

Parameters:
  • dataset_name – the name of the dataset: valid ones are ‘hp’, ‘kindle’, ‘imdb’

  • tfidf – set to True to transform the raw documents into tfidf weighted matrices

  • min_df – minimun number of documents that should contain a term in order for the term to be kept (ignored if tfidf==False)

  • data_home – specify the quapy home directory where collections will be dumped (leave empty to use the default ~/quay_data/ directory)

  • pickle – set to True to pickle the Dataset object the first time it is generated, in order to allow for faster subsequent invokations

Returns:

a quapy.data.base.Dataset instance

quapy.data.datasets.fetch_twitter(dataset_name, for_model_selection=False, min_df=None, data_home=None, pickle=False) Dataset

Loads a Twitter dataset as a quapy.data.base.Dataset instance, as used in: Gao, W., Sebastiani, F.: From classification to quantification in tweet sentiment analysis. Social Network Analysis and Mining6(19), 1–22 (2016) Note that the datasets ‘semeval13’, ‘semeval14’, ‘semeval15’ share the same training set. The list of valid dataset names corresponding to training sets can be accessed in quapy.data.datasets.TWITTER_SENTIMENT_DATASETS_TRAIN, while the test sets can be accessed in quapy.data.datasets.TWITTER_SENTIMENT_DATASETS_TEST

Parameters:
  • dataset_name – the name of the dataset: valid ones are ‘gasp’, ‘hcr’, ‘omd’, ‘sanders’, ‘semeval13’, ‘semeval14’, ‘semeval15’, ‘semeval16’, ‘sst’, ‘wa’, ‘wb’

  • for_model_selection – if True, then returns the train split as the training set and the devel split as the test set; if False, then returns the train+devel split as the training set and the test set as the test set

  • min_df – minimun number of documents that should contain a term in order for the term to be kept

  • data_home – specify the quapy home directory where collections will be dumped (leave empty to use the default ~/quay_data/ directory)

  • pickle – set to True to pickle the Dataset object the first time it is generated, in order to allow for faster subsequent invokations

Returns:

a quapy.data.base.Dataset instance

quapy.data.datasets.warn(*args, **kwargs)

quapy.data.preprocessing

class quapy.data.preprocessing.IndexTransformer(**kwargs)

Bases: object

This class implements a sklearn’s-style transformer that indexes text as numerical ids for the tokens it contains, and that would be generated by sklearn’s CountVectorizer

Parameters:

kwargs

keyworded arguments from CountVectorizer

add_word(word, id=None, nogaps=True)

Adds a new token (regardless of whether it has been found in the text or not), with dedicated id. Useful to define special tokens for codifying unknown words, or padding tokens.

Parameters:
  • word – string, surface form of the token

  • id – integer, numerical value to assign to the token (leave as None for indicating the next valid id, default)

  • nogaps – if set to True (default) asserts that the id indicated leads to no numerical gaps with precedent ids stored so far

Returns:

integer, the numerical id for the new token

fit(X)

Fits the transformer, i.e., decides on the vocabulary, given a list of strings.

Parameters:

X – a list of strings

Returns:

self

fit_transform(X, n_jobs=None)

Fits the transform on X and transforms it.

Parameters:
  • X – a list of strings

  • n_jobs – the number of parallel workers to carry out this task

Returns:

a np.ndarray of numerical ids

transform(X, n_jobs=None)

Transforms the strings in X as lists of numerical ids

Parameters:
  • X – a list of strings

  • n_jobs – the number of parallel workers to carry out this task

Returns:

a np.ndarray of numerical ids

vocabulary_size()

Gets the length of the vocabulary according to which the document tokens have been indexed

Returns:

integer

quapy.data.preprocessing.index(dataset: Dataset, min_df=5, inplace=False, **kwargs)

Indexes the tokens of a textual quapy.data.base.Dataset of string documents. To index a document means to replace each different token by a unique numerical index. Rare words (i.e., words occurring less than min_df times) are replaced by a special token UNK

Parameters:
  • dataset – a quapy.data.base.Dataset object where the instances of training and test documents are lists of str

  • min_df – minimum number of occurrences below which the term is replaced by a UNK index

  • inplace – whether or not to apply the transformation inplace (True), or to a new copy (False, default)

  • kwargs – the rest of parameters of the transformation (as for sklearn’s CountVectorizer <https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html>_)

Returns:

a new quapy.data.base.Dataset (if inplace=False) or a reference to the current quapy.data.base.Dataset (inplace=True) consisting of lists of integer values representing indices.

quapy.data.preprocessing.reduce_columns(dataset: Dataset, min_df=5, inplace=False)

Reduces the dimensionality of the instances, represented as a csr_matrix (or any subtype of scipy.sparse.spmatrix), of training and test documents by removing the columns of words which are not present in at least min_df instances in the training set

Parameters:
  • dataset – a quapy.data.base.Dataset in which instances are represented in sparse format (any subtype of scipy.sparse.spmatrix)

  • min_df – integer, minimum number of instances below which the columns are removed

  • inplace – whether or not to apply the transformation inplace (True), or to a new copy (False, default)

Returns:

a new quapy.data.base.Dataset (if inplace=False) or a reference to the current quapy.data.base.Dataset (inplace=True) where the dimensions corresponding to infrequent terms in the training set have been removed

quapy.data.preprocessing.standardize(dataset: Dataset, inplace=False)

Standardizes the real-valued columns of a quapy.data.base.Dataset. Standardization, aka z-scoring, of a variable X comes down to subtracting the average and normalizing by the standard deviation.

Parameters:
Returns:

an instance of quapy.data.base.Dataset

quapy.data.preprocessing.text2tfidf(dataset: Dataset, min_df=3, sublinear_tf=True, inplace=False, **kwargs)

Transforms a quapy.data.base.Dataset of textual instances into a quapy.data.base.Dataset of tfidf weighted sparse vectors

Parameters:
  • dataset – a quapy.data.base.Dataset where the instances of training and test collections are lists of str

  • min_df – minimum number of occurrences for a word to be considered as part of the vocabulary (default 3)

  • sublinear_tf – whether or not to apply the log scalling to the tf counters (default True)

  • inplace – whether or not to apply the transformation inplace (True), or to a new copy (False, default)

  • kwargs – the rest of parameters of the transformation (as for sklearn’s TfidfVectorizer)

Returns:

a new quapy.data.base.Dataset in csr_matrix format (if inplace=False) or a reference to the current Dataset (if inplace=True) where the instances are stored in a csr_matrix of real-valued tfidf scores

quapy.data.reader

quapy.data.reader.binarize(y, pos_class)

Binarizes a categorical array-like collection of labels towards the positive class pos_class. E.g.,:

>>> binarize([1, 2, 3, 1, 1, 0], pos_class=2)
>>> array([0, 1, 0, 0, 0, 0])
Parameters:
  • y – array-like of labels

  • pos_class – integer, the positive class

Returns:

a binary np.ndarray, in which values 1 corresponds to positions in whcih y had pos_class labels, and 0 otherwise

quapy.data.reader.from_csv(path, encoding='utf-8')

Reads a csv file in which columns are separated by ‘,’. File format <label>,<feat1>,<feat2>,…,<featn>

Parameters:
  • path – path to the csv file

  • encoding – the text encoding used to open the file

Returns:

a np.ndarray for the labels and a ndarray (float) for the covariates

quapy.data.reader.from_sparse(path)

Reads a labelled collection of real-valued instances expressed in sparse format File format <-1 or 0 or 1>[s col(int):val(float)]

Parameters:

path – path to the labelled collection

Returns:

a csr_matrix containing the instances (rows), and a ndarray containing the labels

quapy.data.reader.from_text(path, encoding='utf-8', verbose=1, class2int=True)

Reads a labelled colletion of documents. File fomart <0 or 1> <document>

Parameters:
  • path – path to the labelled collection

  • encoding – the text encoding used to open the file

  • verbose – if >0 (default) shows some progress information in standard output

Returns:

a list of sentences, and a list of labels

quapy.data.reader.reindex_labels(y)

Re-indexes a list of labels as a list of indexes, and returns the classnames corresponding to the indexes. E.g.:

>>> reindex_labels(['B', 'B', 'A', 'C'])
>>> (array([1, 1, 0, 2]), array(['A', 'B', 'C'], dtype='<U1'))
Parameters:

y – the list or array of original labels

Returns:

a ndarray (int) of class indexes, and a ndarray of classnames corresponding to the indexes.

Module contents