quapy.data package

Submodules

quapy.data.base module

class quapy.data.base.Dataset(training: quapy.data.base.LabelledCollection, test: quapy.data.base.LabelledCollection, vocabulary: Optional[dict] = None, name='')

Bases: object

Abstraction of training and test LabelledCollection objects.

Parameters
  • training – a LabelledCollection instance

  • test – a LabelledCollection instance

  • vocabulary – if indicated, is a dictionary of the terms used in this textual dataset

  • name – a string representing the name of the dataset

classmethod SplitStratified(collection: quapy.data.base.LabelledCollection, train_size=0.6)

Generates a Dataset from a stratified split of a LabelledCollection instance. See LabelledCollection.split_stratified()

Parameters
  • collectionLabelledCollection

  • train_size – the proportion of training documents (the rest conforms the test split)

Returns

an instance of Dataset

property binary

Returns True if the training collection is labelled according to two classes

Returns

boolean

property classes_

The classes according to which the training collection is labelled

Returns

The classes according to which the training collection is labelled

classmethod kFCV(data: quapy.data.base.LabelledCollection, nfolds=5, nrepeats=1, random_state=0)

Generator of stratified folds to be used in k-fold cross validation. This function is only a wrapper around LabelledCollection.kFCV() that returns Dataset instances made of training and test folds.

Parameters
  • nfolds – integer (default 5), the number of folds to generate

  • nrepeats – integer (default 1), the number of rounds of k-fold cross validation to run

  • random_state – integer (default 0), guarantees that the folds generated are reproducible

Returns

yields nfolds * nrepeats folds for k-fold cross validation as instances of Dataset

classmethod load(train_path, test_path, loader_func: callable, classes=None, **loader_kwargs)

Loads a training and a test labelled set of data and convert it into a Dataset instance. The function in charge of reading the instances must be specified. This function can be a custom one, or any of the reading functions defined in quapy.data.reader module.

Parameters
  • train_path – string, the path to the file containing the training instances

  • test_path – string, the path to the file containing the test instances

  • loader_func – a custom function that implements the data loader and returns a tuple with instances and labels

  • classes – array-like, the classes according to which the instances are labelled

  • loader_kwargs – any argument that the loader_func function needs in order to read the instances. See LabelledCollection.load() for further details.

Returns

a Dataset object

property n_classes

The number of classes according to which the training collection is labelled

Returns

integer

stats(show)

Returns (and eventually prints) a dictionary with some stats of this dataset. E.g.,:

>>> data = qp.datasets.fetch_reviews('kindle', tfidf=True, min_df=5)
>>> data.stats()
>>> Dataset=kindle #tr-instances=3821, #te-instances=21591, type=<class 'scipy.sparse.csr.csr_matrix'>, #features=4403, #classes=[0 1], tr-prevs=[0.081, 0.919], te-prevs=[0.063, 0.937]
Parameters

show – if set to True (default), prints the stats in standard output

Returns

a dictionary containing some stats of this collection for the training and test collections. The keys are train and test, and point to dedicated dictionaries of stats, for each collection, with keys #instances (the number of instances), type (the type representing the instances), #features (the number of features, if the instances are in array-like format), #classes (the classes of the collection), prevs (the prevalence values for each class)

property vocabulary_size

If the dataset is textual, and the vocabulary was indicated, returns the size of the vocabulary

Returns

integer

class quapy.data.base.LabelledCollection(instances, labels, classes_=None)

Bases: object

A LabelledCollection is a set of objects each with a label associated to it. This class implements many sampling routines.

Parameters
  • instances – array-like (np.ndarray, list, or csr_matrix are supported)

  • labels – array-like with the same length of instances

  • classes – optional, list of classes from which labels are taken. If not specified, the classes are inferred from the labels. The classes must be indicated in cases in which some of the labels might have no examples (i.e., a prevalence of 0)

property Xy

Gets the instances and labels. This is useful when working with sklearn estimators, e.g.:

>>> svm = LinearSVC().fit(*my_collection.Xy)
Returns

a tuple (instances, labels) from this collection

artificial_sampling_generator(sample_size, n_prevalences=101, repeats=1)

A generator of samples that implements the artificial prevalence protocol (APP). The APP consists of exploring a grid of prevalence values containing n_prevalences points (e.g., [0, 0.05, 0.1, 0.15, …, 1], if n_prevalences=21), and generating all valid combinations of prevalence values for all classes (e.g., for 3 classes, samples with [0, 0, 1], [0, 0.05, 0.95], …, [1, 0, 0] prevalence values of size sample_size will be yielded). The number of samples for each valid combination of prevalence values is indicated by repeats.

Parameters
  • sample_size – the number of instances in each sample

  • n_prevalences – the number of prevalence points to be taken from the [0,1] interval (including the limits {0,1}). E.g., if n_prevalences=11, then the prevalence points to take are [0, 0.1, 0.2, …, 1]

  • repeats – the number of samples to generate for each valid combination of prevalence values (default 1)

Returns

yield samples generated at artificially controlled prevalence values

artificial_sampling_index_generator(sample_size, n_prevalences=101, repeats=1)

A generator of sample indexes implementing the artificial prevalence protocol (APP). The APP consists of exploring a grid of prevalence values (e.g., [0, 0.05, 0.1, 0.15, …, 1]), and generating all valid combinations of prevalence values for all classes (e.g., for 3 classes, samples with [0, 0, 1], [0, 0.05, 0.95], …, [1, 0, 0] prevalence values of size sample_size will be yielded). The number of sample indexes for each valid combination of prevalence values is indicated by repeats

Parameters
  • sample_size – the number of instances in each sample (i.e., length of each index)

  • n_prevalences – the number of prevalence points to be taken from the [0,1] interval (including the limits {0,1}). E.g., if n_prevalences=11, then the prevalence points to take are [0, 0.1, 0.2, …, 1]

  • repeats – the number of samples to generate for each valid combination of prevalence values (default 1)

Returns

yield the indexes that generate the samples according to APP

property binary

Returns True if the number of classes is 2

Returns

boolean

counts()

Returns the number of instances for each of the classes of interest.

Returns

a np.ndarray of shape (n_classes) with the number of instances of each class, in the same order as listed by self.classes_

kFCV(nfolds=5, nrepeats=1, random_state=0)

Generator of stratified folds to be used in k-fold cross validation.

Parameters
  • nfolds – integer (default 5), the number of folds to generate

  • nrepeats – integer (default 1), the number of rounds of k-fold cross validation to run

  • random_state – integer (default 0), guarantees that the folds generated are reproducible

Returns

yields nfolds * nrepeats folds for k-fold cross validation

classmethod load(path: str, loader_func: callable, classes=None, **loader_kwargs)

Loads a labelled set of data and convert it into a LabelledCollection instance. The function in charge of reading the instances must be specified. This function can be a custom one, or any of the reading functions defined in quapy.data.reader module.

Parameters
  • path – string, the path to the file containing the labelled instances

  • loader_func – a custom function that implements the data loader and returns a tuple with instances and labels

  • classes – array-like, the classes according to which the instances are labelled

  • loader_kwargs – any argument that the loader_func function needs in order to read the instances, i.e., these arguments are used to call loader_func(path, **loader_kwargs)

Returns

a LabelledCollection object

property n_classes

The number of classes

Returns

integer

natural_sampling_generator(sample_size, repeats=100)

A generator of samples that implements the natural prevalence protocol (NPP). The NPP consists of drawing samples uniformly at random, therefore approximately preserving the natural prevalence of the collection.

Parameters
  • sample_size – integer, the number of instances in each sample

  • repeats – the number of samples to generate

Returns

yield instances of LabelledCollection

natural_sampling_index_generator(sample_size, repeats=100)

A generator of sample indexes according to the natural prevalence protocol (NPP). The NPP consists of drawing samples uniformly at random, therefore approximately preserving the natural prevalence of the collection.

Parameters
  • sample_size – integer, the number of instances in each sample (i.e., the length of each index)

  • repeats – the number of indexes to generate

Returns

yield repeats instances of np.ndarray with shape (sample_size,)

prevalence()

Returns the prevalence, or relative frequency, of the classes of interest.

Returns

a np.ndarray of shape (n_classes) with the relative frequencies of each class, in the same order as listed by self.classes_

sampling(size, *prevs, shuffle=True)

Return a random sample (an instance of LabelledCollection) of desired size and desired prevalence values. For each class, the sampling is drawn without replacement if the requested prevalence is larger than the actual prevalence of the class, or with replacement otherwise.

Parameters
  • size – integer, the requested size

  • prevs – the prevalence for each class; the prevalence value for the last class can be lead empty since it is constrained. E.g., for binary collections, only the prevalence p for the first class (as listed in self.classes_ can be specified, while the other class takes prevalence value 1-p

  • shuffle – if set to True (default), shuffles the index before returning it

Returns

an instance of LabelledCollection with length == size and prevalence close to prevs (or prevalence == prevs if the exact prevalence values can be met as proportions of instances)

sampling_from_index(index)

Returns an instance of LabelledCollection whose elements are sampled from this collection using the index.

Parameters

index – np.ndarray

Returns

an instance of LabelledCollection

sampling_index(size, *prevs, shuffle=True)

Returns an index to be used to extract a random sample of desired size and desired prevalence values. If the prevalence values are not specified, then returns the index of a uniform sampling. For each class, the sampling is drawn without replacement if the requested prevalence is larger than the actual prevalence of the class, or with replacement otherwise.

Parameters
  • size – integer, the requested size

  • prevs – the prevalence for each class; the prevalence value for the last class can be lead empty since it is constrained. E.g., for binary collections, only the prevalence p for the first class (as listed in self.classes_ can be specified, while the other class takes prevalence value 1-p

  • shuffle – if set to True (default), shuffles the index before returning it

Returns

a np.ndarray of shape (size) with the indexes

split_stratified(train_prop=0.6, random_state=None)

Returns two instances of LabelledCollection split with stratification from this collection, at desired proportion.

Parameters
  • train_prop – the proportion of elements to include in the left-most returned collection (typically used as the training collection). The rest of elements are included in the right-most returned collection (typically used as a test collection).

  • random_state – if specified, guarantees reproducibility of the split.

Returns

two instances of LabelledCollection, the first one with train_prop elements, and the second one with 1-train_prop elements

stats(show=True)

Returns (and eventually prints) a dictionary with some stats of this collection. E.g.,:

>>> data = qp.datasets.fetch_reviews('kindle', tfidf=True, min_df=5)
>>> data.training.stats()
>>> #instances=3821, type=<class 'scipy.sparse.csr.csr_matrix'>, #features=4403, #classes=[0 1], prevs=[0.081, 0.919]
Parameters

show – if set to True (default), prints the stats in standard output

Returns

a dictionary containing some stats of this collection. Keys include #instances (the number of instances), type (the type representing the instances), #features (the number of features, if the instances are in array-like format), #classes (the classes of the collection), prevs (the prevalence values for each class)

uniform_sampling(size)

Returns a uniform sample (an instance of LabelledCollection) of desired size. The sampling is drawn without replacement if the requested size is greater than the number of instances, or with replacement otherwise.

Parameters

size – integer, the requested size

Returns

an instance of LabelledCollection with length == size

uniform_sampling_index(size)

Returns an index to be used to extract a uniform sample of desired size. The sampling is drawn without replacement if the requested size is greater than the number of instances, or with replacement otherwise.

Parameters

size – integer, the size of the uniform sample

Returns

a np.ndarray of shape (size) with the indexes

quapy.data.base.isbinary(data)

Returns True if data is either a binary Dataset or a binary LabelledCollection

Parameters

data – a Dataset or a LabelledCollection object

Returns

True if labelled according to two classes

quapy.data.datasets module

quapy.data.datasets.fetch_UCIDataset(dataset_name, data_home=None, test_split=0.3, verbose=False) quapy.data.base.Dataset

Loads a UCI dataset as an instance of quapy.data.base.Dataset, as used in Pérez-Gállego, P., Quevedo, J. R., & del Coz, J. J. (2017). Using ensembles for problems with characterizable changes in data distribution: A case study on quantification. Information Fusion, 34, 87-100. and Pérez-Gállego, P., Castano, A., Quevedo, J. R., & del Coz, J. J. (2019). Dynamic ensemble selection for quantification tasks. Information Fusion, 45, 1-15.. The datasets do not come with a predefined train-test split (see fetch_UCILabelledCollection() for further information on how to use these collections), and so a train-test split is generated at desired proportion. The list of valid dataset names can be accessed in quapy.data.datasets.UCI_DATASETS

Parameters
  • dataset_name – a dataset name

  • data_home – specify the quapy home directory where collections will be dumped (leave empty to use the default ~/quay_data/ directory)

  • test_split – proportion of documents to be included in the test set. The rest conforms the training set

  • verbose – set to True (default is False) to get information (from the UCI ML repository) about the datasets

Returns

a quapy.data.base.Dataset instance

quapy.data.datasets.fetch_UCILabelledCollection(dataset_name, data_home=None, verbose=False) quapy.data.base.Dataset

Loads a UCI collection as an instance of quapy.data.base.LabelledCollection, as used in Pérez-Gállego, P., Quevedo, J. R., & del Coz, J. J. (2017). Using ensembles for problems with characterizable changes in data distribution: A case study on quantification. Information Fusion, 34, 87-100. and Pérez-Gállego, P., Castano, A., Quevedo, J. R., & del Coz, J. J. (2019). Dynamic ensemble selection for quantification tasks. Information Fusion, 45, 1-15.. The datasets do not come with a predefined train-test split, and so Pérez-Gállego et al. adopted a 5FCVx2 evaluation protocol, meaning that each collection was used to generate two rounds (hence the x2) of 5 fold cross validation. This can be reproduced by using quapy.data.base.Dataset.kFCV(), e.g.:

>>> import quapy as qp
>>> collection = qp.datasets.fetch_UCILabelledCollection("yeast")
>>> for data in qp.data.Dataset.kFCV(collection, nfolds=5, nrepeats=2):
>>>     ...

The list of valid dataset names can be accessed in quapy.data.datasets.UCI_DATASETS

Parameters
  • dataset_name – a dataset name

  • data_home – specify the quapy home directory where collections will be dumped (leave empty to use the default ~/quay_data/ directory)

  • test_split – proportion of documents to be included in the test set. The rest conforms the training set

  • verbose – set to True (default is False) to get information (from the UCI ML repository) about the datasets

Returns

a quapy.data.base.Dataset instance

quapy.data.datasets.fetch_reviews(dataset_name, tfidf=False, min_df=None, data_home=None, pickle=False) quapy.data.base.Dataset

Loads a Reviews dataset as a Dataset instance, as used in Esuli, A., Moreo, A., and Sebastiani, F. “A recurrent neural network for sentiment quantification.” Proceedings of the 27th ACM International Conference on Information and Knowledge Management. 2018.. The list of valid dataset names can be accessed in quapy.data.datasets.REVIEWS_SENTIMENT_DATASETS

Parameters
  • dataset_name – the name of the dataset: valid ones are ‘hp’, ‘kindle’, ‘imdb’

  • tfidf – set to True to transform the raw documents into tfidf weighted matrices

  • min_df – minimun number of documents that should contain a term in order for the term to be kept (ignored if tfidf==False)

  • data_home – specify the quapy home directory where collections will be dumped (leave empty to use the default ~/quay_data/ directory)

  • pickle – set to True to pickle the Dataset object the first time it is generated, in order to allow for faster subsequent invokations

Returns

a quapy.data.base.Dataset instance

quapy.data.datasets.fetch_twitter(dataset_name, for_model_selection=False, min_df=None, data_home=None, pickle=False) quapy.data.base.Dataset

Loads a Twitter dataset as a quapy.data.base.Dataset instance, as used in: Gao, W., Sebastiani, F.: From classification to quantification in tweet sentiment analysis. Social Network Analysis and Mining6(19), 1–22 (2016) Note that the datasets ‘semeval13’, ‘semeval14’, ‘semeval15’ share the same training set. The list of valid dataset names corresponding to training sets can be accessed in quapy.data.datasets.TWITTER_SENTIMENT_DATASETS_TRAIN, while the test sets can be accessed in quapy.data.datasets.TWITTER_SENTIMENT_DATASETS_TEST

Parameters
  • dataset_name – the name of the dataset: valid ones are ‘gasp’, ‘hcr’, ‘omd’, ‘sanders’, ‘semeval13’, ‘semeval14’, ‘semeval15’, ‘semeval16’, ‘sst’, ‘wa’, ‘wb’

  • for_model_selection – if True, then returns the train split as the training set and the devel split as the test set; if False, then returns the train+devel split as the training set and the test set as the test set

  • min_df – minimun number of documents that should contain a term in order for the term to be kept

  • data_home – specify the quapy home directory where collections will be dumped (leave empty to use the default ~/quay_data/ directory)

  • pickle – set to True to pickle the Dataset object the first time it is generated, in order to allow for faster subsequent invokations

Returns

a quapy.data.base.Dataset instance

quapy.data.datasets.warn(*args, **kwargs)

quapy.data.preprocessing module

class quapy.data.preprocessing.IndexTransformer(**kwargs)

Bases: object

This class implements a sklearn’s-style transformer that indexes text as numerical ids for the tokens it contains, and that would be generated by sklearn’s CountVectorizer

Parameters

kwargs

keyworded arguments from CountVectorizer

add_word(word, id=None, nogaps=True)

Adds a new token (regardless of whether it has been found in the text or not), with dedicated id. Useful to define special tokens for codifying unknown words, or padding tokens.

Parameters
  • word – string, surface form of the token

  • id – integer, numerical value to assign to the token (leave as None for indicating the next valid id, default)

  • nogaps – if set to True (default) asserts that the id indicated leads to no numerical gaps with precedent ids stored so far

Returns

integer, the numerical id for the new token

fit(X)

Fits the transformer, i.e., decides on the vocabulary, given a list of strings.

Parameters

X – a list of strings

Returns

self

fit_transform(X, n_jobs=- 1)

Fits the transform on X and transforms it.

Parameters
  • X – a list of strings

  • n_jobs – the number of parallel workers to carry out this task

Returns

a np.ndarray of numerical ids

transform(X, n_jobs=- 1)

Transforms the strings in X as lists of numerical ids

Parameters
  • X – a list of strings

  • n_jobs – the number of parallel workers to carry out this task

Returns

a np.ndarray of numerical ids

vocabulary_size()

Gets the length of the vocabulary according to which the document tokens have been indexed

Returns

integer

quapy.data.preprocessing.index(dataset: quapy.data.base.Dataset, min_df=5, inplace=False, **kwargs)

Indexes the tokens of a textual quapy.data.base.Dataset of string documents. To index a document means to replace each different token by a unique numerical index. Rare words (i.e., words occurring less than min_df times) are replaced by a special token UNK

Parameters
  • dataset – a quapy.data.base.Dataset object where the instances of training and test documents are lists of str

  • min_df – minimum number of occurrences below which the term is replaced by a UNK index

  • inplace – whether or not to apply the transformation inplace (True), or to a new copy (False, default)

  • kwargs – the rest of parameters of the transformation (as for sklearn’s

CountVectorizer <https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html>_) :return: a new quapy.data.base.Dataset (if inplace=False) or a reference to the current

quapy.data.base.Dataset (inplace=True) consisting of lists of integer values representing indices.

quapy.data.preprocessing.reduce_columns(dataset: quapy.data.base.Dataset, min_df=5, inplace=False)

Reduces the dimensionality of the instances, represented as a csr_matrix (or any subtype of scipy.sparse.spmatrix), of training and test documents by removing the columns of words which are not present in at least min_df instances in the training set

Parameters
  • dataset – a quapy.data.base.Dataset in which instances are represented in sparse format (any subtype of scipy.sparse.spmatrix)

  • min_df – integer, minimum number of instances below which the columns are removed

  • inplace – whether or not to apply the transformation inplace (True), or to a new copy (False, default)

Returns

a new quapy.data.base.Dataset (if inplace=False) or a reference to the current quapy.data.base.Dataset (inplace=True) where the dimensions corresponding to infrequent terms in the training set have been removed

quapy.data.preprocessing.standardize(dataset: quapy.data.base.Dataset, inplace=False)

Standardizes the real-valued columns of a quapy.data.base.Dataset. Standardization, aka z-scoring, of a variable X comes down to subtracting the average and normalizing by the standard deviation.

Parameters
Returns

quapy.data.preprocessing.text2tfidf(dataset: quapy.data.base.Dataset, min_df=3, sublinear_tf=True, inplace=False, **kwargs)

Transforms a quapy.data.base.Dataset of textual instances into a quapy.data.base.Dataset of tfidf weighted sparse vectors

Parameters
  • dataset – a quapy.data.base.Dataset where the instances of training and test collections are lists of str

  • min_df – minimum number of occurrences for a word to be considered as part of the vocabulary (default 3)

  • sublinear_tf – whether or not to apply the log scalling to the tf counters (default True)

  • inplace – whether or not to apply the transformation inplace (True), or to a new copy (False, default)

  • kwargs – the rest of parameters of the transformation (as for sklearn’s TfidfVectorizer)

Returns

a new quapy.data.base.Dataset in csr_matrix format (if inplace=False) or a reference to the current Dataset (if inplace=True) where the instances are stored in a csr_matrix of real-valued tfidf scores

quapy.data.reader module

quapy.data.reader.binarize(y, pos_class)

Binarizes a categorical array-like collection of labels towards the positive class pos_class. E.g.,:

>>> binarize([1, 2, 3, 1, 1, 0], pos_class=2)
>>> array([0, 1, 0, 0, 0, 0])
Parameters
  • y – array-like of labels

  • pos_class – integer, the positive class

Returns

a binary np.ndarray, in which values 1 corresponds to positions in whcih y had pos_class labels, and 0 otherwise

quapy.data.reader.from_csv(path, encoding='utf-8')

Reads a csv file in which columns are separated by ‘,’. File format <label>,<feat1>,<feat2>,…,<featn>

Parameters
  • path – path to the csv file

  • encoding – the text encoding used to open the file

Returns

a np.ndarray for the labels and a ndarray (float) for the covariates

quapy.data.reader.from_sparse(path)

Reads a labelled collection of real-valued instances expressed in sparse format File format <-1 or 0 or 1>[s col(int):val(float)]

Parameters

path – path to the labelled collection

Returns

a csr_matrix containing the instances (rows), and a ndarray containing the labels

quapy.data.reader.from_text(path, encoding='utf-8', verbose=1, class2int=True)

Reads a labelled colletion of documents. File fomart <0 or 1> <document>

Parameters
  • path – path to the labelled collection

  • encoding – the text encoding used to open the file

  • verbose – if >0 (default) shows some progress information in standard output

Returns

a list of sentences, and a list of labels

quapy.data.reader.reindex_labels(y)

Re-indexes a list of labels as a list of indexes, and returns the classnames corresponding to the indexes. E.g.:

>>> reindex_labels(['B', 'B', 'A', 'C'])
>>> (array([1, 1, 0, 2]), array(['A', 'B', 'C'], dtype='<U1'))
Parameters

y – the list or array of original labels

Returns

a ndarray (int) of class indexes, and a ndarray of classnames corresponding to the indexes.

Module contents