quapy.data package¶

Submodules¶

quapy.data.base module¶

class quapy.data.base.Dataset(training: quapy.data.base.LabelledCollection, test: quapy.data.base.LabelledCollection, vocabulary: Optional[dict] = None, name='')¶

Bases: object

Abstraction of training and test LabelledCollection objects.

Parameters

training – a LabelledCollection instance
test – a LabelledCollection instance
vocabulary – if indicated, is a dictionary of the terms used in this textual dataset
name – a string representing the name of the dataset

classmethod SplitStratified(collection: quapy.data.base.LabelledCollection, train_size=0.6)¶

Generates a Dataset from a stratified split of a LabelledCollection instance. See LabelledCollection.split_stratified()

Parameters

collection – LabelledCollection
train_size – the proportion of training documents (the rest conforms the test split)

Returns

an instance of Dataset

property binary¶

Returns True if the training collection is labelled according to two classes

Returns: boolean

property classes_¶

The classes according to which the training collection is labelled

Returns: The classes according to which the training collection is labelled

classmethod kFCV(data: quapy.data.base.LabelledCollection, nfolds=5, nrepeats=1, random_state=0)¶

Generator of stratified folds to be used in k-fold cross validation. This function is only a wrapper around LabelledCollection.kFCV() that returns Dataset instances made of training and test folds.

Parameters

nfolds – integer (default 5), the number of folds to generate
nrepeats – integer (default 1), the number of rounds of k-fold cross validation to run
random_state – integer (default 0), guarantees that the folds generated are reproducible

Returns

yields nfolds * nrepeats folds for k-fold cross validation as instances of Dataset

classmethod load(train_path, test_path, loader_func: callable, classes=None, **loader_kwargs)¶

Loads a training and a test labelled set of data and convert it into a Dataset instance. The function in charge of reading the instances must be specified. This function can be a custom one, or any of the reading functions defined in quapy.data.reader module.

Parameters

train_path – string, the path to the file containing the training instances
test_path – string, the path to the file containing the test instances
loader_func – a custom function that implements the data loader and returns a tuple with instances and labels
classes – array-like, the classes according to which the instances are labelled
loader_kwargs – any argument that the loader_func function needs in order to read the instances. See LabelledCollection.load() for further details.

Returns

a Dataset object

property n_classes¶

The number of classes according to which the training collection is labelled

Returns: integer

stats(show)¶

Returns (and eventually prints) a dictionary with some stats of this dataset. E.g.,:

>>> data = qp.datasets.fetch_reviews('kindle', tfidf=True, min_df=5)
>>> data.stats()
>>> Dataset=kindle #tr-instances=3821, #te-instances=21591, type=<class 'scipy.sparse.csr.csr_matrix'>, #features=4403, #classes=[0 1], tr-prevs=[0.081, 0.919], te-prevs=[0.063, 0.937]

Parameters: show – if set to True (default), prints the stats in standard output
Returns: a dictionary containing some stats of this collection for the training and test collections. The keys are train and test, and point to dedicated dictionaries of stats, for each collection, with keys #instances (the number of instances), type (the type representing the instances), #features (the number of features, if the instances are in array-like format), #classes (the classes of the collection), prevs (the prevalence values for each class)

property vocabulary_size¶

If the dataset is textual, and the vocabulary was indicated, returns the size of the vocabulary

Returns: integer

class quapy.data.base.LabelledCollection(instances, labels, classes_=None)¶

Bases: object

A LabelledCollection is a set of objects each with a label associated to it. This class implements many sampling routines.

Parameters

instances – array-like (np.ndarray, list, or csr_matrix are supported)
labels – array-like with the same length of instances
classes – optional, list of classes from which labels are taken. If not specified, the classes are inferred from the labels. The classes must be indicated in cases in which some of the labels might have no examples (i.e., a prevalence of 0)

property Xy¶

Gets the instances and labels. This is useful when working with sklearn estimators, e.g.:

>>> svm = LinearSVC().fit(*my_collection.Xy)

Returns: a tuple (instances, labels) from this collection

artificial_sampling_generator(sample_size, n_prevalences=101, repeats=1)¶

A generator of samples that implements the artificial prevalence protocol (APP). The APP consists of exploring a grid of prevalence values containing n_prevalences points (e.g., [0, 0.05, 0.1, 0.15, …, 1], if n_prevalences=21), and generating all valid combinations of prevalence values for all classes (e.g., for 3 classes, samples with [0, 0, 1], [0, 0.05, 0.95], …, [1, 0, 0] prevalence values of size sample_size will be yielded). The number of samples for each valid combination of prevalence values is indicated by repeats.

Parameters

sample_size – the number of instances in each sample
n_prevalences – the number of prevalence points to be taken from the [0,1] interval (including the limits {0,1}). E.g., if n_prevalences=11, then the prevalence points to take are [0, 0.1, 0.2, …, 1]
repeats – the number of samples to generate for each valid combination of prevalence values (default 1)

Returns

yield samples generated at artificially controlled prevalence values

artificial_sampling_index_generator(sample_size, n_prevalences=101, repeats=1)¶

A generator of sample indexes implementing the artificial prevalence protocol (APP). The APP consists of exploring a grid of prevalence values (e.g., [0, 0.05, 0.1, 0.15, …, 1]), and generating all valid combinations of prevalence values for all classes (e.g., for 3 classes, samples with [0, 0, 1], [0, 0.05, 0.95], …, [1, 0, 0] prevalence values of size sample_size will be yielded). The number of sample indexes for each valid combination of prevalence values is indicated by repeats

Parameters

sample_size – the number of instances in each sample (i.e., length of each index)
n_prevalences – the number of prevalence points to be taken from the [0,1] interval (including the limits {0,1}). E.g., if n_prevalences=11, then the prevalence points to take are [0, 0.1, 0.2, …, 1]
repeats – the number of samples to generate for each valid combination of prevalence values (default 1)

Returns

yield the indexes that generate the samples according to APP

property binary¶

Returns True if the number of classes is 2

Returns: boolean

counts()¶

Returns the number of instances for each of the classes of interest.

Returns: a np.ndarray of shape (n_classes) with the number of instances of each class, in the same order as listed by self.classes_

kFCV(nfolds=5, nrepeats=1, random_state=0)¶

Generator of stratified folds to be used in k-fold cross validation.

Parameters

nfolds – integer (default 5), the number of folds to generate
nrepeats – integer (default 1), the number of rounds of k-fold cross validation to run
random_state – integer (default 0), guarantees that the folds generated are reproducible

Returns

yields nfolds * nrepeats folds for k-fold cross validation

classmethod load(path: str, loader_func: callable, classes=None, **loader_kwargs)¶

Loads a labelled set of data and convert it into a LabelledCollection instance. The function in charge of reading the instances must be specified. This function can be a custom one, or any of the reading functions defined in quapy.data.reader module.

Parameters

path – string, the path to the file containing the labelled instances
loader_func – a custom function that implements the data loader and returns a tuple with instances and labels
classes – array-like, the classes according to which the instances are labelled
loader_kwargs – any argument that the loader_func function needs in order to read the instances, i.e., these arguments are used to call loader_func(path, **loader_kwargs)

Returns

a LabelledCollection object

property n_classes¶

The number of classes

Returns: integer

natural_sampling_generator(sample_size, repeats=100)¶

A generator of samples that implements the natural prevalence protocol (NPP). The NPP consists of drawing samples uniformly at random, therefore approximately preserving the natural prevalence of the collection.

Parameters

sample_size – integer, the number of instances in each sample
repeats – the number of samples to generate

Returns

yield instances of LabelledCollection

natural_sampling_index_generator(sample_size, repeats=100)¶

A generator of sample indexes according to the natural prevalence protocol (NPP). The NPP consists of drawing samples uniformly at random, therefore approximately preserving the natural prevalence of the collection.

Parameters

sample_size – integer, the number of instances in each sample (i.e., the length of each index)
repeats – the number of indexes to generate

Returns

yield repeats instances of np.ndarray with shape (sample_size,)

prevalence()¶

Returns the prevalence, or relative frequency, of the classes of interest.

Returns: a np.ndarray of shape (n_classes) with the relative frequencies of each class, in the same order as listed by self.classes_

sampling(size, *prevs, shuffle=True)¶

Return a random sample (an instance of LabelledCollection) of desired size and desired prevalence values. For each class, the sampling is drawn without replacement if the requested prevalence is larger than the actual prevalence of the class, or with replacement otherwise.

Parameters

size – integer, the requested size
prevs – the prevalence for each class; the prevalence value for the last class can be lead empty since it is constrained. E.g., for binary collections, only the prevalence p for the first class (as listed in self.classes_ can be specified, while the other class takes prevalence value 1-p
shuffle – if set to True (default), shuffles the index before returning it

Returns

an instance of LabelledCollection with length == size and prevalence close to prevs (or prevalence == prevs if the exact prevalence values can be met as proportions of instances)

sampling_from_index(index)¶

Returns an instance of LabelledCollection whose elements are sampled from this collection using the index.

Parameters: index – np.ndarray
Returns: an instance of LabelledCollection

sampling_index(size, *prevs, shuffle=True)¶

Returns an index to be used to extract a random sample of desired size and desired prevalence values. If the prevalence values are not specified, then returns the index of a uniform sampling. For each class, the sampling is drawn without replacement if the requested prevalence is larger than the actual prevalence of the class, or with replacement otherwise.

Parameters

size – integer, the requested size
prevs – the prevalence for each class; the prevalence value for the last class can be lead empty since it is constrained. E.g., for binary collections, only the prevalence p for the first class (as listed in self.classes_ can be specified, while the other class takes prevalence value 1-p
shuffle – if set to True (default), shuffles the index before returning it

Returns

a np.ndarray of shape (size) with the indexes

split_stratified(train_prop=0.6, random_state=None)¶

Returns two instances of LabelledCollection split with stratification from this collection, at desired proportion.

Parameters

train_prop – the proportion of elements to include in the left-most returned collection (typically used as the training collection). The rest of elements are included in the right-most returned collection (typically used as a test collection).
random_state – if specified, guarantees reproducibility of the split.

Returns

two instances of LabelledCollection, the first one with train_prop elements, and the second one with 1-train_prop elements

stats(show=True)¶

Returns (and eventually prints) a dictionary with some stats of this collection. E.g.,:

>>> data = qp.datasets.fetch_reviews('kindle', tfidf=True, min_df=5)
>>> data.training.stats()
>>> #instances=3821, type=<class 'scipy.sparse.csr.csr_matrix'>, #features=4403, #classes=[0 1], prevs=[0.081, 0.919]

Parameters: show – if set to True (default), prints the stats in standard output
Returns: a dictionary containing some stats of this collection. Keys include #instances (the number of instances), type (the type representing the instances), #features (the number of features, if the instances are in array-like format), #classes (the classes of the collection), prevs (the prevalence values for each class)

uniform_sampling(size)¶

Returns a uniform sample (an instance of LabelledCollection) of desired size. The sampling is drawn without replacement if the requested size is greater than the number of instances, or with replacement otherwise.

Parameters: size – integer, the requested size
Returns: an instance of LabelledCollection with length == size

uniform_sampling_index(size)¶

Returns an index to be used to extract a uniform sample of desired size. The sampling is drawn without replacement if the requested size is greater than the number of instances, or with replacement otherwise.

Parameters: size – integer, the size of the uniform sample
Returns: a np.ndarray of shape (size) with the indexes

quapy.data.base.isbinary(data)¶

Returns True if data is either a binary Dataset or a binary LabelledCollection

Parameters: data – a Dataset or a LabelledCollection object
Returns: True if labelled according to two classes

quapy.data.datasets module¶

quapy.data.datasets.fetch_UCIDataset(dataset_name, data_home=None, test_split=0.3, verbose=False) → quapy.data.base.Dataset¶

Loads a UCI dataset as an instance of quapy.data.base.Dataset, as used in Pérez-Gállego, P., Quevedo, J. R., & del Coz, J. J. (2017). Using ensembles for problems with characterizable changes in data distribution: A case study on quantification. Information Fusion, 34, 87-100. and Pérez-Gállego, P., Castano, A., Quevedo, J. R., & del Coz, J. J. (2019). Dynamic ensemble selection for quantification tasks. Information Fusion, 45, 1-15.. The datasets do not come with a predefined train-test split (see fetch_UCILabelledCollection() for further information on how to use these collections), and so a train-test split is generated at desired proportion. The list of valid dataset names can be accessed in quapy.data.datasets.UCI_DATASETS

Parameters

dataset_name – a dataset name
data_home – specify the quapy home directory where collections will be dumped (leave empty to use the default ~/quay_data/ directory)
test_split – proportion of documents to be included in the test set. The rest conforms the training set
verbose – set to True (default is False) to get information (from the UCI ML repository) about the datasets

Returns

a quapy.data.base.Dataset instance

quapy.data.datasets.fetch_UCILabelledCollection(dataset_name, data_home=None, verbose=False) → quapy.data.base.Dataset¶

Loads a UCI collection as an instance of quapy.data.base.LabelledCollection, as used in Pérez-Gállego, P., Quevedo, J. R., & del Coz, J. J. (2017). Using ensembles for problems with characterizable changes in data distribution: A case study on quantification. Information Fusion, 34, 87-100. and Pérez-Gállego, P., Castano, A., Quevedo, J. R., & del Coz, J. J. (2019). Dynamic ensemble selection for quantification tasks. Information Fusion, 45, 1-15.. The datasets do not come with a predefined train-test split, and so Pérez-Gállego et al. adopted a 5FCVx2 evaluation protocol, meaning that each collection was used to generate two rounds (hence the x2) of 5 fold cross validation. This can be reproduced by using quapy.data.base.Dataset.kFCV(), e.g.:

>>> import quapy as qp
>>> collection = qp.datasets.fetch_UCILabelledCollection("yeast")
>>> for data in qp.data.Dataset.kFCV(collection, nfolds=5, nrepeats=2):
>>>     ...

The list of valid dataset names can be accessed in quapy.data.datasets.UCI_DATASETS

Parameters

dataset_name – a dataset name
data_home – specify the quapy home directory where collections will be dumped (leave empty to use the default ~/quay_data/ directory)
test_split – proportion of documents to be included in the test set. The rest conforms the training set
verbose – set to True (default is False) to get information (from the UCI ML repository) about the datasets

Returns

a quapy.data.base.Dataset instance

quapy.data.datasets.fetch_reviews(dataset_name, tfidf=False, min_df=None, data_home=None, pickle=False) → quapy.data.base.Dataset¶

Loads a Reviews dataset as a Dataset instance, as used in Esuli, A., Moreo, A., and Sebastiani, F. “A recurrent neural network for sentiment quantification.” Proceedings of the 27th ACM International Conference on Information and Knowledge Management. 2018.. The list of valid dataset names can be accessed in quapy.data.datasets.REVIEWS_SENTIMENT_DATASETS

Parameters

dataset_name – the name of the dataset: valid ones are ‘hp’, ‘kindle’, ‘imdb’
tfidf – set to True to transform the raw documents into tfidf weighted matrices
min_df – minimun number of documents that should contain a term in order for the term to be kept (ignored if tfidf==False)
data_home – specify the quapy home directory where collections will be dumped (leave empty to use the default ~/quay_data/ directory)
pickle – set to True to pickle the Dataset object the first time it is generated, in order to allow for faster subsequent invokations

Returns

a quapy.data.base.Dataset instance

quapy.data.datasets.fetch_twitter(dataset_name, for_model_selection=False, min_df=None, data_home=None, pickle=False) → quapy.data.base.Dataset¶

Loads a Twitter dataset as a quapy.data.base.Dataset instance, as used in: Gao, W., Sebastiani, F.: From classification to quantification in tweet sentiment analysis. Social Network Analysis and Mining6(19), 1–22 (2016) Note that the datasets ‘semeval13’, ‘semeval14’, ‘semeval15’ share the same training set. The list of valid dataset names corresponding to training sets can be accessed in quapy.data.datasets.TWITTER_SENTIMENT_DATASETS_TRAIN, while the test sets can be accessed in quapy.data.datasets.TWITTER_SENTIMENT_DATASETS_TEST

Parameters

dataset_name – the name of the dataset: valid ones are ‘gasp’, ‘hcr’, ‘omd’, ‘sanders’, ‘semeval13’, ‘semeval14’, ‘semeval15’, ‘semeval16’, ‘sst’, ‘wa’, ‘wb’
for_model_selection – if True, then returns the train split as the training set and the devel split as the test set; if False, then returns the train+devel split as the training set and the test set as the test set
min_df – minimun number of documents that should contain a term in order for the term to be kept
data_home – specify the quapy home directory where collections will be dumped (leave empty to use the default ~/quay_data/ directory)
pickle – set to True to pickle the Dataset object the first time it is generated, in order to allow for faster subsequent invokations

Returns

a quapy.data.base.Dataset instance

quapy.data.datasets.warn(*args, **kwargs)¶

quapy.data.preprocessing module¶

class quapy.data.preprocessing.IndexTransformer(**kwargs)¶

Bases: object

This class implements a sklearn’s-style transformer that indexes text as numerical ids for the tokens it contains, and that would be generated by sklearn’s CountVectorizer

Parameters

kwargs –

keyworded arguments from CountVectorizer

add_word(word, id=None, nogaps=True)¶

Adds a new token (regardless of whether it has been found in the text or not), with dedicated id. Useful to define special tokens for codifying unknown words, or padding tokens.

Parameters

word – string, surface form of the token
id – integer, numerical value to assign to the token (leave as None for indicating the next valid id, default)
nogaps – if set to True (default) asserts that the id indicated leads to no numerical gaps with precedent ids stored so far

Returns

integer, the numerical id for the new token

fit(X)¶

Fits the transformer, i.e., decides on the vocabulary, given a list of strings.

Parameters: X – a list of strings
Returns: self

fit_transform(X, n_jobs=- 1)¶

Fits the transform on X and transforms it.

Parameters

X – a list of strings
n_jobs – the number of parallel workers to carry out this task

Returns

a np.ndarray of numerical ids

transform(X, n_jobs=- 1)¶

Transforms the strings in X as lists of numerical ids

Parameters

X – a list of strings
n_jobs – the number of parallel workers to carry out this task

Returns

a np.ndarray of numerical ids

vocabulary_size()¶

Gets the length of the vocabulary according to which the document tokens have been indexed

Returns: integer

quapy.data.preprocessing.index(dataset: quapy.data.base.Dataset, min_df=5, inplace=False, **kwargs)¶

Indexes the tokens of a textual quapy.data.base.Dataset of string documents. To index a document means to replace each different token by a unique numerical index. Rare words (i.e., words occurring less than min_df times) are replaced by a special token UNK

Parameters

dataset – a quapy.data.base.Dataset object where the instances of training and test documents are lists of str
min_df – minimum number of occurrences below which the term is replaced by a UNK index
inplace – whether or not to apply the transformation inplace (True), or to a new copy (False, default)
kwargs – the rest of parameters of the transformation (as for sklearn’s

CountVectorizer <https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html>_) :return: a new quapy.data.base.Dataset (if inplace=False) or a reference to the current

quapy.data.base.Dataset (inplace=True) consisting of lists of integer values representing indices.

quapy.data.preprocessing.reduce_columns(dataset: quapy.data.base.Dataset, min_df=5, inplace=False)¶

Reduces the dimensionality of the instances, represented as a csr_matrix (or any subtype of scipy.sparse.spmatrix), of training and test documents by removing the columns of words which are not present in at least min_df instances in the training set

Parameters

dataset – a quapy.data.base.Dataset in which instances are represented in sparse format (any subtype of scipy.sparse.spmatrix)
min_df – integer, minimum number of instances below which the columns are removed
inplace – whether or not to apply the transformation inplace (True), or to a new copy (False, default)

Returns

a new quapy.data.base.Dataset (if inplace=False) or a reference to the current quapy.data.base.Dataset (inplace=True) where the dimensions corresponding to infrequent terms in the training set have been removed

quapy.data.preprocessing.standardize(dataset: quapy.data.base.Dataset, inplace=False)¶

Standardizes the real-valued columns of a quapy.data.base.Dataset. Standardization, aka z-scoring, of a variable X comes down to subtracting the average and normalizing by the standard deviation.

Parameters

dataset – a quapy.data.base.Dataset object
inplace – set to True if the transformation is to be applied inplace, or to False (default) if a new quapy.data.base.Dataset is to be returned

Returns

quapy.data.preprocessing.text2tfidf(dataset: quapy.data.base.Dataset, min_df=3, sublinear_tf=True, inplace=False, **kwargs)¶

Transforms a quapy.data.base.Dataset of textual instances into a quapy.data.base.Dataset of tfidf weighted sparse vectors

Parameters

dataset – a quapy.data.base.Dataset where the instances of training and test collections are lists of str
min_df – minimum number of occurrences for a word to be considered as part of the vocabulary (default 3)
sublinear_tf – whether or not to apply the log scalling to the tf counters (default True)
inplace – whether or not to apply the transformation inplace (True), or to a new copy (False, default)
kwargs – the rest of parameters of the transformation (as for sklearn’s TfidfVectorizer)

Returns

a new quapy.data.base.Dataset in csr_matrix format (if inplace=False) or a reference to the current Dataset (if inplace=True) where the instances are stored in a csr_matrix of real-valued tfidf scores

quapy.data.reader module¶

quapy.data.reader.binarize(y, pos_class)¶

Binarizes a categorical array-like collection of labels towards the positive class pos_class. E.g.,:

>>> binarize([1, 2, 3, 1, 1, 0], pos_class=2)
>>> array([0, 1, 0, 0, 0, 0])

Parameters

y – array-like of labels
pos_class – integer, the positive class

Returns

a binary np.ndarray, in which values 1 corresponds to positions in whcih y had pos_class labels, and 0 otherwise

quapy.data.reader.from_csv(path, encoding='utf-8')¶

Reads a csv file in which columns are separated by ‘,’. File format <label>,<feat1>,<feat2>,…,<featn>

Parameters

path – path to the csv file
encoding – the text encoding used to open the file

Returns

a np.ndarray for the labels and a ndarray (float) for the covariates

quapy.data.reader.from_sparse(path)¶

Reads a labelled collection of real-valued instances expressed in sparse format File format <-1 or 0 or 1>[s col(int):val(float)]

Parameters: path – path to the labelled collection
Returns: a csr_matrix containing the instances (rows), and a ndarray containing the labels

quapy.data.reader.from_text(path, encoding='utf-8', verbose=1, class2int=True)¶

Reads a labelled colletion of documents. File fomart <0 or 1> <document>

Parameters

path – path to the labelled collection
encoding – the text encoding used to open the file
verbose – if >0 (default) shows some progress information in standard output

Returns

a list of sentences, and a list of labels

quapy.data.reader.reindex_labels(y)¶

Re-indexes a list of labels as a list of indexes, and returns the classnames corresponding to the indexes. E.g.:

>>> reindex_labels(['B', 'B', 'A', 'C'])
>>> (array([1, 1, 0, 2]), array(['A', 'B', 'C'], dtype='<U1'))

Parameters: y – the list or array of original labels
Returns: a ndarray (int) of class indexes, and a ndarray of classnames corresponding to the indexes.

quapy.data package¶

Submodules¶

quapy.data.base module¶

quapy.data.datasets module¶

quapy.data.preprocessing module¶

quapy.data.reader module¶

Module contents¶

Table of Contents

Previous topic

Next topic

This Page