quapy.data package¶
Submodules¶
quapy.data.base module¶
- class quapy.data.base.Dataset(training: quapy.data.base.LabelledCollection, test: quapy.data.base.LabelledCollection, vocabulary: Optional[dict] = None, name='')¶
Bases:
object
Abstraction of training and test
LabelledCollection
objects.- Parameters
training – a
LabelledCollection
instancetest – a
LabelledCollection
instancevocabulary – if indicated, is a dictionary of the terms used in this textual dataset
name – a string representing the name of the dataset
- classmethod SplitStratified(collection: quapy.data.base.LabelledCollection, train_size=0.6)¶
Generates a
Dataset
from a stratified split of aLabelledCollection
instance. SeeLabelledCollection.split_stratified()
- Parameters
collection –
LabelledCollection
train_size – the proportion of training documents (the rest conforms the test split)
- Returns
an instance of
Dataset
- property binary¶
Returns True if the training collection is labelled according to two classes
- Returns
boolean
- property classes_¶
The classes according to which the training collection is labelled
- Returns
The classes according to which the training collection is labelled
- classmethod kFCV(data: quapy.data.base.LabelledCollection, nfolds=5, nrepeats=1, random_state=0)¶
Generator of stratified folds to be used in k-fold cross validation. This function is only a wrapper around
LabelledCollection.kFCV()
that returnsDataset
instances made of training and test folds.- Parameters
nfolds – integer (default 5), the number of folds to generate
nrepeats – integer (default 1), the number of rounds of k-fold cross validation to run
random_state – integer (default 0), guarantees that the folds generated are reproducible
- Returns
yields nfolds * nrepeats folds for k-fold cross validation as instances of
Dataset
- classmethod load(train_path, test_path, loader_func: callable, classes=None, **loader_kwargs)¶
Loads a training and a test labelled set of data and convert it into a
Dataset
instance. The function in charge of reading the instances must be specified. This function can be a custom one, or any of the reading functions defined inquapy.data.reader
module.- Parameters
train_path – string, the path to the file containing the training instances
test_path – string, the path to the file containing the test instances
loader_func – a custom function that implements the data loader and returns a tuple with instances and labels
classes – array-like, the classes according to which the instances are labelled
loader_kwargs – any argument that the loader_func function needs in order to read the instances. See
LabelledCollection.load()
for further details.
- Returns
a
Dataset
object
- property n_classes¶
The number of classes according to which the training collection is labelled
- Returns
integer
- stats(show)¶
Returns (and eventually prints) a dictionary with some stats of this dataset. E.g.,:
>>> data = qp.datasets.fetch_reviews('kindle', tfidf=True, min_df=5) >>> data.stats() >>> Dataset=kindle #tr-instances=3821, #te-instances=21591, type=<class 'scipy.sparse.csr.csr_matrix'>, #features=4403, #classes=[0 1], tr-prevs=[0.081, 0.919], te-prevs=[0.063, 0.937]
- Parameters
show – if set to True (default), prints the stats in standard output
- Returns
a dictionary containing some stats of this collection for the training and test collections. The keys are train and test, and point to dedicated dictionaries of stats, for each collection, with keys #instances (the number of instances), type (the type representing the instances), #features (the number of features, if the instances are in array-like format), #classes (the classes of the collection), prevs (the prevalence values for each class)
- property vocabulary_size¶
If the dataset is textual, and the vocabulary was indicated, returns the size of the vocabulary
- Returns
integer
- class quapy.data.base.LabelledCollection(instances, labels, classes_=None)¶
Bases:
object
A LabelledCollection is a set of objects each with a label associated to it. This class implements many sampling routines.
- Parameters
instances – array-like (np.ndarray, list, or csr_matrix are supported)
labels – array-like with the same length of instances
classes – optional, list of classes from which labels are taken. If not specified, the classes are inferred from the labels. The classes must be indicated in cases in which some of the labels might have no examples (i.e., a prevalence of 0)
- property Xy¶
Gets the instances and labels. This is useful when working with sklearn estimators, e.g.:
>>> svm = LinearSVC().fit(*my_collection.Xy)
- Returns
a tuple (instances, labels) from this collection
- artificial_sampling_generator(sample_size, n_prevalences=101, repeats=1)¶
A generator of samples that implements the artificial prevalence protocol (APP). The APP consists of exploring a grid of prevalence values containing n_prevalences points (e.g., [0, 0.05, 0.1, 0.15, …, 1], if n_prevalences=21), and generating all valid combinations of prevalence values for all classes (e.g., for 3 classes, samples with [0, 0, 1], [0, 0.05, 0.95], …, [1, 0, 0] prevalence values of size sample_size will be yielded). The number of samples for each valid combination of prevalence values is indicated by repeats.
- Parameters
sample_size – the number of instances in each sample
n_prevalences – the number of prevalence points to be taken from the [0,1] interval (including the limits {0,1}). E.g., if n_prevalences=11, then the prevalence points to take are [0, 0.1, 0.2, …, 1]
repeats – the number of samples to generate for each valid combination of prevalence values (default 1)
- Returns
yield samples generated at artificially controlled prevalence values
- artificial_sampling_index_generator(sample_size, n_prevalences=101, repeats=1)¶
A generator of sample indexes implementing the artificial prevalence protocol (APP). The APP consists of exploring a grid of prevalence values (e.g., [0, 0.05, 0.1, 0.15, …, 1]), and generating all valid combinations of prevalence values for all classes (e.g., for 3 classes, samples with [0, 0, 1], [0, 0.05, 0.95], …, [1, 0, 0] prevalence values of size sample_size will be yielded). The number of sample indexes for each valid combination of prevalence values is indicated by repeats
- Parameters
sample_size – the number of instances in each sample (i.e., length of each index)
n_prevalences – the number of prevalence points to be taken from the [0,1] interval (including the limits {0,1}). E.g., if n_prevalences=11, then the prevalence points to take are [0, 0.1, 0.2, …, 1]
repeats – the number of samples to generate for each valid combination of prevalence values (default 1)
- Returns
yield the indexes that generate the samples according to APP
- property binary¶
Returns True if the number of classes is 2
- Returns
boolean
- counts()¶
Returns the number of instances for each of the classes of interest.
- Returns
a np.ndarray of shape (n_classes) with the number of instances of each class, in the same order as listed by self.classes_
- kFCV(nfolds=5, nrepeats=1, random_state=0)¶
Generator of stratified folds to be used in k-fold cross validation.
- Parameters
nfolds – integer (default 5), the number of folds to generate
nrepeats – integer (default 1), the number of rounds of k-fold cross validation to run
random_state – integer (default 0), guarantees that the folds generated are reproducible
- Returns
yields nfolds * nrepeats folds for k-fold cross validation
- classmethod load(path: str, loader_func: callable, classes=None, **loader_kwargs)¶
Loads a labelled set of data and convert it into a
LabelledCollection
instance. The function in charge of reading the instances must be specified. This function can be a custom one, or any of the reading functions defined inquapy.data.reader
module.- Parameters
path – string, the path to the file containing the labelled instances
loader_func – a custom function that implements the data loader and returns a tuple with instances and labels
classes – array-like, the classes according to which the instances are labelled
loader_kwargs – any argument that the loader_func function needs in order to read the instances, i.e., these arguments are used to call loader_func(path, **loader_kwargs)
- Returns
a
LabelledCollection
object
- property n_classes¶
The number of classes
- Returns
integer
- natural_sampling_generator(sample_size, repeats=100)¶
A generator of samples that implements the natural prevalence protocol (NPP). The NPP consists of drawing samples uniformly at random, therefore approximately preserving the natural prevalence of the collection.
- Parameters
sample_size – integer, the number of instances in each sample
repeats – the number of samples to generate
- Returns
yield instances of
LabelledCollection
- natural_sampling_index_generator(sample_size, repeats=100)¶
A generator of sample indexes according to the natural prevalence protocol (NPP). The NPP consists of drawing samples uniformly at random, therefore approximately preserving the natural prevalence of the collection.
- Parameters
sample_size – integer, the number of instances in each sample (i.e., the length of each index)
repeats – the number of indexes to generate
- Returns
yield repeats instances of np.ndarray with shape (sample_size,)
- prevalence()¶
Returns the prevalence, or relative frequency, of the classes of interest.
- Returns
a np.ndarray of shape (n_classes) with the relative frequencies of each class, in the same order as listed by self.classes_
- sampling(size, *prevs, shuffle=True)¶
Return a random sample (an instance of
LabelledCollection
) of desired size and desired prevalence values. For each class, the sampling is drawn without replacement if the requested prevalence is larger than the actual prevalence of the class, or with replacement otherwise.- Parameters
size – integer, the requested size
prevs – the prevalence for each class; the prevalence value for the last class can be lead empty since it is constrained. E.g., for binary collections, only the prevalence p for the first class (as listed in self.classes_ can be specified, while the other class takes prevalence value 1-p
shuffle – if set to True (default), shuffles the index before returning it
- Returns
an instance of
LabelledCollection
with length == size and prevalence close to prevs (or prevalence == prevs if the exact prevalence values can be met as proportions of instances)
- sampling_from_index(index)¶
Returns an instance of
LabelledCollection
whose elements are sampled from this collection using the index.- Parameters
index – np.ndarray
- Returns
an instance of
LabelledCollection
- sampling_index(size, *prevs, shuffle=True)¶
Returns an index to be used to extract a random sample of desired size and desired prevalence values. If the prevalence values are not specified, then returns the index of a uniform sampling. For each class, the sampling is drawn without replacement if the requested prevalence is larger than the actual prevalence of the class, or with replacement otherwise.
- Parameters
size – integer, the requested size
prevs – the prevalence for each class; the prevalence value for the last class can be lead empty since it is constrained. E.g., for binary collections, only the prevalence p for the first class (as listed in self.classes_ can be specified, while the other class takes prevalence value 1-p
shuffle – if set to True (default), shuffles the index before returning it
- Returns
a np.ndarray of shape (size) with the indexes
- split_stratified(train_prop=0.6, random_state=None)¶
Returns two instances of
LabelledCollection
split with stratification from this collection, at desired proportion.- Parameters
train_prop – the proportion of elements to include in the left-most returned collection (typically used as the training collection). The rest of elements are included in the right-most returned collection (typically used as a test collection).
random_state – if specified, guarantees reproducibility of the split.
- Returns
two instances of
LabelledCollection
, the first one with train_prop elements, and the second one with 1-train_prop elements
- stats(show=True)¶
Returns (and eventually prints) a dictionary with some stats of this collection. E.g.,:
>>> data = qp.datasets.fetch_reviews('kindle', tfidf=True, min_df=5) >>> data.training.stats() >>> #instances=3821, type=<class 'scipy.sparse.csr.csr_matrix'>, #features=4403, #classes=[0 1], prevs=[0.081, 0.919]
- Parameters
show – if set to True (default), prints the stats in standard output
- Returns
a dictionary containing some stats of this collection. Keys include #instances (the number of instances), type (the type representing the instances), #features (the number of features, if the instances are in array-like format), #classes (the classes of the collection), prevs (the prevalence values for each class)
- uniform_sampling(size)¶
Returns a uniform sample (an instance of
LabelledCollection
) of desired size. The sampling is drawn without replacement if the requested size is greater than the number of instances, or with replacement otherwise.- Parameters
size – integer, the requested size
- Returns
an instance of
LabelledCollection
with length == size
- uniform_sampling_index(size)¶
Returns an index to be used to extract a uniform sample of desired size. The sampling is drawn without replacement if the requested size is greater than the number of instances, or with replacement otherwise.
- Parameters
size – integer, the size of the uniform sample
- Returns
a np.ndarray of shape (size) with the indexes
- quapy.data.base.isbinary(data)¶
Returns True if data is either a binary
Dataset
or a binaryLabelledCollection
- Parameters
data – a
Dataset
or aLabelledCollection
object- Returns
True if labelled according to two classes
quapy.data.datasets module¶
- quapy.data.datasets.fetch_UCIDataset(dataset_name, data_home=None, test_split=0.3, verbose=False) quapy.data.base.Dataset ¶
Loads a UCI dataset as an instance of
quapy.data.base.Dataset
, as used in Pérez-Gállego, P., Quevedo, J. R., & del Coz, J. J. (2017). Using ensembles for problems with characterizable changes in data distribution: A case study on quantification. Information Fusion, 34, 87-100. and Pérez-Gállego, P., Castano, A., Quevedo, J. R., & del Coz, J. J. (2019). Dynamic ensemble selection for quantification tasks. Information Fusion, 45, 1-15.. The datasets do not come with a predefined train-test split (seefetch_UCILabelledCollection()
for further information on how to use these collections), and so a train-test split is generated at desired proportion. The list of valid dataset names can be accessed in quapy.data.datasets.UCI_DATASETS- Parameters
dataset_name – a dataset name
data_home – specify the quapy home directory where collections will be dumped (leave empty to use the default ~/quay_data/ directory)
test_split – proportion of documents to be included in the test set. The rest conforms the training set
verbose – set to True (default is False) to get information (from the UCI ML repository) about the datasets
- Returns
a
quapy.data.base.Dataset
instance
- quapy.data.datasets.fetch_UCILabelledCollection(dataset_name, data_home=None, verbose=False) quapy.data.base.Dataset ¶
Loads a UCI collection as an instance of
quapy.data.base.LabelledCollection
, as used in Pérez-Gállego, P., Quevedo, J. R., & del Coz, J. J. (2017). Using ensembles for problems with characterizable changes in data distribution: A case study on quantification. Information Fusion, 34, 87-100. and Pérez-Gállego, P., Castano, A., Quevedo, J. R., & del Coz, J. J. (2019). Dynamic ensemble selection for quantification tasks. Information Fusion, 45, 1-15.. The datasets do not come with a predefined train-test split, and so Pérez-Gállego et al. adopted a 5FCVx2 evaluation protocol, meaning that each collection was used to generate two rounds (hence the x2) of 5 fold cross validation. This can be reproduced by usingquapy.data.base.Dataset.kFCV()
, e.g.:>>> import quapy as qp >>> collection = qp.datasets.fetch_UCILabelledCollection("yeast") >>> for data in qp.data.Dataset.kFCV(collection, nfolds=5, nrepeats=2): >>> ...
The list of valid dataset names can be accessed in quapy.data.datasets.UCI_DATASETS
- Parameters
dataset_name – a dataset name
data_home – specify the quapy home directory where collections will be dumped (leave empty to use the default ~/quay_data/ directory)
test_split – proportion of documents to be included in the test set. The rest conforms the training set
verbose – set to True (default is False) to get information (from the UCI ML repository) about the datasets
- Returns
a
quapy.data.base.Dataset
instance
- quapy.data.datasets.fetch_reviews(dataset_name, tfidf=False, min_df=None, data_home=None, pickle=False) quapy.data.base.Dataset ¶
Loads a Reviews dataset as a Dataset instance, as used in Esuli, A., Moreo, A., and Sebastiani, F. “A recurrent neural network for sentiment quantification.” Proceedings of the 27th ACM International Conference on Information and Knowledge Management. 2018.. The list of valid dataset names can be accessed in quapy.data.datasets.REVIEWS_SENTIMENT_DATASETS
- Parameters
dataset_name – the name of the dataset: valid ones are ‘hp’, ‘kindle’, ‘imdb’
tfidf – set to True to transform the raw documents into tfidf weighted matrices
min_df – minimun number of documents that should contain a term in order for the term to be kept (ignored if tfidf==False)
data_home – specify the quapy home directory where collections will be dumped (leave empty to use the default ~/quay_data/ directory)
pickle – set to True to pickle the Dataset object the first time it is generated, in order to allow for faster subsequent invokations
- Returns
a
quapy.data.base.Dataset
instance
- quapy.data.datasets.fetch_twitter(dataset_name, for_model_selection=False, min_df=None, data_home=None, pickle=False) quapy.data.base.Dataset ¶
Loads a Twitter dataset as a
quapy.data.base.Dataset
instance, as used in: Gao, W., Sebastiani, F.: From classification to quantification in tweet sentiment analysis. Social Network Analysis and Mining6(19), 1–22 (2016) Note that the datasets ‘semeval13’, ‘semeval14’, ‘semeval15’ share the same training set. The list of valid dataset names corresponding to training sets can be accessed in quapy.data.datasets.TWITTER_SENTIMENT_DATASETS_TRAIN, while the test sets can be accessed in quapy.data.datasets.TWITTER_SENTIMENT_DATASETS_TEST- Parameters
dataset_name – the name of the dataset: valid ones are ‘gasp’, ‘hcr’, ‘omd’, ‘sanders’, ‘semeval13’, ‘semeval14’, ‘semeval15’, ‘semeval16’, ‘sst’, ‘wa’, ‘wb’
for_model_selection – if True, then returns the train split as the training set and the devel split as the test set; if False, then returns the train+devel split as the training set and the test set as the test set
min_df – minimun number of documents that should contain a term in order for the term to be kept
data_home – specify the quapy home directory where collections will be dumped (leave empty to use the default ~/quay_data/ directory)
pickle – set to True to pickle the Dataset object the first time it is generated, in order to allow for faster subsequent invokations
- Returns
a
quapy.data.base.Dataset
instance
- quapy.data.datasets.warn(*args, **kwargs)¶
quapy.data.preprocessing module¶
- class quapy.data.preprocessing.IndexTransformer(**kwargs)¶
Bases:
object
This class implements a sklearn’s-style transformer that indexes text as numerical ids for the tokens it contains, and that would be generated by sklearn’s CountVectorizer
- Parameters
kwargs –
keyworded arguments from CountVectorizer
- add_word(word, id=None, nogaps=True)¶
Adds a new token (regardless of whether it has been found in the text or not), with dedicated id. Useful to define special tokens for codifying unknown words, or padding tokens.
- Parameters
word – string, surface form of the token
id – integer, numerical value to assign to the token (leave as None for indicating the next valid id, default)
nogaps – if set to True (default) asserts that the id indicated leads to no numerical gaps with precedent ids stored so far
- Returns
integer, the numerical id for the new token
- fit(X)¶
Fits the transformer, i.e., decides on the vocabulary, given a list of strings.
- Parameters
X – a list of strings
- Returns
self
- fit_transform(X, n_jobs=- 1)¶
Fits the transform on X and transforms it.
- Parameters
X – a list of strings
n_jobs – the number of parallel workers to carry out this task
- Returns
a np.ndarray of numerical ids
- transform(X, n_jobs=- 1)¶
Transforms the strings in X as lists of numerical ids
- Parameters
X – a list of strings
n_jobs – the number of parallel workers to carry out this task
- Returns
a np.ndarray of numerical ids
- vocabulary_size()¶
Gets the length of the vocabulary according to which the document tokens have been indexed
- Returns
integer
- quapy.data.preprocessing.index(dataset: quapy.data.base.Dataset, min_df=5, inplace=False, **kwargs)¶
Indexes the tokens of a textual
quapy.data.base.Dataset
of string documents. To index a document means to replace each different token by a unique numerical index. Rare words (i.e., words occurring less than min_df times) are replaced by a special token UNK- Parameters
dataset – a
quapy.data.base.Dataset
object where the instances of training and test documents are lists of strmin_df – minimum number of occurrences below which the term is replaced by a UNK index
inplace – whether or not to apply the transformation inplace (True), or to a new copy (False, default)
kwargs – the rest of parameters of the transformation (as for sklearn’s
CountVectorizer <https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html>_) :return: a new
quapy.data.base.Dataset
(if inplace=False) or a reference to the currentquapy.data.base.Dataset
(inplace=True) consisting of lists of integer values representing indices.
- quapy.data.preprocessing.reduce_columns(dataset: quapy.data.base.Dataset, min_df=5, inplace=False)¶
Reduces the dimensionality of the instances, represented as a csr_matrix (or any subtype of scipy.sparse.spmatrix), of training and test documents by removing the columns of words which are not present in at least min_df instances in the training set
- Parameters
dataset – a
quapy.data.base.Dataset
in which instances are represented in sparse format (any subtype of scipy.sparse.spmatrix)min_df – integer, minimum number of instances below which the columns are removed
inplace – whether or not to apply the transformation inplace (True), or to a new copy (False, default)
- Returns
a new
quapy.data.base.Dataset
(if inplace=False) or a reference to the currentquapy.data.base.Dataset
(inplace=True) where the dimensions corresponding to infrequent terms in the training set have been removed
- quapy.data.preprocessing.standardize(dataset: quapy.data.base.Dataset, inplace=False)¶
Standardizes the real-valued columns of a
quapy.data.base.Dataset
. Standardization, aka z-scoring, of a variable X comes down to subtracting the average and normalizing by the standard deviation.- Parameters
dataset – a
quapy.data.base.Dataset
objectinplace – set to True if the transformation is to be applied inplace, or to False (default) if a new
quapy.data.base.Dataset
is to be returned
- Returns
- quapy.data.preprocessing.text2tfidf(dataset: quapy.data.base.Dataset, min_df=3, sublinear_tf=True, inplace=False, **kwargs)¶
Transforms a
quapy.data.base.Dataset
of textual instances into aquapy.data.base.Dataset
of tfidf weighted sparse vectors- Parameters
dataset – a
quapy.data.base.Dataset
where the instances of training and test collections are lists of strmin_df – minimum number of occurrences for a word to be considered as part of the vocabulary (default 3)
sublinear_tf – whether or not to apply the log scalling to the tf counters (default True)
inplace – whether or not to apply the transformation inplace (True), or to a new copy (False, default)
kwargs – the rest of parameters of the transformation (as for sklearn’s TfidfVectorizer)
- Returns
a new
quapy.data.base.Dataset
in csr_matrix format (if inplace=False) or a reference to the current Dataset (if inplace=True) where the instances are stored in a csr_matrix of real-valued tfidf scores
quapy.data.reader module¶
- quapy.data.reader.binarize(y, pos_class)¶
Binarizes a categorical array-like collection of labels towards the positive class pos_class. E.g.,:
>>> binarize([1, 2, 3, 1, 1, 0], pos_class=2) >>> array([0, 1, 0, 0, 0, 0])
- Parameters
y – array-like of labels
pos_class – integer, the positive class
- Returns
a binary np.ndarray, in which values 1 corresponds to positions in whcih y had pos_class labels, and 0 otherwise
- quapy.data.reader.from_csv(path, encoding='utf-8')¶
Reads a csv file in which columns are separated by ‘,’. File format <label>,<feat1>,<feat2>,…,<featn>
- Parameters
path – path to the csv file
encoding – the text encoding used to open the file
- Returns
a np.ndarray for the labels and a ndarray (float) for the covariates
- quapy.data.reader.from_sparse(path)¶
Reads a labelled collection of real-valued instances expressed in sparse format File format <-1 or 0 or 1>[s col(int):val(float)]
- Parameters
path – path to the labelled collection
- Returns
a csr_matrix containing the instances (rows), and a ndarray containing the labels
- quapy.data.reader.from_text(path, encoding='utf-8', verbose=1, class2int=True)¶
Reads a labelled colletion of documents. File fomart <0 or 1> <document>
- Parameters
path – path to the labelled collection
encoding – the text encoding used to open the file
verbose – if >0 (default) shows some progress information in standard output
- Returns
a list of sentences, and a list of labels
- quapy.data.reader.reindex_labels(y)¶
Re-indexes a list of labels as a list of indexes, and returns the classnames corresponding to the indexes. E.g.:
>>> reindex_labels(['B', 'B', 'A', 'C']) >>> (array([1, 1, 0, 2]), array(['A', 'B', 'C'], dtype='<U1'))
- Parameters
y – the list or array of original labels
- Returns
a ndarray (int) of class indexes, and a ndarray of classnames corresponding to the indexes.