quapy.data package¶
Submodules¶
quapy.data.base¶
- class quapy.data.base.Dataset(training: LabelledCollection, test: LabelledCollection, vocabulary: Optional[dict] = None, name='')¶
Bases:
object
Abstraction of training and test
LabelledCollection
objects.- Parameters:
training – a
LabelledCollection
instancetest – a
LabelledCollection
instancevocabulary – if indicated, is a dictionary of the terms used in this textual dataset
name – a string representing the name of the dataset
- classmethod SplitStratified(collection: LabelledCollection, train_size=0.6)¶
Generates a
Dataset
from a stratified split of aLabelledCollection
instance. SeeLabelledCollection.split_stratified()
- Parameters:
collection –
LabelledCollection
train_size – the proportion of training documents (the rest conforms the test split)
- Returns:
an instance of
Dataset
- property binary¶
Returns True if the training collection is labelled according to two classes
- Returns:
boolean
- property classes_¶
The classes according to which the training collection is labelled
- Returns:
The classes according to which the training collection is labelled
- classmethod kFCV(data: LabelledCollection, nfolds=5, nrepeats=1, random_state=0)¶
Generator of stratified folds to be used in k-fold cross validation. This function is only a wrapper around
LabelledCollection.kFCV()
that returnsDataset
instances made of training and test folds.- Parameters:
nfolds – integer (default 5), the number of folds to generate
nrepeats – integer (default 1), the number of rounds of k-fold cross validation to run
random_state – integer (default 0), guarantees that the folds generated are reproducible
- Returns:
yields nfolds * nrepeats folds for k-fold cross validation as instances of
Dataset
- classmethod load(train_path, test_path, loader_func: callable, classes=None, **loader_kwargs)¶
Loads a training and a test labelled set of data and convert it into a
Dataset
instance. The function in charge of reading the instances must be specified. This function can be a custom one, or any of the reading functions defined inquapy.data.reader
module.- Parameters:
train_path – string, the path to the file containing the training instances
test_path – string, the path to the file containing the test instances
loader_func – a custom function that implements the data loader and returns a tuple with instances and labels
classes – array-like, the classes according to which the instances are labelled
loader_kwargs – any argument that the loader_func function needs in order to read the instances. See
LabelledCollection.load()
for further details.
- Returns:
a
Dataset
object
- property n_classes¶
The number of classes according to which the training collection is labelled
- Returns:
integer
- reduce(n_train=100, n_test=100)¶
Reduce the number of instances in place for quick experiments. Preserves the prevalence of each set.
- Parameters:
n_train – number of training documents to keep (default 100)
n_test – number of test documents to keep (default 100)
- Returns:
self
- stats(show=True)¶
Returns (and eventually prints) a dictionary with some stats of this dataset. E.g.,:
>>> data = qp.datasets.fetch_reviews('kindle', tfidf=True, min_df=5) >>> data.stats() >>> Dataset=kindle #tr-instances=3821, #te-instances=21591, type=<class 'scipy.sparse.csr.csr_matrix'>, #features=4403, #classes=[0 1], tr-prevs=[0.081, 0.919], te-prevs=[0.063, 0.937]
- Parameters:
show – if set to True (default), prints the stats in standard output
- Returns:
a dictionary containing some stats of this collection for the training and test collections. The keys are train and test, and point to dedicated dictionaries of stats, for each collection, with keys #instances (the number of instances), type (the type representing the instances), #features (the number of features, if the instances are in array-like format), #classes (the classes of the collection), prevs (the prevalence values for each class)
- property train_test¶
Alias to self.training and self.test
- Returns:
the training and test collections
- Returns:
the training and test collections
- property vocabulary_size¶
If the dataset is textual, and the vocabulary was indicated, returns the size of the vocabulary
- Returns:
integer
- class quapy.data.base.LabelledCollection(instances, labels, classes=None)¶
Bases:
object
A LabelledCollection is a set of objects each with a label attached to each of them. This class implements several sampling routines and other utilities.
- Parameters:
instances – array-like (np.ndarray, list, or csr_matrix are supported)
labels – array-like with the same length of instances
classes – optional, list of classes from which labels are taken. If not specified, the classes are inferred from the labels. The classes must be indicated in cases in which some of the labels might have no examples (i.e., a prevalence of 0)
- property X¶
An alias to self.instances
- Returns:
self.instances
- property Xp¶
Gets the instances and the true prevalence. This is useful when implementing evaluation protocols from a
LabelledCollection
object.- Returns:
a tuple (instances, prevalence) from this collection
- property Xy¶
Gets the instances and labels. This is useful when working with sklearn estimators, e.g.:
>>> svm = LinearSVC().fit(*my_collection.Xy)
- Returns:
a tuple (instances, labels) from this collection
- property binary¶
Returns True if the number of classes is 2
- Returns:
boolean
- counts()¶
Returns the number of instances for each of the classes in the codeframe.
- Returns:
a np.ndarray of shape (n_classes) with the number of instances of each class, in the same order as listed by self.classes_
- classmethod join(*args: Iterable[LabelledCollection])¶
Returns a new
LabelledCollection
as the union of the collections given in input.- Parameters:
args – instances of
LabelledCollection
- Returns:
a
LabelledCollection
representing the union of both collections
- kFCV(nfolds=5, nrepeats=1, random_state=None)¶
Generator of stratified folds to be used in k-fold cross validation.
- Parameters:
nfolds – integer (default 5), the number of folds to generate
nrepeats – integer (default 1), the number of rounds of k-fold cross validation to run
random_state – integer (default 0), guarantees that the folds generated are reproducible
- Returns:
yields nfolds * nrepeats folds for k-fold cross validation
- classmethod load(path: str, loader_func: callable, classes=None, **loader_kwargs)¶
Loads a labelled set of data and convert it into a
LabelledCollection
instance. The function in charge of reading the instances must be specified. This function can be a custom one, or any of the reading functions defined inquapy.data.reader
module.- Parameters:
path – string, the path to the file containing the labelled instances
loader_func – a custom function that implements the data loader and returns a tuple with instances and labels
classes – array-like, the classes according to which the instances are labelled
loader_kwargs – any argument that the loader_func function needs in order to read the instances, i.e., these arguments are used to call loader_func(path, **loader_kwargs)
- Returns:
a
LabelledCollection
object
- property n_classes¶
The number of classes
- Returns:
integer
- property p¶
An alias to self.prevalence()
- Returns:
self.prevalence()
- prevalence()¶
Returns the prevalence, or relative frequency, of the classes in the codeframe.
- Returns:
a np.ndarray of shape (n_classes) with the relative frequencies of each class, in the same order as listed by self.classes_
- sampling(size, *prevs, shuffle=True, random_state=None)¶
Return a random sample (an instance of
LabelledCollection
) of desired size and desired prevalence values. For each class, the sampling is drawn without replacement if the requested prevalence is larger than the actual prevalence of the class, or with replacement otherwise.- Parameters:
size – integer, the requested size
prevs – the prevalence for each class; the prevalence value for the last class can be lead empty since it is constrained. E.g., for binary collections, only the prevalence p for the first class (as listed in self.classes_ can be specified, while the other class takes prevalence value 1-p
shuffle – if set to True (default), shuffles the index before returning it
random_state – seed for reproducing sampling
- Returns:
an instance of
LabelledCollection
with length == size and prevalence close to prevs (or prevalence == prevs if the exact prevalence values can be met as proportions of instances)
- sampling_from_index(index)¶
Returns an instance of
LabelledCollection
whose elements are sampled from this collection using the index.- Parameters:
index – np.ndarray
- Returns:
an instance of
LabelledCollection
- sampling_index(size, *prevs, shuffle=True, random_state=None)¶
Returns an index to be used to extract a random sample of desired size and desired prevalence values. If the prevalence values are not specified, then returns the index of a uniform sampling. For each class, the sampling is drawn with replacement if the requested prevalence is larger than the actual prevalence of the class, or without replacement otherwise.
- Parameters:
size – integer, the requested size
prevs – the prevalence for each class; the prevalence value for the last class can be lead empty since it is constrained. E.g., for binary collections, only the prevalence p for the first class (as listed in self.classes_ can be specified, while the other class takes prevalence value 1-p
shuffle – if set to True (default), shuffles the index before returning it
random_state – seed for reproducing sampling
- Returns:
a np.ndarray of shape (size) with the indexes
- split_random(train_prop=0.6, random_state=None)¶
Returns two instances of
LabelledCollection
split randomly from this collection, at desired proportion.- Parameters:
train_prop – the proportion of elements to include in the left-most returned collection (typically used as the training collection). The rest of elements are included in the right-most returned collection (typically used as a test collection).
random_state – if specified, guarantees reproducibility of the split.
- Returns:
two instances of
LabelledCollection
, the first one with train_prop elements, and the second one with 1-train_prop elements
- split_stratified(train_prop=0.6, random_state=None)¶
Returns two instances of
LabelledCollection
split with stratification from this collection, at desired proportion.- Parameters:
train_prop – the proportion of elements to include in the left-most returned collection (typically used as the training collection). The rest of elements are included in the right-most returned collection (typically used as a test collection).
random_state – if specified, guarantees reproducibility of the split.
- Returns:
two instances of
LabelledCollection
, the first one with train_prop elements, and the second one with 1-train_prop elements
- stats(show=True)¶
Returns (and eventually prints) a dictionary with some stats of this collection. E.g.,:
>>> data = qp.datasets.fetch_reviews('kindle', tfidf=True, min_df=5) >>> data.training.stats() >>> #instances=3821, type=<class 'scipy.sparse.csr.csr_matrix'>, #features=4403, #classes=[0 1], prevs=[0.081, 0.919]
- Parameters:
show – if set to True (default), prints the stats in standard output
- Returns:
a dictionary containing some stats of this collection. Keys include #instances (the number of instances), type (the type representing the instances), #features (the number of features, if the instances are in array-like format), #classes (the classes of the collection), prevs (the prevalence values for each class)
- uniform_sampling(size, random_state=None)¶
Returns a uniform sample (an instance of
LabelledCollection
) of desired size. The sampling is drawn with replacement if the requested size is greater than the number of instances, or without replacement otherwise.- Parameters:
size – integer, the requested size
random_state – if specified, guarantees reproducibility of the split.
- Returns:
an instance of
LabelledCollection
with length == size
- uniform_sampling_index(size, random_state=None)¶
Returns an index to be used to extract a uniform sample of desired size. The sampling is drawn with replacement if the requested size is greater than the number of instances, or without replacement otherwise.
- Parameters:
size – integer, the size of the uniform sample
random_state – if specified, guarantees reproducibility of the split.
- Returns:
a np.ndarray of shape (size) with the indexes
- property y¶
An alias to self.labels
- Returns:
self.labels
quapy.data.datasets¶
- quapy.data.datasets.fetch_UCIDataset(dataset_name, data_home=None, test_split=0.3, verbose=False) Dataset ¶
Loads a UCI dataset as an instance of
quapy.data.base.Dataset
, as used in Pérez-Gállego, P., Quevedo, J. R., & del Coz, J. J. (2017). Using ensembles for problems with characterizable changes in data distribution: A case study on quantification. Information Fusion, 34, 87-100. and Pérez-Gállego, P., Castano, A., Quevedo, J. R., & del Coz, J. J. (2019). Dynamic ensemble selection for quantification tasks. Information Fusion, 45, 1-15.. The datasets do not come with a predefined train-test split (seefetch_UCILabelledCollection()
for further information on how to use these collections), and so a train-test split is generated at desired proportion. The list of valid dataset names can be accessed in quapy.data.datasets.UCI_DATASETS- Parameters:
dataset_name – a dataset name
data_home – specify the quapy home directory where collections will be dumped (leave empty to use the default ~/quay_data/ directory)
test_split – proportion of documents to be included in the test set. The rest conforms the training set
verbose – set to True (default is False) to get information (from the UCI ML repository) about the datasets
- Returns:
a
quapy.data.base.Dataset
instance
- quapy.data.datasets.fetch_UCILabelledCollection(dataset_name, data_home=None, verbose=False) Dataset ¶
Loads a UCI collection as an instance of
quapy.data.base.LabelledCollection
, as used in Pérez-Gállego, P., Quevedo, J. R., & del Coz, J. J. (2017). Using ensembles for problems with characterizable changes in data distribution: A case study on quantification. Information Fusion, 34, 87-100. and Pérez-Gállego, P., Castano, A., Quevedo, J. R., & del Coz, J. J. (2019). Dynamic ensemble selection for quantification tasks. Information Fusion, 45, 1-15.. The datasets do not come with a predefined train-test split, and so Pérez-Gállego et al. adopted a 5FCVx2 evaluation protocol, meaning that each collection was used to generate two rounds (hence the x2) of 5 fold cross validation. This can be reproduced by usingquapy.data.base.Dataset.kFCV()
, e.g.:>>> import quapy as qp >>> collection = qp.datasets.fetch_UCILabelledCollection("yeast") >>> for data in qp.data.Dataset.kFCV(collection, nfolds=5, nrepeats=2): >>> ...
The list of valid dataset names can be accessed in quapy.data.datasets.UCI_DATASETS
- Parameters:
dataset_name – a dataset name
data_home – specify the quapy home directory where collections will be dumped (leave empty to use the default ~/quay_data/ directory)
test_split – proportion of documents to be included in the test set. The rest conforms the training set
verbose – set to True (default is False) to get information (from the UCI ML repository) about the datasets
- Returns:
a
quapy.data.base.Dataset
instance
- quapy.data.datasets.fetch_lequa2022(task, data_home=None)¶
Loads the official datasets provided for the LeQua competition. In brief, there are 4 tasks (T1A, T1B, T2A, T2B) having to do with text quantification problems. Tasks T1A and T1B provide documents in vector form, while T2A and T2B provide raw documents instead. Tasks T1A and T2A are binary sentiment quantification problems, while T2A and T2B are multiclass quantification problems consisting of estimating the class prevalence values of 28 different merchandise products. We refer to the Esuli, A., Moreo, A., Sebastiani, F., & Sperduti, G. (2022). A Detailed Overview of LeQua@ CLEF 2022: Learning to Quantify. for a detailed description on the tasks and datasets.
The datasets are downloaded only once, and stored for fast reuse.
See lequa2022_experiments.py provided in the example folder, that can serve as a guide on how to use these datasets.
- Parameters:
task – a string representing the task name; valid ones are T1A, T1B, T2A, and T2B
data_home – specify the quapy home directory where collections will be dumped (leave empty to use the default ~/quay_data/ directory)
- Returns:
a tuple (train, val_gen, test_gen) where train is an instance of
quapy.data.base.LabelledCollection
, val_gen and test_gen are instances ofquapy.protocol.SamplesFromDir
, i.e., are sampling protocols that return a series of samples labelled by prevalence.
- quapy.data.datasets.fetch_reviews(dataset_name, tfidf=False, min_df=None, data_home=None, pickle=False) Dataset ¶
Loads a Reviews dataset as a Dataset instance, as used in Esuli, A., Moreo, A., and Sebastiani, F. “A recurrent neural network for sentiment quantification.” Proceedings of the 27th ACM International Conference on Information and Knowledge Management. 2018.. The list of valid dataset names can be accessed in quapy.data.datasets.REVIEWS_SENTIMENT_DATASETS
- Parameters:
dataset_name – the name of the dataset: valid ones are ‘hp’, ‘kindle’, ‘imdb’
tfidf – set to True to transform the raw documents into tfidf weighted matrices
min_df – minimun number of documents that should contain a term in order for the term to be kept (ignored if tfidf==False)
data_home – specify the quapy home directory where collections will be dumped (leave empty to use the default ~/quay_data/ directory)
pickle – set to True to pickle the Dataset object the first time it is generated, in order to allow for faster subsequent invokations
- Returns:
a
quapy.data.base.Dataset
instance
- quapy.data.datasets.fetch_twitter(dataset_name, for_model_selection=False, min_df=None, data_home=None, pickle=False) Dataset ¶
Loads a Twitter dataset as a
quapy.data.base.Dataset
instance, as used in: Gao, W., Sebastiani, F.: From classification to quantification in tweet sentiment analysis. Social Network Analysis and Mining6(19), 1–22 (2016) Note that the datasets ‘semeval13’, ‘semeval14’, ‘semeval15’ share the same training set. The list of valid dataset names corresponding to training sets can be accessed in quapy.data.datasets.TWITTER_SENTIMENT_DATASETS_TRAIN, while the test sets can be accessed in quapy.data.datasets.TWITTER_SENTIMENT_DATASETS_TEST- Parameters:
dataset_name – the name of the dataset: valid ones are ‘gasp’, ‘hcr’, ‘omd’, ‘sanders’, ‘semeval13’, ‘semeval14’, ‘semeval15’, ‘semeval16’, ‘sst’, ‘wa’, ‘wb’
for_model_selection – if True, then returns the train split as the training set and the devel split as the test set; if False, then returns the train+devel split as the training set and the test set as the test set
min_df – minimun number of documents that should contain a term in order for the term to be kept
data_home – specify the quapy home directory where collections will be dumped (leave empty to use the default ~/quay_data/ directory)
pickle – set to True to pickle the Dataset object the first time it is generated, in order to allow for faster subsequent invokations
- Returns:
a
quapy.data.base.Dataset
instance
- quapy.data.datasets.warn(*args, **kwargs)¶
quapy.data.preprocessing¶
- class quapy.data.preprocessing.IndexTransformer(**kwargs)¶
Bases:
object
This class implements a sklearn’s-style transformer that indexes text as numerical ids for the tokens it contains, and that would be generated by sklearn’s CountVectorizer
- Parameters:
kwargs –
keyworded arguments from CountVectorizer
- add_word(word, id=None, nogaps=True)¶
Adds a new token (regardless of whether it has been found in the text or not), with dedicated id. Useful to define special tokens for codifying unknown words, or padding tokens.
- Parameters:
word – string, surface form of the token
id – integer, numerical value to assign to the token (leave as None for indicating the next valid id, default)
nogaps – if set to True (default) asserts that the id indicated leads to no numerical gaps with precedent ids stored so far
- Returns:
integer, the numerical id for the new token
- fit(X)¶
Fits the transformer, i.e., decides on the vocabulary, given a list of strings.
- Parameters:
X – a list of strings
- Returns:
self
- fit_transform(X, n_jobs=None)¶
Fits the transform on X and transforms it.
- Parameters:
X – a list of strings
n_jobs – the number of parallel workers to carry out this task
- Returns:
a np.ndarray of numerical ids
- transform(X, n_jobs=None)¶
Transforms the strings in X as lists of numerical ids
- Parameters:
X – a list of strings
n_jobs – the number of parallel workers to carry out this task
- Returns:
a np.ndarray of numerical ids
- vocabulary_size()¶
Gets the length of the vocabulary according to which the document tokens have been indexed
- Returns:
integer
- quapy.data.preprocessing.index(dataset: Dataset, min_df=5, inplace=False, **kwargs)¶
Indexes the tokens of a textual
quapy.data.base.Dataset
of string documents. To index a document means to replace each different token by a unique numerical index. Rare words (i.e., words occurring less than min_df times) are replaced by a special token UNK- Parameters:
dataset – a
quapy.data.base.Dataset
object where the instances of training and test documents are lists of strmin_df – minimum number of occurrences below which the term is replaced by a UNK index
inplace – whether or not to apply the transformation inplace (True), or to a new copy (False, default)
kwargs – the rest of parameters of the transformation (as for sklearn’s CountVectorizer <https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html>_)
- Returns:
a new
quapy.data.base.Dataset
(if inplace=False) or a reference to the currentquapy.data.base.Dataset
(inplace=True) consisting of lists of integer values representing indices.
- quapy.data.preprocessing.reduce_columns(dataset: Dataset, min_df=5, inplace=False)¶
Reduces the dimensionality of the instances, represented as a csr_matrix (or any subtype of scipy.sparse.spmatrix), of training and test documents by removing the columns of words which are not present in at least min_df instances in the training set
- Parameters:
dataset – a
quapy.data.base.Dataset
in which instances are represented in sparse format (any subtype of scipy.sparse.spmatrix)min_df – integer, minimum number of instances below which the columns are removed
inplace – whether or not to apply the transformation inplace (True), or to a new copy (False, default)
- Returns:
a new
quapy.data.base.Dataset
(if inplace=False) or a reference to the currentquapy.data.base.Dataset
(inplace=True) where the dimensions corresponding to infrequent terms in the training set have been removed
- quapy.data.preprocessing.standardize(dataset: Dataset, inplace=False)¶
Standardizes the real-valued columns of a
quapy.data.base.Dataset
. Standardization, aka z-scoring, of a variable X comes down to subtracting the average and normalizing by the standard deviation.- Parameters:
dataset – a
quapy.data.base.Dataset
objectinplace – set to True if the transformation is to be applied inplace, or to False (default) if a new
quapy.data.base.Dataset
is to be returned
- Returns:
an instance of
quapy.data.base.Dataset
- quapy.data.preprocessing.text2tfidf(dataset: Dataset, min_df=3, sublinear_tf=True, inplace=False, **kwargs)¶
Transforms a
quapy.data.base.Dataset
of textual instances into aquapy.data.base.Dataset
of tfidf weighted sparse vectors- Parameters:
dataset – a
quapy.data.base.Dataset
where the instances of training and test collections are lists of strmin_df – minimum number of occurrences for a word to be considered as part of the vocabulary (default 3)
sublinear_tf – whether or not to apply the log scalling to the tf counters (default True)
inplace – whether or not to apply the transformation inplace (True), or to a new copy (False, default)
kwargs – the rest of parameters of the transformation (as for sklearn’s TfidfVectorizer)
- Returns:
a new
quapy.data.base.Dataset
in csr_matrix format (if inplace=False) or a reference to the current Dataset (if inplace=True) where the instances are stored in a csr_matrix of real-valued tfidf scores
quapy.data.reader¶
- quapy.data.reader.binarize(y, pos_class)¶
Binarizes a categorical array-like collection of labels towards the positive class pos_class. E.g.,:
>>> binarize([1, 2, 3, 1, 1, 0], pos_class=2) >>> array([0, 1, 0, 0, 0, 0])
- Parameters:
y – array-like of labels
pos_class – integer, the positive class
- Returns:
a binary np.ndarray, in which values 1 corresponds to positions in whcih y had pos_class labels, and 0 otherwise
- quapy.data.reader.from_csv(path, encoding='utf-8')¶
Reads a csv file in which columns are separated by ‘,’. File format <label>,<feat1>,<feat2>,…,<featn>
- Parameters:
path – path to the csv file
encoding – the text encoding used to open the file
- Returns:
a np.ndarray for the labels and a ndarray (float) for the covariates
- quapy.data.reader.from_sparse(path)¶
Reads a labelled collection of real-valued instances expressed in sparse format File format <-1 or 0 or 1>[s col(int):val(float)]
- Parameters:
path – path to the labelled collection
- Returns:
a csr_matrix containing the instances (rows), and a ndarray containing the labels
- quapy.data.reader.from_text(path, encoding='utf-8', verbose=1, class2int=True)¶
Reads a labelled colletion of documents. File fomart <0 or 1> <document>
- Parameters:
path – path to the labelled collection
encoding – the text encoding used to open the file
verbose – if >0 (default) shows some progress information in standard output
- Returns:
a list of sentences, and a list of labels
- quapy.data.reader.reindex_labels(y)¶
Re-indexes a list of labels as a list of indexes, and returns the classnames corresponding to the indexes. E.g.:
>>> reindex_labels(['B', 'B', 'A', 'C']) >>> (array([1, 1, 0, 2]), array(['A', 'B', 'C'], dtype='<U1'))
- Parameters:
y – the list or array of original labels
- Returns:
a ndarray (int) of class indexes, and a ndarray of classnames corresponding to the indexes.