quapy.data package¶

Submodules¶

quapy.data.base module¶

class quapy.data.base.Dataset(training: quapy.data.base.LabelledCollection, test: quapy.data.base.LabelledCollection, vocabulary: Optional[dict] = None, name='')¶

Bases: object

classmethod SplitStratified(collection: quapy.data.base.LabelledCollection, train_size=0.6)¶

property binary¶

property classes_¶

classmethod kFCV(data: quapy.data.base.LabelledCollection, nfolds=5, nrepeats=1, random_state=0)¶

classmethod load(train_path, test_path, loader_func: callable)¶

property n_classes¶

stats()¶

property vocabulary_size¶

class quapy.data.base.LabelledCollection(instances, labels, classes_=None)¶

Bases: object

A LabelledCollection is a set of objects each with a label associated to it.

property Xy¶

artificial_sampling_generator(sample_size, n_prevalences=101, repeats=1)¶

artificial_sampling_index_generator(sample_size, n_prevalences=101, repeats=1)¶

property binary¶

counts()¶

kFCV(nfolds=5, nrepeats=1, random_state=0)¶

classmethod load(path: str, loader_func: callable, classes=None)¶

property n_classes¶

natural_sampling_generator(sample_size, repeats=100)¶

natural_sampling_index_generator(sample_size, repeats=100)¶

prevalence()¶

sampling(size, *prevs, shuffle=True)¶

sampling_from_index(index)¶

sampling_index(size, *prevs, shuffle=True)¶

split_stratified(train_prop=0.6, random_state=None)¶

stats(show=True)¶

uniform_sampling(size)¶

uniform_sampling_index(size)¶

quapy.data.base.isbinary(data)¶

quapy.data.datasets module¶

quapy.data.datasets.df_replace(df, col, repl={'no': 0, 'yes': 1}, astype=<class 'float'>)¶

quapy.data.datasets.fetch_UCIDataset(dataset_name, data_home=None, test_split=0.3, verbose=False) → quapy.data.base.Dataset¶

quapy.data.datasets.fetch_UCILabelledCollection(dataset_name, data_home=None, verbose=False) → quapy.data.base.Dataset¶

quapy.data.datasets.fetch_reviews(dataset_name, tfidf=False, min_df=None, data_home=None, pickle=False) → quapy.data.base.Dataset¶: Load a Reviews dataset as a Dataset instance, as used in: Esuli, A., Moreo, A., and Sebastiani, F. “A recurrent neural network for sentiment quantification.” Proceedings of the 27th ACM International Conference on Information and Knowledge Management. 2018. :param dataset_name: the name of the dataset: valid ones are ‘hp’, ‘kindle’, ‘imdb’ :param tfidf: set to True to transform the raw documents into tfidf weighted matrices :param min_df: minimun number of documents that should contain a term in order for the term to be kept (ignored if tfidf==False) :param data_home: specify the quapy home directory where collections will be dumped (leave empty to use the default ~/quay_data/ directory) :param pickle: set to True to pickle the Dataset object the first time it is generated, in order to allow for faster subsequent invokations :return: a Dataset instance

quapy.data.datasets.fetch_twitter(dataset_name, for_model_selection=False, min_df=None, data_home=None, pickle=False) → quapy.data.base.Dataset¶

Load a Twitter dataset as a Dataset instance, as used in: Gao, W., Sebastiani, F.: From classification to quantification in tweet sentiment analysis. Social Network Analysis and Mining6(19), 1–22 (2016) The datasets ‘semeval13’, ‘semeval14’, ‘semeval15’ share the same training set.

Parameters: dataset_name – the name of the dataset: valid ones are ‘gasp’, ‘hcr’, ‘omd’, ‘sanders’, ‘semeval13’,

‘semeval14’, ‘semeval15’, ‘semeval16’, ‘sst’, ‘wa’, ‘wb’ :param for_model_selection: if True, then returns the train split as the training set and the devel split as the test set; if False, then returns the train+devel split as the training set and the test set as the test set :param min_df: minimun number of documents that should contain a term in order for the term to be kept :param data_home: specify the quapy home directory where collections will be dumped (leave empty to use the default ~/quay_data/ directory) :param pickle: set to True to pickle the Dataset object the first time it is generated, in order to allow for faster subsequent invokations :return: a Dataset instance

quapy.data.datasets.warn(*args, **kwargs)¶

quapy.data.preprocessing module¶

class quapy.data.preprocessing.IndexTransformer(**kwargs)¶

Bases: object

add_word(word, id=None, nogaps=True)¶

fit(X)¶

Parameters: X – a list of strings
Returns: self

fit_transform(X, n_jobs=- 1)¶

index(documents)¶

transform(X, n_jobs=- 1)¶

vocabulary_size()¶

quapy.data.preprocessing.index(dataset: quapy.data.base.Dataset, min_df=5, inplace=False, **kwargs)¶: Indexes a dataset of strings. To index a document means to replace each different token by a unique numerical index. Rare words (i.e., words occurring less than _min_df_ times) are replaced by a special token UNK :param dataset: a Dataset where the instances are lists of str :param min_df: minimum number of instances below which the term is replaced by a UNK index :param inplace: whether or not to apply the transformation inplace, or to a new copy :param kwargs: the rest of parameters of the transformation (as for sklearn.feature_extraction.text.CountVectorizer) :return: a new Dataset (if inplace=False) or a reference to the current Dataset (inplace=True) consisting of lists of integer values representing indices.

quapy.data.preprocessing.reduce_columns(dataset: quapy.data.base.Dataset, min_df=5, inplace=False)¶: Reduces the dimensionality of the csr_matrix by removing the columns of words which are not present in at least _min_df_ instances :param dataset: a Dataset in sparse format (any subtype of scipy.sparse.spmatrix) :param min_df: minimum number of instances below which the columns are removed :param inplace: whether or not to apply the transformation inplace, or to a new copy :return: a new Dataset (if inplace=False) or a reference to the current Dataset (inplace=True) where the dimensions corresponding to infrequent instances have been removed

quapy.data.preprocessing.standardize(dataset: quapy.data.base.Dataset, inplace=True)¶

quapy.data.preprocessing.text2tfidf(dataset: quapy.data.base.Dataset, min_df=3, sublinear_tf=True, inplace=False, **kwargs)¶: Transforms a Dataset of textual instances into a Dataset of tfidf weighted sparse vectors :param dataset: a Dataset where the instances are lists of str :param min_df: minimum number of occurrences for a word to be considered as part of the vocabulary :param sublinear_tf: whether or not to apply the log scalling to the tf counters :param inplace: whether or not to apply the transformation inplace, or to a new copy :param kwargs: the rest of parameters of the transformation (as for sklearn.feature_extraction.text.TfidfVectorizer) :return: a new Dataset in csr_matrix format (if inplace=False) or a reference to the current Dataset (inplace=True) where the instances are stored in a csr_matrix of real-valued tfidf scores

quapy.data.reader module¶

quapy.data.reader.binarize(y, pos_class)¶

quapy.data.reader.from_csv(path, encoding='utf-8')¶

Reads a csv file in which columns are separated by ‘,’. File format <label>,<feat1>,<feat2>,…,<featn>

Parameters: path – path to the csv file
Returns: a ndarray for the labels and a ndarray (float) for the covariates

quapy.data.reader.from_sparse(path)¶

Reads a labelled collection of real-valued instances expressed in sparse format File format <-1 or 0 or 1>[s col(int):val(float)]

Parameters: path – path to the labelled collection
Returns: a csr_matrix containing the instances (rows), and a ndarray containing the labels

quapy.data.reader.from_text(path, encoding='utf-8', verbose=1, class2int=True)¶

Reads a labelled colletion of documents. File fomart <0 or 1> <document>

Parameters: path – path to the labelled collection
Returns: a list of sentences, and a list of labels

quapy.data.reader.reindex_labels(y)¶: Re-indexes a list of labels as a list of indexes, and returns the classnames corresponding to the indexes. E.g., y=[‘B’, ‘B’, ‘A’, ‘C’] -> [1,1,0,2], [‘A’,’B’,’C’] :param y: the list or array of original labels :return: a ndarray (int) of class indexes, and a ndarray of classnames corresponding to the indexes.

quapy.data package¶

Submodules¶

quapy.data.base module¶

quapy.data.datasets module¶

quapy.data.preprocessing module¶

quapy.data.reader module¶

Module contents¶

Table of Contents

Previous topic

Next topic

This Page