quapy.data package¶
Submodules¶
quapy.data.base module¶
- class quapy.data.base.Dataset(training: quapy.data.base.LabelledCollection, test: quapy.data.base.LabelledCollection, vocabulary: Optional[dict] = None, name='')¶
Bases:
object
- classmethod SplitStratified(collection: quapy.data.base.LabelledCollection, train_size=0.6)¶
- property binary¶
- property classes_¶
- classmethod kFCV(data: quapy.data.base.LabelledCollection, nfolds=5, nrepeats=1, random_state=0)¶
- classmethod load(train_path, test_path, loader_func: callable)¶
- property n_classes¶
- stats()¶
- property vocabulary_size¶
- class quapy.data.base.LabelledCollection(instances, labels, classes_=None)¶
Bases:
object
A LabelledCollection is a set of objects each with a label associated to it.
- property Xy¶
- artificial_sampling_generator(sample_size, n_prevalences=101, repeats=1)¶
- artificial_sampling_index_generator(sample_size, n_prevalences=101, repeats=1)¶
- property binary¶
- counts()¶
- kFCV(nfolds=5, nrepeats=1, random_state=0)¶
- classmethod load(path: str, loader_func: callable, classes=None)¶
- property n_classes¶
- natural_sampling_generator(sample_size, repeats=100)¶
- natural_sampling_index_generator(sample_size, repeats=100)¶
- prevalence()¶
- sampling(size, *prevs, shuffle=True)¶
- sampling_from_index(index)¶
- sampling_index(size, *prevs, shuffle=True)¶
- split_stratified(train_prop=0.6, random_state=None)¶
- stats(show=True)¶
- uniform_sampling(size)¶
- uniform_sampling_index(size)¶
- quapy.data.base.isbinary(data)¶
quapy.data.datasets module¶
- quapy.data.datasets.df_replace(df, col, repl={'no': 0, 'yes': 1}, astype=<class 'float'>)¶
- quapy.data.datasets.fetch_UCIDataset(dataset_name, data_home=None, test_split=0.3, verbose=False) quapy.data.base.Dataset ¶
- quapy.data.datasets.fetch_UCILabelledCollection(dataset_name, data_home=None, verbose=False) quapy.data.base.Dataset ¶
- quapy.data.datasets.fetch_reviews(dataset_name, tfidf=False, min_df=None, data_home=None, pickle=False) quapy.data.base.Dataset ¶
Load a Reviews dataset as a Dataset instance, as used in: Esuli, A., Moreo, A., and Sebastiani, F. “A recurrent neural network for sentiment quantification.” Proceedings of the 27th ACM International Conference on Information and Knowledge Management. 2018. :param dataset_name: the name of the dataset: valid ones are ‘hp’, ‘kindle’, ‘imdb’ :param tfidf: set to True to transform the raw documents into tfidf weighted matrices :param min_df: minimun number of documents that should contain a term in order for the term to be kept (ignored if tfidf==False) :param data_home: specify the quapy home directory where collections will be dumped (leave empty to use the default ~/quay_data/ directory) :param pickle: set to True to pickle the Dataset object the first time it is generated, in order to allow for faster subsequent invokations :return: a Dataset instance
- quapy.data.datasets.fetch_twitter(dataset_name, for_model_selection=False, min_df=None, data_home=None, pickle=False) quapy.data.base.Dataset ¶
Load a Twitter dataset as a Dataset instance, as used in: Gao, W., Sebastiani, F.: From classification to quantification in tweet sentiment analysis. Social Network Analysis and Mining6(19), 1–22 (2016) The datasets ‘semeval13’, ‘semeval14’, ‘semeval15’ share the same training set.
- Parameters
dataset_name – the name of the dataset: valid ones are ‘gasp’, ‘hcr’, ‘omd’, ‘sanders’, ‘semeval13’,
‘semeval14’, ‘semeval15’, ‘semeval16’, ‘sst’, ‘wa’, ‘wb’ :param for_model_selection: if True, then returns the train split as the training set and the devel split as the test set; if False, then returns the train+devel split as the training set and the test set as the test set :param min_df: minimun number of documents that should contain a term in order for the term to be kept :param data_home: specify the quapy home directory where collections will be dumped (leave empty to use the default ~/quay_data/ directory) :param pickle: set to True to pickle the Dataset object the first time it is generated, in order to allow for faster subsequent invokations :return: a Dataset instance
- quapy.data.datasets.warn(*args, **kwargs)¶
quapy.data.preprocessing module¶
- class quapy.data.preprocessing.IndexTransformer(**kwargs)¶
Bases:
object
- add_word(word, id=None, nogaps=True)¶
- fit(X)¶
- Parameters
X – a list of strings
- Returns
self
- fit_transform(X, n_jobs=- 1)¶
- index(documents)¶
- transform(X, n_jobs=- 1)¶
- vocabulary_size()¶
- quapy.data.preprocessing.index(dataset: quapy.data.base.Dataset, min_df=5, inplace=False, **kwargs)¶
Indexes a dataset of strings. To index a document means to replace each different token by a unique numerical index. Rare words (i.e., words occurring less than _min_df_ times) are replaced by a special token UNK :param dataset: a Dataset where the instances are lists of str :param min_df: minimum number of instances below which the term is replaced by a UNK index :param inplace: whether or not to apply the transformation inplace, or to a new copy :param kwargs: the rest of parameters of the transformation (as for sklearn.feature_extraction.text.CountVectorizer) :return: a new Dataset (if inplace=False) or a reference to the current Dataset (inplace=True) consisting of lists of integer values representing indices.
- quapy.data.preprocessing.reduce_columns(dataset: quapy.data.base.Dataset, min_df=5, inplace=False)¶
Reduces the dimensionality of the csr_matrix by removing the columns of words which are not present in at least _min_df_ instances :param dataset: a Dataset in sparse format (any subtype of scipy.sparse.spmatrix) :param min_df: minimum number of instances below which the columns are removed :param inplace: whether or not to apply the transformation inplace, or to a new copy :return: a new Dataset (if inplace=False) or a reference to the current Dataset (inplace=True) where the dimensions corresponding to infrequent instances have been removed
- quapy.data.preprocessing.standardize(dataset: quapy.data.base.Dataset, inplace=True)¶
- quapy.data.preprocessing.text2tfidf(dataset: quapy.data.base.Dataset, min_df=3, sublinear_tf=True, inplace=False, **kwargs)¶
Transforms a Dataset of textual instances into a Dataset of tfidf weighted sparse vectors :param dataset: a Dataset where the instances are lists of str :param min_df: minimum number of occurrences for a word to be considered as part of the vocabulary :param sublinear_tf: whether or not to apply the log scalling to the tf counters :param inplace: whether or not to apply the transformation inplace, or to a new copy :param kwargs: the rest of parameters of the transformation (as for sklearn.feature_extraction.text.TfidfVectorizer) :return: a new Dataset in csr_matrix format (if inplace=False) or a reference to the current Dataset (inplace=True) where the instances are stored in a csr_matrix of real-valued tfidf scores
quapy.data.reader module¶
- quapy.data.reader.binarize(y, pos_class)¶
- quapy.data.reader.from_csv(path, encoding='utf-8')¶
Reads a csv file in which columns are separated by ‘,’. File format <label>,<feat1>,<feat2>,…,<featn>
- Parameters
path – path to the csv file
- Returns
a ndarray for the labels and a ndarray (float) for the covariates
- quapy.data.reader.from_sparse(path)¶
Reads a labelled collection of real-valued instances expressed in sparse format File format <-1 or 0 or 1>[s col(int):val(float)]
- Parameters
path – path to the labelled collection
- Returns
a csr_matrix containing the instances (rows), and a ndarray containing the labels
- quapy.data.reader.from_text(path, encoding='utf-8', verbose=1, class2int=True)¶
Reads a labelled colletion of documents. File fomart <0 or 1> <document>
- Parameters
path – path to the labelled collection
- Returns
a list of sentences, and a list of labels
- quapy.data.reader.reindex_labels(y)¶
Re-indexes a list of labels as a list of indexes, and returns the classnames corresponding to the indexes. E.g., y=[‘B’, ‘B’, ‘A’, ‘C’] -> [1,1,0,2], [‘A’,’B’,’C’] :param y: the list or array of original labels :return: a ndarray (int) of class indexes, and a ndarray of classnames corresponding to the indexes.