quapy.classification package

Submodules

quapy.classification.methods module

class quapy.classification.methods.LowRankLogisticRegression(n_components=100, **kwargs)

Bases: sklearn.base.BaseEstimator

An example of a classification method (i.e., an object that implements fit, predict, and predict_proba) that also generates embedded inputs (i.e., that implements transform), as those required for quapy.method.neural.QuaNet. This is a mock method to allow for easily instantiating quapy.method.neural.QuaNet on array-like real-valued instances. The transformation consists of applying sklearn.decomposition.TruncatedSVD while classification is performed using sklearn.linear_model.LogisticRegression on the low-rank space.

Parameters
  • n_components – the number of principal components to retain

  • kwargs – parameters for the Logistic Regression classifier

fit(X, y)

Fit the model according to the given training data. The fit consists of fitting TruncatedSVD and then LogisticRegression on the low-rank representation.

Parameters
  • X – array-like of shape (n_samples, n_features) with the instances

  • y – array-like of shape (n_samples, n_classes) with the class labels

Returns

self

get_params()

Get hyper-parameters for this estimator.

Returns

a dictionary with parameter names mapped to their values

predict(X)

Predicts labels for the instances X embedded into the low-rank space.

Parameters

X – array-like of shape (n_samples, n_features) instances to classify

Returns

a numpy array of length n containing the label predictions, where n is the number of instances in X

predict_proba(X)

Predicts posterior probabilities for the instances X embedded into the low-rank space.

Parameters

X – array-like of shape (n_samples, n_features) instances to classify

Returns

array-like of shape (n_samples, n_classes) with the posterior probabilities

set_params(**params)

Set the parameters of this estimator.

Parameters

parameters – a **kwargs dictionary with the estimator parameters for Logistic Regression and eventually also n_components for TruncatedSVD

transform(X)

Returns the low-rank approximation of X with n_components dimensions, or X unaltered if n_components >= X.shape[1].

Parameters

X – array-like of shape (n_samples, n_features) instances to embed

Returns

array-like of shape (n_samples, n_components) with the embedded instances

quapy.classification.neural module

class quapy.classification.neural.CNNnet(vocabulary_size, n_classes, embedding_size=100, hidden_size=256, repr_size=100, kernel_heights=[3, 5, 7], stride=1, padding=0, drop_p=0.5)

Bases: quapy.classification.neural.TextClassifierNet

An implementation of quapy.classification.neural.TextClassifierNet based on Convolutional Neural Networks.

Parameters
  • vocabulary_size – the size of the vocabulary

  • n_classes – number of target classes

  • embedding_size – the dimensionality of the word embeddings space (default 100)

  • hidden_size – the dimensionality of the hidden space (default 256)

  • repr_size – the dimensionality of the document embeddings space (default 100)

  • kernel_heights – list of kernel lengths (default [3,5,7]), i.e., the number of consecutive tokens that each kernel covers

  • stride – convolutional stride (default 1)

  • stride – convolutional pad (default 0)

  • drop_p – drop probability for dropout (default 0.5)

document_embedding(input)

Embeds documents (i.e., performs the forward pass up to the next-to-last layer).

Parameters

input – a batch of instances, typically generated by a torch’s DataLoader instance (see quapy.classification.neural.TorchDataset)

Returns

a torch tensor of shape (n_samples, n_dimensions), where n_samples is the number of documents, and n_dimensions is the dimensionality of the embedding

get_params()

Get hyper-parameters for this estimator

Returns

a dictionary with parameter names mapped to their values

property vocabulary_size

Return the size of the vocabulary

Returns

integer

class quapy.classification.neural.LSTMnet(vocabulary_size, n_classes, embedding_size=100, hidden_size=256, repr_size=100, lstm_class_nlayers=1, drop_p=0.5)

Bases: quapy.classification.neural.TextClassifierNet

An implementation of quapy.classification.neural.TextClassifierNet based on Long Short Term Memory networks.

Parameters
  • vocabulary_size – the size of the vocabulary

  • n_classes – number of target classes

  • embedding_size – the dimensionality of the word embeddings space (default 100)

  • hidden_size – the dimensionality of the hidden space (default 256)

  • repr_size – the dimensionality of the document embeddings space (default 100)

  • lstm_class_nlayers – number of LSTM layers (default 1)

  • drop_p – drop probability for dropout (default 0.5)

document_embedding(x)

Embeds documents (i.e., performs the forward pass up to the next-to-last layer).

Parameters

x – a batch of instances, typically generated by a torch’s DataLoader instance (see quapy.classification.neural.TorchDataset)

Returns

a torch tensor of shape (n_samples, n_dimensions), where n_samples is the number of documents, and n_dimensions is the dimensionality of the embedding

get_params()

Get hyper-parameters for this estimator

Returns

a dictionary with parameter names mapped to their values

property vocabulary_size

Return the size of the vocabulary

Returns

integer

class quapy.classification.neural.NeuralClassifierTrainer(net: quapy.classification.neural.TextClassifierNet, lr=0.001, weight_decay=0, patience=10, epochs=200, batch_size=64, batch_size_test=512, padding_length=300, device='cpu', checkpointpath='../checkpoint/classifier_net.dat')

Bases: object

Trains a neural network for text classification.

Parameters
  • net – an instance of TextClassifierNet implementing the forward pass

  • lr – learning rate (default 1e-3)

  • weight_decay – weight decay (default 0)

  • patience – number of epochs that do not show any improvement in validation to wait before applying early stop (default 10)

  • epochs – maximum number of training epochs (default 200)

  • batch_size – batch size for training (default 64)

  • batch_size_test – batch size for test (default 512)

  • padding_length – maximum number of tokens to consider in a document (default 300)

  • device – specify ‘cpu’ (default) or ‘cuda’ for enabling gpu

  • checkpointpath – where to store the parameters of the best model found so far according to the evaluation in the held-out validation split (default ‘../checkpoint/classifier_net.dat’)

property device

Gets the device in which the network is allocated

Returns

device

fit(instances, labels, val_split=0.3)

Fits the model according to the given training data.

Parameters
  • instances – list of lists of indexed tokens

  • labels – array-like of shape (n_samples, n_classes) with the class labels

  • val_split – proportion of training documents to be taken as the validation set (default 0.3)

Returns

get_params()

Get hyper-parameters for this estimator

Returns

a dictionary with parameter names mapped to their values

predict(instances)

Predicts labels for the instances

Parameters

instances – list of lists of indexed tokens

Returns

a numpy array of length n containing the label predictions, where n is the number of instances in X

predict_proba(instances)

Predicts posterior probabilities for the instances

Parameters

X – array-like of shape (n_samples, n_features) instances to classify

Returns

array-like of shape (n_samples, n_classes) with the posterior probabilities

reset_net_params(vocab_size, n_classes)

Reinitialize the network parameters

Parameters
  • vocab_size – the size of the vocabulary

  • n_classes – the number of target classes

set_params(**params)

Set the parameters of this trainer and the learner it is training. In this current version, parameter names for the trainer and learner should be disjoint.

Parameters

params – a **kwargs dictionary with the parameters

transform(instances)

Returns the embeddings of the instances

Parameters

instances – list of lists of indexed tokens

Returns

array-like of shape (n_samples, embed_size) with the embedded instances, where embed_size is defined by the classification network

class quapy.classification.neural.TextClassifierNet

Bases: torch.nn.modules.module.Module

Abstract Text classifier (torch.nn.Module)

dimensions()

Gets the number of dimensions of the embedding space

Returns

integer

abstract document_embedding(x)

Embeds documents (i.e., performs the forward pass up to the next-to-last layer).

Parameters

x – a batch of instances, typically generated by a torch’s DataLoader instance (see quapy.classification.neural.TorchDataset)

Returns

a torch tensor of shape (n_samples, n_dimensions), where n_samples is the number of documents, and n_dimensions is the dimensionality of the embedding

forward(x)

Performs the forward pass.

Parameters

x – a batch of instances, typically generated by a torch’s DataLoader instance (see quapy.classification.neural.TorchDataset)

Returns

a tensor of shape (n_instances, n_classes) with the decision scores for each of the instances and classes

abstract get_params()

Get hyper-parameters for this estimator

Returns

a dictionary with parameter names mapped to their values

predict_proba(x)

Predicts posterior probabilities for the instances in x

Parameters

x – a torch tensor of indexed tokens with shape (n_instances, pad_length) where n_instances is the number of instances in the batch, and pad_length is length of the pad in the batch

Returns

array-like of shape (n_samples, n_classes) with the posterior probabilities

property vocabulary_size

Return the size of the vocabulary

Returns

integer

xavier_uniform()

Performs Xavier initialization of the network parameters

class quapy.classification.neural.TorchDataset(instances, labels=None)

Bases: torch.utils.data.dataset.Dataset

Transforms labelled instances into a Torch’s torch.utils.data.DataLoader object

Parameters
  • instances – list of lists of indexed tokens

  • labels – array-like of shape (n_samples, n_classes) with the class labels

asDataloader(batch_size, shuffle, pad_length, device)

Converts the labelled collection into a Torch DataLoader with dynamic padding for the batch

Parameters
  • batch_size – batch size

  • shuffle – whether or not to shuffle instances

  • pad_length – the maximum length for the list of tokens (dynamic padding is applied, meaning that if the longest document in the batch is shorter than pad_length, then the batch is padded up to its length, and not to pad_length.

  • device – whether to allocate tensors in cpu or in cuda

Returns

a torch.utils.data.DataLoader object

quapy.classification.svmperf module

class quapy.classification.svmperf.SVMperf(svmperf_base, C=0.01, verbose=False, loss='01')

Bases: sklearn.base.BaseEstimator, sklearn.base.ClassifierMixin

A wrapper for the SVM-perf package by Thorsten Joachims. When using losses for quantification, the source code has to be patched. See the installation documentation for further details.

References:

Parameters
  • svmperf_base – path to directory containing the binary files svm_perf_learn and svm_perf_classify

  • C – trade-off between training error and margin (default 0.01)

  • verbose – set to True to print svm-perf std outputs

  • loss – the loss to optimize for. Available losses are “01”, “f1”, “kld”, “nkld”, “q”, “qacc”, “qf1”, “qgm”, “mae”, “mrae”.

decision_function(X, y=None)

Evaluate the decision function for the samples in X.

Parameters
  • X – array-like of shape (n_samples, n_features) containing the instances to classify

  • y – unused

Returns

array-like of shape (n_samples,) containing the decision scores of the instances

fit(X, y)

Trains the SVM for the multivariate performance loss

Parameters
  • X – training instances

  • y – a binary vector of labels

Returns

self

predict(X)

Predicts labels for the instances X :param X: array-like of shape (n_samples, n_features) instances to classify :return: a numpy array of length n containing the label predictions, where n is the number of

instances in X

set_params(**parameters)

Set the hyper-parameters for svm-perf. Currently, only the C parameter is supported

Parameters

parameters – a **kwargs dictionary {‘C’: <float>}

valid_losses = {'01': 0, 'f1': 1, 'kld': 12, 'mae': 26, 'mrae': 27, 'nkld': 13, 'q': 22, 'qacc': 23, 'qf1': 24, 'qgm': 25}

Module contents