quapy.classification package

Submodules

quapy.classification.calibration

New in version 0.1.7.

class quapy.classification.calibration.BCTSCalibration(classifier, val_split=5, n_jobs=None, verbose=False)

Bases: RecalibratedProbabilisticClassifierBase

Applies the Bias-Corrected Temperature Scaling (BCTS) calibration method from abstention.calibration, as defined in Alexandari et al. paper:

Parameters:
  • classifier – a scikit-learn probabilistic classifier

  • val_split – indicate an integer k for performing kFCV to obtain the posterior prevalences, or a float p in (0,1) to indicate that the posteriors are obtained in a stratified validation split containing p% of the training instances (the rest is used for training). In any case, the classifier is retrained in the whole training set afterwards. Default value is 5.

  • n_jobs – indicate the number of parallel workers (only when val_split is an integer)

  • verbose – whether or not to display information in the standard output

class quapy.classification.calibration.NBVSCalibration(classifier, val_split=5, n_jobs=None, verbose=False)

Bases: RecalibratedProbabilisticClassifierBase

Applies the No-Bias Vector Scaling (NBVS) calibration method from abstention.calibration, as defined in Alexandari et al. paper:

Parameters:
  • classifier – a scikit-learn probabilistic classifier

  • val_split – indicate an integer k for performing kFCV to obtain the posterior prevalences, or a float p in (0,1) to indicate that the posteriors are obtained in a stratified validation split containing p% of the training instances (the rest is used for training). In any case, the classifier is retrained in the whole training set afterwards. Default value is 5.

  • n_jobs – indicate the number of parallel workers (only when val_split is an integer)

  • verbose – whether or not to display information in the standard output

class quapy.classification.calibration.RecalibratedProbabilisticClassifier

Bases: object

Abstract class for (re)calibration method from abstention.calibration, as defined in Alexandari, A., Kundaje, A., & Shrikumar, A. (2020, November). Maximum likelihood with bias-corrected calibration is hard-to-beat at label shift adaptation. In International Conference on Machine Learning (pp. 222-232). PMLR.:

class quapy.classification.calibration.RecalibratedProbabilisticClassifierBase(classifier, calibrator, val_split=5, n_jobs=None, verbose=False)

Bases: BaseEstimator, RecalibratedProbabilisticClassifier

Applies a (re)calibration method from abstention.calibration, as defined in Alexandari et al. paper:

Parameters:
  • classifier – a scikit-learn probabilistic classifier

  • calibrator – the calibration object (an instance of abstention.calibration.CalibratorFactory)

  • val_split – indicate an integer k for performing kFCV to obtain the posterior probabilities, or a float p in (0,1) to indicate that the posteriors are obtained in a stratified validation split containing p% of the training instances (the rest is used for training). In any case, the classifier is retrained in the whole training set afterwards. Default value is 5.

  • n_jobs – indicate the number of parallel workers (only when val_split is an integer); default=None

  • verbose – whether or not to display information in the standard output

property classes_

Returns the classes on which the classifier has been trained on

Returns:

array-like of shape (n_classes)

fit(X, y)

Fits the calibration for the probabilistic classifier.

Parameters:
  • X – array-like of shape (n_samples, n_features) with the data instances

  • y – array-like of shape (n_samples,) with the class labels

Returns:

self

fit_cv(X, y)

Fits the calibration in a cross-validation manner, i.e., it generates posterior probabilities for all training instances via cross-validation, and then retrains the classifier on all training instances. The posterior probabilities thus generated are used for calibrating the outputs of the classifier.

Parameters:
  • X – array-like of shape (n_samples, n_features) with the data instances

  • y – array-like of shape (n_samples,) with the class labels

Returns:

self

fit_tr_val(X, y)

Fits the calibration in a train/val-split manner, i.e.t, it partitions the training instances into a training and a validation set, and then uses the training samples to learn classifier which is then used to generate posterior probabilities for the held-out validation data. These posteriors are used to calibrate the classifier. The classifier is not retrained on the whole dataset.

Parameters:
  • X – array-like of shape (n_samples, n_features) with the data instances

  • y – array-like of shape (n_samples,) with the class labels

Returns:

self

predict(X)

Predicts class labels for the data instances in X

Parameters:

X – array-like of shape (n_samples, n_features) with the data instances

Returns:

array-like of shape (n_samples,) with the class label predictions

predict_proba(X)

Generates posterior probabilities for the data instances in X

Parameters:

X – array-like of shape (n_samples, n_features) with the data instances

Returns:

array-like of shape (n_samples, n_classes) with posterior probabilities

class quapy.classification.calibration.TSCalibration(classifier, val_split=5, n_jobs=None, verbose=False)

Bases: RecalibratedProbabilisticClassifierBase

Applies the Temperature Scaling (TS) calibration method from abstention.calibration, as defined in Alexandari et al. paper:

Parameters:
  • classifier – a scikit-learn probabilistic classifier

  • val_split – indicate an integer k for performing kFCV to obtain the posterior prevalences, or a float p in (0,1) to indicate that the posteriors are obtained in a stratified validation split containing p% of the training instances (the rest is used for training). In any case, the classifier is retrained in the whole training set afterwards. Default value is 5.

  • n_jobs – indicate the number of parallel workers (only when val_split is an integer)

  • verbose – whether or not to display information in the standard output

class quapy.classification.calibration.VSCalibration(classifier, val_split=5, n_jobs=None, verbose=False)

Bases: RecalibratedProbabilisticClassifierBase

Applies the Vector Scaling (VS) calibration method from abstention.calibration, as defined in Alexandari et al. paper:

Parameters:
  • classifier – a scikit-learn probabilistic classifier

  • val_split – indicate an integer k for performing kFCV to obtain the posterior prevalences, or a float p in (0,1) to indicate that the posteriors are obtained in a stratified validation split containing p% of the training instances (the rest is used for training). In any case, the classifier is retrained in the whole training set afterwards. Default value is 5.

  • n_jobs – indicate the number of parallel workers (only when val_split is an integer)

  • verbose – whether or not to display information in the standard output

quapy.classification.methods

class quapy.classification.methods.LowRankLogisticRegression(n_components=100, **kwargs)

Bases: BaseEstimator

An example of a classification method (i.e., an object that implements fit, predict, and predict_proba) that also generates embedded inputs (i.e., that implements transform), as those required for quapy.method.neural.QuaNet. This is a mock method to allow for easily instantiating quapy.method.neural.QuaNet on array-like real-valued instances. The transformation consists of applying sklearn.decomposition.TruncatedSVD while classification is performed using sklearn.linear_model.LogisticRegression on the low-rank space.

Parameters:
  • n_components – the number of principal components to retain

  • kwargs – parameters for the Logistic Regression classifier

fit(X, y)

Fit the model according to the given training data. The fit consists of fitting TruncatedSVD and then LogisticRegression on the low-rank representation.

Parameters:
  • X – array-like of shape (n_samples, n_features) with the instances

  • y – array-like of shape (n_samples, n_classes) with the class labels

Returns:

self

get_params()

Get hyper-parameters for this estimator.

Returns:

a dictionary with parameter names mapped to their values

predict(X)

Predicts labels for the instances X embedded into the low-rank space.

Parameters:

X – array-like of shape (n_samples, n_features) instances to classify

Returns:

a numpy array of length n containing the label predictions, where n is the number of instances in X

predict_proba(X)

Predicts posterior probabilities for the instances X embedded into the low-rank space.

Parameters:

X – array-like of shape (n_samples, n_features) instances to classify

Returns:

array-like of shape (n_samples, n_classes) with the posterior probabilities

set_params(**params)

Set the parameters of this estimator.

Parameters:

parameters – a **kwargs dictionary with the estimator parameters for Logistic Regression and eventually also n_components for TruncatedSVD

transform(X)

Returns the low-rank approximation of X with n_components dimensions, or X unaltered if n_components >= X.shape[1].

Parameters:

X – array-like of shape (n_samples, n_features) instances to embed

Returns:

array-like of shape (n_samples, n_components) with the embedded instances

quapy.classification.neural

class quapy.classification.neural.CNNnet(vocabulary_size, n_classes, embedding_size=100, hidden_size=256, repr_size=100, kernel_heights=[3, 5, 7], stride=1, padding=0, drop_p=0.5)

Bases: TextClassifierNet

An implementation of quapy.classification.neural.TextClassifierNet based on Convolutional Neural Networks.

Parameters:
  • vocabulary_size – the size of the vocabulary

  • n_classes – number of target classes

  • embedding_size – the dimensionality of the word embeddings space (default 100)

  • hidden_size – the dimensionality of the hidden space (default 256)

  • repr_size – the dimensionality of the document embeddings space (default 100)

  • kernel_heights – list of kernel lengths (default [3,5,7]), i.e., the number of consecutive tokens that each kernel covers

  • stride – convolutional stride (default 1)

  • stride – convolutional pad (default 0)

  • drop_p – drop probability for dropout (default 0.5)

document_embedding(input)

Embeds documents (i.e., performs the forward pass up to the next-to-last layer).

Parameters:

input – a batch of instances, typically generated by a torch’s DataLoader instance (see quapy.classification.neural.TorchDataset)

Returns:

a torch tensor of shape (n_samples, n_dimensions), where n_samples is the number of documents, and n_dimensions is the dimensionality of the embedding

get_params()

Get hyper-parameters for this estimator

Returns:

a dictionary with parameter names mapped to their values

training: bool
property vocabulary_size

Return the size of the vocabulary

Returns:

integer

class quapy.classification.neural.LSTMnet(vocabulary_size, n_classes, embedding_size=100, hidden_size=256, repr_size=100, lstm_class_nlayers=1, drop_p=0.5)

Bases: TextClassifierNet

An implementation of quapy.classification.neural.TextClassifierNet based on Long Short Term Memory networks.

Parameters:
  • vocabulary_size – the size of the vocabulary

  • n_classes – number of target classes

  • embedding_size – the dimensionality of the word embeddings space (default 100)

  • hidden_size – the dimensionality of the hidden space (default 256)

  • repr_size – the dimensionality of the document embeddings space (default 100)

  • lstm_class_nlayers – number of LSTM layers (default 1)

  • drop_p – drop probability for dropout (default 0.5)

document_embedding(x)

Embeds documents (i.e., performs the forward pass up to the next-to-last layer).

Parameters:

x – a batch of instances, typically generated by a torch’s DataLoader instance (see quapy.classification.neural.TorchDataset)

Returns:

a torch tensor of shape (n_samples, n_dimensions), where n_samples is the number of documents, and n_dimensions is the dimensionality of the embedding

get_params()

Get hyper-parameters for this estimator

Returns:

a dictionary with parameter names mapped to their values

training: bool
property vocabulary_size

Return the size of the vocabulary

Returns:

integer

class quapy.classification.neural.NeuralClassifierTrainer(net: TextClassifierNet, lr=0.001, weight_decay=0, patience=10, epochs=200, batch_size=64, batch_size_test=512, padding_length=300, device='cpu', checkpointpath='../checkpoint/classifier_net.dat')

Bases: object

Trains a neural network for text classification.

Parameters:
  • net – an instance of TextClassifierNet implementing the forward pass

  • lr – learning rate (default 1e-3)

  • weight_decay – weight decay (default 0)

  • patience – number of epochs that do not show any improvement in validation to wait before applying early stop (default 10)

  • epochs – maximum number of training epochs (default 200)

  • batch_size – batch size for training (default 64)

  • batch_size_test – batch size for test (default 512)

  • padding_length – maximum number of tokens to consider in a document (default 300)

  • device – specify ‘cpu’ (default) or ‘cuda’ for enabling gpu

  • checkpointpath – where to store the parameters of the best model found so far according to the evaluation in the held-out validation split (default ‘../checkpoint/classifier_net.dat’)

property device

Gets the device in which the network is allocated

Returns:

device

fit(instances, labels, val_split=0.3)

Fits the model according to the given training data.

Parameters:
  • instances – list of lists of indexed tokens

  • labels – array-like of shape (n_samples, n_classes) with the class labels

  • val_split – proportion of training documents to be taken as the validation set (default 0.3)

Returns:

get_params()

Get hyper-parameters for this estimator

Returns:

a dictionary with parameter names mapped to their values

predict(instances)

Predicts labels for the instances

Parameters:

instances – list of lists of indexed tokens

Returns:

a numpy array of length n containing the label predictions, where n is the number of instances in X

predict_proba(instances)

Predicts posterior probabilities for the instances

Parameters:

X – array-like of shape (n_samples, n_features) instances to classify

Returns:

array-like of shape (n_samples, n_classes) with the posterior probabilities

reset_net_params(vocab_size, n_classes)

Reinitialize the network parameters

Parameters:
  • vocab_size – the size of the vocabulary

  • n_classes – the number of target classes

set_params(**params)

Set the parameters of this trainer and the learner it is training. In this current version, parameter names for the trainer and learner should be disjoint.

Parameters:

params – a **kwargs dictionary with the parameters

transform(instances)

Returns the embeddings of the instances

Parameters:

instances – list of lists of indexed tokens

Returns:

array-like of shape (n_samples, embed_size) with the embedded instances, where embed_size is defined by the classification network

class quapy.classification.neural.TextClassifierNet

Bases: Module

Abstract Text classifier (torch.nn.Module)

dimensions()

Gets the number of dimensions of the embedding space

Returns:

integer

abstract document_embedding(x)

Embeds documents (i.e., performs the forward pass up to the next-to-last layer).

Parameters:

x – a batch of instances, typically generated by a torch’s DataLoader instance (see quapy.classification.neural.TorchDataset)

Returns:

a torch tensor of shape (n_samples, n_dimensions), where n_samples is the number of documents, and n_dimensions is the dimensionality of the embedding

forward(x)

Performs the forward pass.

Parameters:

x – a batch of instances, typically generated by a torch’s DataLoader instance (see quapy.classification.neural.TorchDataset)

Returns:

a tensor of shape (n_instances, n_classes) with the decision scores for each of the instances and classes

abstract get_params()

Get hyper-parameters for this estimator

Returns:

a dictionary with parameter names mapped to their values

predict_proba(x)

Predicts posterior probabilities for the instances in x

Parameters:

x – a torch tensor of indexed tokens with shape (n_instances, pad_length) where n_instances is the number of instances in the batch, and pad_length is length of the pad in the batch

Returns:

array-like of shape (n_samples, n_classes) with the posterior probabilities

training: bool
property vocabulary_size

Return the size of the vocabulary

Returns:

integer

xavier_uniform()

Performs Xavier initialization of the network parameters

class quapy.classification.neural.TorchDataset(instances, labels=None)

Bases: Dataset

Transforms labelled instances into a Torch’s torch.utils.data.DataLoader object

Parameters:
  • instances – list of lists of indexed tokens

  • labels – array-like of shape (n_samples, n_classes) with the class labels

asDataloader(batch_size, shuffle, pad_length, device)

Converts the labelled collection into a Torch DataLoader with dynamic padding for the batch

Parameters:
  • batch_size – batch size

  • shuffle – whether or not to shuffle instances

  • pad_length – the maximum length for the list of tokens (dynamic padding is applied, meaning that if the longest document in the batch is shorter than pad_length, then the batch is padded up to its length, and not to pad_length.

  • device – whether to allocate tensors in cpu or in cuda

Returns:

a torch.utils.data.DataLoader object

quapy.classification.svmperf

class quapy.classification.svmperf.SVMperf(svmperf_base, C=0.01, verbose=False, loss='01', host_folder=None)

Bases: BaseEstimator, ClassifierMixin

A wrapper for the SVM-perf package by Thorsten Joachims. When using losses for quantification, the source code has to be patched. See the installation documentation for further details.

References:

Parameters:
  • svmperf_base – path to directory containing the binary files svm_perf_learn and svm_perf_classify

  • C – trade-off between training error and margin (default 0.01)

  • verbose – set to True to print svm-perf std outputs

  • loss – the loss to optimize for. Available losses are “01”, “f1”, “kld”, “nkld”, “q”, “qacc”, “qf1”, “qgm”, “mae”, “mrae”.

  • host_folder – directory where to store the trained model; set to None (default) for using a tmp directory (temporal directories are automatically deleted)

decision_function(X, y=None)

Evaluate the decision function for the samples in X.

Parameters:
  • X – array-like of shape (n_samples, n_features) containing the instances to classify

  • y – unused

Returns:

array-like of shape (n_samples,) containing the decision scores of the instances

fit(X, y)

Trains the SVM for the multivariate performance loss

Parameters:
  • X – training instances

  • y – a binary vector of labels

Returns:

self

predict(X)

Predicts labels for the instances X

Parameters:

X – array-like of shape (n_samples, n_features) instances to classify

Returns:

a numpy array of length n containing the label predictions, where n is the number of instances in X

valid_losses = {'01': 0, 'f1': 1, 'kld': 12, 'mae': 26, 'mrae': 27, 'nkld': 13, 'q': 22, 'qacc': 23, 'qf1': 24, 'qgm': 25}

Module contents