quapy.classification package¶
Submodules¶
quapy.classification.calibration¶
New in version 0.1.7.
- class quapy.classification.calibration.BCTSCalibration(classifier, val_split=5, n_jobs=None, verbose=False)¶
Bases:
RecalibratedProbabilisticClassifierBase
Applies the Bias-Corrected Temperature Scaling (BCTS) calibration method from abstention.calibration, as defined in Alexandari et al. paper:
- Parameters:
classifier – a scikit-learn probabilistic classifier
val_split – indicate an integer k for performing kFCV to obtain the posterior prevalences, or a float p in (0,1) to indicate that the posteriors are obtained in a stratified validation split containing p% of the training instances (the rest is used for training). In any case, the classifier is retrained in the whole training set afterwards. Default value is 5.
n_jobs – indicate the number of parallel workers (only when val_split is an integer)
verbose – whether or not to display information in the standard output
- class quapy.classification.calibration.NBVSCalibration(classifier, val_split=5, n_jobs=None, verbose=False)¶
Bases:
RecalibratedProbabilisticClassifierBase
Applies the No-Bias Vector Scaling (NBVS) calibration method from abstention.calibration, as defined in Alexandari et al. paper:
- Parameters:
classifier – a scikit-learn probabilistic classifier
val_split – indicate an integer k for performing kFCV to obtain the posterior prevalences, or a float p in (0,1) to indicate that the posteriors are obtained in a stratified validation split containing p% of the training instances (the rest is used for training). In any case, the classifier is retrained in the whole training set afterwards. Default value is 5.
n_jobs – indicate the number of parallel workers (only when val_split is an integer)
verbose – whether or not to display information in the standard output
- class quapy.classification.calibration.RecalibratedProbabilisticClassifier¶
Bases:
object
Abstract class for (re)calibration method from abstention.calibration, as defined in Alexandari, A., Kundaje, A., & Shrikumar, A. (2020, November). Maximum likelihood with bias-corrected calibration is hard-to-beat at label shift adaptation. In International Conference on Machine Learning (pp. 222-232). PMLR.:
- class quapy.classification.calibration.RecalibratedProbabilisticClassifierBase(classifier, calibrator, val_split=5, n_jobs=None, verbose=False)¶
Bases:
BaseEstimator
,RecalibratedProbabilisticClassifier
Applies a (re)calibration method from abstention.calibration, as defined in Alexandari et al. paper:
- Parameters:
classifier – a scikit-learn probabilistic classifier
calibrator – the calibration object (an instance of abstention.calibration.CalibratorFactory)
val_split – indicate an integer k for performing kFCV to obtain the posterior probabilities, or a float p in (0,1) to indicate that the posteriors are obtained in a stratified validation split containing p% of the training instances (the rest is used for training). In any case, the classifier is retrained in the whole training set afterwards. Default value is 5.
n_jobs – indicate the number of parallel workers (only when val_split is an integer); default=None
verbose – whether or not to display information in the standard output
- property classes_¶
Returns the classes on which the classifier has been trained on
- Returns:
array-like of shape (n_classes)
- fit(X, y)¶
Fits the calibration for the probabilistic classifier.
- Parameters:
X – array-like of shape (n_samples, n_features) with the data instances
y – array-like of shape (n_samples,) with the class labels
- Returns:
self
- fit_cv(X, y)¶
Fits the calibration in a cross-validation manner, i.e., it generates posterior probabilities for all training instances via cross-validation, and then retrains the classifier on all training instances. The posterior probabilities thus generated are used for calibrating the outputs of the classifier.
- Parameters:
X – array-like of shape (n_samples, n_features) with the data instances
y – array-like of shape (n_samples,) with the class labels
- Returns:
self
- fit_tr_val(X, y)¶
Fits the calibration in a train/val-split manner, i.e.t, it partitions the training instances into a training and a validation set, and then uses the training samples to learn classifier which is then used to generate posterior probabilities for the held-out validation data. These posteriors are used to calibrate the classifier. The classifier is not retrained on the whole dataset.
- Parameters:
X – array-like of shape (n_samples, n_features) with the data instances
y – array-like of shape (n_samples,) with the class labels
- Returns:
self
- predict(X)¶
Predicts class labels for the data instances in X
- Parameters:
X – array-like of shape (n_samples, n_features) with the data instances
- Returns:
array-like of shape (n_samples,) with the class label predictions
- predict_proba(X)¶
Generates posterior probabilities for the data instances in X
- Parameters:
X – array-like of shape (n_samples, n_features) with the data instances
- Returns:
array-like of shape (n_samples, n_classes) with posterior probabilities
- class quapy.classification.calibration.TSCalibration(classifier, val_split=5, n_jobs=None, verbose=False)¶
Bases:
RecalibratedProbabilisticClassifierBase
Applies the Temperature Scaling (TS) calibration method from abstention.calibration, as defined in Alexandari et al. paper:
- Parameters:
classifier – a scikit-learn probabilistic classifier
val_split – indicate an integer k for performing kFCV to obtain the posterior prevalences, or a float p in (0,1) to indicate that the posteriors are obtained in a stratified validation split containing p% of the training instances (the rest is used for training). In any case, the classifier is retrained in the whole training set afterwards. Default value is 5.
n_jobs – indicate the number of parallel workers (only when val_split is an integer)
verbose – whether or not to display information in the standard output
- class quapy.classification.calibration.VSCalibration(classifier, val_split=5, n_jobs=None, verbose=False)¶
Bases:
RecalibratedProbabilisticClassifierBase
Applies the Vector Scaling (VS) calibration method from abstention.calibration, as defined in Alexandari et al. paper:
- Parameters:
classifier – a scikit-learn probabilistic classifier
val_split – indicate an integer k for performing kFCV to obtain the posterior prevalences, or a float p in (0,1) to indicate that the posteriors are obtained in a stratified validation split containing p% of the training instances (the rest is used for training). In any case, the classifier is retrained in the whole training set afterwards. Default value is 5.
n_jobs – indicate the number of parallel workers (only when val_split is an integer)
verbose – whether or not to display information in the standard output
quapy.classification.methods¶
- class quapy.classification.methods.LowRankLogisticRegression(n_components=100, **kwargs)¶
Bases:
BaseEstimator
An example of a classification method (i.e., an object that implements fit, predict, and predict_proba) that also generates embedded inputs (i.e., that implements transform), as those required for
quapy.method.neural.QuaNet
. This is a mock method to allow for easily instantiatingquapy.method.neural.QuaNet
on array-like real-valued instances. The transformation consists of applyingsklearn.decomposition.TruncatedSVD
while classification is performed usingsklearn.linear_model.LogisticRegression
on the low-rank space.- Parameters:
n_components – the number of principal components to retain
kwargs – parameters for the Logistic Regression classifier
- fit(X, y)¶
Fit the model according to the given training data. The fit consists of fitting TruncatedSVD and then LogisticRegression on the low-rank representation.
- Parameters:
X – array-like of shape (n_samples, n_features) with the instances
y – array-like of shape (n_samples, n_classes) with the class labels
- Returns:
self
- get_params()¶
Get hyper-parameters for this estimator.
- Returns:
a dictionary with parameter names mapped to their values
- predict(X)¶
Predicts labels for the instances X embedded into the low-rank space.
- Parameters:
X – array-like of shape (n_samples, n_features) instances to classify
- Returns:
a numpy array of length n containing the label predictions, where n is the number of instances in X
- predict_proba(X)¶
Predicts posterior probabilities for the instances X embedded into the low-rank space.
- Parameters:
X – array-like of shape (n_samples, n_features) instances to classify
- Returns:
array-like of shape (n_samples, n_classes) with the posterior probabilities
- set_params(**params)¶
Set the parameters of this estimator.
- Parameters:
parameters – a **kwargs dictionary with the estimator parameters for Logistic Regression and eventually also n_components for TruncatedSVD
- transform(X)¶
Returns the low-rank approximation of X with n_components dimensions, or X unaltered if n_components >= X.shape[1].
- Parameters:
X – array-like of shape (n_samples, n_features) instances to embed
- Returns:
array-like of shape (n_samples, n_components) with the embedded instances
quapy.classification.neural¶
- class quapy.classification.neural.CNNnet(vocabulary_size, n_classes, embedding_size=100, hidden_size=256, repr_size=100, kernel_heights=[3, 5, 7], stride=1, padding=0, drop_p=0.5)¶
Bases:
TextClassifierNet
An implementation of
quapy.classification.neural.TextClassifierNet
based on Convolutional Neural Networks.- Parameters:
vocabulary_size – the size of the vocabulary
n_classes – number of target classes
embedding_size – the dimensionality of the word embeddings space (default 100)
hidden_size – the dimensionality of the hidden space (default 256)
repr_size – the dimensionality of the document embeddings space (default 100)
kernel_heights – list of kernel lengths (default [3,5,7]), i.e., the number of consecutive tokens that each kernel covers
stride – convolutional stride (default 1)
stride – convolutional pad (default 0)
drop_p – drop probability for dropout (default 0.5)
- document_embedding(input)¶
Embeds documents (i.e., performs the forward pass up to the next-to-last layer).
- Parameters:
input – a batch of instances, typically generated by a torch’s DataLoader instance (see
quapy.classification.neural.TorchDataset
)- Returns:
a torch tensor of shape (n_samples, n_dimensions), where n_samples is the number of documents, and n_dimensions is the dimensionality of the embedding
- get_params()¶
Get hyper-parameters for this estimator
- Returns:
a dictionary with parameter names mapped to their values
- training: bool¶
- property vocabulary_size¶
Return the size of the vocabulary
- Returns:
integer
- class quapy.classification.neural.LSTMnet(vocabulary_size, n_classes, embedding_size=100, hidden_size=256, repr_size=100, lstm_class_nlayers=1, drop_p=0.5)¶
Bases:
TextClassifierNet
An implementation of
quapy.classification.neural.TextClassifierNet
based on Long Short Term Memory networks.- Parameters:
vocabulary_size – the size of the vocabulary
n_classes – number of target classes
embedding_size – the dimensionality of the word embeddings space (default 100)
hidden_size – the dimensionality of the hidden space (default 256)
repr_size – the dimensionality of the document embeddings space (default 100)
lstm_class_nlayers – number of LSTM layers (default 1)
drop_p – drop probability for dropout (default 0.5)
- document_embedding(x)¶
Embeds documents (i.e., performs the forward pass up to the next-to-last layer).
- Parameters:
x – a batch of instances, typically generated by a torch’s DataLoader instance (see
quapy.classification.neural.TorchDataset
)- Returns:
a torch tensor of shape (n_samples, n_dimensions), where n_samples is the number of documents, and n_dimensions is the dimensionality of the embedding
- get_params()¶
Get hyper-parameters for this estimator
- Returns:
a dictionary with parameter names mapped to their values
- training: bool¶
- property vocabulary_size¶
Return the size of the vocabulary
- Returns:
integer
- class quapy.classification.neural.NeuralClassifierTrainer(net: TextClassifierNet, lr=0.001, weight_decay=0, patience=10, epochs=200, batch_size=64, batch_size_test=512, padding_length=300, device='cpu', checkpointpath='../checkpoint/classifier_net.dat')¶
Bases:
object
Trains a neural network for text classification.
- Parameters:
net – an instance of TextClassifierNet implementing the forward pass
lr – learning rate (default 1e-3)
weight_decay – weight decay (default 0)
patience – number of epochs that do not show any improvement in validation to wait before applying early stop (default 10)
epochs – maximum number of training epochs (default 200)
batch_size – batch size for training (default 64)
batch_size_test – batch size for test (default 512)
padding_length – maximum number of tokens to consider in a document (default 300)
device – specify ‘cpu’ (default) or ‘cuda’ for enabling gpu
checkpointpath – where to store the parameters of the best model found so far according to the evaluation in the held-out validation split (default ‘../checkpoint/classifier_net.dat’)
- property device¶
Gets the device in which the network is allocated
- Returns:
device
- fit(instances, labels, val_split=0.3)¶
Fits the model according to the given training data.
- Parameters:
instances – list of lists of indexed tokens
labels – array-like of shape (n_samples, n_classes) with the class labels
val_split – proportion of training documents to be taken as the validation set (default 0.3)
- Returns:
- get_params()¶
Get hyper-parameters for this estimator
- Returns:
a dictionary with parameter names mapped to their values
- predict(instances)¶
Predicts labels for the instances
- Parameters:
instances – list of lists of indexed tokens
- Returns:
a numpy array of length n containing the label predictions, where n is the number of instances in X
- predict_proba(instances)¶
Predicts posterior probabilities for the instances
- Parameters:
X – array-like of shape (n_samples, n_features) instances to classify
- Returns:
array-like of shape (n_samples, n_classes) with the posterior probabilities
- reset_net_params(vocab_size, n_classes)¶
Reinitialize the network parameters
- Parameters:
vocab_size – the size of the vocabulary
n_classes – the number of target classes
- set_params(**params)¶
Set the parameters of this trainer and the learner it is training. In this current version, parameter names for the trainer and learner should be disjoint.
- Parameters:
params – a **kwargs dictionary with the parameters
- transform(instances)¶
Returns the embeddings of the instances
- Parameters:
instances – list of lists of indexed tokens
- Returns:
array-like of shape (n_samples, embed_size) with the embedded instances, where embed_size is defined by the classification network
- class quapy.classification.neural.TextClassifierNet¶
Bases:
Module
Abstract Text classifier (torch.nn.Module)
- dimensions()¶
Gets the number of dimensions of the embedding space
- Returns:
integer
- abstract document_embedding(x)¶
Embeds documents (i.e., performs the forward pass up to the next-to-last layer).
- Parameters:
x – a batch of instances, typically generated by a torch’s DataLoader instance (see
quapy.classification.neural.TorchDataset
)- Returns:
a torch tensor of shape (n_samples, n_dimensions), where n_samples is the number of documents, and n_dimensions is the dimensionality of the embedding
- forward(x)¶
Performs the forward pass.
- Parameters:
x – a batch of instances, typically generated by a torch’s DataLoader instance (see
quapy.classification.neural.TorchDataset
)- Returns:
a tensor of shape (n_instances, n_classes) with the decision scores for each of the instances and classes
- abstract get_params()¶
Get hyper-parameters for this estimator
- Returns:
a dictionary with parameter names mapped to their values
- predict_proba(x)¶
Predicts posterior probabilities for the instances in x
- Parameters:
x – a torch tensor of indexed tokens with shape (n_instances, pad_length) where n_instances is the number of instances in the batch, and pad_length is length of the pad in the batch
- Returns:
array-like of shape (n_samples, n_classes) with the posterior probabilities
- training: bool¶
- property vocabulary_size¶
Return the size of the vocabulary
- Returns:
integer
- xavier_uniform()¶
Performs Xavier initialization of the network parameters
- class quapy.classification.neural.TorchDataset(instances, labels=None)¶
Bases:
Dataset
Transforms labelled instances into a Torch’s
torch.utils.data.DataLoader
object- Parameters:
instances – list of lists of indexed tokens
labels – array-like of shape (n_samples, n_classes) with the class labels
- asDataloader(batch_size, shuffle, pad_length, device)¶
Converts the labelled collection into a Torch DataLoader with dynamic padding for the batch
- Parameters:
batch_size – batch size
shuffle – whether or not to shuffle instances
pad_length – the maximum length for the list of tokens (dynamic padding is applied, meaning that if the longest document in the batch is shorter than pad_length, then the batch is padded up to its length, and not to pad_length.
device – whether to allocate tensors in cpu or in cuda
- Returns:
a
torch.utils.data.DataLoader
object
quapy.classification.svmperf¶
- class quapy.classification.svmperf.SVMperf(svmperf_base, C=0.01, verbose=False, loss='01', host_folder=None)¶
Bases:
BaseEstimator
,ClassifierMixin
A wrapper for the SVM-perf package by Thorsten Joachims. When using losses for quantification, the source code has to be patched. See the installation documentation for further details.
References:
- Parameters:
svmperf_base – path to directory containing the binary files svm_perf_learn and svm_perf_classify
C – trade-off between training error and margin (default 0.01)
verbose – set to True to print svm-perf std outputs
loss – the loss to optimize for. Available losses are “01”, “f1”, “kld”, “nkld”, “q”, “qacc”, “qf1”, “qgm”, “mae”, “mrae”.
host_folder – directory where to store the trained model; set to None (default) for using a tmp directory (temporal directories are automatically deleted)
- decision_function(X, y=None)¶
Evaluate the decision function for the samples in X.
- Parameters:
X – array-like of shape (n_samples, n_features) containing the instances to classify
y – unused
- Returns:
array-like of shape (n_samples,) containing the decision scores of the instances
- fit(X, y)¶
Trains the SVM for the multivariate performance loss
- Parameters:
X – training instances
y – a binary vector of labels
- Returns:
self
- predict(X)¶
Predicts labels for the instances X
- Parameters:
X – array-like of shape (n_samples, n_features) instances to classify
- Returns:
a numpy array of length n containing the label predictions, where n is the number of instances in X
- valid_losses = {'01': 0, 'f1': 1, 'kld': 12, 'mae': 26, 'mrae': 27, 'nkld': 13, 'q': 22, 'qacc': 23, 'qf1': 24, 'qgm': 25}¶