quapy.method package

Submodules

quapy.method.aggregative

class quapy.method.aggregative.ACC(classifier: BaseEstimator, val_split=0.4, n_jobs=None)

Bases: AggregativeQuantifier

Adjusted Classify & Count, the “adjusted” variant of CC, that corrects the predictions of CC according to the misclassification rates.

Parameters:
  • classifier – a sklearn’s Estimator that generates a classifier

  • val_split – indicates the proportion of data to be used as a stratified held-out validation set in which the misclassification rates are to be estimated. This parameter can be indicated as a real value (between 0 and 1, default 0.4), representing a proportion of validation data, or as an integer, indicating that the misclassification rates should be estimated via k-fold cross validation (this integer stands for the number of folds k), or as a quapy.data.base.LabelledCollection (the split itself).

aggregate(classif_predictions)

Implements the aggregation of label predictions.

Parameters:

classif_predictionsnp.ndarray of label predictions

Returns:

np.ndarray of shape (n_classes,) with class prevalence estimates.

classify(data)

Provides the label predictions for the given instances. The predictions should respect the format expected by aggregate(), i.e., posterior probabilities for probabilistic quantifiers, or crisp predictions for non-probabilistic quantifiers

Parameters:

instances – array-like

Returns:

np.ndarray of shape (n_instances,) with label predictions

fit(data: LabelledCollection, fit_classifier=True, val_split: Optional[Union[float, int, LabelledCollection]] = None)

Trains a ACC quantifier.

Parameters:
  • data – the training set

  • fit_classifier – set to False to bypass the training (the learner is assumed to be already fit)

  • val_split – either a float in (0,1) indicating the proportion of training instances to use for validation (e.g., 0.3 for using 30% of the training set as validation data), or a LabelledCollection indicating the validation set itself, or an int indicating the number k of folds to be used in k-fold cross validation to estimate the parameters

Returns:

self

classmethod getPteCondEstim(classes, y, y_)
classmethod solve_adjustment(PteCondEstim, prevs_estim)

Solves the system linear system \(Ax = B\) with \(A\) = PteCondEstim and \(B\) = prevs_estim

Parameters:
  • PteCondEstim – a np.ndarray of shape (n_classes,n_classes,) with entry (i,j) being the estimate of \(P(y_i|y_j)\), that is, the probability that an instance that belongs to \(y_j\) ends up being classified as belonging to \(y_i\)

  • prevs_estim – a np.ndarray of shape (n_classes,) with the class prevalence estimates

Returns:

an adjusted np.ndarray of shape (n_classes,) with the corrected class prevalence estimates

quapy.method.aggregative.AdjustedClassifyAndCount

alias of ACC

class quapy.method.aggregative.AggregativeProbabilisticQuantifier

Bases: AggregativeQuantifier

Abstract class for quantification methods that base their estimations on the aggregation of posterior probabilities as returned by a probabilistic classifier. Aggregative Probabilistic Quantifiers thus extend Aggregative Quantifiers by implementing a _posterior_probabilities_ method returning values in [0,1] – the posterior probabilities.

classify(instances)

Provides the label predictions for the given instances. The predictions should respect the format expected by aggregate(), i.e., posterior probabilities for probabilistic quantifiers, or crisp predictions for non-probabilistic quantifiers

Parameters:

instances – array-like

Returns:

np.ndarray of shape (n_instances,) with label predictions

class quapy.method.aggregative.AggregativeQuantifier

Bases: BaseQuantifier

Abstract class for quantification methods that base their estimations on the aggregation of classification results. Aggregative Quantifiers thus implement a classify() method and maintain a classifier attribute. Subclasses of this abstract class must implement the method aggregate() which computes the aggregation of label predictions. The method quantify() comes with a default implementation based on classify() and aggregate().

abstract aggregate(classif_predictions: ndarray)

Implements the aggregation of label predictions.

Parameters:

classif_predictionsnp.ndarray of label predictions

Returns:

np.ndarray of shape (n_classes,) with class prevalence estimates.

property classes_

Class labels, in the same order in which class prevalence values are to be computed. This default implementation actually returns the class labels of the learner.

Returns:

array-like

property classifier

Gives access to the classifier

Returns:

the classifier (typically an sklearn’s Estimator)

classify(instances)

Provides the label predictions for the given instances. The predictions should respect the format expected by aggregate(), i.e., posterior probabilities for probabilistic quantifiers, or crisp predictions for non-probabilistic quantifiers

Parameters:

instances – array-like

Returns:

np.ndarray of shape (n_instances,) with label predictions

abstract fit(data: LabelledCollection, fit_classifier=True)

Trains the aggregative quantifier

Parameters:
  • data – a quapy.data.base.LabelledCollection consisting of the training data

  • fit_classifier – whether or not to train the learner (default is True). Set to False if the learner has been trained outside the quantifier.

Returns:

self

quantify(instances)

Generate class prevalence estimates for the sample’s instances by aggregating the label predictions generated by the classifier.

Parameters:

instances – array-like

Returns:

np.ndarray of shape (n_classes) with class prevalence estimates.

class quapy.method.aggregative.CC(classifier: BaseEstimator)

Bases: AggregativeQuantifier

The most basic Quantification method. One that simply classifies all instances and counts how many have been attributed to each of the classes in order to compute class prevalence estimates.

Parameters:

classifier – a sklearn’s Estimator that generates a classifier

aggregate(classif_predictions: ndarray)

Computes class prevalence estimates by counting the prevalence of each of the predicted labels.

Parameters:

classif_predictions – array-like with label predictions

Returns:

np.ndarray of shape (n_classes,) with class prevalence estimates.

fit(data: LabelledCollection, fit_classifier=True)

Trains the Classify & Count method unless fit_classifier is False, in which case, the classifier is assumed to be already fit and there is nothing else to do.

Parameters:
Returns:

self

quapy.method.aggregative.ClassifyAndCount

alias of CC

class quapy.method.aggregative.DistributionMatching(classifier, val_split=0.4, nbins=8, divergence: Union[str, Callable] = 'HD', cdf=False, n_jobs=None)

Bases: AggregativeProbabilisticQuantifier

Generic Distribution Matching quantifier for binary or multiclass quantification. This implementation takes the number of bins, the divergence, and the possibility to work on CDF as hyperparameters.

Parameters:
  • classifier – a sklearn’s Estimator that generates a probabilistic classifier

  • val_split – indicates the proportion of data to be used as a stratified held-out validation set to model the validation distribution. This parameter can be indicated as a real value (between 0 and 1, default 0.4), representing a proportion of validation data, or as an integer, indicating that the validation distribution should be estimated via k-fold cross validation (this integer stands for the number of folds k), or as a quapy.data.base.LabelledCollection (the split itself).

  • nbins – number of bins used to discretize the distributions (default 8)

  • divergence – a string representing a divergence measure (currently, “HD” and “topsoe” are implemented) or a callable function taking two ndarrays of the same dimension as input (default “HD”, meaning Hellinger Distance)

  • cdf – whether or not to use CDF instead of PDF (default False)

  • n_jobs – number of parallel workers (default None)

aggregate(posteriors: ndarray)

Searches for the mixture model parameter (the sought prevalence values) that yields a validation distribution (the mixture) that best matches the test distribution, in terms of the divergence measure of choice. In the multiclass case, with n the number of classes, the test and mixture distributions contain n channels (proper distributions of binned posterior probabilities), on which the divergence is computed independently. The matching is computed as an average of the divergence across all channels.

Parameters:

instances – instances in the sample

Returns:

a vector of class prevalence estimates

fit(data: LabelledCollection, fit_classifier=True, val_split: Optional[Union[float, LabelledCollection]] = None)

Trains the classifier (if requested) and generates the validation distributions out of the training data. The validation distributions have shape (n, ch, nbins), with n the number of classes, ch the number of channels, and nbins the number of bins. In particular, let V be the validation distributions; di=V[i] are the distributions obtained from training data labelled with class i; dij = di[j] is the discrete distribution of posterior probabilities P(Y=j|X=x) for training data labelled with class i, and dij[k] is the fraction of instances with a value in the k-th bin.

Parameters:
  • data – the training set

  • fit_classifier – set to False to bypass the training (the learner is assumed to be already fit)

  • val_split – either a float in (0,1) indicating the proportion of training instances to use for validation (e.g., 0.3 for using 30% of the training set as validation data), or a LabelledCollection indicating the validation set itself, or an int indicating the number k of folds to be used in kFCV to estimate the parameters

class quapy.method.aggregative.DyS(classifier: BaseEstimator, val_split=0.4, n_bins=8, divergence: Union[str, Callable] = 'HD', tol=1e-05)

Bases: AggregativeProbabilisticQuantifier, BinaryQuantifier

DyS framework (DyS). DyS is a generalization of HDy method, using a Ternary Search in order to find the prevalence that minimizes the distance between distributions. Details for the ternary search have been got from <https://dl.acm.org/doi/pdf/10.1145/3219819.3220059>

Parameters:
  • classifier – a sklearn’s Estimator that generates a binary classifier

  • val_split – a float in range (0,1) indicating the proportion of data to be used as a stratified held-out validation distribution, or a quapy.data.base.LabelledCollection (the split itself).

  • n_bins – an int with the number of bins to use to compute the histograms.

  • divergence – a str indicating the name of divergence (currently supported ones are “HD” or “topsoe”), or a callable function computes the divergence between two distributions (two equally sized arrays).

  • tol – a float with the tolerance for the ternary search algorithm.

aggregate(classif_posteriors)

Implements the aggregation of label predictions.

Parameters:

classif_predictionsnp.ndarray of label predictions

Returns:

np.ndarray of shape (n_classes,) with class prevalence estimates.

fit(data: LabelledCollection, fit_classifier=True, val_split: Optional[Union[float, LabelledCollection]] = None)

Trains the aggregative quantifier

Parameters:
  • data – a quapy.data.base.LabelledCollection consisting of the training data

  • fit_classifier – whether or not to train the learner (default is True). Set to False if the learner has been trained outside the quantifier.

Returns:

self

class quapy.method.aggregative.EMQ(classifier: BaseEstimator, exact_train_prev=True, recalib=None)

Bases: AggregativeProbabilisticQuantifier

Expectation Maximization for Quantification (EMQ), aka Saerens-Latinne-Decaestecker (SLD) algorithm. EMQ consists of using the well-known Expectation Maximization algorithm to iteratively update the posterior probabilities generated by a probabilistic classifier and the class prevalence estimates obtained via maximum-likelihood estimation, in a mutually recursive way, until convergence.

Parameters:
  • classifier – a sklearn’s Estimator that generates a classifier

  • exact_train_prev – set to True (default) for using, as the initial observation, the true training prevalence; or set to False for computing the training prevalence as an estimate, akin to PCC, i.e., as the expected value of the posterior probabilities of the training instances as suggested in Alexandari et al. paper:

  • recalib – a string indicating the method of recalibration. Available choices include “nbvs” (No-Bias Vector Scaling), “bcts” (Bias-Corrected Temperature Scaling), “ts” (Temperature Scaling), and “vs” (Vector Scaling). The default value is None, indicating no recalibration.

classmethod EM(tr_prev, posterior_probabilities, epsilon=0.0001)

Computes the Expectation Maximization routine.

Parameters:
  • tr_prev – array-like, the training prevalence

  • posterior_probabilitiesnp.ndarray of shape (n_instances, n_classes,) with the posterior probabilities

  • epsilon – float, the threshold different between two consecutive iterations to reach before stopping the loop

Returns:

a tuple with the estimated prevalence values (shape (n_classes,)) and the corrected posterior probabilities (shape (n_instances, n_classes,))

EPSILON = 0.0001
MAX_ITER = 1000
aggregate(classif_posteriors, epsilon=0.0001)

Implements the aggregation of label predictions.

Parameters:

classif_predictionsnp.ndarray of label predictions

Returns:

np.ndarray of shape (n_classes,) with class prevalence estimates.

fit(data: LabelledCollection, fit_classifier=True)

Trains the aggregative quantifier

Parameters:
  • data – a quapy.data.base.LabelledCollection consisting of the training data

  • fit_classifier – whether or not to train the learner (default is True). Set to False if the learner has been trained outside the quantifier.

Returns:

self

predict_proba(instances, epsilon=0.0001)
quapy.method.aggregative.ExpectationMaximizationQuantifier

alias of EMQ

class quapy.method.aggregative.HDy(classifier: BaseEstimator, val_split=0.4)

Bases: AggregativeProbabilisticQuantifier, BinaryQuantifier

Hellinger Distance y (HDy). HDy is a probabilistic method for training binary quantifiers, that models quantification as the problem of minimizing the divergence (in terms of the Hellinger Distance) between two cumulative distributions of posterior probabilities returned by the classifier. One of the distributions is generated from the unlabelled examples and the other is generated from a validation set. This latter distribution is defined as a mixture of the class-conditional distributions of the posterior probabilities returned for the positive and negative validation examples, respectively. The parameters of the mixture thus represent the estimates of the class prevalence values.

Parameters:
  • classifier – a sklearn’s Estimator that generates a binary classifier

  • val_split – a float in range (0,1) indicating the proportion of data to be used as a stratified held-out validation distribution, or a quapy.data.base.LabelledCollection (the split itself).

aggregate(classif_posteriors)

Implements the aggregation of label predictions.

Parameters:

classif_predictionsnp.ndarray of label predictions

Returns:

np.ndarray of shape (n_classes,) with class prevalence estimates.

fit(data: LabelledCollection, fit_classifier=True, val_split: Optional[Union[float, LabelledCollection]] = None)

Trains a HDy quantifier.

Parameters:
  • data – the training set

  • fit_classifier – set to False to bypass the training (the learner is assumed to be already fit)

  • val_split – either a float in (0,1) indicating the proportion of training instances to use for validation (e.g., 0.3 for using 30% of the training set as validation data), or a quapy.data.base.LabelledCollection indicating the validation set itself

Returns:

self

quapy.method.aggregative.HellingerDistanceY

alias of HDy

class quapy.method.aggregative.MAX(classifier: BaseEstimator, val_split=0.4)

Bases: ThresholdOptimization

Threshold Optimization variant for ACC as proposed by Forman 2006 and Forman 2008 that looks for the threshold that maximizes tpr-fpr. The goal is to bring improved stability to the denominator of the adjustment.

Parameters:
  • classifier – a sklearn’s Estimator that generates a classifier

  • val_split – indicates the proportion of data to be used as a stratified held-out validation set in which the misclassification rates are to be estimated. This parameter can be indicated as a real value (between 0 and 1, default 0.4), representing a proportion of validation data, or as an integer, indicating that the misclassification rates should be estimated via k-fold cross validation (this integer stands for the number of folds k), or as a quapy.data.base.LabelledCollection (the split itself).

class quapy.method.aggregative.MS(classifier: BaseEstimator, val_split=0.4)

Bases: ThresholdOptimization

Median Sweep. Threshold Optimization variant for ACC as proposed by Forman 2006 and Forman 2008 that generates class prevalence estimates for all decision thresholds and returns the median of them all. The goal is to bring improved stability to the denominator of the adjustment.

Parameters:
  • classifier – a sklearn’s Estimator that generates a classifier

  • val_split – indicates the proportion of data to be used as a stratified held-out validation set in which the misclassification rates are to be estimated. This parameter can be indicated as a real value (between 0 and 1, default 0.4), representing a proportion of validation data, or as an integer, indicating that the misclassification rates should be estimated via k-fold cross validation (this integer stands for the number of folds k), or as a quapy.data.base.LabelledCollection (the split itself).

class quapy.method.aggregative.MS2(classifier: BaseEstimator, val_split=0.4)

Bases: MS

Median Sweep 2. Threshold Optimization variant for ACC as proposed by Forman 2006 and Forman 2008 that generates class prevalence estimates for all decision thresholds and returns the median of for cases in which tpr-fpr>0.25 The goal is to bring improved stability to the denominator of the adjustment.

Parameters:
  • classifier – a sklearn’s Estimator that generates a classifier

  • val_split – indicates the proportion of data to be used as a stratified held-out validation set in which the misclassification rates are to be estimated. This parameter can be indicated as a real value (between 0 and 1, default 0.4), representing a proportion of validation data, or as an integer, indicating that the misclassification rates should be estimated via k-fold cross validation (this integer stands for the number of folds k), or as a quapy.data.base.LabelledCollection (the split itself).

quapy.method.aggregative.MedianSweep

alias of MS

quapy.method.aggregative.MedianSweep2

alias of MS2

class quapy.method.aggregative.OneVsAllAggregative(binary_quantifier, n_jobs=None, parallel_backend='multiprocessing')

Bases: OneVsAllGeneric, AggregativeQuantifier

Allows any binary quantifier to perform quantification on single-label datasets. The method maintains one binary quantifier for each class, and then l1-normalizes the outputs so that the class prevelences sum up to 1. This variant was used, along with the EMQ quantifier, in Gao and Sebastiani, 2016.

Parameters:
  • binary_quantifier – a quantifier (binary) that will be employed to work on multiclass model in a one-vs-all manner

  • n_jobs – number of parallel workers

  • parallel_backend – the parallel backend for joblib (default “loky”); this is helpful for some quantifiers (e.g., ELM-based ones) that cannot be run with multiprocessing, since the temp dir they create during fit will is removed and no longer available at predict time.

aggregate(classif_predictions)

Implements the aggregation of label predictions.

Parameters:

classif_predictionsnp.ndarray of label predictions

Returns:

np.ndarray of shape (n_classes,) with class prevalence estimates.

classify(instances)

If the base quantifier is not probabilistic, returns a matrix of shape (n,m,) with n the number of instances and m the number of classes. The entry (i,j) is a binary value indicating whether instance i `belongs to class `j. The binary classifications are independent of each other, meaning that an instance can end up be attributed to 0, 1, or more classes. If the base quantifier is probabilistic, returns a matrix of shape (n,m,2) with n the number of instances and m the number of classes. The entry (i,j,1) (resp. (i,j,0)) is a value in [0,1] indicating the posterior probability that instance i belongs (resp. does not belong) to class j. The posterior probabilities are independent of each other, meaning that, in general, they do not sum up to one.

Parameters:

instances – array-like

Returns:

np.ndarray

class quapy.method.aggregative.PACC(classifier: BaseEstimator, val_split=0.4, n_jobs=None)

Bases: AggregativeProbabilisticQuantifier

Probabilistic Adjusted Classify & Count, the probabilistic variant of ACC that relies on the posterior probabilities returned by a probabilistic classifier.

Parameters:
  • classifier – a sklearn’s Estimator that generates a classifier

  • val_split – indicates the proportion of data to be used as a stratified held-out validation set in which the misclassification rates are to be estimated. This parameter can be indicated as a real value (between 0 and 1, default 0.4), representing a proportion of validation data, or as an integer, indicating that the misclassification rates should be estimated via k-fold cross validation (this integer stands for the number of folds k), or as a quapy.data.base.LabelledCollection (the split itself).

  • n_jobs – number of parallel workers

aggregate(classif_posteriors)

Implements the aggregation of label predictions.

Parameters:

classif_predictionsnp.ndarray of label predictions

Returns:

np.ndarray of shape (n_classes,) with class prevalence estimates.

classify(data)

Provides the label predictions for the given instances. The predictions should respect the format expected by aggregate(), i.e., posterior probabilities for probabilistic quantifiers, or crisp predictions for non-probabilistic quantifiers

Parameters:

instances – array-like

Returns:

np.ndarray of shape (n_instances,) with label predictions

fit(data: LabelledCollection, fit_classifier=True, val_split: Optional[Union[float, int, LabelledCollection]] = None)

Trains a PACC quantifier.

Parameters:
  • data – the training set

  • fit_classifier – set to False to bypass the training (the learner is assumed to be already fit)

  • val_split – either a float in (0,1) indicating the proportion of training instances to use for validation (e.g., 0.3 for using 30% of the training set as validation data), or a LabelledCollection indicating the validation set itself, or an int indicating the number k of folds to be used in kFCV to estimate the parameters

Returns:

self

classmethod getPteCondEstim(classes, y, y_)
class quapy.method.aggregative.PCC(classifier: BaseEstimator)

Bases: AggregativeProbabilisticQuantifier

Probabilistic Classify & Count, the probabilistic variant of CC that relies on the posterior probabilities returned by a probabilistic classifier.

Parameters:

classifier – a sklearn’s Estimator that generates a classifier

aggregate(classif_posteriors)

Implements the aggregation of label predictions.

Parameters:

classif_predictionsnp.ndarray of label predictions

Returns:

np.ndarray of shape (n_classes,) with class prevalence estimates.

fit(data: LabelledCollection, fit_classifier=True)

Trains the aggregative quantifier

Parameters:
  • data – a quapy.data.base.LabelledCollection consisting of the training data

  • fit_classifier – whether or not to train the learner (default is True). Set to False if the learner has been trained outside the quantifier.

Returns:

self

quapy.method.aggregative.ProbabilisticAdjustedClassifyAndCount

alias of PACC

quapy.method.aggregative.ProbabilisticClassifyAndCount

alias of PCC

quapy.method.aggregative.SLD

alias of EMQ

class quapy.method.aggregative.SMM(classifier: BaseEstimator, val_split=0.4)

Bases: AggregativeProbabilisticQuantifier, BinaryQuantifier

SMM method (SMM). SMM is a simplification of matching distribution methods where the representation of the examples is created using the mean instead of a histogram.

Parameters:
  • classifier – a sklearn’s Estimator that generates a binary classifier.

  • val_split – a float in range (0,1) indicating the proportion of data to be used as a stratified held-out validation distribution, or a quapy.data.base.LabelledCollection (the split itself).

aggregate(classif_posteriors)

Implements the aggregation of label predictions.

Parameters:

classif_predictionsnp.ndarray of label predictions

Returns:

np.ndarray of shape (n_classes,) with class prevalence estimates.

fit(data: LabelledCollection, fit_classifier=True, val_split: Optional[Union[float, LabelledCollection]] = None)

Trains the aggregative quantifier

Parameters:
  • data – a quapy.data.base.LabelledCollection consisting of the training data

  • fit_classifier – whether or not to train the learner (default is True). Set to False if the learner has been trained outside the quantifier.

Returns:

self

class quapy.method.aggregative.T50(classifier: BaseEstimator, val_split=0.4)

Bases: ThresholdOptimization

Threshold Optimization variant for ACC as proposed by Forman 2006 and Forman 2008 that looks for the threshold that makes tpr cosest to 0.5. The goal is to bring improved stability to the denominator of the adjustment.

Parameters:
  • classifier – a sklearn’s Estimator that generates a classifier

  • val_split – indicates the proportion of data to be used as a stratified held-out validation set in which the misclassification rates are to be estimated. This parameter can be indicated as a real value (between 0 and 1, default 0.4), representing a proportion of validation data, or as an integer, indicating that the misclassification rates should be estimated via k-fold cross validation (this integer stands for the number of folds k), or as a quapy.data.base.LabelledCollection (the split itself).

class quapy.method.aggregative.ThresholdOptimization(classifier: BaseEstimator, val_split=0.4, n_jobs=None)

Bases: AggregativeQuantifier, BinaryQuantifier

Abstract class of Threshold Optimization variants for ACC as proposed by Forman 2006 and Forman 2008. The goal is to bring improved stability to the denominator of the adjustment. The different variants are based on different heuristics for choosing a decision threshold that would allow for more true positives and many more false positives, on the grounds this would deliver larger denominators.

Parameters:
  • classifier – a sklearn’s Estimator that generates a classifier

  • val_split – indicates the proportion of data to be used as a stratified held-out validation set in which the misclassification rates are to be estimated. This parameter can be indicated as a real value (between 0 and 1, default 0.4), representing a proportion of validation data, or as an integer, indicating that the misclassification rates should be estimated via k-fold cross validation (this integer stands for the number of folds k), or as a quapy.data.base.LabelledCollection (the split itself).

aggregate(classif_predictions)

Implements the aggregation of label predictions.

Parameters:

classif_predictionsnp.ndarray of label predictions

Returns:

np.ndarray of shape (n_classes,) with class prevalence estimates.

fit(data: LabelledCollection, fit_classifier=True, val_split: Optional[Union[float, int, LabelledCollection]] = None)

Trains the aggregative quantifier

Parameters:
  • data – a quapy.data.base.LabelledCollection consisting of the training data

  • fit_classifier – whether or not to train the learner (default is True). Set to False if the learner has been trained outside the quantifier.

Returns:

self

class quapy.method.aggregative.X(classifier: BaseEstimator, val_split=0.4)

Bases: ThresholdOptimization

Threshold Optimization variant for ACC as proposed by Forman 2006 and Forman 2008 that looks for the threshold that yields tpr=1-fpr. The goal is to bring improved stability to the denominator of the adjustment.

Parameters:
  • classifier – a sklearn’s Estimator that generates a classifier

  • val_split – indicates the proportion of data to be used as a stratified held-out validation set in which the misclassification rates are to be estimated. This parameter can be indicated as a real value (between 0 and 1, default 0.4), representing a proportion of validation data, or as an integer, indicating that the misclassification rates should be estimated via k-fold cross validation (this integer stands for the number of folds k), or as a quapy.data.base.LabelledCollection (the split itself).

quapy.method.aggregative.cross_generate_predictions(data, classifier, val_split, probabilistic, fit_classifier, n_jobs)
quapy.method.aggregative.newELM(svmperf_base=None, loss='01', C=1)

Explicit Loss Minimization (ELM) quantifiers. Quantifiers based on ELM represent a family of methods based on structured output learning; these quantifiers rely on classifiers that have been optimized using a quantification-oriented loss measure. This implementation relies on Joachims’ SVM perf structured output learning algorithm, which has to be installed and patched for the purpose (see this script). This function equivalent to:

>>> CC(SVMperf(svmperf_base, loss, C))
Parameters:
  • svmperf_base – path to the folder containing the binary files of SVM perf; if set to None (default) this path will be obtained from qp.environ[‘SVMPERF_HOME’]

  • loss – the loss to optimize (see quapy.classification.svmperf.SVMperf.valid_losses)

  • C – trade-off between training error and margin (default 0.01)

Returns:

returns an instance of CC set to work with SVMperf (with loss and C set properly) as the underlying classifier

quapy.method.aggregative.newSVMAE(svmperf_base=None, C=1)

SVM(KLD) is an Explicit Loss Minimization (ELM) quantifier set to optimize for the Absolute Error as first used by Moreo and Sebastiani, 2021. Equivalent to:

>>> CC(SVMperf(svmperf_base, loss='mae', C=C))

Quantifiers based on ELM represent a family of methods based on structured output learning; these quantifiers rely on classifiers that have been optimized using a quantification-oriented loss measure. This implementation relies on Joachims’ SVM perf structured output learning algorithm, which has to be installed and patched for the purpose (see this script). This function is a wrapper around CC(SVMperf(svmperf_base, loss, C))

Parameters:
  • svmperf_base – path to the folder containing the binary files of SVM perf; if set to None (default) this path will be obtained from qp.environ[‘SVMPERF_HOME’]

  • C – trade-off between training error and margin (default 0.01)

Returns:

returns an instance of CC set to work with SVMperf (with loss and C set properly) as the underlying classifier

quapy.method.aggregative.newSVMKLD(svmperf_base=None, C=1)

SVM(KLD) is an Explicit Loss Minimization (ELM) quantifier set to optimize for the Kullback-Leibler Divergence normalized via the logistic function, as proposed by Esuli et al. 2015. Equivalent to:

>>> CC(SVMperf(svmperf_base, loss='nkld', C=C))

Quantifiers based on ELM represent a family of methods based on structured output learning; these quantifiers rely on classifiers that have been optimized using a quantification-oriented loss measure. This implementation relies on Joachims’ SVM perf structured output learning algorithm, which has to be installed and patched for the purpose (see this script). This function is a wrapper around CC(SVMperf(svmperf_base, loss, C))

Parameters:
  • svmperf_base – path to the folder containing the binary files of SVM perf; if set to None (default) this path will be obtained from qp.environ[‘SVMPERF_HOME’]

  • C – trade-off between training error and margin (default 0.01)

Returns:

returns an instance of CC set to work with SVMperf (with loss and C set properly) as the underlying classifier

quapy.method.aggregative.newSVMQ(svmperf_base=None, C=1)

SVM(Q) is an Explicit Loss Minimization (ELM) quantifier set to optimize for the Q loss combining a classification-oriented loss and a quantification-oriented loss, as proposed by Barranquero et al. 2015. Equivalent to:

>>> CC(SVMperf(svmperf_base, loss='q', C=C))

Quantifiers based on ELM represent a family of methods based on structured output learning; these quantifiers rely on classifiers that have been optimized using a quantification-oriented loss measure. This implementation relies on Joachims’ SVM perf structured output learning algorithm, which has to be installed and patched for the purpose (see this script). This function is a wrapper around CC(SVMperf(svmperf_base, loss, C))

Parameters:
  • svmperf_base – path to the folder containing the binary files of SVM perf; if set to None (default) this path will be obtained from qp.environ[‘SVMPERF_HOME’]

  • C – trade-off between training error and margin (default 0.01)

Returns:

returns an instance of CC set to work with SVMperf (with loss and C set properly) as the underlying classifier

quapy.method.aggregative.newSVMRAE(svmperf_base=None, C=1)

SVM(KLD) is an Explicit Loss Minimization (ELM) quantifier set to optimize for the Relative Absolute Error as first used by Moreo and Sebastiani, 2021. Equivalent to:

>>> CC(SVMperf(svmperf_base, loss='mrae', C=C))

Quantifiers based on ELM represent a family of methods based on structured output learning; these quantifiers rely on classifiers that have been optimized using a quantification-oriented loss measure. This implementation relies on Joachims’ SVM perf structured output learning algorithm, which has to be installed and patched for the purpose (see this script). This function is a wrapper around CC(SVMperf(svmperf_base, loss, C))

Parameters:
  • svmperf_base – path to the folder containing the binary files of SVM perf; if set to None (default) this path will be obtained from qp.environ[‘SVMPERF_HOME’]

  • C – trade-off between training error and margin (default 0.01)

Returns:

returns an instance of CC set to work with SVMperf (with loss and C set properly) as the underlying classifier

quapy.method.base

class quapy.method.base.BaseQuantifier

Bases: BaseEstimator

Abstract Quantifier. A quantifier is defined as an object of a class that implements the method fit() on quapy.data.base.LabelledCollection, the method quantify(), and the set_params() and get_params() for model selection (see quapy.model_selection.GridSearchQ())

abstract fit(data: LabelledCollection)

Trains a quantifier.

Parameters:

data – a quapy.data.base.LabelledCollection consisting of the training data

Returns:

self

abstract quantify(instances)

Generate class prevalence estimates for the sample’s instances

Parameters:

instances – array-like

Returns:

np.ndarray of shape (n_classes,) with class prevalence estimates.

class quapy.method.base.BinaryQuantifier

Bases: BaseQuantifier

Abstract class of binary quantifiers, i.e., quantifiers estimating class prevalence values for only two classes (typically, to be interpreted as one class and its complement).

class quapy.method.base.OneVsAll

Bases: object

class quapy.method.base.OneVsAllGeneric(binary_quantifier, n_jobs=None)

Bases: OneVsAll, BaseQuantifier

Allows any binary quantifier to perform quantification on single-label datasets. The method maintains one binary quantifier for each class, and then l1-normalizes the outputs so that the class prevelence values sum up to 1.

property classes_
fit(data: LabelledCollection, fit_classifier=True)

Trains a quantifier.

Parameters:

data – a quapy.data.base.LabelledCollection consisting of the training data

Returns:

self

quantify(instances)

Generate class prevalence estimates for the sample’s instances

Parameters:

instances – array-like

Returns:

np.ndarray of shape (n_classes,) with class prevalence estimates.

quapy.method.base.newOneVsAll(binary_quantifier, n_jobs=None)

quapy.method.meta

quapy.method.meta.EACC(classifier, param_grid=None, optim=None, param_mod_sel=None, **kwargs)

Implements an ensemble of quapy.method.aggregative.ACC quantifiers, as used by Pérez-Gállego et al., 2019.

Equivalent to:

>>> ensembleFactory(classifier, ACC, param_grid, optim, param_mod_sel, **kwargs)

See ensembleFactory() for further details.

Parameters:
  • classifier – sklearn’s Estimator that generates a classifier

  • param_grid – a dictionary with the grid of parameters to optimize for

  • optim – a valid quantification or classification error, or a string name of it

  • param_model_sel – a dictionary containing any keyworded argument to pass to quapy.model_selection.GridSearchQ

  • kwargs – kwargs for the class Ensemble

Returns:

an instance of Ensemble

quapy.method.meta.ECC(classifier, param_grid=None, optim=None, param_mod_sel=None, **kwargs)

Implements an ensemble of quapy.method.aggregative.CC quantifiers, as used by Pérez-Gállego et al., 2019.

Equivalent to:

>>> ensembleFactory(classifier, CC, param_grid, optim, param_mod_sel, **kwargs)

See ensembleFactory() for further details.

Parameters:
  • classifier – sklearn’s Estimator that generates a classifier

  • param_grid – a dictionary with the grid of parameters to optimize for

  • optim – a valid quantification or classification error, or a string name of it

  • param_model_sel – a dictionary containing any keyworded argument to pass to quapy.model_selection.GridSearchQ

  • kwargs – kwargs for the class Ensemble

Returns:

an instance of Ensemble

quapy.method.meta.EEMQ(classifier, param_grid=None, optim=None, param_mod_sel=None, **kwargs)

Implements an ensemble of quapy.method.aggregative.EMQ quantifiers.

Equivalent to:

>>> ensembleFactory(classifier, EMQ, param_grid, optim, param_mod_sel, **kwargs)

See ensembleFactory() for further details.

Parameters:
  • classifier – sklearn’s Estimator that generates a classifier

  • param_grid – a dictionary with the grid of parameters to optimize for

  • optim – a valid quantification or classification error, or a string name of it

  • param_model_sel – a dictionary containing any keyworded argument to pass to quapy.model_selection.GridSearchQ

  • kwargs – kwargs for the class Ensemble

Returns:

an instance of Ensemble

quapy.method.meta.EHDy(classifier, param_grid=None, optim=None, param_mod_sel=None, **kwargs)

Implements an ensemble of quapy.method.aggregative.HDy quantifiers, as used by Pérez-Gállego et al., 2019.

Equivalent to:

>>> ensembleFactory(classifier, HDy, param_grid, optim, param_mod_sel, **kwargs)

See ensembleFactory() for further details.

Parameters:
  • classifier – sklearn’s Estimator that generates a classifier

  • param_grid – a dictionary with the grid of parameters to optimize for

  • optim – a valid quantification or classification error, or a string name of it

  • param_model_sel – a dictionary containing any keyworded argument to pass to quapy.model_selection.GridSearchQ

  • kwargs – kwargs for the class Ensemble

Returns:

an instance of Ensemble

quapy.method.meta.EPACC(classifier, param_grid=None, optim=None, param_mod_sel=None, **kwargs)

Implements an ensemble of quapy.method.aggregative.PACC quantifiers.

Equivalent to:

>>> ensembleFactory(classifier, PACC, param_grid, optim, param_mod_sel, **kwargs)

See ensembleFactory() for further details.

Parameters:
  • classifier – sklearn’s Estimator that generates a classifier

  • param_grid – a dictionary with the grid of parameters to optimize for

  • optim – a valid quantification or classification error, or a string name of it

  • param_model_sel – a dictionary containing any keyworded argument to pass to quapy.model_selection.GridSearchQ

  • kwargs – kwargs for the class Ensemble

Returns:

an instance of Ensemble

class quapy.method.meta.Ensemble(quantifier: BaseQuantifier, size=50, red_size=25, min_pos=5, policy='ave', max_sample_size=None, val_split: Optional[Union[float, LabelledCollection]] = None, n_jobs=None, verbose=False)

Bases: BaseQuantifier

VALID_POLICIES = {'ave', 'ds', 'mae', 'mkld', 'mnkld', 'mrae', 'mse', 'ptr'}

Implementation of the Ensemble methods for quantification described by Pérez-Gállego et al., 2017 and Pérez-Gállego et al., 2019. The policies implemented include:

  • Average (policy=’ave’): computes class prevalence estimates as the average of the estimates returned by the base quantifiers.

  • Training Prevalence (policy=’ptr’): applies a dynamic selection to the ensemble’s members by retaining only those members such that the class prevalence values in the samples they use as training set are closest to preliminary class prevalence estimates computed as the average of the estimates of all the members. The final estimate is recomputed by considering only the selected members.

  • Distribution Similarity (policy=’ds’): performs a dynamic selection of base members by retaining the members trained on samples whose distribution of posterior probabilities is closest, in terms of the Hellinger Distance, to the distribution of posterior probabilities in the test sample

  • Accuracy (policy=’<valid error name>’): performs a static selection of the ensemble members by retaining those that minimize a quantification error measure, which is passed as an argument.

Example:

>>> model = Ensemble(quantifier=ACC(LogisticRegression()), size=30, policy='ave', n_jobs=-1)
Parameters:
  • quantifier – base quantification member of the ensemble

  • size – number of members

  • red_size – number of members to retain after selection (depending on the policy)

  • min_pos – minimum number of positive instances to consider a sample as valid

  • policy – the selection policy; available policies include: ave (default), ptr, ds, and accuracy (which is instantiated via a valid error name, e.g., mae)

  • max_sample_size – maximum number of instances to consider in the samples (set to None to indicate no limit, default)

  • val_split – a float in range (0,1) indicating the proportion of data to be used as a stratified held-out validation split, or a quapy.data.base.LabelledCollection (the split itself).

  • n_jobs – number of parallel workers (default 1)

  • verbose – set to True (default is False) to get some information in standard output

property aggregative

Indicates that the quantifier is not aggregative.

Returns:

False

fit(data: LabelledCollection, val_split: Optional[Union[float, LabelledCollection]] = None)

Trains a quantifier.

Parameters:

data – a quapy.data.base.LabelledCollection consisting of the training data

Returns:

self

get_params(deep=True)

This function should not be used within quapy.model_selection.GridSearchQ (is here for compatibility with the abstract class). Instead, use Ensemble(GridSearchQ(q),…), with q a Quantifier (recommended), or Ensemble(Q(GridSearchCV(l))) with Q a quantifier class that has a classifier l optimized for classification (not recommended).

Parameters:

deep – for compatibility with scikit-learn

Returns:

raises an Exception

property probabilistic

Indicates that the quantifier is not probabilistic.

Returns:

False

quantify(instances)

Generate class prevalence estimates for the sample’s instances

Parameters:

instances – array-like

Returns:

np.ndarray of shape (n_classes,) with class prevalence estimates.

set_params(**parameters)

This function should not be used within quapy.model_selection.GridSearchQ (is here for compatibility with the abstract class). Instead, use Ensemble(GridSearchQ(q),…), with q a Quantifier (recommended), or Ensemble(Q(GridSearchCV(l))) with Q a quantifier class that has a classifier l optimized for classification (not recommended).

Parameters:

parameters – dictionary

Returns:

raises an Exception

quapy.method.meta.ensembleFactory(classifier, base_quantifier_class, param_grid=None, optim=None, param_model_sel: Optional[dict] = None, **kwargs)

Ensemble factory. Provides a unified interface for instantiating ensembles that can be optimized (via model selection for quantification) for a given evaluation metric using quapy.model_selection.GridSearchQ. If the evaluation metric is classification-oriented (instead of quantification-oriented), then the optimization will be carried out via sklearn’s GridSearchCV.

Example to instantiate an Ensemble based on quapy.method.aggregative.PACC in which the base members are optimized for quapy.error.mae() via quapy.model_selection.GridSearchQ. The ensemble follows the policy Accuracy based on quapy.error.mae() (the same measure being optimized), meaning that a static selection of members of the ensemble is made based on their performance in terms of this error.

>>> param_grid = {
>>>     'C': np.logspace(-3,3,7),
>>>     'class_weight': ['balanced', None]
>>> }
>>> param_mod_sel = {
>>>     'sample_size': 500,
>>>     'protocol': 'app'
>>> }
>>> common={
>>>     'max_sample_size': 1000,
>>>     'n_jobs': -1,
>>>     'param_grid': param_grid,
>>>     'param_mod_sel': param_mod_sel,
>>> }
>>>
>>> ensembleFactory(LogisticRegression(), PACC, optim='mae', policy='mae', **common)
Parameters:
  • classifier – sklearn’s Estimator that generates a classifier

  • base_quantifier_class – a class of quantifiers

  • param_grid – a dictionary with the grid of parameters to optimize for

  • optim – a valid quantification or classification error, or a string name of it

  • param_model_sel – a dictionary containing any keyworded argument to pass to quapy.model_selection.GridSearchQ

  • kwargs – kwargs for the class Ensemble

Returns:

an instance of Ensemble

quapy.method.meta.get_probability_distribution(posterior_probabilities, bins=8)

Gets a histogram out of the posterior probabilities (only for the binary case).

Parameters:
  • posterior_probabilities – array-like of shape (n_instances, 2,)

  • bins – integer

Returns:

np.ndarray with the relative frequencies for each bin (for the positive class only)

quapy.method.neural

class quapy.method.neural.QuaNetModule(doc_embedding_size, n_classes, stats_size, lstm_hidden_size=64, lstm_nlayers=1, ff_layers=[1024, 512], bidirectional=True, qdrop_p=0.5, order_by=0)

Bases: Module

Implements the QuaNet forward pass. See QuaNetTrainer for training QuaNet.

Parameters:
  • doc_embedding_size – integer, the dimensionality of the document embeddings

  • n_classes – integer, number of classes

  • stats_size – integer, number of statistics estimated by simple quantification methods

  • lstm_hidden_size – integer, hidden dimensionality of the LSTM cell

  • lstm_nlayers – integer, number of LSTM layers

  • ff_layers – list of integers, dimensions of the densely-connected FF layers on top of the quantification embedding

  • bidirectional – boolean, whether or not to use bidirectional LSTM

  • qdrop_p – float, dropout probability

  • order_by – integer, class for which the document embeddings are to be sorted

property device
forward(doc_embeddings, doc_posteriors, statistics)

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

training: bool
class quapy.method.neural.QuaNetTrainer(classifier, sample_size=None, n_epochs=100, tr_iter_per_poch=500, va_iter_per_poch=100, lr=0.001, lstm_hidden_size=64, lstm_nlayers=1, ff_layers=[1024, 512], bidirectional=True, qdrop_p=0.5, patience=10, checkpointdir='../checkpoint', checkpointname=None, device='cuda')

Bases: BaseQuantifier

Implementation of QuaNet, a neural network for quantification. This implementation uses PyTorch and can take advantage of GPU for speeding-up the training phase.

Example:

>>> import quapy as qp
>>> from quapy.method.meta import QuaNet
>>> from quapy.classification.neural import NeuralClassifierTrainer, CNNnet
>>>
>>> # use samples of 100 elements
>>> qp.environ['SAMPLE_SIZE'] = 100
>>>
>>> # load the kindle dataset as text, and convert words to numerical indexes
>>> dataset = qp.datasets.fetch_reviews('kindle', pickle=True)
>>> qp.data.preprocessing.index(dataset, min_df=5, inplace=True)
>>>
>>> # the text classifier is a CNN trained by NeuralClassifierTrainer
>>> cnn = CNNnet(dataset.vocabulary_size, dataset.n_classes)
>>> classifier = NeuralClassifierTrainer(cnn, device='cuda')
>>>
>>> # train QuaNet (QuaNet is an alias to QuaNetTrainer)
>>> model = QuaNet(classifier, qp.environ['SAMPLE_SIZE'], device='cuda')
>>> model.fit(dataset.training)
>>> estim_prevalence = model.quantify(dataset.test.instances)
Parameters:
  • classifier – an object implementing fit (i.e., that can be trained on labelled data), predict_proba (i.e., that can generate posterior probabilities of unlabelled examples) and transform (i.e., that can generate embedded representations of the unlabelled instances).

  • sample_size – integer, the sample size; default is None, meaning that the sample size should be taken from qp.environ[“SAMPLE_SIZE”]

  • n_epochs – integer, maximum number of training epochs

  • tr_iter_per_poch – integer, number of training iterations before considering an epoch complete

  • va_iter_per_poch – integer, number of validation iterations to perform after each epoch

  • lr – float, the learning rate

  • lstm_hidden_size – integer, hidden dimensionality of the LSTM cells

  • lstm_nlayers – integer, number of LSTM layers

  • ff_layers – list of integers, dimensions of the densely-connected FF layers on top of the quantification embedding

  • bidirectional – boolean, indicates whether the LSTM is bidirectional or not

  • qdrop_p – float, dropout probability

  • patience – integer, number of epochs showing no improvement in the validation set before stopping the training phase (early stopping)

  • checkpointdir – string, a path where to store models’ checkpoints

  • checkpointname – string (optional), the name of the model’s checkpoint

  • device – string, indicate “cpu” or “cuda”

property classes_
clean_checkpoint()

Removes the checkpoint

clean_checkpoint_dir()

Removes anything contained in the checkpoint directory

fit(data: LabelledCollection, fit_classifier=True)

Trains QuaNet.

Parameters:
  • data – the training data on which to train QuaNet. If fit_classifier=True, the data will be split in 40/40/20 for training the classifier, training QuaNet, and validating QuaNet, respectively. If fit_classifier=False, the data will be split in 66/34 for training QuaNet and validating it, respectively.

  • fit_classifier – if True, trains the classifier on a split containing 40% of the data

Returns:

self

get_params(deep=True)

Get parameters for this estimator.

Parameters:

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

params – Parameter names mapped to their values.

Return type:

dict

quantify(instances)

Generate class prevalence estimates for the sample’s instances

Parameters:

instances – array-like

Returns:

np.ndarray of shape (n_classes,) with class prevalence estimates.

set_params(**parameters)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**params (dict) – Estimator parameters.

Returns:

self – Estimator instance.

Return type:

estimator instance

quapy.method.neural.mae_loss(output, target)

Torch-like wrapper for the Mean Absolute Error

Parameters:
  • output – predictions

  • target – ground truth values

Returns:

mean absolute error loss

quapy.method.non_aggregative

class quapy.method.non_aggregative.MaximumLikelihoodPrevalenceEstimation

Bases: BaseQuantifier

The Maximum Likelihood Prevalence Estimation (MLPE) method is a lazy method that assumes there is no prior probability shift between training and test instances (put it other way, that the i.i.d. assumpion holds). The estimation of class prevalence values for any test sample is always (i.e., irrespective of the test sample itself) the class prevalence seen during training. This method is considered to be a lower-bound quantifier that any quantification method should beat.

fit(data: LabelledCollection)

Computes the training prevalence and stores it.

Parameters:

data – the training sample

Returns:

self

quantify(instances)

Ignores the input instances and returns, as the class prevalence estimantes, the training prevalence.

Parameters:

instances – array-like (ignored)

Returns:

the class prevalence seen during training

Module contents