quapy.method package¶
Submodules¶
quapy.method.aggregative module¶
- class quapy.method.aggregative.ACC(learner: sklearn.base.BaseEstimator, val_split=0.4)¶
Bases:
quapy.method.aggregative.AggregativeQuantifier
Adjusted Classify & Count, the “adjusted” variant of
CC
, that corrects the predictions of CC according to the misclassification rates.- Parameters
learner – a sklearn’s Estimator that generates a classifier
val_split – indicates the proportion of data to be used as a stratified held-out validation set in which the misclassification rates are to be estimated. This parameter can be indicated as a real value (between 0 and 1, default 0.4), representing a proportion of validation data, or as an integer, indicating that the misclassification rates should be estimated via k-fold cross validation (this integer stands for the number of folds k), or as a
quapy.data.base.LabelledCollection
(the split itself).
- aggregate(classif_predictions)¶
Implements the aggregation of label predictions.
- Parameters
classif_predictions – np.ndarray of label predictions
- Returns
np.ndarray of shape (self.n_classes_,) with class prevalence estimates.
- classify(data)¶
Provides the label predictions for the given instances.
- Parameters
instances – array-like
- Returns
np.ndarray of shape (n_instances,) with label predictions
- fit(data: quapy.data.base.LabelledCollection, fit_learner=True, val_split: Optional[Union[float, int, quapy.data.base.LabelledCollection]] = None)¶
Trains a ACC quantifier.
- Parameters
data – the training set
fit_learner – set to False to bypass the training (the learner is assumed to be already fit)
val_split – either a float in (0,1) indicating the proportion of training instances to use for validation (e.g., 0.3 for using 30% of the training set as validation data), or a LabelledCollection indicating the validation set itself, or an int indicating the number k of folds to be used in k-fold cross validation to estimate the parameters
- Returns
self
- classmethod solve_adjustment(PteCondEstim, prevs_estim)¶
Solves the system linear system \(Ax = B\) with \(A\) = PteCondEstim and \(B\) = prevs_estim
- Parameters
PteCondEstim – a np.ndarray of shape (n_classes,n_classes,) with entry (i,j) being the estimate of \(P(y_i|y_j)\), that is, the probability that an instance that belongs to \(y_j\) ends up being classified as belonging to \(y_i\)
prevs_estim – a np.ndarray of shape (n_classes,) with the class prevalence estimates
- Returns
an adjusted np.ndarray of shape (n_classes,) with the corrected class prevalence estimates
- quapy.method.aggregative.AdjustedClassifyAndCount¶
alias of
quapy.method.aggregative.ACC
- class quapy.method.aggregative.AggregativeProbabilisticQuantifier¶
Bases:
quapy.method.aggregative.AggregativeQuantifier
Abstract class for quantification methods that base their estimations on the aggregation of posterior probabilities as returned by a probabilistic classifier. Aggregative Probabilistic Quantifiers thus extend Aggregative Quantifiers by implementing a _posterior_probabilities_ method returning values in [0,1] – the posterior probabilities.
- posterior_probabilities(instances)¶
- predict_proba(instances)¶
- property probabilistic¶
Indicates whether the quantifier is of type probabilistic or not
- Returns
False (to be overridden)
- quantify(instances)¶
Generate class prevalence estimates for the sample’s instances by aggregating the label predictions generated by the classifier.
- Parameters
instances – array-like
- Returns
np.ndarray of shape (self.n_classes_,) with class prevalence estimates.
- set_params(**parameters)¶
Set the parameters of the quantifier.
- Parameters
parameters – dictionary of param-value pairs
- class quapy.method.aggregative.AggregativeQuantifier¶
Bases:
quapy.method.base.BaseQuantifier
Abstract class for quantification methods that base their estimations on the aggregation of classification results. Aggregative Quantifiers thus implement a
classify()
method and maintain alearner
attribute. Subclasses of this abstract class must implement the methodaggregate()
which computes the aggregation of label predictions. The methodquantify()
comes with a default implementation based onclassify()
andaggregate()
.- abstract aggregate(classif_predictions: numpy.ndarray)¶
Implements the aggregation of label predictions.
- Parameters
classif_predictions – np.ndarray of label predictions
- Returns
np.ndarray of shape (self.n_classes_,) with class prevalence estimates.
- property aggregative¶
Returns True, indicating the quantifier is of type aggregative.
- Returns
True
- property classes_¶
Class labels, in the same order in which class prevalence values are to be computed. This default implementation actually returns the class labels of the learner.
- Returns
array-like
- classify(instances)¶
Provides the label predictions for the given instances.
- Parameters
instances – array-like
- Returns
np.ndarray of shape (n_instances,) with label predictions
- abstract fit(data: quapy.data.base.LabelledCollection, fit_learner=True)¶
Trains the aggregative quantifier
- Parameters
data – a
quapy.data.base.LabelledCollection
consisting of the training datafit_learner – whether or not to train the learner (default is True). Set to False if the learner has been trained outside the quantifier.
- Returns
self
- get_params(deep=True)¶
Return the current parameters of the quantifier.
- Parameters
deep – for compatibility with sklearn
- Returns
a dictionary of param-value pairs
- property learner¶
Gives access to the classifier
- Returns
the classifier (typically an sklearn’s Estimator)
- quantify(instances)¶
Generate class prevalence estimates for the sample’s instances by aggregating the label predictions generated by the classifier.
- Parameters
instances – array-like
- Returns
np.ndarray of shape (self.n_classes_,) with class prevalence estimates.
- set_params(**parameters)¶
Set the parameters of the quantifier.
- Parameters
parameters – dictionary of param-value pairs
- class quapy.method.aggregative.CC(learner: sklearn.base.BaseEstimator)¶
Bases:
quapy.method.aggregative.AggregativeQuantifier
The most basic Quantification method. One that simply classifies all instances and counts how many have been attributed to each of the classes in order to compute class prevalence estimates.
- Parameters
learner – a sklearn’s Estimator that generates a classifier
- aggregate(classif_predictions: numpy.ndarray)¶
Computes class prevalence estimates by counting the prevalence of each of the predicted labels.
- Parameters
classif_predictions – array-like with label predictions
- Returns
np.ndarray of shape (self.n_classes_,) with class prevalence estimates.
- fit(data: quapy.data.base.LabelledCollection, fit_learner=True)¶
Trains the Classify & Count method unless fit_learner is False, in which case, the classifier is assumed to be already fit and there is nothing else to do.
- Parameters
data – a
quapy.data.base.LabelledCollection
consisting of the training datafit_learner – if False, the classifier is assumed to be fit
- Returns
self
- quapy.method.aggregative.ClassifyAndCount¶
alias of
quapy.method.aggregative.CC
- class quapy.method.aggregative.ELM(svmperf_base=None, loss='01', **kwargs)¶
Bases:
quapy.method.aggregative.AggregativeQuantifier
,quapy.method.base.BinaryQuantifier
Class of Explicit Loss Minimization (ELM) quantifiers. Quantifiers based on ELM represent a family of methods based on structured output learning; these quantifiers rely on classifiers that have been optimized using a quantification-oriented loss measure. This implementation relies on Joachims’ SVM perf structured output learning algorithm, which has to be installed and patched for the purpose (see this script).
- Parameters
svmperf_base – path to the folder containing the binary files of SVM perf
loss – the loss to optimize (see
quapy.classification.svmperf.SVMperf.valid_losses
)kwargs – rest of SVM perf’s parameters
- aggregate(classif_predictions: numpy.ndarray)¶
Implements the aggregation of label predictions.
- Parameters
classif_predictions – np.ndarray of label predictions
- Returns
np.ndarray of shape (self.n_classes_,) with class prevalence estimates.
- classify(X, y=None)¶
Provides the label predictions for the given instances.
- Parameters
instances – array-like
- Returns
np.ndarray of shape (n_instances,) with label predictions
- fit(data: quapy.data.base.LabelledCollection, fit_learner=True)¶
Trains the aggregative quantifier
- Parameters
data – a
quapy.data.base.LabelledCollection
consisting of the training datafit_learner – whether or not to train the learner (default is True). Set to False if the learner has been trained outside the quantifier.
- Returns
self
- class quapy.method.aggregative.EMQ(learner: sklearn.base.BaseEstimator)¶
Bases:
quapy.method.aggregative.AggregativeProbabilisticQuantifier
Expectation Maximization for Quantification (EMQ), aka Saerens-Latinne-Decaestecker (SLD) algorithm. EMQ consists of using the well-known Expectation Maximization algorithm to iteratively update the posterior probabilities generated by a probabilistic classifier and the class prevalence estimates obtained via maximum-likelihood estimation, in a mutually recursive way, until convergence.
- Parameters
learner – a sklearn’s Estimator that generates a classifier
- classmethod EM(tr_prev, posterior_probabilities, epsilon=0.0001)¶
Computes the Expectation Maximization routine.
- Parameters
tr_prev – array-like, the training prevalence
posterior_probabilities – np.ndarray of shape (n_instances, n_classes,) with the posterior probabilities
epsilon – float, the threshold different between two consecutive iterations to reach before stopping the loop
- Returns
a tuple with the estimated prevalence values (shape (n_classes,)) and the corrected posterior probabilities (shape (n_instances, n_classes,))
- EPSILON = 0.0001¶
- MAX_ITER = 1000¶
- aggregate(classif_posteriors, epsilon=0.0001)¶
Implements the aggregation of label predictions.
- Parameters
classif_predictions – np.ndarray of label predictions
- Returns
np.ndarray of shape (self.n_classes_,) with class prevalence estimates.
- fit(data: quapy.data.base.LabelledCollection, fit_learner=True)¶
Trains the aggregative quantifier
- Parameters
data – a
quapy.data.base.LabelledCollection
consisting of the training datafit_learner – whether or not to train the learner (default is True). Set to False if the learner has been trained outside the quantifier.
- Returns
self
- predict_proba(instances, epsilon=0.0001)¶
- quapy.method.aggregative.ExpectationMaximizationQuantifier¶
alias of
quapy.method.aggregative.EMQ
- quapy.method.aggregative.ExplicitLossMinimisation¶
alias of
quapy.method.aggregative.ELM
- class quapy.method.aggregative.HDy(learner: sklearn.base.BaseEstimator, val_split=0.4)¶
Bases:
quapy.method.aggregative.AggregativeProbabilisticQuantifier
,quapy.method.base.BinaryQuantifier
Hellinger Distance y (HDy). HDy is a probabilistic method for training binary quantifiers, that models quantification as the problem of minimizing the divergence (in terms of the Hellinger Distance) between two cumulative distributions of posterior probabilities returned by the classifier. One of the distributions is generated from the unlabelled examples and the other is generated from a validation set. This latter distribution is defined as a mixture of the class-conditional distributions of the posterior probabilities returned for the positive and negative validation examples, respectively. The parameters of the mixture thus represent the estimates of the class prevalence values.
- Parameters
learner – a sklearn’s Estimator that generates a binary classifier
val_split – a float in range (0,1) indicating the proportion of data to be used as a stratified held-out validation distribution, or a
quapy.data.base.LabelledCollection
(the split itself).
- aggregate(classif_posteriors)¶
Implements the aggregation of label predictions.
- Parameters
classif_predictions – np.ndarray of label predictions
- Returns
np.ndarray of shape (self.n_classes_,) with class prevalence estimates.
- fit(data: quapy.data.base.LabelledCollection, fit_learner=True, val_split: Optional[Union[float, quapy.data.base.LabelledCollection]] = None)¶
Trains a HDy quantifier.
- Parameters
data – the training set
fit_learner – set to False to bypass the training (the learner is assumed to be already fit)
val_split – either a float in (0,1) indicating the proportion of training instances to use for validation (e.g., 0.3 for using 30% of the training set as validation data), or a
quapy.data.base.LabelledCollection
indicating the validation set itself
- Returns
self
- quapy.method.aggregative.HellingerDistanceY¶
alias of
quapy.method.aggregative.HDy
- class quapy.method.aggregative.MAX(learner: sklearn.base.BaseEstimator, val_split=0.4)¶
Bases:
quapy.method.aggregative.ThresholdOptimization
Threshold Optimization variant for
ACC
as proposed by Forman 2006 and Forman 2008 that looks for the threshold that maximizes tpr-fpr. The goal is to bring improved stability to the denominator of the adjustment.- Parameters
learner – a sklearn’s Estimator that generates a classifier
val_split – indicates the proportion of data to be used as a stratified held-out validation set in which the misclassification rates are to be estimated. This parameter can be indicated as a real value (between 0 and 1, default 0.4), representing a proportion of validation data, or as an integer, indicating that the misclassification rates should be estimated via k-fold cross validation (this integer stands for the number of folds k), or as a
quapy.data.base.LabelledCollection
(the split itself).
- class quapy.method.aggregative.MS(learner: sklearn.base.BaseEstimator, val_split=0.4)¶
Bases:
quapy.method.aggregative.ThresholdOptimization
Median Sweep. Threshold Optimization variant for
ACC
as proposed by Forman 2006 and Forman 2008 that generates class prevalence estimates for all decision thresholds and returns the median of them all. The goal is to bring improved stability to the denominator of the adjustment.- Parameters
learner – a sklearn’s Estimator that generates a classifier
val_split – indicates the proportion of data to be used as a stratified held-out validation set in which the misclassification rates are to be estimated. This parameter can be indicated as a real value (between 0 and 1, default 0.4), representing a proportion of validation data, or as an integer, indicating that the misclassification rates should be estimated via k-fold cross validation (this integer stands for the number of folds k), or as a
quapy.data.base.LabelledCollection
(the split itself).
- class quapy.method.aggregative.MS2(learner: sklearn.base.BaseEstimator, val_split=0.4)¶
Bases:
quapy.method.aggregative.MS
Median Sweep 2. Threshold Optimization variant for
ACC
as proposed by Forman 2006 and Forman 2008 that generates class prevalence estimates for all decision thresholds and returns the median of for cases in which tpr-fpr>0.25 The goal is to bring improved stability to the denominator of the adjustment.- Parameters
learner – a sklearn’s Estimator that generates a classifier
val_split – indicates the proportion of data to be used as a stratified held-out validation set in which the misclassification rates are to be estimated. This parameter can be indicated as a real value (between 0 and 1, default 0.4), representing a proportion of validation data, or as an integer, indicating that the misclassification rates should be estimated via k-fold cross validation (this integer stands for the number of folds k), or as a
quapy.data.base.LabelledCollection
(the split itself).
- quapy.method.aggregative.MedianSweep¶
alias of
quapy.method.aggregative.MS
- quapy.method.aggregative.MedianSweep2¶
alias of
quapy.method.aggregative.MS2
- class quapy.method.aggregative.OneVsAll(binary_quantifier, n_jobs=- 1)¶
Bases:
quapy.method.aggregative.AggregativeQuantifier
Allows any binary quantifier to perform quantification on single-label datasets. The method maintains one binary quantifier for each class, and then l1-normalizes the outputs so that the class prevelences sum up to 1. This variant was used, along with the
EMQ
quantifier, in Gao and Sebastiani, 2016.- Parameters
learner – a sklearn’s Estimator that generates a binary classifier
n_jobs – number of parallel workers
- aggregate(classif_predictions_bin)¶
Implements the aggregation of label predictions.
- Parameters
classif_predictions – np.ndarray of label predictions
- Returns
np.ndarray of shape (self.n_classes_,) with class prevalence estimates.
- property binary¶
Informs that the classifier is not binary
- Returns
False
- property classes_¶
Class labels, in the same order in which class prevalence values are to be computed. This default implementation actually returns the class labels of the learner.
- Returns
array-like
- classify(instances)¶
Returns a matrix of shape (n,m,) with n the number of instances and m the number of classes. The entry (i,j) is a binary value indicating whether instance i `belongs to class `j. The binary classifications are independent of each other, meaning that an instance can end up be attributed to 0, 1, or more classes.
- Parameters
instances – array-like
- Returns
np.ndarray
- fit(data: quapy.data.base.LabelledCollection, fit_learner=True)¶
Trains the aggregative quantifier
- Parameters
data – a
quapy.data.base.LabelledCollection
consisting of the training datafit_learner – whether or not to train the learner (default is True). Set to False if the learner has been trained outside the quantifier.
- Returns
self
- get_params(deep=True)¶
Return the current parameters of the quantifier.
- Parameters
deep – for compatibility with sklearn
- Returns
a dictionary of param-value pairs
- posterior_probabilities(instances)¶
Returns a matrix of shape (n,m,2) with n the number of instances and m the number of classes. The entry (i,j,1) (resp. (i,j,0)) is a value in [0,1] indicating the posterior probability that instance i belongs (resp. does not belong) to class j. The posterior probabilities are independent of each other, meaning that, in general, they do not sum up to one.
- Parameters
instances – array-like
- Returns
np.ndarray
- property probabilistic¶
Indicates if the classifier is probabilistic or not (depending on the nature of the base classifier).
- Returns
boolean
- quantify(X)¶
Generate class prevalence estimates for the sample’s instances by aggregating the label predictions generated by the classifier.
- Parameters
instances – array-like
- Returns
np.ndarray of shape (self.n_classes_,) with class prevalence estimates.
- set_params(**parameters)¶
Set the parameters of the quantifier.
- Parameters
parameters – dictionary of param-value pairs
- class quapy.method.aggregative.PACC(learner: sklearn.base.BaseEstimator, val_split=0.4)¶
Bases:
quapy.method.aggregative.AggregativeProbabilisticQuantifier
Probabilistic Adjusted Classify & Count, the probabilistic variant of ACC that relies on the posterior probabilities returned by a probabilistic classifier.
- Parameters
learner – a sklearn’s Estimator that generates a classifier
val_split – indicates the proportion of data to be used as a stratified held-out validation set in which the misclassification rates are to be estimated. This parameter can be indicated as a real value (between 0 and 1, default 0.4), representing a proportion of validation data, or as an integer, indicating that the misclassification rates should be estimated via k-fold cross validation (this integer stands for the number of folds k), or as a
quapy.data.base.LabelledCollection
(the split itself).
- aggregate(classif_posteriors)¶
Implements the aggregation of label predictions.
- Parameters
classif_predictions – np.ndarray of label predictions
- Returns
np.ndarray of shape (self.n_classes_,) with class prevalence estimates.
- classify(data)¶
Provides the label predictions for the given instances.
- Parameters
instances – array-like
- Returns
np.ndarray of shape (n_instances,) with label predictions
- fit(data: quapy.data.base.LabelledCollection, fit_learner=True, val_split: Optional[Union[float, int, quapy.data.base.LabelledCollection]] = None)¶
Trains a PACC quantifier.
- Parameters
data – the training set
fit_learner – set to False to bypass the training (the learner is assumed to be already fit)
val_split – either a float in (0,1) indicating the proportion of training instances to use for validation (e.g., 0.3 for using 30% of the training set as validation data), or a LabelledCollection indicating the validation set itself, or an int indicating the number k of folds to be used in kFCV to estimate the parameters
- Returns
self
- class quapy.method.aggregative.PCC(learner: sklearn.base.BaseEstimator)¶
Bases:
quapy.method.aggregative.AggregativeProbabilisticQuantifier
Probabilistic Classify & Count, the probabilistic variant of CC that relies on the posterior probabilities returned by a probabilistic classifier.
- Parameters
learner – a sklearn’s Estimator that generates a classifier
- aggregate(classif_posteriors)¶
Implements the aggregation of label predictions.
- Parameters
classif_predictions – np.ndarray of label predictions
- Returns
np.ndarray of shape (self.n_classes_,) with class prevalence estimates.
- fit(data: quapy.data.base.LabelledCollection, fit_learner=True)¶
Trains the aggregative quantifier
- Parameters
data – a
quapy.data.base.LabelledCollection
consisting of the training datafit_learner – whether or not to train the learner (default is True). Set to False if the learner has been trained outside the quantifier.
- Returns
self
- quapy.method.aggregative.ProbabilisticAdjustedClassifyAndCount¶
alias of
quapy.method.aggregative.PACC
- quapy.method.aggregative.ProbabilisticClassifyAndCount¶
alias of
quapy.method.aggregative.PCC
- quapy.method.aggregative.SLD¶
alias of
quapy.method.aggregative.EMQ
- class quapy.method.aggregative.SVMAE(svmperf_base=None, **kwargs)¶
Bases:
quapy.method.aggregative.ELM
SVM(AE), which attempts to minimize Absolute Error as first used by Moreo and Sebastiani, 2021. Equivalent to:
>>> ELM(svmperf_base, loss='mae', **kwargs)
- Parameters
svmperf_base – path to the folder containing the binary files of SVM perf
kwargs – rest of SVM perf’s parameters
- class quapy.method.aggregative.SVMKLD(svmperf_base=None, **kwargs)¶
Bases:
quapy.method.aggregative.ELM
SVM(KLD), which attempts to minimize the Kullback-Leibler Divergence as proposed by Esuli et al. 2015. Equivalent to:
>>> ELM(svmperf_base, loss='kld', **kwargs)
- Parameters
svmperf_base – path to the folder containing the binary files of SVM perf
kwargs – rest of SVM perf’s parameters
- class quapy.method.aggregative.SVMNKLD(svmperf_base=None, **kwargs)¶
Bases:
quapy.method.aggregative.ELM
SVM(NKLD), which attempts to minimize a version of the the Kullback-Leibler Divergence normalized via the logistic function, as proposed by Esuli et al. 2015. Equivalent to:
>>> ELM(svmperf_base, loss='nkld', **kwargs)
- Parameters
svmperf_base – path to the folder containing the binary files of SVM perf
kwargs – rest of SVM perf’s parameters
- class quapy.method.aggregative.SVMQ(svmperf_base=None, **kwargs)¶
Bases:
quapy.method.aggregative.ELM
SVM(Q), which attempts to minimize the Q loss combining a classification-oriented loss and a quantification-oriented loss, as proposed by Barranquero et al. 2015. Equivalent to:
>>> ELM(svmperf_base, loss='q', **kwargs)
- Parameters
svmperf_base – path to the folder containing the binary files of SVM perf
kwargs – rest of SVM perf’s parameters
- class quapy.method.aggregative.SVMRAE(svmperf_base=None, **kwargs)¶
Bases:
quapy.method.aggregative.ELM
SVM(RAE), which attempts to minimize Relative Absolute Error as first used by Moreo and Sebastiani, 2021. Equivalent to:
>>> ELM(svmperf_base, loss='mrae', **kwargs)
- Parameters
svmperf_base – path to the folder containing the binary files of SVM perf
kwargs – rest of SVM perf’s parameters
- class quapy.method.aggregative.T50(learner: sklearn.base.BaseEstimator, val_split=0.4)¶
Bases:
quapy.method.aggregative.ThresholdOptimization
Threshold Optimization variant for
ACC
as proposed by Forman 2006 and Forman 2008 that looks for the threshold that makes tpr cosest to 0.5. The goal is to bring improved stability to the denominator of the adjustment.- Parameters
learner – a sklearn’s Estimator that generates a classifier
val_split – indicates the proportion of data to be used as a stratified held-out validation set in which the misclassification rates are to be estimated. This parameter can be indicated as a real value (between 0 and 1, default 0.4), representing a proportion of validation data, or as an integer, indicating that the misclassification rates should be estimated via k-fold cross validation (this integer stands for the number of folds k), or as a
quapy.data.base.LabelledCollection
(the split itself).
- class quapy.method.aggregative.ThresholdOptimization(learner: sklearn.base.BaseEstimator, val_split=0.4)¶
Bases:
quapy.method.aggregative.AggregativeQuantifier
,quapy.method.base.BinaryQuantifier
Abstract class of Threshold Optimization variants for
ACC
as proposed by Forman 2006 and Forman 2008. The goal is to bring improved stability to the denominator of the adjustment. The different variants are based on different heuristics for choosing a decision threshold that would allow for more true positives and many more false positives, on the grounds this would deliver larger denominators.- Parameters
learner – a sklearn’s Estimator that generates a classifier
val_split – indicates the proportion of data to be used as a stratified held-out validation set in which the misclassification rates are to be estimated. This parameter can be indicated as a real value (between 0 and 1, default 0.4), representing a proportion of validation data, or as an integer, indicating that the misclassification rates should be estimated via k-fold cross validation (this integer stands for the number of folds k), or as a
quapy.data.base.LabelledCollection
(the split itself).
- aggregate(classif_predictions)¶
Implements the aggregation of label predictions.
- Parameters
classif_predictions – np.ndarray of label predictions
- Returns
np.ndarray of shape (self.n_classes_,) with class prevalence estimates.
- fit(data: quapy.data.base.LabelledCollection, fit_learner=True, val_split: Optional[Union[float, int, quapy.data.base.LabelledCollection]] = None)¶
Trains the aggregative quantifier
- Parameters
data – a
quapy.data.base.LabelledCollection
consisting of the training datafit_learner – whether or not to train the learner (default is True). Set to False if the learner has been trained outside the quantifier.
- Returns
self
- class quapy.method.aggregative.X(learner: sklearn.base.BaseEstimator, val_split=0.4)¶
Bases:
quapy.method.aggregative.ThresholdOptimization
Threshold Optimization variant for
ACC
as proposed by Forman 2006 and Forman 2008 that looks for the threshold that yields tpr=1-fpr. The goal is to bring improved stability to the denominator of the adjustment.- Parameters
learner – a sklearn’s Estimator that generates a classifier
val_split – indicates the proportion of data to be used as a stratified held-out validation set in which the misclassification rates are to be estimated. This parameter can be indicated as a real value (between 0 and 1, default 0.4), representing a proportion of validation data, or as an integer, indicating that the misclassification rates should be estimated via k-fold cross validation (this integer stands for the number of folds k), or as a
quapy.data.base.LabelledCollection
(the split itself).
quapy.method.base module¶
- class quapy.method.base.BaseQuantifier¶
Bases:
object
Abstract Quantifier. A quantifier is defined as an object of a class that implements the method
fit()
onquapy.data.base.LabelledCollection
, the methodquantify()
, and theset_params()
andget_params()
for model selection (seequapy.model_selection.GridSearchQ()
)- property aggregative¶
Indicates whether the quantifier is of type aggregative or not
- Returns
False (to be overridden)
- property binary¶
Indicates whether the quantifier is binary or not.
- Returns
False (to be overridden)
- abstract property classes_¶
Class labels, in the same order in which class prevalence values are to be computed.
- Returns
array-like
- abstract fit(data: quapy.data.base.LabelledCollection)¶
Trains a quantifier.
- Parameters
data – a
quapy.data.base.LabelledCollection
consisting of the training data- Returns
self
- abstract get_params(deep=True)¶
Return the current parameters of the quantifier.
- Parameters
deep – for compatibility with sklearn
- Returns
a dictionary of param-value pairs
- property n_classes¶
Returns the number of classes
- Returns
integer
- property probabilistic¶
Indicates whether the quantifier is of type probabilistic or not
- Returns
False (to be overridden)
- abstract quantify(instances)¶
Generate class prevalence estimates for the sample’s instances
- Parameters
instances – array-like
- Returns
np.ndarray of shape (self.n_classes_,) with class prevalence estimates.
- abstract set_params(**parameters)¶
Set the parameters of the quantifier.
- Parameters
parameters – dictionary of param-value pairs
- class quapy.method.base.BinaryQuantifier¶
Bases:
quapy.method.base.BaseQuantifier
Abstract class of binary quantifiers, i.e., quantifiers estimating class prevalence values for only two classes (typically, to be interpreted as one class and its complement).
- property binary¶
Informs that the quantifier is binary
- Returns
True
- quapy.method.base.isaggregative(model: quapy.method.base.BaseQuantifier)¶
Alias for property aggregative
- Parameters
model – the model
- Returns
True if the model is aggregative, False otherwise
- quapy.method.base.isbinary(model: quapy.method.base.BaseQuantifier)¶
Alias for property binary
- Parameters
model – the model
- Returns
True if the model is binary, False otherwise
- quapy.method.base.isprobabilistic(model: quapy.method.base.BaseQuantifier)¶
Alias for property probabilistic
- Parameters
model – the model
- Returns
True if the model is probabilistic, False otherwise
quapy.method.meta module¶
- quapy.method.meta.EACC(learner, param_grid=None, optim=None, param_mod_sel=None, **kwargs)¶
Implements an ensemble of
quapy.method.aggregative.ACC
quantifiers, as used by Pérez-Gállego et al., 2019.Equivalent to:
>>> ensembleFactory(learner, ACC, param_grid, optim, param_mod_sel, **kwargs)
See
ensembleFactory()
for further details.- Parameters
learner – sklearn’s Estimator that generates a classifier
param_grid – a dictionary with the grid of parameters to optimize for
optim – a valid quantification or classification error, or a string name of it
param_model_sel – a dictionary containing any keyworded argument to pass to
quapy.model_selection.GridSearchQ
kwargs – kwargs for the class
Ensemble
- Returns
an instance of
Ensemble
- quapy.method.meta.ECC(learner, param_grid=None, optim=None, param_mod_sel=None, **kwargs)¶
Implements an ensemble of
quapy.method.aggregative.CC
quantifiers, as used by Pérez-Gállego et al., 2019.Equivalent to:
>>> ensembleFactory(learner, CC, param_grid, optim, param_mod_sel, **kwargs)
See
ensembleFactory()
for further details.- Parameters
learner – sklearn’s Estimator that generates a classifier
param_grid – a dictionary with the grid of parameters to optimize for
optim – a valid quantification or classification error, or a string name of it
param_model_sel – a dictionary containing any keyworded argument to pass to
quapy.model_selection.GridSearchQ
kwargs – kwargs for the class
Ensemble
- Returns
an instance of
Ensemble
- quapy.method.meta.EEMQ(learner, param_grid=None, optim=None, param_mod_sel=None, **kwargs)¶
Implements an ensemble of
quapy.method.aggregative.EMQ
quantifiers.Equivalent to:
>>> ensembleFactory(learner, EMQ, param_grid, optim, param_mod_sel, **kwargs)
See
ensembleFactory()
for further details.- Parameters
learner – sklearn’s Estimator that generates a classifier
param_grid – a dictionary with the grid of parameters to optimize for
optim – a valid quantification or classification error, or a string name of it
param_model_sel – a dictionary containing any keyworded argument to pass to
quapy.model_selection.GridSearchQ
kwargs – kwargs for the class
Ensemble
- Returns
an instance of
Ensemble
- quapy.method.meta.EHDy(learner, param_grid=None, optim=None, param_mod_sel=None, **kwargs)¶
Implements an ensemble of
quapy.method.aggregative.HDy
quantifiers, as used by Pérez-Gállego et al., 2019.Equivalent to:
>>> ensembleFactory(learner, HDy, param_grid, optim, param_mod_sel, **kwargs)
See
ensembleFactory()
for further details.- Parameters
learner – sklearn’s Estimator that generates a classifier
param_grid – a dictionary with the grid of parameters to optimize for
optim – a valid quantification or classification error, or a string name of it
param_model_sel – a dictionary containing any keyworded argument to pass to
quapy.model_selection.GridSearchQ
kwargs – kwargs for the class
Ensemble
- Returns
an instance of
Ensemble
- quapy.method.meta.EPACC(learner, param_grid=None, optim=None, param_mod_sel=None, **kwargs)¶
Implements an ensemble of
quapy.method.aggregative.PACC
quantifiers.Equivalent to:
>>> ensembleFactory(learner, PACC, param_grid, optim, param_mod_sel, **kwargs)
See
ensembleFactory()
for further details.- Parameters
learner – sklearn’s Estimator that generates a classifier
param_grid – a dictionary with the grid of parameters to optimize for
optim – a valid quantification or classification error, or a string name of it
param_model_sel – a dictionary containing any keyworded argument to pass to
quapy.model_selection.GridSearchQ
kwargs – kwargs for the class
Ensemble
- Returns
an instance of
Ensemble
- class quapy.method.meta.Ensemble(quantifier: quapy.method.base.BaseQuantifier, size=50, red_size=25, min_pos=5, policy='ave', max_sample_size=None, val_split: Optional[Union[float, quapy.data.base.LabelledCollection]] = None, n_jobs=1, verbose=False)¶
Bases:
quapy.method.base.BaseQuantifier
- VALID_POLICIES = {'ave', 'ds', 'mae', 'mkld', 'mnkld', 'mrae', 'mse', 'ptr'}¶
Implementation of the Ensemble methods for quantification described by Pérez-Gállego et al., 2017 and Pérez-Gállego et al., 2019. The policies implemented include:
Average (policy=’ave’): computes class prevalence estimates as the average of the estimates returned by the base quantifiers.
Training Prevalence (policy=’ptr’): applies a dynamic selection to the ensemble’s members by retaining only those members such that the class prevalence values in the samples they use as training set are closest to preliminary class prevalence estimates computed as the average of the estimates of all the members. The final estimate is recomputed by considering only the selected members.
Distribution Similarity (policy=’ds’): performs a dynamic selection of base members by retaining the members trained on samples whose distribution of posterior probabilities is closest, in terms of the Hellinger Distance, to the distribution of posterior probabilities in the test sample
Accuracy (policy=’<valid error name>’): performs a static selection of the ensemble members by retaining those that minimize a quantification error measure, which is passed as an argument.
Example:
>>> model = Ensemble(quantifier=ACC(LogisticRegression()), size=30, policy='ave', n_jobs=-1)
- Parameters
quantifier – base quantification member of the ensemble
size – number of members
red_size – number of members to retain after selection (depending on the policy)
min_pos – minimum number of positive instances to consider a sample as valid
policy – the selection policy; available policies include: ave (default), ptr, ds, and accuracy (which is instantiated via a valid error name, e.g., mae)
max_sample_size – maximum number of instances to consider in the samples (set to None to indicate no limit, default)
val_split – a float in range (0,1) indicating the proportion of data to be used as a stratified held-out validation split, or a
quapy.data.base.LabelledCollection
(the split itself).n_jobs – number of parallel workers (default 1)
verbose – set to True (default is False) to get some information in standard output
- property aggregative¶
Indicates that the quantifier is not aggregative.
- Returns
False
- property binary¶
Returns a boolean indicating whether the base quantifiers are binary or not
- Returns
boolean
- property classes_¶
Class labels, in the same order in which class prevalence values are to be computed.
- Returns
array-like
- fit(data: quapy.data.base.LabelledCollection, val_split: Optional[Union[float, quapy.data.base.LabelledCollection]] = None)¶
Trains a quantifier.
- Parameters
data – a
quapy.data.base.LabelledCollection
consisting of the training data- Returns
self
- get_params(deep=True)¶
This function should not be used within
quapy.model_selection.GridSearchQ
(is here for compatibility with the abstract class). Instead, use Ensemble(GridSearchQ(q),…), with q a Quantifier (recommended), or Ensemble(Q(GridSearchCV(l))) with Q a quantifier class that has a learner l optimized forclassification (not recommended).
- Returns
raises an Exception
- property probabilistic¶
Indicates that the quantifier is not probabilistic.
- Returns
False
- quantify(instances)¶
Generate class prevalence estimates for the sample’s instances
- Parameters
instances – array-like
- Returns
np.ndarray of shape (self.n_classes_,) with class prevalence estimates.
- set_params(**parameters)¶
This function should not be used within
quapy.model_selection.GridSearchQ
(is here for compatibility with the abstract class). Instead, use Ensemble(GridSearchQ(q),…), with q a Quantifier (recommended), or Ensemble(Q(GridSearchCV(l))) with Q a quantifier class that has a learner l optimized forclassification (not recommended).
- Parameters
parameters – dictionary
- Returns
raises an Exception
- quapy.method.meta.ensembleFactory(learner, base_quantifier_class, param_grid=None, optim=None, param_model_sel: Optional[dict] = None, **kwargs)¶
Ensemble factory. Provides a unified interface for instantiating ensembles that can be optimized (via model selection for quantification) for a given evaluation metric using
quapy.model_selection.GridSearchQ
. If the evaluation metric is classification-oriented (instead of quantification-oriented), then the optimization will be carried out via sklearn’s GridSearchCV.Example to instantiate an
Ensemble
based onquapy.method.aggregative.PACC
in which the base members are optimized forquapy.error.mae()
viaquapy.model_selection.GridSearchQ
. The ensemble follows the policy Accuracy based onquapy.error.mae()
(the same measure being optimized), meaning that a static selection of members of the ensemble is made based on their performance in terms of this error.>>> param_grid = { >>> 'C': np.logspace(-3,3,7), >>> 'class_weight': ['balanced', None] >>> } >>> param_mod_sel = { >>> 'sample_size': 500, >>> 'protocol': 'app' >>> } >>> common={ >>> 'max_sample_size': 1000, >>> 'n_jobs': -1, >>> 'param_grid': param_grid, >>> 'param_mod_sel': param_mod_sel, >>> } >>> >>> ensembleFactory(LogisticRegression(), PACC, optim='mae', policy='mae', **common)
- Parameters
learner – sklearn’s Estimator that generates a classifier
base_quantifier_class – a class of quantifiers
param_grid – a dictionary with the grid of parameters to optimize for
optim – a valid quantification or classification error, or a string name of it
param_model_sel – a dictionary containing any keyworded argument to pass to
quapy.model_selection.GridSearchQ
kwargs – kwargs for the class
Ensemble
- Returns
an instance of
Ensemble
- quapy.method.meta.get_probability_distribution(posterior_probabilities, bins=8)¶
Gets a histogram out of the posterior probabilities (only for the binary case).
- Parameters
posterior_probabilities – array-like of shape (n_instances, 2,)
bins – integer
- Returns
np.ndarray with the relative frequencies for each bin (for the positive class only)
quapy.method.neural module¶
- class quapy.method.neural.QuaNetModule(doc_embedding_size, n_classes, stats_size, lstm_hidden_size=64, lstm_nlayers=1, ff_layers=[1024, 512], bidirectional=True, qdrop_p=0.5, order_by=0)¶
Bases:
torch.nn.modules.module.Module
Implements the QuaNet forward pass. See
QuaNetTrainer
for training QuaNet.- Parameters
doc_embedding_size – integer, the dimensionality of the document embeddings
n_classes – integer, number of classes
stats_size – integer, number of statistics estimated by simple quantification methods
lstm_hidden_size – integer, hidden dimensionality of the LSTM cell
lstm_nlayers – integer, number of LSTM layers
ff_layers – list of integers, dimensions of the densely-connected FF layers on top of the quantification embedding
bidirectional – boolean, whether or not to use bidirectional LSTM
qdrop_p – float, dropout probability
order_by – integer, class for which the document embeddings are to be sorted
- property device¶
- forward(doc_embeddings, doc_posteriors, statistics)¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class quapy.method.neural.QuaNetTrainer(learner, sample_size, n_epochs=100, tr_iter_per_poch=500, va_iter_per_poch=100, lr=0.001, lstm_hidden_size=64, lstm_nlayers=1, ff_layers=[1024, 512], bidirectional=True, qdrop_p=0.5, patience=10, checkpointdir='../checkpoint', checkpointname=None, device='cuda')¶
Bases:
quapy.method.base.BaseQuantifier
Implementation of QuaNet, a neural network for quantification. This implementation uses PyTorch and can take advantage of GPU for speeding-up the training phase.
Example:
>>> import quapy as qp >>> from quapy.method.meta import QuaNet >>> from quapy.classification.neural import NeuralClassifierTrainer, CNNnet >>> >>> # use samples of 100 elements >>> qp.environ['SAMPLE_SIZE'] = 100 >>> >>> # load the kindle dataset as text, and convert words to numerical indexes >>> dataset = qp.datasets.fetch_reviews('kindle', pickle=True) >>> qp.data.preprocessing.index(dataset, min_df=5, inplace=True) >>> >>> # the text classifier is a CNN trained by NeuralClassifierTrainer >>> cnn = CNNnet(dataset.vocabulary_size, dataset.n_classes) >>> learner = NeuralClassifierTrainer(cnn, device='cuda') >>> >>> # train QuaNet (QuaNet is an alias to QuaNetTrainer) >>> model = QuaNet(learner, qp.environ['SAMPLE_SIZE'], device='cuda') >>> model.fit(dataset.training) >>> estim_prevalence = model.quantify(dataset.test.instances)
- Parameters
learner – an object implementing fit (i.e., that can be trained on labelled data), predict_proba (i.e., that can generate posterior probabilities of unlabelled examples) and transform (i.e., that can generate embedded representations of the unlabelled instances).
sample_size – integer, the sample size
n_epochs – integer, maximum number of training epochs
tr_iter_per_poch – integer, number of training iterations before considering an epoch complete
va_iter_per_poch – integer, number of validation iterations to perform after each epoch
lr – float, the learning rate
lstm_hidden_size – integer, hidden dimensionality of the LSTM cells
lstm_nlayers – integer, number of LSTM layers
ff_layers – list of integers, dimensions of the densely-connected FF layers on top of the quantification embedding
bidirectional – boolean, indicates whether the LSTM is bidirectional or not
qdrop_p – float, dropout probability
patience – integer, number of epochs showing no improvement in the validation set before stopping the training phase (early stopping)
checkpointdir – string, a path where to store models’ checkpoints
checkpointname – string (optional), the name of the model’s checkpoint
device – string, indicate “cpu” or “cuda”
- property classes_¶
Class labels, in the same order in which class prevalence values are to be computed.
- Returns
array-like
- clean_checkpoint()¶
Removes the checkpoint
- clean_checkpoint_dir()¶
Removes anything contained in the checkpoint directory
- fit(data: quapy.data.base.LabelledCollection, fit_learner=True)¶
Trains QuaNet.
- Parameters
data – the training data on which to train QuaNet. If fit_learner=True, the data will be split in 40/40/20 for training the classifier, training QuaNet, and validating QuaNet, respectively. If fit_learner=False, the data will be split in 66/34 for training QuaNet and validating it, respectively.
fit_learner – if True, trains the classifier on a split containing 40% of the data
- Returns
self
- get_params(deep=True)¶
Return the current parameters of the quantifier.
- Parameters
deep – for compatibility with sklearn
- Returns
a dictionary of param-value pairs
- quantify(instances)¶
Generate class prevalence estimates for the sample’s instances
- Parameters
instances – array-like
- Returns
np.ndarray of shape (self.n_classes_,) with class prevalence estimates.
- set_params(**parameters)¶
Set the parameters of the quantifier.
- Parameters
parameters – dictionary of param-value pairs
- quapy.method.neural.mae_loss(output, target)¶
Torch-like wrapper for the Mean Absolute Error
- Parameters
output – predictions
target – ground truth values
- Returns
mean absolute error loss
quapy.method.non_aggregative module¶
- class quapy.method.non_aggregative.MaximumLikelihoodPrevalenceEstimation¶
Bases:
quapy.method.base.BaseQuantifier
The Maximum Likelihood Prevalence Estimation (MLPE) method is a lazy method that assumes there is no prior probability shift between training and test instances (put it other way, that the i.i.d. assumpion holds). The estimation of class prevalence values for any test sample is always (i.e., irrespective of the test sample itself) the class prevalence seen during training. This method is considered to be a lower-bound quantifier that any quantification method should beat.
- property classes_¶
Number of classes
- Returns
integer
- fit(data: quapy.data.base.LabelledCollection)¶
Computes the training prevalence and stores it.
- Parameters
data – the training sample
- Returns
self
- get_params(deep=True)¶
Does nothing, since this learner has no parameters.
- Parameters
deep – for compatibility with sklearn
- Returns
None
- quantify(instances)¶
Ignores the input instances and returns, as the class prevalence estimantes, the training prevalence.
- Parameters
instances – array-like (ignored)
- Returns
the class prevalence seen during training
- set_params(**parameters)¶
Does nothing, since this learner has no parameters.
- Parameters
parameters – dictionary of param-value pairs (ignored)