# Quantification Methods Quantification methods can be categorized as belonging to _aggregative_ and _non-aggregative_ groups. Most methods included in QuaPy at the moment are of type _aggregative_ (though we plan to add many more methods in the near future), i.e., are methods characterized by the fact that quantification is performed as an aggregation function of the individual products of classification. Any quantifier in QuaPy shoud extend the class _BaseQuantifier_, and implement some abstract methods: ```python @abstractmethod def fit(self, data: LabelledCollection): ... @abstractmethod def quantify(self, instances): ... @abstractmethod def set_params(self, **parameters): ... @abstractmethod def get_params(self, deep=True): ... ``` The meaning of those functions should be familiar to those used to work with scikit-learn since the class structure of QuaPy is directly inspired by scikit-learn's _Estimators_. Functions _fit_ and _quantify_ are used to train the model and to provide class estimations (the reason why scikit-learn' structure has not been adopted _as is_ in QuaPy responds to the fact that scikit-learn's _predict_ function is expected to return one output for each input element --e.g., a predicted label for each instance in a sample-- while in quantification the output for a sample is one single array of class prevalences), while functions _set_params_ and _get_params_ allow a [model selector](https://github.com/HLT-ISTI/QuaPy/wiki/Model-Selection) to automate the process of hyperparameter search. ## Aggregative Methods All quantification methods are implemented as part of the _qp.method_ package. In particular, _aggregative_ methods are defined in _qp.method.aggregative_, and extend _AggregativeQuantifier(BaseQuantifier)_. The methods that any _aggregative_ quantifier must implement are: ```python @abstractmethod def fit(self, data: LabelledCollection, fit_learner=True): ... @abstractmethod def aggregate(self, classif_predictions:np.ndarray): ... ``` since, as mentioned before, aggregative methods base their prediction on the individual predictions of a classifier. Indeed, a default implementation of _BaseQuantifier.quantify_ is already provided, which looks like: ```python def quantify(self, instances): classif_predictions = self.preclassify(instances) return self.aggregate(classif_predictions) ``` Aggregative quantifiers are expected to maintain a classifier (which is accessed through the _@property_ _learner_). This classifier is given as input to the quantifier, and can be already fit on external data (in which case, the _fit_learner_ argument should be set to False), or be fit by the quantifier's fit (default). Another class of _aggregative_ methods are the _probabilistic_ aggregative methods, that should inherit from the abstract class _AggregativeProbabilisticQuantifier(AggregativeQuantifier)_. The particularity of _probabilistic_ aggregative methods (w.r.t. non-probabilistic ones), is that the default quantifier is defined in terms of the posterior probabilities returned by a probabilistic classifier, and not by the crisp decisions of a hard classifier; i.e.: ```python def quantify(self, instances): classif_posteriors = self.posterior_probabilities(instances) return self.aggregate(classif_posteriors) ``` One advantage of _aggregative_ methods (either probabilistic or not) is that the evaluation according to any sampling procedure (e.g., the [artificial sampling protocol](https://github.com/HLT-ISTI/QuaPy/wiki/Evaluation)) can be achieved very efficiently, since the entire set can be pre-classified once, and the quantification estimations for different samples can directly reuse these predictions, without requiring to classify each element every time. QuaPy leverages this property to speed-up any procedure having to do with quantification over samples, as is customarily done in model selection or in evaluation. ### The Classify & Count variants QuaPy implements the four CC variants, i.e.: * _CC_ (Classify & Count), the simplest aggregative quantifier; one that simply relies on the label predictions of a classifier to deliver class estimates. * _ACC_ (Adjusted Classify & Count), the adjusted variant of CC. * _PCC_ (Probabilistic Classify & Count), the probabilistic variant of CC that relies on the soft estimations (or posterior probabilities) returned by a (probabilistic) classifier. * _PACC_ (Probabilistic Adjusted Classify & Count), the adjusted variant of PCC. The following code serves as a complete example using CC equipped with a SVM as the classifier: ```python import quapy as qp import quapy.functional as F from sklearn.svm import LinearSVC dataset = qp.datasets.fetch_twitter('hcr', pickle=True) training = dataset.training test = dataset.test # instantiate a classifier learner, in this case a SVM svm = LinearSVC() # instantiate a Classify & Count with the SVM # (an alias is available in qp.method.aggregative.ClassifyAndCount) model = qp.method.aggregative.CC(svm) model.fit(training) estim_prevalence = model.quantify(test.instances) ``` The same code could be used to instantiate an ACC, by simply replacing the instantiation of the model with: ```python model = qp.method.aggregative.ACC(svm) ``` Note that the adjusted variants (ACC and PACC) need to estimate some parameters for performing the adjustment (e.g., the _true positive rate_ and the _false positive rate_ in case of binary classification) that are estimated on a validation split of the labelled set. In this case, the __init__ method of ACC defines an additional parameter, _val_split_ which, by default, is set to 0.4 and so, the 40% of the labelled data will be used for estimating the parameters for adjusting the predictions. This parameters can also be set with an integer, indicating that the parameters should be estimated by means of _k_-fold cross-validation, for which the integer indicates the number _k_ of folds. Finally, _val_split_ can be set to a specific held-out validation set (i.e., an instance of _LabelledCollection_). The specification of _val_split_ can be postponed to the invokation of the fit method (if _val_split_ was also set in the constructor, the one specified at fit time would prevail), e.g.: ```python model = qp.method.aggregative.ACC(svm) # perform 5-fold cross validation for estimating ACC's parameters # (overrides the default val_split=0.4 in the constructor) model.fit(training, val_split=5) ``` The following code illustrates the case in which PCC is used: ```python model = qp.method.aggregative.PCC(svm) model.fit(training) estim_prevalence = model.quantify(test.instances) print('classifier:', model.learner) ``` In this case, QuaPy will print: ``` The learner LinearSVC does not seem to be probabilistic. The learner will be calibrated. classifier: CalibratedClassifierCV(base_estimator=LinearSVC(), cv=5) ``` The first output indicates that the learner (_LinearSVC_ in this case) is not a probabilistic classifier (i.e., it does not implement the _predict_proba_ method) and so, the classifier will be converted to a probabilistic one through [calibration](https://scikit-learn.org/stable/modules/calibration.html). As a result, the classifier that is printed in the second line points to a _CalibratedClassifier_ instance. Note that calibration can only be applied to hard classifiers when _fit_learner=True_; an exception will be raised otherwise. Lastly, everything we said aboud ACC and PCC applies to PACC as well. ### Expectation Maximization (EMQ) The Expectation Maximization Quantifier (EMQ), also known as the SLD, is available at _qp.method.aggregative.EMQ_ or via the alias _qp.method.aggregative.ExpectationMaximizationQuantifier_. The method is described in: _Saerens, M., Latinne, P., and Decaestecker, C. (2002). Adjusting the outputs of a classifier to new a priori probabilities: A simple procedure. Neural Computation, 14(1):21–41._ EMQ works with a probabilistic classifier (if the classifier given as input is a hard one, a calibration will be attempted). Although this method was originally proposed for improving the posterior probabilities of a probabilistic classifier, and not for improving the estimation of prior probabilities, EMQ ranks almost always among the most effective quantifiers in the experiments we have carried out. An example of use can be found below: ```python import quapy as qp from sklearn.linear_model import LogisticRegression dataset = qp.datasets.fetch_twitter('hcr', pickle=True) model = qp.method.aggregative.EMQ(LogisticRegression()) model.fit(dataset.training) estim_prevalence = model.quantify(dataset.test.instances) ``` ### Hellinger Distance y (HDy) The method HDy is described in: _Implementation of the method based on the Hellinger Distance y (HDy) proposed by González-Castro, V., Alaiz-Rodrı́guez, R., and Alegre, E. (2013). Class distribution estimation based on the Hellinger distance. Information Sciences, 218:146–164._ It is implemented in _qp.method.aggregative.HDy_ (also accessible through the allias _qp.method.aggregative.HellingerDistanceY_). This method works with a probabilistic classifier (hard classifiers can be used as well and will be calibrated) and requires a validation set to estimate parameter for the mixture model. Just like ACC and PACC, this quantifier receives a _val_split_ argument in the constructor (or in the fit method, in which case the previous value is overridden) that can either be a float indicating the proportion of training data to be taken as the validation set (in a random stratified split), or a validation set (i.e., an instance of _LabelledCollection_) itself. HDy was proposed as a binary classifier and the implementation provided in QuaPy accepts only binary datasets. The following code shows an example of use: ```python import quapy as qp from sklearn.linear_model import LogisticRegression # load a binary dataset dataset = qp.datasets.fetch_reviews('hp', pickle=True) qp.data.preprocessing.text2tfidf(dataset, min_df=5, inplace=True) model = qp.method.aggregative.HDy(LogisticRegression()) model.fit(dataset.training) estim_prevalence = model.quantify(dataset.test.instances) ``` ### Explicit Loss Minimization The Explicit Loss Minimization (ELM) represent a family of methods based on structured output learning, i.e., quantifiers relying on classifiers that have been optimized targeting a quantification-oriented evaluation measure. In QuaPy, the following methods, all relying on Joachim's [SVMperf](https://www.cs.cornell.edu/people/tj/svm_light/svm_perf.html) implementation, are available in _qp.method.aggregative_: * SVMQ (SVM-Q) is a quantification method optimizing the metric _Q_ defined in _Barranquero, J., Díez, J., and del Coz, J. J. (2015). Quantification-oriented learning based on reliable classifiers. Pattern Recognition, 48(2):591–604._ * SVMKLD (SVM for Kullback-Leibler Divergence) proposed in _Esuli, A. and Sebastiani, F. (2015). Optimizing text quantifiers for multivariate loss functions. ACM Transactions on Knowledge Discovery and Data, 9(4):Article 27._ * SVMNKLD (SVM for Normalized Kullback-Leibler Divergence) proposed in _Esuli, A. and Sebastiani, F. (2015). Optimizing text quantifiers for multivariate loss functions. ACM Transactions on Knowledge Discovery and Data, 9(4):Article 27._ * SVMAE (SVM for Mean Absolute Error) * SVMRAE (SVM for Mean Relative Absolute Error) the last two methods (SVMAE and SVMRAE) have been implemented in QuaPy in order to make available ELM variants for what nowadays are considered the most well-behaved evaluation metrics in quantification. In order to make these models work, you would need to run the script _prepare_svmperf.sh_ (distributed along with QuaPy) that downloads _SVMperf_' source code, applies a patch that implements the quantification oriented losses, and compiles the sources. If you want to add any custom loss, you would need to modify the source code of _SVMperf_ in order to implement it, and assign a valid loss code to it. Then you must re-compile the whole thing and instantiate the quantifier in QuaPy as follows: ```python # you can either set the path to your custom svm_perf_quantification implementation # in the environment variable, or as an argument to the constructor of ELM qp.environ['SVMPERF_HOME'] = './path/to/svm_perf_quantification' # assign an alias to your custom loss and the id you have assigned to it svmperf = qp.classification.svmperf.SVMperf svmperf.valid_losses['mycustomloss'] = 28 # instantiate the ELM method indicating the loss model = qp.method.aggregative.ELM(loss='mycustomloss') ``` All ELM are binary quantifiers since they rely on _SVMperf_, that currently supports only binary classification. ELM variants (any binary quantifier in general) can be extended to operate in single-label scenarios trivially by adopting a "one-vs-all" strategy (as, e.g., in _Gao, W. and Sebastiani, F. (2016). From classification to quantification in tweet sentiment analysis. Social Network Analysis and Mining, 6(19):1–22_). In QuaPy this is possible by using the _OneVsAll_ class: ```python import quapy as qp from quapy.method.aggregative import SVMQ, OneVsAll # load a single-label dataset (this one contains 3 classes) dataset = qp.datasets.fetch_twitter('hcr', pickle=True) # let qp know where svmperf is qp.environ['SVMPERF_HOME'] = '../svm_perf_quantification' model = OneVsAll(SVMQ(), n_jobs=-1) # run them on parallel model.fit(dataset.training) estim_prevalence = model.quantify(dataset.test.instances) ``` ## Meta Models By _meta_ models we mean quantification methods that are defined on top of other quantification methods, and that thus do not squarely belong to the aggregative nor the non-aggregative group (indeed, _meta_ models could use quantifiers from any of those groups). _Meta_ models are implemented in the _qp.method.meta_ module. ### Ensembles QuaPy implements (some of) the variants proposed in: * _Pérez-Gállego, P., Quevedo, J. R., & del Coz, J. J. (2017). Using ensembles for problems with characterizable changes in data distribution: A case study on quantification. Information Fusion, 34, 87-100._ * _Pérez-Gállego, P., Castano, A., Quevedo, J. R., & del Coz, J. J. (2019). Dynamic ensemble selection for quantification tasks. Information Fusion, 45, 1-15._ The following code shows how to instantiate an Ensemble of 30 _Adjusted Classify & Count_ (ACC) quantifiers operating with a _Logistic Regressor_ (LR) as the base classifier, and using the _average_ as the aggregation policy (see the original article for further details). The last parameter indicates to use all processors for parallelization. ```python import quapy as qp from quapy.method.aggregative import ACC from quapy.method.meta import Ensemble from sklearn.linear_model import LogisticRegression dataset = qp.datasets.fetch_UCIDataset('haberman') model = Ensemble(quantifier=ACC(LogisticRegression()), size=30, policy='ave', n_jobs=-1) model.fit(dataset.training) estim_prevalence = model.quantify(dataset.test.instances) ``` Other aggregation policies implemented in QuaPy include: * 'ptr' for applying a dynamic selection based on the training prevalence of the ensemble's members * 'ds' for applying a dynamic selection based on the Hellinger Distance * _any valid quantification measure_ (e.g., 'mse') for performing a static selection based on the performance estimated for each member of the ensemble in terms of that evaluation metric. When using any of the above options, it is important to set the _red_size_ parameter, which informs of the number of members to retain. Please, check the [model selection](https://github.com/HLT-ISTI/QuaPy/wiki/Model-Selection) wiki if you want to optimize the hyperparameters of ensemble for classification or quantification. ### The QuaNet neural network QuaPy offers an implementation of QuaNet, a deep learning model presented in: _Esuli, A., Moreo, A., & Sebastiani, F. (2018, October). A recurrent neural network for sentiment quantification. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management (pp. 1775-1778)._ This model requires _torch_ to be installed. QuaNet also requires a classifier that can provide embedded representations of the inputs. In the original paper, QuaNet was tested using an LSTM as the base classifier. In the following example, we show an instantiation of QuaNet that instead uses CNN as a probabilistic classifier, taking its last layer representation as the document embedding: ```python import quapy as qp from quapy.method.meta import QuaNet from quapy.classification.neural import NeuralClassifierTrainer, CNNnet # use samples of 100 elements qp.environ['SAMPLE_SIZE'] = 100 # load the kindle dataset as text, and convert words to numerical indexes dataset = qp.datasets.fetch_reviews('kindle', pickle=True) qp.data.preprocessing.index(dataset, min_df=5, inplace=True) # the text classifier is a CNN trained by NeuralClassifierTrainer cnn = CNNnet(dataset.vocabulary_size, dataset.n_classes) learner = NeuralClassifierTrainer(cnn, device='cuda') # train QuaNet model = QuaNet(learner, qp.environ['SAMPLE_SIZE'], device='cuda') model.fit(dataset.training) estim_prevalence = model.quantify(dataset.test.instances) ```