413 lines
17 KiB
Plaintext
413 lines
17 KiB
Plaintext
# Quantification Methods
|
||
|
||
Quantification methods can be categorized as belonging to
|
||
_aggregative_ and _non-aggregative_ groups.
|
||
Most methods included in QuaPy at the moment are of type _aggregative_
|
||
(though we plan to add many more methods in the near future), i.e.,
|
||
are methods characterized by the fact that
|
||
quantification is performed as an aggregation function of the individual
|
||
products of classification.
|
||
|
||
Any quantifier in QuaPy shoud extend the class _BaseQuantifier_,
|
||
and implement some abstract methods:
|
||
```python
|
||
@abstractmethod
|
||
def fit(self, data: LabelledCollection): ...
|
||
|
||
@abstractmethod
|
||
def quantify(self, instances): ...
|
||
|
||
@abstractmethod
|
||
def set_params(self, **parameters): ...
|
||
|
||
@abstractmethod
|
||
def get_params(self, deep=True): ...
|
||
```
|
||
The meaning of those functions should be familiar to those
|
||
used to work with scikit-learn since the class structure of QuaPy
|
||
is directly inspired by scikit-learn's _Estimators_. Functions
|
||
_fit_ and _quantify_ are used to train the model and to provide
|
||
class estimations (the reason why
|
||
scikit-learn' structure has not been adopted _as is_ in QuaPy responds to
|
||
the fact that scikit-learn's _predict_ function is expected to return
|
||
one output for each input element --e.g., a predicted label for each
|
||
instance in a sample-- while in quantification the output for a sample
|
||
is one single array of class prevalences), while functions _set_params_
|
||
and _get_params_ allow a
|
||
[model selector](https://github.com/HLT-ISTI/QuaPy/wiki/Model-Selection)
|
||
to automate the process of hyperparameter search.
|
||
|
||
## Aggregative Methods
|
||
|
||
All quantification methods are implemented as part of the
|
||
_qp.method_ package. In particular, _aggregative_ methods are defined in
|
||
_qp.method.aggregative_, and extend _AggregativeQuantifier(BaseQuantifier)_.
|
||
The methods that any _aggregative_ quantifier must implement are:
|
||
|
||
```python
|
||
@abstractmethod
|
||
def fit(self, data: LabelledCollection, fit_learner=True): ...
|
||
|
||
@abstractmethod
|
||
def aggregate(self, classif_predictions:np.ndarray): ...
|
||
```
|
||
|
||
since, as mentioned before, aggregative methods base their prediction on the
|
||
individual predictions of a classifier. Indeed, a default implementation
|
||
of _BaseQuantifier.quantify_ is already provided, which looks like:
|
||
|
||
```python
|
||
def quantify(self, instances):
|
||
classif_predictions = self.preclassify(instances)
|
||
return self.aggregate(classif_predictions)
|
||
```
|
||
Aggregative quantifiers are expected to maintain a classifier (which is
|
||
accessed through the _@property_ _learner_). This classifier is
|
||
given as input to the quantifier, and can be already fit
|
||
on external data (in which case, the _fit_learner_ argument should
|
||
be set to False), or be fit by the quantifier's fit (default).
|
||
|
||
Another class of _aggregative_ methods are the _probabilistic_
|
||
aggregative methods, that should inherit from the abstract class
|
||
_AggregativeProbabilisticQuantifier(AggregativeQuantifier)_.
|
||
The particularity of _probabilistic_ aggregative methods (w.r.t.
|
||
non-probabilistic ones), is that the default quantifier is defined
|
||
in terms of the posterior probabilities returned by a probabilistic
|
||
classifier, and not by the crisp decisions of a hard classifier; i.e.:
|
||
|
||
```python
|
||
def quantify(self, instances):
|
||
classif_posteriors = self.posterior_probabilities(instances)
|
||
return self.aggregate(classif_posteriors)
|
||
```
|
||
|
||
One advantage of _aggregative_ methods (either probabilistic or not)
|
||
is that the evaluation according to any sampling procedure (e.g.,
|
||
the [artificial sampling protocol](https://github.com/HLT-ISTI/QuaPy/wiki/Evaluation))
|
||
can be achieved very efficiently, since the entire set can be pre-classified
|
||
once, and the quantification estimations for different samples can directly
|
||
reuse these predictions, without requiring to classify each element every time.
|
||
QuaPy leverages this property to speed-up any procedure having to do with
|
||
quantification over samples, as is customarily done in model selection or
|
||
in evaluation.
|
||
|
||
### The Classify & Count variants
|
||
|
||
QuaPy implements the four CC variants, i.e.:
|
||
|
||
* _CC_ (Classify & Count), the simplest aggregative quantifier; one that
|
||
simply relies on the label predictions of a classifier to deliver class estimates.
|
||
* _ACC_ (Adjusted Classify & Count), the adjusted variant of CC.
|
||
* _PCC_ (Probabilistic Classify & Count), the probabilistic variant of CC that
|
||
relies on the soft estimations (or posterior probabilities) returned by a (probabilistic) classifier.
|
||
* _PACC_ (Probabilistic Adjusted Classify & Count), the adjusted variant of PCC.
|
||
|
||
The following code serves as a complete example using CC equipped
|
||
with a SVM as the classifier:
|
||
|
||
```python
|
||
import quapy as qp
|
||
import quapy.functional as F
|
||
from sklearn.svm import LinearSVC
|
||
|
||
dataset = qp.datasets.fetch_twitter('hcr', pickle=True)
|
||
training = dataset.training
|
||
test = dataset.test
|
||
|
||
# instantiate a classifier learner, in this case a SVM
|
||
svm = LinearSVC()
|
||
|
||
# instantiate a Classify & Count with the SVM
|
||
# (an alias is available in qp.method.aggregative.ClassifyAndCount)
|
||
model = qp.method.aggregative.CC(svm)
|
||
model.fit(training)
|
||
estim_prevalence = model.quantify(test.instances)
|
||
```
|
||
|
||
The same code could be used to instantiate an ACC, by simply replacing
|
||
the instantiation of the model with:
|
||
```python
|
||
model = qp.method.aggregative.ACC(svm)
|
||
```
|
||
Note that the adjusted variants (ACC and PACC) need to estimate
|
||
some parameters for performing the adjustment (e.g., the
|
||
_true positive rate_ and the _false positive rate_ in case of
|
||
binary classification) that are estimated on a validation split
|
||
of the labelled set. In this case, the __init__ method of
|
||
ACC defines an additional parameter, _val_split_ which, by
|
||
default, is set to 0.4 and so, the 40% of the labelled data
|
||
will be used for estimating the parameters for adjusting the
|
||
predictions. This parameters can also be set with an integer,
|
||
indicating that the parameters should be estimated by means of
|
||
_k_-fold cross-validation, for which the integer indicates the
|
||
number _k_ of folds. Finally, _val_split_ can be set to a
|
||
specific held-out validation set (i.e., an instance of _LabelledCollection_).
|
||
|
||
The specification of _val_split_ can be
|
||
postponed to the invokation of the fit method (if _val_split_ was also
|
||
set in the constructor, the one specified at fit time would prevail),
|
||
e.g.:
|
||
|
||
```python
|
||
model = qp.method.aggregative.ACC(svm)
|
||
# perform 5-fold cross validation for estimating ACC's parameters
|
||
# (overrides the default val_split=0.4 in the constructor)
|
||
model.fit(training, val_split=5)
|
||
```
|
||
|
||
The following code illustrates the case in which PCC is used:
|
||
```python
|
||
model = qp.method.aggregative.PCC(svm)
|
||
model.fit(training)
|
||
estim_prevalence = model.quantify(test.instances)
|
||
print('classifier:', model.learner)
|
||
```
|
||
In this case, QuaPy will print:
|
||
```
|
||
The learner LinearSVC does not seem to be probabilistic. The learner will be calibrated.
|
||
classifier: CalibratedClassifierCV(base_estimator=LinearSVC(), cv=5)
|
||
```
|
||
The first output indicates that the learner (_LinearSVC_ in this case)
|
||
is not a probabilistic classifier (i.e., it does not implement the
|
||
_predict_proba_ method) and so, the classifier will be converted to
|
||
a probabilistic one through [calibration](https://scikit-learn.org/stable/modules/calibration.html).
|
||
As a result, the classifier that is printed in the second line points
|
||
to a _CalibratedClassifier_ instance. Note that calibration can only
|
||
be applied to hard classifiers when _fit_learner=True_; an exception
|
||
will be raised otherwise.
|
||
|
||
Lastly, everything we said aboud ACC and PCC
|
||
applies to PACC as well.
|
||
|
||
|
||
### Expectation Maximization (EMQ)
|
||
|
||
The Expectation Maximization Quantifier (EMQ), also known as
|
||
the SLD, is available at _qp.method.aggregative.EMQ_ or via the
|
||
alias _qp.method.aggregative.ExpectationMaximizationQuantifier_.
|
||
The method is described in:
|
||
|
||
_Saerens, M., Latinne, P., and Decaestecker, C. (2002). Adjusting the outputs of a classifier
|
||
to new a priori probabilities: A simple procedure. Neural Computation, 14(1):21–41._
|
||
|
||
EMQ works with a probabilistic classifier (if the classifier
|
||
given as input is a hard one, a calibration will be attempted).
|
||
Although this method was originally proposed for improving the
|
||
posterior probabilities of a probabilistic classifier, and not
|
||
for improving the estimation of prior probabilities, EMQ ranks
|
||
almost always among the most effective quantifiers in the
|
||
experiments we have carried out.
|
||
|
||
An example of use can be found below:
|
||
|
||
```python
|
||
import quapy as qp
|
||
from sklearn.linear_model import LogisticRegression
|
||
|
||
dataset = qp.datasets.fetch_twitter('hcr', pickle=True)
|
||
|
||
model = qp.method.aggregative.EMQ(LogisticRegression())
|
||
model.fit(dataset.training)
|
||
estim_prevalence = model.quantify(dataset.test.instances)
|
||
```
|
||
|
||
|
||
### Hellinger Distance y (HDy)
|
||
|
||
The method HDy is described in:
|
||
|
||
_Implementation of the method based on the Hellinger Distance y (HDy) proposed by
|
||
González-Castro, V., Alaiz-Rodrı́guez, R., and Alegre, E. (2013). Class distribution
|
||
estimation based on the Hellinger distance. Information Sciences, 218:146–164._
|
||
|
||
It is implemented in _qp.method.aggregative.HDy_ (also accessible
|
||
through the allias _qp.method.aggregative.HellingerDistanceY_).
|
||
This method works with a probabilistic classifier (hard classifiers
|
||
can be used as well and will be calibrated) and requires a validation
|
||
set to estimate parameter for the mixture model. Just like
|
||
ACC and PACC, this quantifier receives a _val_split_ argument
|
||
in the constructor (or in the fit method, in which case the previous
|
||
value is overridden) that can either be a float indicating the proportion
|
||
of training data to be taken as the validation set (in a random
|
||
stratified split), or a validation set (i.e., an instance of
|
||
_LabelledCollection_) itself.
|
||
|
||
HDy was proposed as a binary classifier and the implementation
|
||
provided in QuaPy accepts only binary datasets.
|
||
|
||
The following code shows an example of use:
|
||
```python
|
||
import quapy as qp
|
||
from sklearn.linear_model import LogisticRegression
|
||
|
||
# load a binary dataset
|
||
dataset = qp.datasets.fetch_reviews('hp', pickle=True)
|
||
qp.data.preprocessing.text2tfidf(dataset, min_df=5, inplace=True)
|
||
|
||
model = qp.method.aggregative.HDy(LogisticRegression())
|
||
model.fit(dataset.training)
|
||
estim_prevalence = model.quantify(dataset.test.instances)
|
||
```
|
||
|
||
### Explicit Loss Minimization
|
||
|
||
The Explicit Loss Minimization (ELM) represent a family of methods
|
||
based on structured output learning, i.e., quantifiers relying on
|
||
classifiers that have been optimized targeting a
|
||
quantification-oriented evaluation measure.
|
||
|
||
In QuaPy, the following methods, all relying on Joachim's
|
||
[SVMperf](https://www.cs.cornell.edu/people/tj/svm_light/svm_perf.html)
|
||
implementation, are available in _qp.method.aggregative_:
|
||
|
||
* SVMQ (SVM-Q) is a quantification method optimizing the metric _Q_ defined
|
||
in _Barranquero, J., Díez, J., and del Coz, J. J. (2015). Quantification-oriented learning based
|
||
on reliable classifiers. Pattern Recognition, 48(2):591–604._
|
||
* SVMKLD (SVM for Kullback-Leibler Divergence) proposed in _Esuli, A. and Sebastiani, F. (2015).
|
||
Optimizing text quantifiers for multivariate loss functions.
|
||
ACM Transactions on Knowledge Discovery and Data, 9(4):Article 27._
|
||
* SVMNKLD (SVM for Normalized Kullback-Leibler Divergence) proposed in _Esuli, A. and Sebastiani, F. (2015).
|
||
Optimizing text quantifiers for multivariate loss functions.
|
||
ACM Transactions on Knowledge Discovery and Data, 9(4):Article 27._
|
||
* SVMAE (SVM for Mean Absolute Error)
|
||
* SVMRAE (SVM for Mean Relative Absolute Error)
|
||
|
||
the last two methods (SVMAE and SVMRAE) have been implemented in
|
||
QuaPy in order to make available ELM variants for what nowadays
|
||
are considered the most well-behaved evaluation metrics in quantification.
|
||
|
||
In order to make these models work, you would need to run the script
|
||
_prepare_svmperf.sh_ (distributed along with QuaPy) that
|
||
downloads _SVMperf_' source code, applies a patch that
|
||
implements the quantification oriented losses, and compiles the
|
||
sources.
|
||
|
||
If you want to add any custom loss, you would need to modify
|
||
the source code of _SVMperf_ in order to implement it, and
|
||
assign a valid loss code to it. Then you must re-compile
|
||
the whole thing and instantiate the quantifier in QuaPy
|
||
as follows:
|
||
|
||
```python
|
||
# you can either set the path to your custom svm_perf_quantification implementation
|
||
# in the environment variable, or as an argument to the constructor of ELM
|
||
qp.environ['SVMPERF_HOME'] = './path/to/svm_perf_quantification'
|
||
|
||
# assign an alias to your custom loss and the id you have assigned to it
|
||
svmperf = qp.classification.svmperf.SVMperf
|
||
svmperf.valid_losses['mycustomloss'] = 28
|
||
|
||
# instantiate the ELM method indicating the loss
|
||
model = qp.method.aggregative.ELM(loss='mycustomloss')
|
||
```
|
||
|
||
All ELM are binary quantifiers since they rely on _SVMperf_, that
|
||
currently supports only binary classification.
|
||
ELM variants (any binary quantifier in general) can be extended
|
||
to operate in single-label scenarios trivially by adopting a
|
||
"one-vs-all" strategy (as, e.g., in
|
||
_Gao, W. and Sebastiani, F. (2016). From classification to quantification in tweet sentiment
|
||
analysis. Social Network Analysis and Mining, 6(19):1–22_).
|
||
In QuaPy this is possible by using the _OneVsAll_ class:
|
||
|
||
```python
|
||
import quapy as qp
|
||
from quapy.method.aggregative import SVMQ, OneVsAll
|
||
|
||
# load a single-label dataset (this one contains 3 classes)
|
||
dataset = qp.datasets.fetch_twitter('hcr', pickle=True)
|
||
|
||
# let qp know where svmperf is
|
||
qp.environ['SVMPERF_HOME'] = '../svm_perf_quantification'
|
||
|
||
model = OneVsAll(SVMQ(), n_jobs=-1) # run them on parallel
|
||
model.fit(dataset.training)
|
||
estim_prevalence = model.quantify(dataset.test.instances)
|
||
```
|
||
|
||
## Meta Models
|
||
|
||
By _meta_ models we mean quantification methods that are defined on top of other
|
||
quantification methods, and that thus do not squarely belong to the aggregative nor
|
||
the non-aggregative group (indeed, _meta_ models could use quantifiers from any of those
|
||
groups).
|
||
_Meta_ models are implemented in the _qp.method.meta_ module.
|
||
|
||
### Ensembles
|
||
|
||
QuaPy implements (some of) the variants proposed in:
|
||
|
||
* _Pérez-Gállego, P., Quevedo, J. R., & del Coz, J. J. (2017).
|
||
Using ensembles for problems with characterizable changes in data distribution: A case study on quantification.
|
||
Information Fusion, 34, 87-100._
|
||
* _Pérez-Gállego, P., Castano, A., Quevedo, J. R., & del Coz, J. J. (2019).
|
||
Dynamic ensemble selection for quantification tasks.
|
||
Information Fusion, 45, 1-15._
|
||
|
||
The following code shows how to instantiate an Ensemble of 30 _Adjusted Classify & Count_ (ACC)
|
||
quantifiers operating with a _Logistic Regressor_ (LR) as the base classifier, and using the
|
||
_average_ as the aggregation policy (see the original article for further details).
|
||
The last parameter indicates to use all processors for parallelization.
|
||
|
||
```python
|
||
import quapy as qp
|
||
from quapy.method.aggregative import ACC
|
||
from quapy.method.meta import Ensemble
|
||
from sklearn.linear_model import LogisticRegression
|
||
|
||
dataset = qp.datasets.fetch_UCIDataset('haberman')
|
||
|
||
model = Ensemble(quantifier=ACC(LogisticRegression()), size=30, policy='ave', n_jobs=-1)
|
||
model.fit(dataset.training)
|
||
estim_prevalence = model.quantify(dataset.test.instances)
|
||
```
|
||
|
||
Other aggregation policies implemented in QuaPy include:
|
||
* 'ptr' for applying a dynamic selection based on the training prevalence of the ensemble's members
|
||
* 'ds' for applying a dynamic selection based on the Hellinger Distance
|
||
* _any valid quantification measure_ (e.g., 'mse') for performing a static selection based on
|
||
the performance estimated for each member of the ensemble in terms of that evaluation metric.
|
||
|
||
When using any of the above options, it is important to set the _red_size_ parameter, which
|
||
informs of the number of members to retain.
|
||
|
||
Please, check the [model selection](https://github.com/HLT-ISTI/QuaPy/wiki/Model-Selection)
|
||
wiki if you want to optimize the hyperparameters of ensemble for classification or quantification.
|
||
|
||
### The QuaNet neural network
|
||
|
||
QuaPy offers an implementation of QuaNet, a deep learning model presented in:
|
||
|
||
_Esuli, A., Moreo, A., & Sebastiani, F. (2018, October).
|
||
A recurrent neural network for sentiment quantification.
|
||
In Proceedings of the 27th ACM International Conference on
|
||
Information and Knowledge Management (pp. 1775-1778)._
|
||
|
||
This model requires _torch_ to be installed.
|
||
QuaNet also requires a classifier that can provide embedded representations
|
||
of the inputs.
|
||
In the original paper, QuaNet was tested using an LSTM as the base classifier.
|
||
In the following example, we show an instantiation of QuaNet that instead uses CNN as a probabilistic classifier, taking its last layer representation as the document embedding:
|
||
|
||
```python
|
||
import quapy as qp
|
||
from quapy.method.meta import QuaNet
|
||
from quapy.classification.neural import NeuralClassifierTrainer, CNNnet
|
||
|
||
# use samples of 100 elements
|
||
qp.environ['SAMPLE_SIZE'] = 100
|
||
|
||
# load the kindle dataset as text, and convert words to numerical indexes
|
||
dataset = qp.datasets.fetch_reviews('kindle', pickle=True)
|
||
qp.data.preprocessing.index(dataset, min_df=5, inplace=True)
|
||
|
||
# the text classifier is a CNN trained by NeuralClassifierTrainer
|
||
cnn = CNNnet(dataset.vocabulary_size, dataset.n_classes)
|
||
learner = NeuralClassifierTrainer(cnn, device='cuda')
|
||
|
||
# train QuaNet
|
||
model = QuaNet(learner, qp.environ['SAMPLE_SIZE'], device='cuda')
|
||
model.fit(dataset.training)
|
||
estim_prevalence = model.quantify(dataset.test.instances)
|
||
```
|