forked from moreo/QuaPy
413 lines
17 KiB
Plaintext
413 lines
17 KiB
Plaintext
|
# Quantification Methods
|
|||
|
|
|||
|
Quantification methods can be categorized as belonging to
|
|||
|
_aggregative_ and _non-aggregative_ groups.
|
|||
|
Most methods included in QuaPy at the moment are of type _aggregative_
|
|||
|
(though we plan to add many more methods in the near future), i.e.,
|
|||
|
are methods characterized by the fact that
|
|||
|
quantification is performed as an aggregation function of the individual
|
|||
|
products of classification.
|
|||
|
|
|||
|
Any quantifier in QuaPy shoud extend the class _BaseQuantifier_,
|
|||
|
and implement some abstract methods:
|
|||
|
```python
|
|||
|
@abstractmethod
|
|||
|
def fit(self, data: LabelledCollection): ...
|
|||
|
|
|||
|
@abstractmethod
|
|||
|
def quantify(self, instances): ...
|
|||
|
|
|||
|
@abstractmethod
|
|||
|
def set_params(self, **parameters): ...
|
|||
|
|
|||
|
@abstractmethod
|
|||
|
def get_params(self, deep=True): ...
|
|||
|
```
|
|||
|
The meaning of those functions should be familiar to those
|
|||
|
used to work with scikit-learn since the class structure of QuaPy
|
|||
|
is directly inspired by scikit-learn's _Estimators_. Functions
|
|||
|
_fit_ and _quantify_ are used to train the model and to provide
|
|||
|
class estimations (the reason why
|
|||
|
scikit-learn' structure has not been adopted _as is_ in QuaPy responds to
|
|||
|
the fact that scikit-learn's _predict_ function is expected to return
|
|||
|
one output for each input element --e.g., a predicted label for each
|
|||
|
instance in a sample-- while in quantification the output for a sample
|
|||
|
is one single array of class prevalences), while functions _set_params_
|
|||
|
and _get_params_ allow a
|
|||
|
[model selector](https://github.com/HLT-ISTI/QuaPy/wiki/Model-Selection)
|
|||
|
to automate the process of hyperparameter search.
|
|||
|
|
|||
|
## Aggregative Methods
|
|||
|
|
|||
|
All quantification methods are implemented as part of the
|
|||
|
_qp.method_ package. In particular, _aggregative_ methods are defined in
|
|||
|
_qp.method.aggregative_, and extend _AggregativeQuantifier(BaseQuantifier)_.
|
|||
|
The methods that any _aggregative_ quantifier must implement are:
|
|||
|
|
|||
|
```python
|
|||
|
@abstractmethod
|
|||
|
def fit(self, data: LabelledCollection, fit_learner=True): ...
|
|||
|
|
|||
|
@abstractmethod
|
|||
|
def aggregate(self, classif_predictions:np.ndarray): ...
|
|||
|
```
|
|||
|
|
|||
|
since, as mentioned before, aggregative methods base their prediction on the
|
|||
|
individual predictions of a classifier. Indeed, a default implementation
|
|||
|
of _BaseQuantifier.quantify_ is already provided, which looks like:
|
|||
|
|
|||
|
```python
|
|||
|
def quantify(self, instances):
|
|||
|
classif_predictions = self.preclassify(instances)
|
|||
|
return self.aggregate(classif_predictions)
|
|||
|
```
|
|||
|
Aggregative quantifiers are expected to maintain a classifier (which is
|
|||
|
accessed through the _@property_ _learner_). This classifier is
|
|||
|
given as input to the quantifier, and can be already fit
|
|||
|
on external data (in which case, the _fit_learner_ argument should
|
|||
|
be set to False), or be fit by the quantifier's fit (default).
|
|||
|
|
|||
|
Another class of _aggregative_ methods are the _probabilistic_
|
|||
|
aggregative methods, that should inherit from the abstract class
|
|||
|
_AggregativeProbabilisticQuantifier(AggregativeQuantifier)_.
|
|||
|
The particularity of _probabilistic_ aggregative methods (w.r.t.
|
|||
|
non-probabilistic ones), is that the default quantifier is defined
|
|||
|
in terms of the posterior probabilities returned by a probabilistic
|
|||
|
classifier, and not by the crisp decisions of a hard classifier; i.e.:
|
|||
|
|
|||
|
```python
|
|||
|
def quantify(self, instances):
|
|||
|
classif_posteriors = self.posterior_probabilities(instances)
|
|||
|
return self.aggregate(classif_posteriors)
|
|||
|
```
|
|||
|
|
|||
|
One advantage of _aggregative_ methods (either probabilistic or not)
|
|||
|
is that the evaluation according to any sampling procedure (e.g.,
|
|||
|
the [artificial sampling protocol](https://github.com/HLT-ISTI/QuaPy/wiki/Evaluation))
|
|||
|
can be achieved very efficiently, since the entire set can be pre-classified
|
|||
|
once, and the quantification estimations for different samples can directly
|
|||
|
reuse these predictions, without requiring to classify each element every time.
|
|||
|
QuaPy leverages this property to speed-up any procedure having to do with
|
|||
|
quantification over samples, as is customarily done in model selection or
|
|||
|
in evaluation.
|
|||
|
|
|||
|
### The Classify & Count variants
|
|||
|
|
|||
|
QuaPy implements the four CC variants, i.e.:
|
|||
|
|
|||
|
* _CC_ (Classify & Count), the simplest aggregative quantifier; one that
|
|||
|
simply relies on the label predictions of a classifier to deliver class estimates.
|
|||
|
* _ACC_ (Adjusted Classify & Count), the adjusted variant of CC.
|
|||
|
* _PCC_ (Probabilistic Classify & Count), the probabilistic variant of CC that
|
|||
|
relies on the soft estimations (or posterior probabilities) returned by a (probabilistic) classifier.
|
|||
|
* _PACC_ (Probabilistic Adjusted Classify & Count), the adjusted variant of PCC.
|
|||
|
|
|||
|
The following code serves as a complete example using CC equipped
|
|||
|
with a SVM as the classifier:
|
|||
|
|
|||
|
```python
|
|||
|
import quapy as qp
|
|||
|
import quapy.functional as F
|
|||
|
from sklearn.svm import LinearSVC
|
|||
|
|
|||
|
dataset = qp.datasets.fetch_twitter('hcr', pickle=True)
|
|||
|
training = dataset.training
|
|||
|
test = dataset.test
|
|||
|
|
|||
|
# instantiate a classifier learner, in this case a SVM
|
|||
|
svm = LinearSVC()
|
|||
|
|
|||
|
# instantiate a Classify & Count with the SVM
|
|||
|
# (an alias is available in qp.method.aggregative.ClassifyAndCount)
|
|||
|
model = qp.method.aggregative.CC(svm)
|
|||
|
model.fit(training)
|
|||
|
estim_prevalence = model.quantify(test.instances)
|
|||
|
```
|
|||
|
|
|||
|
The same code could be used to instantiate an ACC, by simply replacing
|
|||
|
the instantiation of the model with:
|
|||
|
```python
|
|||
|
model = qp.method.aggregative.ACC(svm)
|
|||
|
```
|
|||
|
Note that the adjusted variants (ACC and PACC) need to estimate
|
|||
|
some parameters for performing the adjustment (e.g., the
|
|||
|
_true positive rate_ and the _false positive rate_ in case of
|
|||
|
binary classification) that are estimated on a validation split
|
|||
|
of the labelled set. In this case, the __init__ method of
|
|||
|
ACC defines an additional parameter, _val_split_ which, by
|
|||
|
default, is set to 0.4 and so, the 40% of the labelled data
|
|||
|
will be used for estimating the parameters for adjusting the
|
|||
|
predictions. This parameters can also be set with an integer,
|
|||
|
indicating that the parameters should be estimated by means of
|
|||
|
_k_-fold cross-validation, for which the integer indicates the
|
|||
|
number _k_ of folds. Finally, _val_split_ can be set to a
|
|||
|
specific held-out validation set (i.e., an instance of _LabelledCollection_).
|
|||
|
|
|||
|
The specification of _val_split_ can be
|
|||
|
postponed to the invokation of the fit method (if _val_split_ was also
|
|||
|
set in the constructor, the one specified at fit time would prevail),
|
|||
|
e.g.:
|
|||
|
|
|||
|
```python
|
|||
|
model = qp.method.aggregative.ACC(svm)
|
|||
|
# perform 5-fold cross validation for estimating ACC's parameters
|
|||
|
# (overrides the default val_split=0.4 in the constructor)
|
|||
|
model.fit(training, val_split=5)
|
|||
|
```
|
|||
|
|
|||
|
The following code illustrates the case in which PCC is used:
|
|||
|
```python
|
|||
|
model = qp.method.aggregative.PCC(svm)
|
|||
|
model.fit(training)
|
|||
|
estim_prevalence = model.quantify(test.instances)
|
|||
|
print('classifier:', model.learner)
|
|||
|
```
|
|||
|
In this case, QuaPy will print:
|
|||
|
```
|
|||
|
The learner LinearSVC does not seem to be probabilistic. The learner will be calibrated.
|
|||
|
classifier: CalibratedClassifierCV(base_estimator=LinearSVC(), cv=5)
|
|||
|
```
|
|||
|
The first output indicates that the learner (_LinearSVC_ in this case)
|
|||
|
is not a probabilistic classifier (i.e., it does not implement the
|
|||
|
_predict_proba_ method) and so, the classifier will be converted to
|
|||
|
a probabilistic one through [calibration](https://scikit-learn.org/stable/modules/calibration.html).
|
|||
|
As a result, the classifier that is printed in the second line points
|
|||
|
to a _CalibratedClassifier_ instance. Note that calibration can only
|
|||
|
be applied to hard classifiers when _fit_learner=True_; an exception
|
|||
|
will be raised otherwise.
|
|||
|
|
|||
|
Lastly, everything we said aboud ACC and PCC
|
|||
|
applies to PACC as well.
|
|||
|
|
|||
|
|
|||
|
### Expectation Maximization (EMQ)
|
|||
|
|
|||
|
The Expectation Maximization Quantifier (EMQ), also known as
|
|||
|
the SLD, is available at _qp.method.aggregative.EMQ_ or via the
|
|||
|
alias _qp.method.aggregative.ExpectationMaximizationQuantifier_.
|
|||
|
The method is described in:
|
|||
|
|
|||
|
_Saerens, M., Latinne, P., and Decaestecker, C. (2002). Adjusting the outputs of a classifier
|
|||
|
to new a priori probabilities: A simple procedure. Neural Computation, 14(1):21–41._
|
|||
|
|
|||
|
EMQ works with a probabilistic classifier (if the classifier
|
|||
|
given as input is a hard one, a calibration will be attempted).
|
|||
|
Although this method was originally proposed for improving the
|
|||
|
posterior probabilities of a probabilistic classifier, and not
|
|||
|
for improving the estimation of prior probabilities, EMQ ranks
|
|||
|
almost always among the most effective quantifiers in the
|
|||
|
experiments we have carried out.
|
|||
|
|
|||
|
An example of use can be found below:
|
|||
|
|
|||
|
```python
|
|||
|
import quapy as qp
|
|||
|
from sklearn.linear_model import LogisticRegression
|
|||
|
|
|||
|
dataset = qp.datasets.fetch_twitter('hcr', pickle=True)
|
|||
|
|
|||
|
model = qp.method.aggregative.EMQ(LogisticRegression())
|
|||
|
model.fit(dataset.training)
|
|||
|
estim_prevalence = model.quantify(dataset.test.instances)
|
|||
|
```
|
|||
|
|
|||
|
|
|||
|
### Hellinger Distance y (HDy)
|
|||
|
|
|||
|
The method HDy is described in:
|
|||
|
|
|||
|
_Implementation of the method based on the Hellinger Distance y (HDy) proposed by
|
|||
|
González-Castro, V., Alaiz-Rodrı́guez, R., and Alegre, E. (2013). Class distribution
|
|||
|
estimation based on the Hellinger distance. Information Sciences, 218:146–164._
|
|||
|
|
|||
|
It is implemented in _qp.method.aggregative.HDy_ (also accessible
|
|||
|
through the allias _qp.method.aggregative.HellingerDistanceY_).
|
|||
|
This method works with a probabilistic classifier (hard classifiers
|
|||
|
can be used as well and will be calibrated) and requires a validation
|
|||
|
set to estimate parameter for the mixture model. Just like
|
|||
|
ACC and PACC, this quantifier receives a _val_split_ argument
|
|||
|
in the constructor (or in the fit method, in which case the previous
|
|||
|
value is overridden) that can either be a float indicating the proportion
|
|||
|
of training data to be taken as the validation set (in a random
|
|||
|
stratified split), or a validation set (i.e., an instance of
|
|||
|
_LabelledCollection_) itself.
|
|||
|
|
|||
|
HDy was proposed as a binary classifier and the implementation
|
|||
|
provided in QuaPy accepts only binary datasets.
|
|||
|
|
|||
|
The following code shows an example of use:
|
|||
|
```python
|
|||
|
import quapy as qp
|
|||
|
from sklearn.linear_model import LogisticRegression
|
|||
|
|
|||
|
# load a binary dataset
|
|||
|
dataset = qp.datasets.fetch_reviews('hp', pickle=True)
|
|||
|
qp.data.preprocessing.text2tfidf(dataset, min_df=5, inplace=True)
|
|||
|
|
|||
|
model = qp.method.aggregative.HDy(LogisticRegression())
|
|||
|
model.fit(dataset.training)
|
|||
|
estim_prevalence = model.quantify(dataset.test.instances)
|
|||
|
```
|
|||
|
|
|||
|
### Explicit Loss Minimization
|
|||
|
|
|||
|
The Explicit Loss Minimization (ELM) represent a family of methods
|
|||
|
based on structured output learning, i.e., quantifiers relying on
|
|||
|
classifiers that have been optimized targeting a
|
|||
|
quantification-oriented evaluation measure.
|
|||
|
|
|||
|
In QuaPy, the following methods, all relying on Joachim's
|
|||
|
[SVMperf](https://www.cs.cornell.edu/people/tj/svm_light/svm_perf.html)
|
|||
|
implementation, are available in _qp.method.aggregative_:
|
|||
|
|
|||
|
* SVMQ (SVM-Q) is a quantification method optimizing the metric _Q_ defined
|
|||
|
in _Barranquero, J., Díez, J., and del Coz, J. J. (2015). Quantification-oriented learning based
|
|||
|
on reliable classifiers. Pattern Recognition, 48(2):591–604._
|
|||
|
* SVMKLD (SVM for Kullback-Leibler Divergence) proposed in _Esuli, A. and Sebastiani, F. (2015).
|
|||
|
Optimizing text quantifiers for multivariate loss functions.
|
|||
|
ACM Transactions on Knowledge Discovery and Data, 9(4):Article 27._
|
|||
|
* SVMNKLD (SVM for Normalized Kullback-Leibler Divergence) proposed in _Esuli, A. and Sebastiani, F. (2015).
|
|||
|
Optimizing text quantifiers for multivariate loss functions.
|
|||
|
ACM Transactions on Knowledge Discovery and Data, 9(4):Article 27._
|
|||
|
* SVMAE (SVM for Mean Absolute Error)
|
|||
|
* SVMRAE (SVM for Mean Relative Absolute Error)
|
|||
|
|
|||
|
the last two methods (SVMAE and SVMRAE) have been implemented in
|
|||
|
QuaPy in order to make available ELM variants for what nowadays
|
|||
|
are considered the most well-behaved evaluation metrics in quantification.
|
|||
|
|
|||
|
In order to make these models work, you would need to run the script
|
|||
|
_prepare_svmperf.sh_ (distributed along with QuaPy) that
|
|||
|
downloads _SVMperf_' source code, applies a patch that
|
|||
|
implements the quantification oriented losses, and compiles the
|
|||
|
sources.
|
|||
|
|
|||
|
If you want to add any custom loss, you would need to modify
|
|||
|
the source code of _SVMperf_ in order to implement it, and
|
|||
|
assign a valid loss code to it. Then you must re-compile
|
|||
|
the whole thing and instantiate the quantifier in QuaPy
|
|||
|
as follows:
|
|||
|
|
|||
|
```python
|
|||
|
# you can either set the path to your custom svm_perf_quantification implementation
|
|||
|
# in the environment variable, or as an argument to the constructor of ELM
|
|||
|
qp.environ['SVMPERF_HOME'] = './path/to/svm_perf_quantification'
|
|||
|
|
|||
|
# assign an alias to your custom loss and the id you have assigned to it
|
|||
|
svmperf = qp.classification.svmperf.SVMperf
|
|||
|
svmperf.valid_losses['mycustomloss'] = 28
|
|||
|
|
|||
|
# instantiate the ELM method indicating the loss
|
|||
|
model = qp.method.aggregative.ELM(loss='mycustomloss')
|
|||
|
```
|
|||
|
|
|||
|
All ELM are binary quantifiers since they rely on _SVMperf_, that
|
|||
|
currently supports only binary classification.
|
|||
|
ELM variants (any binary quantifier in general) can be extended
|
|||
|
to operate in single-label scenarios trivially by adopting a
|
|||
|
"one-vs-all" strategy (as, e.g., in
|
|||
|
_Gao, W. and Sebastiani, F. (2016). From classification to quantification in tweet sentiment
|
|||
|
analysis. Social Network Analysis and Mining, 6(19):1–22_).
|
|||
|
In QuaPy this is possible by using the _OneVsAll_ class:
|
|||
|
|
|||
|
```python
|
|||
|
import quapy as qp
|
|||
|
from quapy.method.aggregative import SVMQ, OneVsAll
|
|||
|
|
|||
|
# load a single-label dataset (this one contains 3 classes)
|
|||
|
dataset = qp.datasets.fetch_twitter('hcr', pickle=True)
|
|||
|
|
|||
|
# let qp know where svmperf is
|
|||
|
qp.environ['SVMPERF_HOME'] = '../svm_perf_quantification'
|
|||
|
|
|||
|
model = OneVsAll(SVMQ(), n_jobs=-1) # run them on parallel
|
|||
|
model.fit(dataset.training)
|
|||
|
estim_prevalence = model.quantify(dataset.test.instances)
|
|||
|
```
|
|||
|
|
|||
|
## Meta Models
|
|||
|
|
|||
|
By _meta_ models we mean quantification methods that are defined on top of other
|
|||
|
quantification methods, and that thus do not squarely belong to the aggregative nor
|
|||
|
the non-aggregative group (indeed, _meta_ models could use quantifiers from any of those
|
|||
|
groups).
|
|||
|
_Meta_ models are implemented in the _qp.method.meta_ module.
|
|||
|
|
|||
|
### Ensembles
|
|||
|
|
|||
|
QuaPy implements (some of) the variants proposed in:
|
|||
|
|
|||
|
* _Pérez-Gállego, P., Quevedo, J. R., & del Coz, J. J. (2017).
|
|||
|
Using ensembles for problems with characterizable changes in data distribution: A case study on quantification.
|
|||
|
Information Fusion, 34, 87-100._
|
|||
|
* _Pérez-Gállego, P., Castano, A., Quevedo, J. R., & del Coz, J. J. (2019).
|
|||
|
Dynamic ensemble selection for quantification tasks.
|
|||
|
Information Fusion, 45, 1-15._
|
|||
|
|
|||
|
The following code shows how to instantiate an Ensemble of 30 _Adjusted Classify & Count_ (ACC)
|
|||
|
quantifiers operating with a _Logistic Regressor_ (LR) as the base classifier, and using the
|
|||
|
_average_ as the aggregation policy (see the original article for further details).
|
|||
|
The last parameter indicates to use all processors for parallelization.
|
|||
|
|
|||
|
```python
|
|||
|
import quapy as qp
|
|||
|
from quapy.method.aggregative import ACC
|
|||
|
from quapy.method.meta import Ensemble
|
|||
|
from sklearn.linear_model import LogisticRegression
|
|||
|
|
|||
|
dataset = qp.datasets.fetch_UCIDataset('haberman')
|
|||
|
|
|||
|
model = Ensemble(quantifier=ACC(LogisticRegression()), size=30, policy='ave', n_jobs=-1)
|
|||
|
model.fit(dataset.training)
|
|||
|
estim_prevalence = model.quantify(dataset.test.instances)
|
|||
|
```
|
|||
|
|
|||
|
Other aggregation policies implemented in QuaPy include:
|
|||
|
* 'ptr' for applying a dynamic selection based on the training prevalence of the ensemble's members
|
|||
|
* 'ds' for applying a dynamic selection based on the Hellinger Distance
|
|||
|
* _any valid quantification measure_ (e.g., 'mse') for performing a static selection based on
|
|||
|
the performance estimated for each member of the ensemble in terms of that evaluation metric.
|
|||
|
|
|||
|
When using any of the above options, it is important to set the _red_size_ parameter, which
|
|||
|
informs of the number of members to retain.
|
|||
|
|
|||
|
Please, check the [model selection](https://github.com/HLT-ISTI/QuaPy/wiki/Model-Selection)
|
|||
|
wiki if you want to optimize the hyperparameters of ensemble for classification or quantification.
|
|||
|
|
|||
|
### The QuaNet neural network
|
|||
|
|
|||
|
QuaPy offers an implementation of QuaNet, a deep learning model presented in:
|
|||
|
|
|||
|
_Esuli, A., Moreo, A., & Sebastiani, F. (2018, October).
|
|||
|
A recurrent neural network for sentiment quantification.
|
|||
|
In Proceedings of the 27th ACM International Conference on
|
|||
|
Information and Knowledge Management (pp. 1775-1778)._
|
|||
|
|
|||
|
This model requires _torch_ to be installed.
|
|||
|
QuaNet also requires a classifier that can provide embedded representations
|
|||
|
of the inputs.
|
|||
|
In the original paper, QuaNet was tested using an LSTM as the base classifier.
|
|||
|
In the following example, we show an instantiation of QuaNet that instead uses CNN as a probabilistic classifier, taking its last layer representation as the document embedding:
|
|||
|
|
|||
|
```python
|
|||
|
import quapy as qp
|
|||
|
from quapy.method.meta import QuaNet
|
|||
|
from quapy.classification.neural import NeuralClassifierTrainer, CNNnet
|
|||
|
|
|||
|
# use samples of 100 elements
|
|||
|
qp.environ['SAMPLE_SIZE'] = 100
|
|||
|
|
|||
|
# load the kindle dataset as text, and convert words to numerical indexes
|
|||
|
dataset = qp.datasets.fetch_reviews('kindle', pickle=True)
|
|||
|
qp.data.preprocessing.index(dataset, min_df=5, inplace=True)
|
|||
|
|
|||
|
# the text classifier is a CNN trained by NeuralClassifierTrainer
|
|||
|
cnn = CNNnet(dataset.vocabulary_size, dataset.n_classes)
|
|||
|
learner = NeuralClassifierTrainer(cnn, device='cuda')
|
|||
|
|
|||
|
# train QuaNet
|
|||
|
model = QuaNet(learner, qp.environ['SAMPLE_SIZE'], device='cuda')
|
|||
|
model.fit(dataset.training)
|
|||
|
estim_prevalence = model.quantify(dataset.test.instances)
|
|||
|
```
|