forked from moreo/QuaPy
439 lines
20 KiB
Plaintext
439 lines
20 KiB
Plaintext
# Quantification Methods
|
||
|
||
Quantification methods can be categorized as belonging to
|
||
_aggregative_ and _non-aggregative_ groups.
|
||
Most methods included in QuaPy at the moment are of type _aggregative_
|
||
(though we plan to add many more methods in the near future), i.e.,
|
||
are methods characterized by the fact that
|
||
quantification is performed as an aggregation function of the individual
|
||
products of classification.
|
||
|
||
Any quantifier in QuaPy shoud extend the class _BaseQuantifier_,
|
||
and implement some abstract methods:
|
||
```python
|
||
@abstractmethod
|
||
def fit(self, data: LabelledCollection): ...
|
||
|
||
@abstractmethod
|
||
def quantify(self, instances): ...
|
||
```
|
||
The meaning of those functions should be familiar to those
|
||
used to work with scikit-learn since the class structure of QuaPy
|
||
is directly inspired by scikit-learn's _Estimators_. Functions
|
||
_fit_ and _quantify_ are used to train the model and to provide
|
||
class estimations (the reason why
|
||
scikit-learn' structure has not been adopted _as is_ in QuaPy responds to
|
||
the fact that scikit-learn's _predict_ function is expected to return
|
||
one output for each input element --e.g., a predicted label for each
|
||
instance in a sample-- while in quantification the output for a sample
|
||
is one single array of class prevalences).
|
||
Quantifiers also extend from scikit-learn's `BaseEstimator`, in order
|
||
to simplify the use of _set_params_ and _get_params_ used in
|
||
[model selector](https://github.com/HLT-ISTI/QuaPy/wiki/Model-Selection).
|
||
|
||
## Aggregative Methods
|
||
|
||
All quantification methods are implemented as part of the
|
||
_qp.method_ package. In particular, _aggregative_ methods are defined in
|
||
_qp.method.aggregative_, and extend _AggregativeQuantifier(BaseQuantifier)_.
|
||
The methods that any _aggregative_ quantifier must implement are:
|
||
|
||
```python
|
||
@abstractmethod
|
||
def fit(self, data: LabelledCollection, fit_learner=True): ...
|
||
|
||
@abstractmethod
|
||
def aggregate(self, classif_predictions:np.ndarray): ...
|
||
```
|
||
|
||
since, as mentioned before, aggregative methods base their prediction on the
|
||
individual predictions of a classifier. Indeed, a default implementation
|
||
of _BaseQuantifier.quantify_ is already provided, which looks like:
|
||
|
||
```python
|
||
def quantify(self, instances):
|
||
classif_predictions = self.classify(instances)
|
||
return self.aggregate(classif_predictions)
|
||
```
|
||
Aggregative quantifiers are expected to maintain a classifier (which is
|
||
accessed through the _@property_ _classifier_). This classifier is
|
||
given as input to the quantifier, and can be already fit
|
||
on external data (in which case, the _fit_learner_ argument should
|
||
be set to False), or be fit by the quantifier's fit (default).
|
||
|
||
Another class of _aggregative_ methods are the _probabilistic_
|
||
aggregative methods, that should inherit from the abstract class
|
||
_AggregativeProbabilisticQuantifier(AggregativeQuantifier)_.
|
||
The particularity of _probabilistic_ aggregative methods (w.r.t.
|
||
non-probabilistic ones), is that the default quantifier is defined
|
||
in terms of the posterior probabilities returned by a probabilistic
|
||
classifier, and not by the crisp decisions of a hard classifier.
|
||
In any case, the interface _classify(instances)_ remains unchanged.
|
||
|
||
One advantage of _aggregative_ methods (either probabilistic or not)
|
||
is that the evaluation according to any sampling procedure (e.g.,
|
||
the [artificial sampling protocol](https://github.com/HLT-ISTI/QuaPy/wiki/Evaluation))
|
||
can be achieved very efficiently, since the entire set can be pre-classified
|
||
once, and the quantification estimations for different samples can directly
|
||
reuse these predictions, without requiring to classify each element every time.
|
||
QuaPy leverages this property to speed-up any procedure having to do with
|
||
quantification over samples, as is customarily done in model selection or
|
||
in evaluation.
|
||
|
||
### The Classify & Count variants
|
||
|
||
QuaPy implements the four CC variants, i.e.:
|
||
|
||
* _CC_ (Classify & Count), the simplest aggregative quantifier; one that
|
||
simply relies on the label predictions of a classifier to deliver class estimates.
|
||
* _ACC_ (Adjusted Classify & Count), the adjusted variant of CC.
|
||
* _PCC_ (Probabilistic Classify & Count), the probabilistic variant of CC that
|
||
relies on the soft estimations (or posterior probabilities) returned by a (probabilistic) classifier.
|
||
* _PACC_ (Probabilistic Adjusted Classify & Count), the adjusted variant of PCC.
|
||
|
||
The following code serves as a complete example using CC equipped
|
||
with a SVM as the classifier:
|
||
|
||
```python
|
||
import quapy as qp
|
||
import quapy.functional as F
|
||
from sklearn.svm import LinearSVC
|
||
|
||
training, test = qp.datasets.fetch_twitter('hcr', pickle=True).train_test
|
||
|
||
# instantiate a classifier learner, in this case a SVM
|
||
svm = LinearSVC()
|
||
|
||
# instantiate a Classify & Count with the SVM
|
||
# (an alias is available in qp.method.aggregative.ClassifyAndCount)
|
||
model = qp.method.aggregative.CC(svm)
|
||
model.fit(training)
|
||
estim_prevalence = model.quantify(test.instances)
|
||
```
|
||
|
||
The same code could be used to instantiate an ACC, by simply replacing
|
||
the instantiation of the model with:
|
||
```python
|
||
model = qp.method.aggregative.ACC(svm)
|
||
```
|
||
Note that the adjusted variants (ACC and PACC) need to estimate
|
||
some parameters for performing the adjustment (e.g., the
|
||
_true positive rate_ and the _false positive rate_ in case of
|
||
binary classification) that are estimated on a validation split
|
||
of the labelled set. In this case, the __init__ method of
|
||
ACC defines an additional parameter, _val_split_ which, by
|
||
default, is set to 0.4 and so, the 40% of the labelled data
|
||
will be used for estimating the parameters for adjusting the
|
||
predictions. This parameters can also be set with an integer,
|
||
indicating that the parameters should be estimated by means of
|
||
_k_-fold cross-validation, for which the integer indicates the
|
||
number _k_ of folds. Finally, _val_split_ can be set to a
|
||
specific held-out validation set (i.e., an instance of _LabelledCollection_).
|
||
|
||
The specification of _val_split_ can be
|
||
postponed to the invokation of the fit method (if _val_split_ was also
|
||
set in the constructor, the one specified at fit time would prevail),
|
||
e.g.:
|
||
|
||
```python
|
||
model = qp.method.aggregative.ACC(svm)
|
||
# perform 5-fold cross validation for estimating ACC's parameters
|
||
# (overrides the default val_split=0.4 in the constructor)
|
||
model.fit(training, val_split=5)
|
||
```
|
||
|
||
The following code illustrates the case in which PCC is used:
|
||
|
||
```python
|
||
model = qp.method.aggregative.PCC(svm)
|
||
model.fit(training)
|
||
estim_prevalence = model.quantify(test.instances)
|
||
print('classifier:', model.classifier)
|
||
```
|
||
In this case, QuaPy will print:
|
||
```
|
||
The learner LinearSVC does not seem to be probabilistic. The learner will be calibrated.
|
||
classifier: CalibratedClassifierCV(base_estimator=LinearSVC(), cv=5)
|
||
```
|
||
The first output indicates that the learner (_LinearSVC_ in this case)
|
||
is not a probabilistic classifier (i.e., it does not implement the
|
||
_predict_proba_ method) and so, the classifier will be converted to
|
||
a probabilistic one through [calibration](https://scikit-learn.org/stable/modules/calibration.html).
|
||
As a result, the classifier that is printed in the second line points
|
||
to a _CalibratedClassifier_ instance. Note that calibration can only
|
||
be applied to hard classifiers when _fit_learner=True_; an exception
|
||
will be raised otherwise.
|
||
|
||
Lastly, everything we said aboud ACC and PCC
|
||
applies to PACC as well.
|
||
|
||
|
||
### Expectation Maximization (EMQ)
|
||
|
||
The Expectation Maximization Quantifier (EMQ), also known as
|
||
the SLD, is available at _qp.method.aggregative.EMQ_ or via the
|
||
alias _qp.method.aggregative.ExpectationMaximizationQuantifier_.
|
||
The method is described in:
|
||
|
||
_Saerens, M., Latinne, P., and Decaestecker, C. (2002). Adjusting the outputs of a classifier
|
||
to new a priori probabilities: A simple procedure. Neural Computation, 14(1):21–41._
|
||
|
||
EMQ works with a probabilistic classifier (if the classifier
|
||
given as input is a hard one, a calibration will be attempted).
|
||
Although this method was originally proposed for improving the
|
||
posterior probabilities of a probabilistic classifier, and not
|
||
for improving the estimation of prior probabilities, EMQ ranks
|
||
almost always among the most effective quantifiers in the
|
||
experiments we have carried out.
|
||
|
||
An example of use can be found below:
|
||
|
||
```python
|
||
import quapy as qp
|
||
from sklearn.linear_model import LogisticRegression
|
||
|
||
dataset = qp.datasets.fetch_twitter('hcr', pickle=True)
|
||
|
||
model = qp.method.aggregative.EMQ(LogisticRegression())
|
||
model.fit(dataset.training)
|
||
estim_prevalence = model.quantify(dataset.test.instances)
|
||
```
|
||
|
||
_New in v0.1.7_: EMQ now accepts two new parameters in the construction method, namely
|
||
_exact_train_prev_ which allows to use the true training prevalence as the departing
|
||
prevalence estimation (default behaviour), or instead an approximation of it as
|
||
suggested by [Alexandari et al. (2020)](http://proceedings.mlr.press/v119/alexandari20a.html)
|
||
(by setting _exact_train_prev=False_).
|
||
The other parameter is _recalib_ which allows to indicate a calibration method, among those
|
||
proposed by [Alexandari et al. (2020)](http://proceedings.mlr.press/v119/alexandari20a.html),
|
||
including the Bias-Corrected Temperature Scaling, Vector Scaling, etc.
|
||
See the API documentation for further details.
|
||
|
||
|
||
### Hellinger Distance y (HDy)
|
||
|
||
Implementation of the method based on the Hellinger Distance y (HDy) proposed by
|
||
[González-Castro, V., Alaiz-Rodrı́guez, R., and Alegre, E. (2013). Class distribution
|
||
estimation based on the Hellinger distance. Information Sciences, 218:146–164.](https://www.sciencedirect.com/science/article/pii/S0020025512004069)
|
||
|
||
It is implemented in _qp.method.aggregative.HDy_ (also accessible
|
||
through the allias _qp.method.aggregative.HellingerDistanceY_).
|
||
This method works with a probabilistic classifier (hard classifiers
|
||
can be used as well and will be calibrated) and requires a validation
|
||
set to estimate parameter for the mixture model. Just like
|
||
ACC and PACC, this quantifier receives a _val_split_ argument
|
||
in the constructor (or in the fit method, in which case the previous
|
||
value is overridden) that can either be a float indicating the proportion
|
||
of training data to be taken as the validation set (in a random
|
||
stratified split), or a validation set (i.e., an instance of
|
||
_LabelledCollection_) itself.
|
||
|
||
HDy was proposed as a binary classifier and the implementation
|
||
provided in QuaPy accepts only binary datasets.
|
||
|
||
The following code shows an example of use:
|
||
```python
|
||
import quapy as qp
|
||
from sklearn.linear_model import LogisticRegression
|
||
|
||
# load a binary dataset
|
||
dataset = qp.datasets.fetch_reviews('hp', pickle=True)
|
||
qp.data.preprocessing.text2tfidf(dataset, min_df=5, inplace=True)
|
||
|
||
model = qp.method.aggregative.HDy(LogisticRegression())
|
||
model.fit(dataset.training)
|
||
estim_prevalence = model.quantify(dataset.test.instances)
|
||
```
|
||
|
||
_New in v0.1.7:_ QuaPy now provides an implementation of the generalized
|
||
"Distribution Matching" approaches for multiclass, inspired by the framework
|
||
of [Firat (2016)](https://arxiv.org/abs/1606.00868). One can instantiate
|
||
a variant of HDy for multiclass quantification as follows:
|
||
|
||
```python
|
||
mutliclassHDy = qp.method.aggregative.DistributionMatching(classifier=LogisticRegression(), divergence='HD', cdf=False)
|
||
```
|
||
|
||
_New in v0.1.7:_ QuaPy now provides an implementation of the "DyS"
|
||
framework proposed by [Maletzke et al (2020)](https://ojs.aaai.org/index.php/AAAI/article/view/4376)
|
||
and the "SMM" method proposed by [Hassan et al (2019)](https://ieeexplore.ieee.org/document/9260028)
|
||
(thanks to _Pablo González_ for the contributions!)
|
||
|
||
### Threshold Optimization methods
|
||
|
||
_New in v0.1.7:_ QuaPy now implements Forman's threshold optimization methods;
|
||
see, e.g., [(Forman 2006)](https://dl.acm.org/doi/abs/10.1145/1150402.1150423)
|
||
and [(Forman 2008)](https://link.springer.com/article/10.1007/s10618-008-0097-y).
|
||
These include: T50, MAX, X, Median Sweep (MS), and its variant MS2.
|
||
|
||
### Explicit Loss Minimization
|
||
|
||
The Explicit Loss Minimization (ELM) represent a family of methods
|
||
based on structured output learning, i.e., quantifiers relying on
|
||
classifiers that have been optimized targeting a
|
||
quantification-oriented evaluation measure.
|
||
The original methods are implemented in QuaPy as classify & count (CC)
|
||
quantifiers that use Joachim's [SVMperf](https://www.cs.cornell.edu/people/tj/svm_light/svm_perf.html)
|
||
as the underlying classifier, properly set to optimize for the desired loss.
|
||
|
||
In QuaPy, this can be more achieved by calling the functions:
|
||
|
||
* _newSVMQ_: returns the quantification method called SVM(Q) that optimizes for the metric _Q_ defined
|
||
in [_Barranquero, J., Díez, J., and del Coz, J. J. (2015). Quantification-oriented learning based
|
||
on reliable classifiers. Pattern Recognition, 48(2):591–604._](https://www.sciencedirect.com/science/article/pii/S003132031400291X)
|
||
* _newSVMKLD_ and _newSVMNKLD_: returns the quantification method called SVM(KLD) and SVM(nKLD), standing for
|
||
Kullback-Leibler Divergence and Normalized Kullback-Leibler Divergence, as proposed in [_Esuli, A. and Sebastiani, F. (2015).
|
||
Optimizing text quantifiers for multivariate loss functions.
|
||
ACM Transactions on Knowledge Discovery and Data, 9(4):Article 27._](https://dl.acm.org/doi/abs/10.1145/2700406)
|
||
* _newSVMAE_ and _newSVMRAE_: returns a quantification method called SVM(AE) and SVM(RAE) that optimizes for the (Mean) Absolute Error and for the
|
||
(Mean) Relative Absolute Error, as first used by
|
||
[_Moreo, A. and Sebastiani, F. (2021). Tweet sentiment quantification: An experimental re-evaluation. PLOS ONE 17 (9), 1-23._](https://arxiv.org/abs/2011.02552)
|
||
|
||
the last two methods (SVM(AE) and SVM(RAE)) have been implemented in
|
||
QuaPy in order to make available ELM variants for what nowadays
|
||
are considered the most well-behaved evaluation metrics in quantification.
|
||
|
||
In order to make these models work, you would need to run the script
|
||
_prepare_svmperf.sh_ (distributed along with QuaPy) that
|
||
downloads _SVMperf_' source code, applies a patch that
|
||
implements the quantification oriented losses, and compiles the
|
||
sources.
|
||
|
||
If you want to add any custom loss, you would need to modify
|
||
the source code of _SVMperf_ in order to implement it, and
|
||
assign a valid loss code to it. Then you must re-compile
|
||
the whole thing and instantiate the quantifier in QuaPy
|
||
as follows:
|
||
|
||
```python
|
||
# you can either set the path to your custom svm_perf_quantification implementation
|
||
# in the environment variable, or as an argument to the constructor of ELM
|
||
qp.environ['SVMPERF_HOME'] = './path/to/svm_perf_quantification'
|
||
|
||
# assign an alias to your custom loss and the id you have assigned to it
|
||
svmperf = qp.classification.svmperf.SVMperf
|
||
svmperf.valid_losses['mycustomloss'] = 28
|
||
|
||
# instantiate the ELM method indicating the loss
|
||
model = qp.method.aggregative.ELM(loss='mycustomloss')
|
||
```
|
||
|
||
All ELM are binary quantifiers since they rely on _SVMperf_, that
|
||
currently supports only binary classification.
|
||
ELM variants (any binary quantifier in general) can be extended
|
||
to operate in single-label scenarios trivially by adopting a
|
||
"one-vs-all" strategy (as, e.g., in
|
||
[_Gao, W. and Sebastiani, F. (2016). From classification to quantification in tweet sentiment
|
||
analysis. Social Network Analysis and Mining, 6(19):1–22_](https://link.springer.com/article/10.1007/s13278-016-0327-z)).
|
||
In QuaPy this is possible by using the _OneVsAll_ class.
|
||
|
||
There are two ways for instantiating this class, _OneVsAllGeneric_ that works for
|
||
any quantifier, and _OneVsAllAggregative_ that is optimized for aggregative quantifiers.
|
||
In general, you can simply use the _getOneVsAll_ function and QuaPy will choose
|
||
the more convenient of the two.
|
||
|
||
```python
|
||
import quapy as qp
|
||
from quapy.method.aggregative import SVMQ
|
||
|
||
# load a single-label dataset (this one contains 3 classes)
|
||
dataset = qp.datasets.fetch_twitter('hcr', pickle=True)
|
||
|
||
# let qp know where svmperf is
|
||
qp.environ['SVMPERF_HOME'] = '../svm_perf_quantification'
|
||
|
||
model = getOneVsAll(SVMQ(), n_jobs=-1) # run them on parallel
|
||
model.fit(dataset.training)
|
||
estim_prevalence = model.quantify(dataset.test.instances)
|
||
```
|
||
|
||
Check the examples _[explicit_loss_minimization.py](..%2Fexamples%2Fexplicit_loss_minimization.py)_
|
||
and [one_vs_all.py](..%2Fexamples%2Fone_vs_all.py) for more details.
|
||
|
||
## Meta Models
|
||
|
||
By _meta_ models we mean quantification methods that are defined on top of other
|
||
quantification methods, and that thus do not squarely belong to the aggregative nor
|
||
the non-aggregative group (indeed, _meta_ models could use quantifiers from any of those
|
||
groups).
|
||
_Meta_ models are implemented in the _qp.method.meta_ module.
|
||
|
||
### Ensembles
|
||
|
||
QuaPy implements (some of) the variants proposed in:
|
||
|
||
* [_Pérez-Gállego, P., Quevedo, J. R., & del Coz, J. J. (2017).
|
||
Using ensembles for problems with characterizable changes in data distribution: A case study on quantification.
|
||
Information Fusion, 34, 87-100._](https://www.sciencedirect.com/science/article/pii/S1566253516300628)
|
||
* [_Pérez-Gállego, P., Castano, A., Quevedo, J. R., & del Coz, J. J. (2019).
|
||
Dynamic ensemble selection for quantification tasks.
|
||
Information Fusion, 45, 1-15._](https://www.sciencedirect.com/science/article/pii/S1566253517303652)
|
||
|
||
The following code shows how to instantiate an Ensemble of 30 _Adjusted Classify & Count_ (ACC)
|
||
quantifiers operating with a _Logistic Regressor_ (LR) as the base classifier, and using the
|
||
_average_ as the aggregation policy (see the original article for further details).
|
||
The last parameter indicates to use all processors for parallelization.
|
||
|
||
```python
|
||
import quapy as qp
|
||
from quapy.method.aggregative import ACC
|
||
from quapy.method.meta import Ensemble
|
||
from sklearn.linear_model import LogisticRegression
|
||
|
||
dataset = qp.datasets.fetch_UCIDataset('haberman')
|
||
|
||
model = Ensemble(quantifier=ACC(LogisticRegression()), size=30, policy='ave', n_jobs=-1)
|
||
model.fit(dataset.training)
|
||
estim_prevalence = model.quantify(dataset.test.instances)
|
||
```
|
||
|
||
Other aggregation policies implemented in QuaPy include:
|
||
* 'ptr' for applying a dynamic selection based on the training prevalence of the ensemble's members
|
||
* 'ds' for applying a dynamic selection based on the Hellinger Distance
|
||
* _any valid quantification measure_ (e.g., 'mse') for performing a static selection based on
|
||
the performance estimated for each member of the ensemble in terms of that evaluation metric.
|
||
|
||
When using any of the above options, it is important to set the _red_size_ parameter, which
|
||
informs of the number of members to retain.
|
||
|
||
Please, check the [model selection](https://github.com/HLT-ISTI/QuaPy/wiki/Model-Selection)
|
||
wiki if you want to optimize the hyperparameters of ensemble for classification or quantification.
|
||
|
||
### The QuaNet neural network
|
||
|
||
QuaPy offers an implementation of QuaNet, a deep learning model presented in:
|
||
|
||
[_Esuli, A., Moreo, A., & Sebastiani, F. (2018, October).
|
||
A recurrent neural network for sentiment quantification.
|
||
In Proceedings of the 27th ACM International Conference on
|
||
Information and Knowledge Management (pp. 1775-1778)._](https://dl.acm.org/doi/abs/10.1145/3269206.3269287)
|
||
|
||
This model requires _torch_ to be installed.
|
||
QuaNet also requires a classifier that can provide embedded representations
|
||
of the inputs.
|
||
In the original paper, QuaNet was tested using an LSTM as the base classifier.
|
||
In the following example, we show an instantiation of QuaNet that instead uses CNN as a probabilistic classifier, taking its last layer representation as the document embedding:
|
||
|
||
```python
|
||
import quapy as qp
|
||
from quapy.method.meta import QuaNet
|
||
from quapy.classification.neural import NeuralClassifierTrainer, CNNnet
|
||
|
||
# use samples of 100 elements
|
||
qp.environ['SAMPLE_SIZE'] = 100
|
||
|
||
# load the kindle dataset as text, and convert words to numerical indexes
|
||
dataset = qp.datasets.fetch_reviews('kindle', pickle=True)
|
||
qp.data.preprocessing.index(dataset, min_df=5, inplace=True)
|
||
|
||
# the text classifier is a CNN trained by NeuralClassifierTrainer
|
||
cnn = CNNnet(dataset.vocabulary_size, dataset.n_classes)
|
||
learner = NeuralClassifierTrainer(cnn, device='cuda')
|
||
|
||
# train QuaNet
|
||
model = QuaNet(learner, device='cuda')
|
||
model.fit(dataset.training)
|
||
estim_prevalence = model.quantify(dataset.test.instances)
|
||
```
|
||
|