QuaPy/examples/18.ReadMe_for_text_analysis.py

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_selection import SelectKBest, chi2

import quapy as qp
from quapy.method.non_aggregative import ReadMe
import quapy.functional as F
from sklearn.pipeline import Pipeline

"""
This example showcases how to use the non-aggregative method ReadMe proposed by Hopkins and King.
This method is for text analysis, so let us first instantiate a dataset for sentiment quantification (we
use IMDb for this example). The method is quite computationally expensive, so we will restrict the training
set to 1000 documents only.
"""
reviews = qp.datasets.fetch_reviews('imdb').reduce(n_train=1000, random_state=0)

"""
We need to convert text to bag-of-words representations. Actually, ReadMe requires the representations to be
binary (i.e., storing a 1 whenever a document contains certain word, or 0 otherwise), so we will not use
TFIDF weighting. We will also retain the top 1000 most important features according to chi2.
"""
encode_0_1 = Pipeline([
    ('0_1_terms', CountVectorizer(min_df=5, binary=True)),
    ('feat_sel',  SelectKBest(chi2, k=1000))
])
train, test = qp.data.preprocessing.instance_transformation(reviews, encode_0_1, inplace=True).train_test

"""
We now instantiate ReadMe, with the prob_model='full' (default behaviour, implementing the Hopkins and King original
idea). This method consists of estimating Q(Y) by solving:

Q(X) = \sum_i Q(X|Y=i) Q(Y=i)

without resorting to estimating the posteriors Q(Y=i|X), by solving a linear least-squares problem.
However, since Q(X) and Q(X|Y=i) are matrices of shape (2^K, 1) and (2^K, n), with K the number of features
and n the number of classes, their calculation becomes intractable. ReadMe instead performs bagging (i.e., it
samples small sets of features and averages the results) thus reducing K to a few terms. In our example we
set K (bagging_range) to 20, and the number of bagging_trials to 100.

ReadMe also computes confidence intervals via bootstrap. We set the number of bootstrap trials to 100.
"""
readme = ReadMe(prob_model='full', bootstrap_trials=100, bagging_trials=100, bagging_range=20, random_state=0, verbose=True)
readme.fit(*train.Xy)   # <- there is actually nothing happening here (only bootstrap resampling); the method is "lazy"
                        #    and postpones most of the calculations to the test phase.

# since the method is slow, we will only test 3 cases with different imbalances
few_negatives = [0.25, 0.75]
balanced = [0.5, 0.5]
few_positives = [0.75, 0.25]

for test_prev in [few_negatives, balanced, few_positives]:
    sample = reviews.test.sampling(500, *test_prev, random_state=0)  # draw sets of 500 documents with desired prevs
    prev_estim, conf = readme.predict_conf(sample.X)
    err = qp.error.mae(sample.prevalence(), prev_estim)
    print(f'true-prevalence={F.strprev(sample.prevalence())},\n'
          f'predicted-prevalence={F.strprev(prev_estim)}, with confidence intervals {conf},\n'
          f'MAE={err:.4f}')