kernel_authorship/TODO.txt

Recap Feb. 2021:
- Adapt everything to testing a classic neural training for AA (i.e., projector+classifier training) vs. applying Supervised
    Contrastive Learning (SCL) as a pretraining step for solving SAV, and then training a linear classifier with
    the projector network frozen. Reassess the work in terms of SAV and made connections with KTA and SVM. Maybe claim
    that SCL+SVM is the way to go.
- Compare (Attribution):
    - S.Ruder systems
    - My system (projector+classifier layer) as a reimplementation of S.Ruder's systems
    - Projector trained via SCL + Classifier layer trained alone.
    - Projector trained via SCL + SVM Classifier.
    - Projector trained via KTA + SVM Classifier.
- Compare (SAV):
    - My system (projector+binary-classifier layer)
    - Projector trained via SCL + Binary Classifier layer trained alone.
    - Projector trained via SCL + SVM Classifier.
    - Projector trained via KTA + SVM Classifier.
    - Other systems (maybe Diff-Vectors, maybe Impostors, maybe distance-based)
- Additional experiments:
    - show the kernel matrix

Future:
- Test also in general TC? there are some torch datasets in torchtext that could simplify things... but that would
    blur the idea of SCL-SAV

Code:
- redo dataset in terms of pytorch's data_loader

---------------------
Things to clarify:

about the network:
==================
remove the .to() calls inside the Module and use the self.on_cpu instead
process datasets and leave it as a generic parameter
padding could start at any random point between [0, length_i-pad_length]
    - in training, pad to the shortest
    - in test, pad to the largest


about the loss and the KTA:
===========================
not clear whether we should define the loss as in "On kernel target alignment", i.e., a numerator with <K,Y>f (and
    change sign to minimize) or as |K-Y|f norm. What about the denominator (now, the normalization factor is n**2)?
maybe the sav-loss is something which may have sense to impose, as a regularization, across many last layers, and not
    only the last one?

are the contribution of the two losses comparable? or one contributes far more than the other?
is the TwoClassBatch the best way?
maybe I have to review the validation of the sav-loss; since it is batched, it might be always checking the same
    submatrices of for alignment, and those may be mostly positive or mostly near an identity?
SAV: how should the range of k(xi,xj) be interpreted? how to decide for value threshold for returning -1 or +1?
    I guess the best thing to do is to learn a simple threshold, one feed forward 1-to-1

plot the kernel matrix as an imshow, with rows/cols arranged by authors, and check whether the KTA that SCL yields
    is better than that obtained using a traditional training for attribution.