57 lines
2.7 KiB
Plaintext
57 lines
2.7 KiB
Plaintext
Recap Feb. 2021:
|
|
- Adapt everything to testing a classic neural training for AA (i.e., projector+classifier training) vs. applying Supervised
|
|
Contrastive Learning (SCL) as a pretraining step for solving SAV, and then training a linear classifier with
|
|
the projector network frozen. Reassess the work in terms of SAV and made connections with KTA and SVM. Maybe claim
|
|
that SCL+SVM is the way to go.
|
|
- Compare (Attribution):
|
|
- S.Ruder systems
|
|
- My system (projector+classifier layer) as a reimplementation of S.Ruder's systems
|
|
- Projector trained via SCL + Classifier layer trained alone.
|
|
- Projector trained via SCL + SVM Classifier.
|
|
- Projector trained via KTA + SVM Classifier.
|
|
- Compare (SAV):
|
|
- My system (projector+binary-classifier layer)
|
|
- Projector trained via SCL + Binary Classifier layer trained alone.
|
|
- Projector trained via SCL + SVM Classifier.
|
|
- Projector trained via KTA + SVM Classifier.
|
|
- Other systems (maybe Diff-Vectors, maybe Impostors, maybe distance-based)
|
|
- Additional experiments:
|
|
- show the kernel matrix
|
|
|
|
Future:
|
|
- Test also in general TC? there are some torch datasets in torchtext that could simplify things... but that would
|
|
blur the idea of SCL-SAV
|
|
|
|
Code:
|
|
- redo dataset in terms of pytorch's data_loader
|
|
|
|
---------------------
|
|
Things to clarify:
|
|
|
|
about the network:
|
|
==================
|
|
remove the .to() calls inside the Module and use the self.on_cpu instead
|
|
process datasets and leave it as a generic parameter
|
|
padding could start at any random point between [0, length_i-pad_length]
|
|
- in training, pad to the shortest
|
|
- in test, pad to the largest
|
|
|
|
|
|
about the loss and the KTA:
|
|
===========================
|
|
not clear whether we should define the loss as in "On kernel target alignment", i.e., a numerator with <K,Y>f (and
|
|
change sign to minimize) or as |K-Y|f norm. What about the denominator (now, the normalization factor is n**2)?
|
|
maybe the sav-loss is something which may have sense to impose, as a regularization, across many last layers, and not
|
|
only the last one?
|
|
|
|
are the contribution of the two losses comparable? or one contributes far more than the other?
|
|
is the TwoClassBatch the best way?
|
|
maybe I have to review the validation of the sav-loss; since it is batched, it might be always checking the same
|
|
submatrices of for alignment, and those may be mostly positive or mostly near an identity?
|
|
SAV: how should the range of k(xi,xj) be interpreted? how to decide for value threshold for returning -1 or +1?
|
|
I guess the best thing to do is to learn a simple threshold, one feed forward 1-to-1
|
|
|
|
plot the kernel matrix as an imshow, with rows/cols arranged by authors, and check whether the KTA that SCL yields
|
|
is better than that obtained using a traditional training for attribution.
|
|
|