30 lines
2.4 KiB
Plaintext
30 lines
2.4 KiB
Plaintext
Things to clarify:
|
|
maybe I have to review the validation of the sav-loss; since it is batched, it might be always checking the same
|
|
submatrices of for alignment, and those may be mostly positive or mostly near an identity?
|
|
maybe the sav-loss is something which may have sense to impose, as a regularization, across many last layers, and not
|
|
only the last one?
|
|
process datasets and leave it as a generic parameter
|
|
padding could start at any random point between [0, length_i-pad_length]
|
|
- in training, pad to the shortest
|
|
- in test, pad to the largest
|
|
save and restore checkpoints
|
|
should the phi(x) be normalized? if so:
|
|
- better at the last step of phi?
|
|
- better outside phi, previous to the gram matrix computation?
|
|
should the single-label classifier have some sort of non linearity from the phi(x) to the labels?
|
|
SAV: how should the range of k(xi,xj) be interpreted? how to decide for value threshold for returning -1 or +1?
|
|
I guess the best thing to do is to learn a simple threshold, one feed forward 1-to-1
|
|
is the TwoClassBatch the best way?
|
|
are the contribution of the two losses comparable? or one contributes far more than the other?
|
|
what is the best representation for inputs? char-based? ngrams-based? word-based? or a multichannel one?
|
|
I think this is irrelevant for the paper
|
|
not clear whether the single-label classifier should work out a ff on top of the intermediate representation, or should it
|
|
instead work directly on the representations with one simple linear projection; not clear either whether the kernel
|
|
should be computed on any further elaboration from the intermediate representation... thing is, that the <phi(xi),phi(xj)>
|
|
is imposing unimodality (documents from the same author should point in a single direction) while working out another
|
|
representation for the single-label classifier could instead relax this and attribute to the same author vectors that
|
|
come from a multimodal distribution. No... This "unimodality" should exist anyway in the last layer. Indeed I start
|
|
thinking that the optimum for any classifier should already impose something similar to the KTA criteria in the
|
|
last layer... Is this redundant?
|
|
not clear whether we should define the loss as in "On kernel target alignment", i.e., a numerator with <K,Y>f (and
|
|
change sign to minimize) or as |K-Y|f norm. What about the denominator (now, the normalization factor is n**2)? |