Things to clarify:
maybe I have to review the validation of the sav-loss; since it is batched, it might be always checking the same
    submatrices of for alignment, and those may be mostly positive or mostly near an identity?
maybe the sav-loss is something which may have sense to impose, as a regularization, across many last layers, and not
    only the last one?
process datasets and leave it as a generic parameter
padding could start at any random point between [0, length_i-pad_length]
    - in training, pad to the shortest
    - in test, pad to the largest
save and restore checkpoints
should the phi(x) be normalized? if so:
    - better at the last step of phi?
    - better outside phi, previous to the gram matrix computation?
should the single-label classifier have some sort of non linearity from the phi(x) to the labels?
SAV: how should the range of k(xi,xj) be interpreted? how to decide for value threshold for returning -1 or +1?
    I guess the best thing to do is to learn a simple threshold, one feed forward 1-to-1
is the TwoClassBatch the best way?
are the contribution of the two losses comparable? or one contributes far more than the other?
what is the best representation for inputs? char-based? ngrams-based? word-based? or a multichannel one?
    I think this is irrelevant for the paper
not clear whether the single-label classifier should work out a ff on top of the intermediate representation, or should it
    instead work directly on the representations with one simple linear projection; not clear either whether the kernel
    should be computed on any further elaboration from the intermediate representation... thing is, that the <phi(xi),phi(xj)>
    is imposing unimodality (documents from the same author should point in a single direction) while working out another
    representation for the single-label classifier could instead relax this and attribute to the same author vectors that
    come from a multimodal distribution. No... This "unimodality" should exist anyway in the last layer. Indeed I start
    thinking that the optimum for any classifier should already impose something similar to the KTA criteria in the
    last layer... Is this redundant?
not clear whether we should define the loss as in "On kernel target alignment", i.e., a numerator with <K,Y>f (and
    change sign to minimize) or as |K-Y|f norm. What about the denominator (now, the normalization factor is n**2)?