Things to clarify: maybe I have to review the validation of the sav-loss; since it is batched, it might be always checking the same submatrices of for alignment, and those may be mostly positive or mostly near an identity? maybe the sav-loss is something which may have sense to impose, as a regularization, across many last layers, and not only the last one? process datasets and leave it as a generic parameter padding could start at any random point between [0, length_i-pad_length] - in training, pad to the shortest - in test, pad to the largest save and restore checkpoints should the phi(x) be normalized? if so: - better at the last step of phi? - better outside phi, previous to the gram matrix computation? should the single-label classifier have some sort of non linearity from the phi(x) to the labels? SAV: how should the range of k(xi,xj) be interpreted? how to decide for value threshold for returning -1 or +1? I guess the best thing to do is to learn a simple threshold, one feed forward 1-to-1 is the TwoClassBatch the best way? are the contribution of the two losses comparable? or one contributes far more than the other? what is the best representation for inputs? char-based? ngrams-based? word-based? or a multichannel one? I think this is irrelevant for the paper not clear whether the single-label classifier should work out a ff on top of the intermediate representation, or should it instead work directly on the representations with one simple linear projection; not clear either whether the kernel should be computed on any further elaboration from the intermediate representation... thing is, that the is imposing unimodality (documents from the same author should point in a single direction) while working out another representation for the single-label classifier could instead relax this and attribute to the same author vectors that come from a multimodal distribution. No... This "unimodality" should exist anyway in the last layer. Indeed I start thinking that the optimum for any classifier should already impose something similar to the KTA criteria in the last layer... Is this redundant? not clear whether we should define the loss as in "On kernel target alignment", i.e., a numerator with f (and change sign to minimize) or as |K-Y|f norm. What about the denominator (now, the normalization factor is n**2)?