baseline multilingual Bert

removed some comments
2020-07-27 12:05:18 +02:00 · 2020-07-27 12:05:18 +02:00 · 92f16e60eb
parent a594a84dab
commit 92f16e60eb
1 changed files with 0 additions and 18 deletions
--- a/src/new_mbert.py
+++ b/src/new_mbert.py
@ -1,21 +1,3 @@
 """
 Test with smaller subset of languages.
 1. Load doc (RCV1/2)
 2. Tokenize texts via bertTokenizer (I should already have these dumps)
 3. Construct better Dataloader/Datasets. NB: I need to keep track of the languages only for
 the testing phase (but who cares actually? If I have to do it for the testing phase, I think
 it is better to deploy it also in the training phase...)
 4. ...
 5. I have to understand if the pooled hidden state of the last layer is way worse than its averaged
 version (However, in BertForSeqClassification I guess that the pooled version is passed through
 the output linear layer in order to get the prediction scores?)
 6. At the same time, I have to build also an end-to-end model in order to fine-tune it. The previous step
 would be useful when deploying mBert as a View Generator. (Refactor gFun code with view generators?)
 7. ...
 8. Profits
 """
 from dataset_builder import MultilingualDataset
 from transformers import BertTokenizer, BertForSequenceClassification, AdamW
 from torch.utils.data import Dataset, DataLoader