baseline multilingual Bert

removed some comments
2020-07-27 12:05:18 +02:00 · 2020-07-27 12:05:18 +02:00 · 92f16e60eb
parent a594a84dab
commit 92f16e60eb
1 changed files with 0 additions and 18 deletions
--- a/src/new_mbert.py
+++ b/src/new_mbert.py
@ -1,21 +1,3 @@
-"""
-Test with smaller subset of languages.
-
-1. Load doc (RCV1/2)
-2. Tokenize texts via bertTokenizer (I should already have these dumps)
-3. Construct better Dataloader/Datasets. NB: I need to keep track of the languages only for
-the testing phase (but who cares actually? If I have to do it for the testing phase, I think
-it is better to deploy it also in the training phase...)
-4. ...
-5. I have to understand if the pooled hidden state of the last layer is way worse than its averaged
-version (However, in BertForSeqClassification I guess that the pooled version is passed through
-the output linear layer in order to get the prediction scores?)
-6. At the same time, I have to build also an end-to-end model in order to fine-tune it. The previous step
-would be useful when deploying mBert as a View Generator. (Refactor gFun code with view generators?)
-7. ...
-8. Profits
-
-"""
 from dataset_builder import MultilingualDataset
 from transformers import BertTokenizer, BertForSequenceClassification, AdamW
 from torch.utils.data import Dataset, DataLoader