Yeb Havinga
commited on
Commit
·
a2ca396
1
Parent(s):
1dcd613
Autoupdate README.md
Browse files
README.md
CHANGED
|
@@ -4,32 +4,50 @@ language:
|
|
| 4 |
datasets:
|
| 5 |
- yhavinga/mc4_nl_cleaned
|
| 6 |
tags:
|
|
|
|
| 7 |
- seq2seq
|
| 8 |
-
|
| 9 |
-
license: apache-2.0
|
| 10 |
inference: false
|
|
|
|
| 11 |
---
|
| 12 |
|
| 13 |
-
#
|
|
|
|
|
|
|
|
|
|
|
|
|
| 14 |
|
| 15 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
|
| 17 |
* Pre-trained T5 models need to be finetuned before they can be used for downstream tasks, therefore the inference widget on the right has been turned off.
|
| 18 |
-
* For a fine-tuned version for summarization, see [yhavinga/t5-v1.1-base-dutch-cnn-test](https://huggingface.co/yhavinga/t5-v1.1-base-dutch-cnn-test).
|
| 19 |
* For a demo of the Dutch CNN summarization models, head over to the Hugging Face Spaces for
|
| 20 |
the **[Netherformer 📰](https://huggingface.co/spaces/flax-community/netherformer)** example application!
|
| 21 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 22 |
|
| 23 |

|
| 24 |
|
| 25 |
-
## Tokenizer
|
| 26 |
|
| 27 |
-
|
| 28 |
-
Transformers [Flax examples](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling).
|
| 29 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 30 |
## Dataset
|
| 31 |
|
| 32 |
-
All models listed below are trained on
|
| 33 |
[cleaned Dutch mC4](https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned),
|
| 34 |
which is the original mC4, except
|
| 35 |
|
|
@@ -40,45 +58,99 @@ which is the original mC4, except
|
|
| 40 |
* Documents with "javascript", "lorum ipsum", "terms of use", "privacy policy", "cookie policy", "uses cookies",
|
| 41 |
"use of cookies", "use cookies", "elementen ontbreken", "deze printversie" are removed.
|
| 42 |
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
TL;DR: [yhavinga/t5-v1.1-base-dutch-cased](https://huggingface.co/yhavinga/t5-v1.1-base-dutch-cased) is the best model.
|
| 46 |
-
|
| 47 |
-
* `yhavinga/t5-base-dutch` is a re-training of the Dutch T5 base v1.0 model trained during the summer 2021
|
| 48 |
-
Flax/Jax community week. Accuracy was improved from 0.64 to 0.70.
|
| 49 |
-
* The two T5 v1.1 base models are an uncased and cased version of `t5-v1.1-base`, again pre-trained from scratch on Dutch,
|
| 50 |
-
with a tokenizer also trained from scratch. The t5 v1.1 models are slightly different from the t5 models, and the
|
| 51 |
-
base models are trained with a dropout of 0.0. For fine-tuning it is intended to set this back to 0.1.
|
| 52 |
-
* The large cased model is a pre-trained Dutch version of `t5-v1.1-large`. Training of t5-v1.1-large proved difficult.
|
| 53 |
-
Without dropout regularization, the training would diverge at a certain point. With dropout training went better,
|
| 54 |
-
be it much slower than training the t5-model. At some point convergance was too slow to warrant further training.
|
| 55 |
-
The latest checkpoint, training scripts and metrics are available for reference. For actual fine-tuning the cased
|
| 56 |
-
base model is probably the better choice.
|
| 57 |
|
| 58 |
-
|
| 59 |
-
|---------------------------------------------------------------------------------------------------|---------|---------------|----------|----------|------------|--------|---------|---------|-----------|------|----------|
|
| 60 |
-
| [yhavinga/t5-base-dutch](https://huggingface.co/yhavinga/t5-base-dutch) | T5 | 512 | 0,70 | 1,38 | 128 | 1 | 528481 | 0.1 | adafactor | 5e-3 | 2d 9h |
|
| 61 |
-
| [yhavinga/t5-v1.1-base-dutch-uncased](https://huggingface.co/yhavinga/t5-v1.1-base-dutch-uncased) | t5-v1.1 | 1024 | 0,73 | 1,20 | 64 | 2 | 1014525 | 0.0 | adafactor | 5e-3 | 5d 5h |
|
| 62 |
-
| [yhavinga/t5-v1.1-base-dutch-cased](https://huggingface.co/yhavinga/t5-v1.1-base-dutch-cased) | t5-v1.1 | 1024 | **0,78** | **0,96** | 64 | 2 | 1210000 | 0.0 | adafactor | 5e-3 | 6d 6h |
|
| 63 |
-
| [yhavinga/t5-v1.1-large-dutch-cased](https://huggingface.co/yhavinga/t5-v1.1-large-dutch-cased) | t5-v1.1 | 512 | 0,76 | 1,07 | 64 | 1 | 1120000 | 0.1 | adafactor | 5e-3 | 86 13h |
|
| 64 |
-
|
| 65 |
-
The cased t5-v1.1 Dutch models were fine-tuned on summarizing the CNN Daily Mail dataset.
|
| 66 |
-
|
| 67 |
-
| | model | input len | target len | Rouge1 | Rouge2 | RougeL | RougeLsum | Test Gen Len | epochs | batch size | steps | duration |
|
| 68 |
-
|-------------------------------------------------------------------------------------------------------|---------|-----------|------------|--------|--------|--------|-----------|--------------|--------|------------|-------|----------|
|
| 69 |
-
| [yhavinga/t5-v1.1-base-dutch-cnn-test](https://huggingface.co/yhavinga/t5-v1.1-base-dutch-cnn-test) | t5-v1.1 | 1024 | 96 | 34,8 | 13,6 | 25,2 | 32,1 | 79 | 6 | 64 | 26916 | 2h 40m |
|
| 70 |
-
| [yhavinga/t5-v1.1-large-dutch-cnn-test](https://huggingface.co/yhavinga/t5-v1.1-large-dutch-cnn-test) | t5-v1.1 | 1024 | 96 | 34,4 | 13,6 | 25,3 | 31,7 | 81 | 5 | 16 | 89720 | 11h |
|
| 71 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 72 |
|
| 73 |
## Acknowledgements
|
| 74 |
|
| 75 |
This project would not have been possible without compute generously provided by Google through the
|
| 76 |
-
[TPU Research Cloud](https://sites.research.google/trc/). The HuggingFace 🤗 ecosystem was also
|
| 77 |
-
instrumental
|
| 78 |
-
and
|
|
|
|
|
|
|
|
|
|
| 79 |
|
| 80 |
* [Gsarti's Pretrain and Fine-tune a T5 model with Flax on GCP](https://github.com/gsarti/t5-flax-gcp)
|
| 81 |
-
* [HUggingFace Flax MLM examples](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling)
|
| 82 |
* [Flax/Jax Community week t5-base-dutch](https://huggingface.co/flax-community/t5-base-dutch)
|
| 83 |
|
| 84 |
-
Created by [Yeb Havinga](https://www.linkedin.com/in/yeb-havinga-86530825/)
|
|
|
|
|
|
| 4 |
datasets:
|
| 5 |
- yhavinga/mc4_nl_cleaned
|
| 6 |
tags:
|
| 7 |
+
- t5
|
| 8 |
- seq2seq
|
| 9 |
+
|
|
|
|
| 10 |
inference: false
|
| 11 |
+
license: apache-2.0
|
| 12 |
---
|
| 13 |
|
| 14 |
+
# t5-v1.1-base-dutch-cased
|
| 15 |
+
|
| 16 |
+
A [T5](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html) sequence to sequence model
|
| 17 |
+
pre-trained from scratch on [cleaned Dutch 🇳🇱🇧🇪 mC4](https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned).
|
| 18 |
+
|
| 19 |
|
| 20 |
+
This **t5-v1.1** model has **247M** parameters.
|
| 21 |
+
It was pre-trained on the dataset
|
| 22 |
+
`mc4_nl_cleaned` config `full` for **2** epoch(s) and a duration of **6d6h**,
|
| 23 |
+
with a sequence length of **1024**, batch size **64** and **1210154** total steps.
|
| 24 |
+
Pre-training evaluation loss and accuracy are **0,96** and **0,78**.
|
| 25 |
+
After fine-tuning on 25K samples of Dutch CNN summarization, the Rouge1 score is **34.1**
|
| 26 |
+
(note: this evaluation model was not saved).
|
| 27 |
|
| 28 |
* Pre-trained T5 models need to be finetuned before they can be used for downstream tasks, therefore the inference widget on the right has been turned off.
|
|
|
|
| 29 |
* For a demo of the Dutch CNN summarization models, head over to the Hugging Face Spaces for
|
| 30 |
the **[Netherformer 📰](https://huggingface.co/spaces/flax-community/netherformer)** example application!
|
| 31 |
+
|
| 32 |
+
Please refer to the original T5 papers and Scale Efficiently papers for more information about the T5 architecture
|
| 33 |
+
and configs, though it must be noted that this model (t5-v1.1-base-dutch-cased) is unrelated to these projects and not an 'official' checkpoint.
|
| 34 |
+
* **[Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/pdf/1910.10683.pdf)** by *Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu*.
|
| 35 |
+
* **[Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers](https://arxiv.org/abs/2109.10686)** by *Yi Tay, Mostafa Dehghani, Jinfeng Rao, William Fedus, Samira Abnar, Hyung Won Chung, Sharan Narang, Dani Yogatama, Ashish Vaswani, Donald Metzler*.
|
| 36 |
+
|
| 37 |
|
| 38 |

|
| 39 |
|
|
|
|
| 40 |
|
| 41 |
+
## Tokenizer
|
|
|
|
| 42 |
|
| 43 |
+
The model uses a cased SentencePiece tokenizer configured with the `Nmt, NFKC, Replace multi-space to single-space` normalizers
|
| 44 |
+
and has 32003 tokens.
|
| 45 |
+
It was trained on Dutch mc4 with scripts from the Huggingface Transformers [Flax examples](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling).
|
| 46 |
+
See [./raw/main/tokenizer.json](tokenizer.json) for details.
|
| 47 |
+
|
| 48 |
## Dataset
|
| 49 |
|
| 50 |
+
All models listed below are trained on
|
| 51 |
[cleaned Dutch mC4](https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned),
|
| 52 |
which is the original mC4, except
|
| 53 |
|
|
|
|
| 58 |
* Documents with "javascript", "lorum ipsum", "terms of use", "privacy policy", "cookie policy", "uses cookies",
|
| 59 |
"use of cookies", "use cookies", "elementen ontbreken", "deze printversie" are removed.
|
| 60 |
|
| 61 |
+
The Dutch and English models are trained on a 50/50% mix of Dutch mC4 and English C4.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 62 |
|
| 63 |
+
## Models
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 64 |
|
| 65 |
+
Three types of models have been trained. `t5-base-dutch` is the only model with an original T5 config.
|
| 66 |
+
The other model types t5-v1.1 and t5-eff have `gated-relu` instead of `relu` as activation function,
|
| 67 |
+
and trained with a drop-out of `0.0` unless training would diverge (`t5-v1.1-large-dutch-cased`).
|
| 68 |
+
The T5-eff models are models with mostly different numbers of layers. The table will list
|
| 69 |
+
the several dimensions of these models. Note that `efficient` is a misnomer for models with few layers,
|
| 70 |
+
e.g. `t5-xl-4L-dutch-english-cased`, that is not efficient and one of the worst models on downstream summarization.
|
| 71 |
+
|
| 72 |
+
| | t5-base-dutch | t5-v1.1-base-dutch-uncased | t5-v1.1-base-dutch-cased | t5-v1.1-large-dutch-cased | t5-v1_1-base-dutch-english-cased | t5-v1_1-base-dutch-english-cased-1024 | t5-small-24L-dutch-english | t5-xl-4L-dutch-english-cased | t5-base-36L-dutch-english-cased | t5-eff-xl-8l-dutch-english-cased | t5-eff-large-8l-dutch-english-cased |
|
| 73 |
+
|:------------------|:----------------|:-----------------------------|:---------------------------|:----------------------------|:-----------------------------------|:----------------------------------------|:-----------------------------|:-------------------------------|:----------------------------------|:-----------------------------------|:--------------------------------------|
|
| 74 |
+
| type | t5 | t5-v1.1 | t5-v1.1 | t5-v1.1 | t5-v1.1 | t5-v1.1 | t5 eff | t5 eff | t5 eff | t5 eff | t5 eff |
|
| 75 |
+
| d_model | 768 | 768 | 768 | 1024 | 768 | 768 | 512 | 2048 | 768 | 1024 | 1024 |
|
| 76 |
+
| d_ff | 3072 | 2048 | 2048 | 2816 | 2048 | 2048 | 1920 | 5120 | 2560 | 16384 | 4096 |
|
| 77 |
+
| num_heads | 12 | 12 | 12 | 16 | 12 | 12 | 8 | 32 | 12 | 32 | 16 |
|
| 78 |
+
| d_kv | 64 | 64 | 64 | 64 | 64 | 64 | 64 | 64 | 64 | 128 | 64 |
|
| 79 |
+
| num_layers | 12 | 12 | 12 | 24 | 12 | 12 | 24 | 4 | 36 | 8 | 8 |
|
| 80 |
+
| num parameters | 223M | 248M | 248M | 783M | 248M | 248M | 250M | 585M | 729M | 1241M | 335M |
|
| 81 |
+
| feed_forward_proj | relu | gated-gelu | gated-gelu | gated-gelu | gated-gelu | gated-gelu | gated-gelu | gated-gelu | gated-gelu | gated-gelu | gated-gelu |
|
| 82 |
+
| dropout | 0.1 | 0.0 | 0.0 | 0.1 | 0.0 | 0.0 | 0.0 | 0.1 | 0.0 | 0.0 | 0.0 |
|
| 83 |
+
| dataset | mc4_nl_cleaned | mc4_nl_cleaned full | mc4_nl_cleaned full | mc4_nl_cleaned | mc4_nl_cleaned small_en_nl | mc4_nl_cleaned large_en_nl | mc4_nl_cleaned large_en_nl | mc4_nl_cleaned large_en_nl | mc4_nl_cleaned large_en_nl | mc4_nl_cleaned large_en_nl | mc4_nl_cleaned large_en_nl |
|
| 84 |
+
| tr. seq len | 512 | 1024 | 1024 | 512 | 512 | 1024 | 512 | 512 | 512 | 512 | 512 |
|
| 85 |
+
| batch size | 128 | 64 | 64 | 64 | 128 | 64 | 128 | 512 | 512 | 64 | 128 |
|
| 86 |
+
| total steps | 527500 | 1014525 | 1210154 | 2427498 | 2839630 | 1520k/3397024 | 851852 | 212963 | 212963 | 538k/1703705 | 851850 |
|
| 87 |
+
| epochs | 1 | 2 | 2 | 2 | 10 | 4 | 1 | 1 | 1 | 1 | 1 |
|
| 88 |
+
| duration | 2d9h | 5d5h | 6d6h | 8d13h | 11d18h | 9d1h | 4d10h | 6d1h | 17d15h | 4d 19h | 3d 23h |
|
| 89 |
+
| optimizer | adafactor | adafactor | adafactor | adafactor | adafactor | adafactor | adafactor | adafactor | adafactor | adafactor | adafactor |
|
| 90 |
+
| lr | 0.005 | 0.005 | 0.005 | 0.005 | 0.005 | 0.005 | 0.005 | 0.005 | 0.009 | 0.005 | 0.005 |
|
| 91 |
+
| warmup | 10000.0 | 10000.0 | 10000.0 | 10000.0 | 10000.0 | 5000.0 | 20000.0 | 2500.0 | 1000.0 | 1500.0 | 1500.0 |
|
| 92 |
+
| eval loss | 1,38 | 1,20 | 0,96 | 1,07 | 1,11 | 1,13 | 1,18 | 1,27 | 1,05 | 1,3019 | 1,15 |
|
| 93 |
+
| eval acc | 0,70 | 0,73 | 0,78 | 0,76 | 0,75 | 0,74 | 0,74 | 0,72 | 0,76 | 0,71 | 0,74 |
|
| 94 |
+
|
| 95 |
+
## Evaluation on summarization
|
| 96 |
+
|
| 97 |
+
The models below have been evaluated on the summarization downstream task on 50K samples from the CNN Dailymail dataset.
|
| 98 |
+
All models were fine-tuned with the AdamW optimizer with a batch size of 128 and constant learning rate of 1e-3 after a
|
| 99 |
+
warmup of 64 steps, with a label smoothing factor of 0.05.
|
| 100 |
+
Article and summary token lengths were set to 1024 and 142.
|
| 101 |
+
|
| 102 |
+
| | t5-base-dutch | t5-v1.1-base-dutch-uncased | t5-v1.1-base-dutch-cased | t5-v1_1-base-dutch-english-cased | t5-v1_1-base-dutch-english-cased-1024 | t5-small-24L-dutch-english | t5-xl-4L-dutch-english-cased | t5-base-36L-dutch-english-cased | t5-eff-large-8l-dutch-english-cased | mt5-base |
|
| 103 |
+
|:-------------------|:----------------|:-----------------------------|:---------------------------|:-----------------------------------|:----------------------------------------|:-----------------------------|:-------------------------------|:----------------------------------|:--------------------------------------|:-----------|
|
| 104 |
+
| rouge1 | 33.0313 | 33.8432 | 34.0906 | 33.1116 | 34.6465 | 34.376 | 30.8983 | 35.0931 | 33.9293 | 33.6466 |
|
| 105 |
+
| rouge2 | 12.9452 | 13.7706 | 13.6203 | 13.275 | 13.8525 | 13.8939 | 11.6005 | 14.3823 | 13.6274 | 13.1085 |
|
| 106 |
+
| rougeL | 23.7204 | 24.5642 | 24.7304 | 24.3561 | 24.721 | 25.2496 | 22.6536 | 25.3213 | 24.5595 | 23.909 |
|
| 107 |
+
| rougeLsum | 29.842 | 30.7783 | 31.1438 | 30.0548 | 31.6104 | 31.3838 | 27.8467 | 32.3526 | 30.952 | 30.5054 |
|
| 108 |
+
| gen_len | 90.488 | 91.832 | 92.122 | 89.583 | 98.333 | 90.442 | 92.342 | 96.832 | 95.057 | 96.312 |
|
| 109 |
+
| num parameters | 223M | 248M | 248M | 248M | 248M | 250M | 585M | 729M | 335M | 582M |
|
| 110 |
+
| samples_per_second | 3.195 | 3.039 | 3.0 | 3.216 | 2.974 | 1.594 | 2.47 | 0.623 | 3.087 | 1.201 |
|
| 111 |
+
|
| 112 |
+
## Translation models
|
| 113 |
+
|
| 114 |
+
The small 24L and base 36L models have been fine-tuned for translation on the CCMatrix dataset.
|
| 115 |
+
The models named *-`multi` support both directions of translation. The models are trained on CCMatrix only. As this is
|
| 116 |
+
a really large dataset with over 100M Dutch-English sentence pairs, the models are trained on a fraction of it,
|
| 117 |
+
refer to the table below for how long. Evaluation is performed on a CCMatrix section not trained on, but also
|
| 118 |
+
on Tatoeba and Opus Books. The `_bp` columns list the *brevity penalty*. The `avg_bleu` score is the bleu score
|
| 119 |
+
averaged over all three evaluation datasets.
|
| 120 |
+
|
| 121 |
+
The translation metrics are listed in the table below:
|
| 122 |
+
|
| 123 |
+
| | t5-base-36L-ccmatrix-en-nl | t5-base-36L-ccmatrix-multi | t5-base-36L-ccmatrix-multi | t5-small-24L-ccmatrix-multi | t5-small-24L-ccmatrix-multi |
|
| 124 |
+
|:-----------------------|:-----------------------------|:-----------------------------|:-----------------------------|:------------------------------|:------------------------------|
|
| 125 |
+
| id | 0 | 14 | 15 | 16 | 20 |
|
| 126 |
+
| source_lang | en | en | nl | en | nl |
|
| 127 |
+
| target_lang | nl | nl | en | nl | en |
|
| 128 |
+
| source_prefix | translate English to Dutch: | translate English to Dutch: | translate Dutch to English: | translate English to Dutch: | translate Dutch to English: |
|
| 129 |
+
| tatoeba_bp | 0.9897614370103832 | 0.9736173618072754 | 0.943521164106552 | 0.9760983304454847 | 0.9406676405486575 |
|
| 130 |
+
| ccmatrix_bp | 0.9590750786190209 | 0.9536276245543676 | 0.9635673583308255 | 0.9517934939463099 | 0.9585648049711814 |
|
| 131 |
+
| opus_books_bp | 0.7478011343203491 | 0.7950194726093107 | 0.9362852511299413 | 0.770498474692027 | 0.8870675076932444 |
|
| 132 |
+
| tatoeba_score | 50.63006965176505 | 46.580601850286214 | 52.82030981131822 | 46.419809813946046 | 51.67887417355214 |
|
| 133 |
+
| ccmatrix_score | 60.33227938980884 | 56.81297258845844 | 62.836646082246254 | 57.404319674892406 | 63.08633155239932 |
|
| 134 |
+
| opus_books_score | 10.405013868050663 | 13.477997378535864 | 24.93113308798125 | 12.927244801365507 | 23.418552148252047 |
|
| 135 |
+
| avg_bleu | 40.455787636541515 | 38.95719060576017 | 46.86269632718191 | 38.91712476340132 | 46.0612526247345 |
|
| 136 |
+
| total steps | 78125 | 390625 | 390625 | 390625 | 390625 |
|
| 137 |
+
| duration | 14h | 101h | 101h | 74h | 74h |
|
| 138 |
+
| num_parameters | 728928000 | 728928000 | 728928000 | 249991680 | 249991680 |
|
| 139 |
+
| label_smoothing_factor | 0.09 | 0.15 | 0.15 | 0.1 | 0.1 |
|
| 140 |
+
| learning_rate | 0.0001 | 5e-05 | 5e-05 | 0.0005 | 0.0005 |
|
| 141 |
|
| 142 |
## Acknowledgements
|
| 143 |
|
| 144 |
This project would not have been possible without compute generously provided by Google through the
|
| 145 |
+
[TPU Research Cloud](https://sites.research.google/trc/). The HuggingFace 🤗 ecosystem and was also
|
| 146 |
+
instrumental all parts of the training. Logging metrics to Weights & Biases made it possible to keep track of many
|
| 147 |
+
models and orchestrate hyper-parameter sweeps with insightful visualizations. I cannot imagine how I would
|
| 148 |
+
have completed this project otherwise.
|
| 149 |
+
The following repositories where helpful in setting up the TPU-VM,
|
| 150 |
+
and getting an idea what sensible hyper-parameters are for training gpt2 from scratch.
|
| 151 |
|
| 152 |
* [Gsarti's Pretrain and Fine-tune a T5 model with Flax on GCP](https://github.com/gsarti/t5-flax-gcp)
|
|
|
|
| 153 |
* [Flax/Jax Community week t5-base-dutch](https://huggingface.co/flax-community/t5-base-dutch)
|
| 154 |
|
| 155 |
+
Created by [Yeb Havinga](https://www.linkedin.com/in/yeb-havinga-86530825/)
|
| 156 |
+
|