Update README.md
Browse files
README.md
CHANGED
|
@@ -12,25 +12,25 @@ widget:
|
|
| 12 |
license: apache-2.0
|
| 13 |
---
|
| 14 |
|
| 15 |
-
# The [BERTić](https://huggingface.co/
|
| 16 |
|
| 17 |
* The name should resemble the facts (1) that the model was trained in Zagreb, Croatia, where diminutives ending in -ić (as in fotić, smajlić, hengić etc.) are very popular, and (2) that most surnames in the countries where these languages are spoken end in -ić (with diminutive etymology as well).
|
| 18 |
|
| 19 |
-
This is a fine-tuned version of the [BERTić](https://huggingface.co/
|
| 20 |
|
| 21 |
- the [hr500k](http://hdl.handle.net/11356/1183) dataset, 500 thousand tokens in size, standard Croatian
|
| 22 |
- the [SETimes.SR](http://hdl.handle.net/11356/1200) dataset, 87 thousand tokens in size, standard Serbian
|
| 23 |
- the [ReLDI-hr](http://hdl.handle.net/11356/1241) dataset, 89 thousand tokens in size, Internet (Twitter) Croatian
|
| 24 |
- the [ReLDI-sr](http://hdl.handle.net/11356/1240) dataset, 92 thousand tokens in size, Internet (Twitter) Serbian
|
| 25 |
|
| 26 |
-
The data was augmented with missing diacritics and standard data was additionally over-represented. The F1 obtained on dev data (train and test was merged into train) is 91.38. For a more detailed per-dataset evaluation of the BERTić model on the NER task have a look at the [main model page](https://huggingface.co/
|
| 27 |
|
| 28 |
If you use this fine-tuned model, please cite the following paper:
|
| 29 |
|
| 30 |
```
|
| 31 |
@inproceedings{ljubesic-lauc-2021-bertic,
|
| 32 |
title = "{BERTić} - The Transformer Language Model for {B}osnian, {C}roatian, {M}ontenegrin and {S}erbian",
|
| 33 |
-
author = "Ljube{
|
| 34 |
Lauc, Davor",
|
| 35 |
booktitle = "Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing",
|
| 36 |
year = "2021",
|
|
|
|
| 12 |
license: apache-2.0
|
| 13 |
---
|
| 14 |
|
| 15 |
+
# The [BERTić](https://huggingface.co/classla/bcms-bertic)* [bert-ich] /bɜrtitʃ/ model fine-tuned for the task of named entity recognition in Bosnian, Croatian, Montenegrin and Serbian (BCMS)
|
| 16 |
|
| 17 |
* The name should resemble the facts (1) that the model was trained in Zagreb, Croatia, where diminutives ending in -ić (as in fotić, smajlić, hengić etc.) are very popular, and (2) that most surnames in the countries where these languages are spoken end in -ić (with diminutive etymology as well).
|
| 18 |
|
| 19 |
+
This is a fine-tuned version of the [BERTić](https://huggingface.co/classla/bcms-bertic) model for the task of named entity recognition (PER, LOC, ORG, MISC). The fine-tuning was performed on the following datasets:
|
| 20 |
|
| 21 |
- the [hr500k](http://hdl.handle.net/11356/1183) dataset, 500 thousand tokens in size, standard Croatian
|
| 22 |
- the [SETimes.SR](http://hdl.handle.net/11356/1200) dataset, 87 thousand tokens in size, standard Serbian
|
| 23 |
- the [ReLDI-hr](http://hdl.handle.net/11356/1241) dataset, 89 thousand tokens in size, Internet (Twitter) Croatian
|
| 24 |
- the [ReLDI-sr](http://hdl.handle.net/11356/1240) dataset, 92 thousand tokens in size, Internet (Twitter) Serbian
|
| 25 |
|
| 26 |
+
The data was augmented with missing diacritics and standard data was additionally over-represented. The F1 obtained on dev data (train and test was merged into train) is 91.38. For a more detailed per-dataset evaluation of the BERTić model on the NER task have a look at the [main model page](https://huggingface.co/classla/bcms-bertic).
|
| 27 |
|
| 28 |
If you use this fine-tuned model, please cite the following paper:
|
| 29 |
|
| 30 |
```
|
| 31 |
@inproceedings{ljubesic-lauc-2021-bertic,
|
| 32 |
title = "{BERTić} - The Transformer Language Model for {B}osnian, {C}roatian, {M}ontenegrin and {S}erbian",
|
| 33 |
+
author = "Ljube{\\v{s}}i{\\'c}, Nikola and
|
| 34 |
Lauc, Davor",
|
| 35 |
booktitle = "Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing",
|
| 36 |
year = "2021",
|