Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -1,10 +1,132 @@
|
|
| 1 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
tags:
|
| 3 |
- model_hub_mixin
|
| 4 |
- pytorch_model_hub_mixin
|
|
|
|
|
|
|
|
|
|
| 5 |
---
|
| 6 |
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
license: cc-by-nc-sa-4.0
|
| 3 |
+
base_model: jhu-clsp/ettin-encoder-17m
|
| 4 |
+
base_model_relation: finetune
|
| 5 |
+
datasets:
|
| 6 |
+
- ucrelnlp/English-USAS-Mosaico
|
| 7 |
+
language:
|
| 8 |
+
- en
|
| 9 |
tags:
|
| 10 |
- model_hub_mixin
|
| 11 |
- pytorch_model_hub_mixin
|
| 12 |
+
- pytorch
|
| 13 |
+
- word-sense-disambiguation
|
| 14 |
+
- lexical-semantics
|
| 15 |
---
|
| 16 |
|
| 17 |
+
# Model Card for PyMUSAS Neural English Small BEM
|
| 18 |
+
|
| 19 |
+
A fine tuned 17 Million (17M) parameter English ModernBERT architecture semantic tagger. The tagger outputs semantic tags at the token level from the [USAS tagset](https://ucrel.lancs.ac.uk/usas/usas_guide.pdf).
|
| 20 |
+
|
| 21 |
+
The semantic tagger is a variation of the [Bi-Encoder Model (BEM) from Blevins and Zettlemoyer 2020](https://aclanthology.org/2020.acl-main.95.pdf) a Word Sense Disambiguation (WSD) model.
|
| 22 |
+
|
| 23 |
+
## Table of contents
|
| 24 |
+
|
| 25 |
+
## Quick start
|
| 26 |
+
|
| 27 |
+
### Installation
|
| 28 |
+
|
| 29 |
+
Requires Python `3.10` or greater, it is best that you install the version of PyTorch you would like to use, e.g. CPU/GPU version etc before installing this package else you will get the default version of PyTorch for your operating system/setup, but we do require `torch>=2.2,<3.0`.
|
| 30 |
+
|
| 31 |
+
``` bash
|
| 32 |
+
pip install wsd-torch-models
|
| 33 |
+
```
|
| 34 |
+
|
| 35 |
+
### Usage
|
| 36 |
+
|
| 37 |
+
``` python
|
| 38 |
+
from transformers import AutoTokenizer
|
| 39 |
+
import torch
|
| 40 |
+
|
| 41 |
+
from wsd_torch_models.bem import BEM
|
| 42 |
+
|
| 43 |
+
|
| 44 |
+
if __name__ == "__main__":
|
| 45 |
+
wsd_model_name = "ucrelnlp/PyMUSAS-Neural-English-Small-BEM"
|
| 46 |
+
wsd_model = BEM.from_pretrained(wsd_model_name)
|
| 47 |
+
tokenizer = AutoTokenizer.from_pretrained(wsd_model_name, add_prefix_space=True)
|
| 48 |
+
|
| 49 |
+
wsd_model.eval()
|
| 50 |
+
# Change this to the device you would like to use, e.g. cpu
|
| 51 |
+
model_device = "cpu"
|
| 52 |
+
wsd_model.to(device=model_device)
|
| 53 |
+
|
| 54 |
+
sentence = "The river bank was full of fish"
|
| 55 |
+
sentence_tokens = sentence.split()
|
| 56 |
+
|
| 57 |
+
with torch.inference_mode(mode=True):
|
| 58 |
+
# sub_word_tokenizer can be None when None it will download the appropriate tokenizer
|
| 59 |
+
# but generally it is better to give it the tokenizer as it saves the operation
|
| 60 |
+
# of checking if the tokenizer is already downloaded.
|
| 61 |
+
predictions = wsd_model.predict(sentence_tokens, sub_word_tokenizer=tokenizer, top_n=5)
|
| 62 |
+
|
| 63 |
+
for sentence_token, semantic_tags in zip(sentence_tokens, predictions):
|
| 64 |
+
print("Token: "+ sentence_token)
|
| 65 |
+
print("Most likely tags: ")
|
| 66 |
+
for tag in semantic_tags:
|
| 67 |
+
tag_definition = wsd_model.label_to_definition[tag]
|
| 68 |
+
print("\t" + tag + ":" + tag_definition)
|
| 69 |
+
print()
|
| 70 |
+
```
|
| 71 |
+
|
| 72 |
+
## Model Description
|
| 73 |
+
|
| 74 |
+
For more details about the model and how it was trained please see the [citation/technical report](#citation), as well as the links in the [model sources section.](#model-sources)
|
| 75 |
+
|
| 76 |
+
### Model Sources
|
| 77 |
+
|
| 78 |
+
The training repository contains the code used to train this model. The inference repository contains the code used to run the model as shown in the [usage section.](#usage)
|
| 79 |
+
|
| 80 |
+
- Training Repository: [https://github.com/UCREL/experimental-wsd](https://github.com/UCREL/experimental-wsd)
|
| 81 |
+
- Inference/Usage Repository: [https://github.com/UCREL/WSD-Torch-Models](https://github.com/UCREL/WSD-Torch-Models)
|
| 82 |
+
|
| 83 |
+
### Model Architecture
|
| 84 |
+
|
| 85 |
+
| Parameter | 17M English | 68M English | 140M Multilingual | 307M Multilingual |
|
| 86 |
+
|:----------|:----|:----|:----|:-----|
|
| 87 |
+
| Layers | 7 | 19 | 22 | 22 |
|
| 88 |
+
| Hidden Size | 256 | 512 | 384 | 768 |
|
| 89 |
+
| Intermediate Size | 384 | 768 | 1152 | 1152 |
|
| 90 |
+
| Attention Heads | 4 | 8 | 6 | 12 |
|
| 91 |
+
| Total Parameters | 17M | 68M | 140M | 307M |
|
| 92 |
+
| Non-embedding Parameters | 3.9M | 42.4M | 42M | 110M |
|
| 93 |
+
| Max Sequence Length | 8,000 | 8,000 | 8,192 | 8,192 |
|
| 94 |
+
| Vocabulary Size | 50,368 | 50,368 | 256,000 | 256,000 |
|
| 95 |
+
| Tokenizer | ModernBERT | ModernBERT | Gemma 2 | Gemma 2 |
|
| 96 |
+
|
| 97 |
+
## Training Data
|
| 98 |
+
|
| 99 |
+
The model has been trained on a portion of the [ucrelnlp/English-USAS-Mosaico](https://huggingface.co/datasets/ucrelnlp/English-USAS-Mosaico), specifically [data/wikipedia_shard_0.jsonl.gz](https://huggingface.co/datasets/ucrelnlp/English-USAS-Mosaico/blob/main/data/wikipedia_shard_0.jsonl.gz), which contains 1,083 English Wikipedia articles, with 444,880 sentences, 6.6 million tokens, with 5.3 million silver labelled tokens generated by a English rule based semantic tagger.
|
| 100 |
+
|
| 101 |
+
## Evaluation
|
| 102 |
+
|
| 103 |
+
We have evaluated the models on 5 datasets from 5 different languages, 4 of these datasets are publicly available whereas one (the Irish data) requires permission from the data owner to access it. The results for these models using top 1 and top 5 accuracy results are shown below, for a more comprehensive comparison please see the technical report.
|
| 104 |
+
|
| 105 |
+
| Dataset | 17M English | 68M English | 140M Multilingual | 307M Multilingual |
|
| 106 |
+
|:----------|:----|:----|:----|:-----|
|
| 107 |
+
| **Top 1** | | | | |
|
| 108 |
+
| Chinese | - | - | 42.2 | 47.9 |
|
| 109 |
+
| English | 66.4 | 70.1 | 66.0 | 70.2 |
|
| 110 |
+
| Finnish | - | - | 15.8 | 25.9 |
|
| 111 |
+
| Irish | - | - | 28.5 | 35.6 |
|
| 112 |
+
| Welsh | - | - | 21.7 | 42.0 |
|
| 113 |
+
| **Top 5** | | | | |
|
| 114 |
+
| Chinese | - | - | 66.3 | 70.4 |
|
| 115 |
+
| English | 87.6 | 90.0 | 88.9 | 90.1 |
|
| 116 |
+
| Finnish | - | - | 32.8 | 42.4 |
|
| 117 |
+
| Irish | - | - | 47.6 | 51.6 |
|
| 118 |
+
| Welsh | - | - | 40.8 | 56.4 |
|
| 119 |
+
|
| 120 |
+
The publicly available datasets can be found on HuggingFace Hub [ucrelnlp/USAS-WSD](https://huggingface.co/datasets/ucrelnlp/USAS-WSD).
|
| 121 |
+
|
| 122 |
+
**Note** the English models have not been evaluated on the non-English datasets as they are unlikely to be able to represent non-English text well or perform well on non-English data.
|
| 123 |
+
|
| 124 |
+
## Citation
|
| 125 |
+
|
| 126 |
+
Technical report is forthcoming.
|
| 127 |
+
|
| 128 |
+
## Contact Information
|
| 129 |
+
|
| 130 |
+
* Paul Rayson (p.rayson@lancaster.ac.uk)
|
| 131 |
+
* Andrew Moore (a.p.moore@lancaster.ac.uk / andrew.p.moore94@gmail.com)
|
| 132 |
+
* UCREL Research Centre (ucrel@lancaster.ac.uk) at Lancaster University.
|