ucrelnlp
/

PyMUSAS-Neural-English-Small-BEM

@@ -1,10 +1,132 @@
 ---
 tags:
 - model_hub_mixin
 - pytorch_model_hub_mixin
 ---
-This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:
-- Code: [More Information Needed]
-- Paper: [More Information Needed]
-- Docs: [More Information Needed]

 ---
+license: cc-by-nc-sa-4.0
+base_model: jhu-clsp/ettin-encoder-17m
+base_model_relation: finetune
+datasets:
+- ucrelnlp/English-USAS-Mosaico
+language:
+- en
 tags:
 - model_hub_mixin
 - pytorch_model_hub_mixin
+- pytorch
+- word-sense-disambiguation
+- lexical-semantics
 ---
+# Model Card for PyMUSAS Neural English Small BEM
+A fine tuned 17 Million (17M) parameter English ModernBERT architecture semantic tagger. The tagger outputs semantic tags at the token level from the [USAS tagset](https://ucrel.lancs.ac.uk/usas/usas_guide.pdf).
+The semantic tagger is a variation of the [Bi-Encoder Model (BEM) from Blevins and Zettlemoyer 2020](https://aclanthology.org/2020.acl-main.95.pdf) a Word Sense Disambiguation (WSD) model.
+## Table of contents
+## Quick start
+### Installation
+Requires Python `3.10` or greater, it is best that you install the version of PyTorch you would like to use, e.g. CPU/GPU version etc before installing this package else you will get the default version of PyTorch for your operating system/setup, but we do require `torch>=2.2,<3.0`.
+``` bash
+pip install wsd-torch-models
+```
+### Usage
+``` python
+from transformers import AutoTokenizer
+import torch
+from wsd_torch_models.bem import BEM
+if __name__ == "__main__":
+    wsd_model_name = "ucrelnlp/PyMUSAS-Neural-English-Small-BEM"
+    wsd_model = BEM.from_pretrained(wsd_model_name)
+    tokenizer = AutoTokenizer.from_pretrained(wsd_model_name, add_prefix_space=True)
+    wsd_model.eval()
+    # Change this to the device you would like to use, e.g. cpu
+    model_device = "cpu"
+    wsd_model.to(device=model_device)
+    sentence = "The river bank was full of fish"
+    sentence_tokens = sentence.split()
+    with torch.inference_mode(mode=True):
+        # sub_word_tokenizer can be None when None it will download the appropriate tokenizer
+        # but generally it is better to give it the tokenizer as it saves the operation
+        # of checking if the tokenizer is already downloaded.
+        predictions = wsd_model.predict(sentence_tokens, sub_word_tokenizer=tokenizer, top_n=5)
+        for sentence_token, semantic_tags in zip(sentence_tokens, predictions):
+            print("Token: "+ sentence_token)
+            print("Most likely tags: ")
+            for tag in semantic_tags:
+                tag_definition = wsd_model.label_to_definition[tag]
+                print("\t" + tag + ":" + tag_definition)
+            print()
+```
+## Model Description
+For more details about the model and how it was trained please see the [citation/technical report](#citation), as well as the links in the [model sources section.](#model-sources)
+### Model Sources
+The training repository contains the code used to train this model. The inference repository contains the code used to run the model as shown in the [usage section.](#usage)
+- Training Repository: [https://github.com/UCREL/experimental-wsd](https://github.com/UCREL/experimental-wsd)
+- Inference/Usage Repository: [https://github.com/UCREL/WSD-Torch-Models](https://github.com/UCREL/WSD-Torch-Models)
+### Model Architecture
+| Parameter | 17M English | 68M English | 140M Multilingual | 307M Multilingual |
+|:----------|:----|:----|:----|:-----|
+| Layers | 7 | 19 | 22 | 22 |
+| Hidden Size | 256 | 512 | 384 | 768 |
+| Intermediate Size | 384 | 768 | 1152 | 1152 |
+| Attention Heads | 4 | 8 | 6 | 12 |
+| Total Parameters | 17M | 68M | 140M | 307M |
+| Non-embedding Parameters | 3.9M | 42.4M | 42M | 110M |
+| Max Sequence Length | 8,000 | 8,000 | 8,192 | 8,192 |
+| Vocabulary Size | 50,368 | 50,368 | 256,000 | 256,000 |
+| Tokenizer | ModernBERT | ModernBERT | Gemma 2 | Gemma 2 |
+## Training Data
+The model has been trained on a portion of the [ucrelnlp/English-USAS-Mosaico](https://huggingface.co/datasets/ucrelnlp/English-USAS-Mosaico), specifically [data/wikipedia_shard_0.jsonl.gz](https://huggingface.co/datasets/ucrelnlp/English-USAS-Mosaico/blob/main/data/wikipedia_shard_0.jsonl.gz), which contains 1,083 English Wikipedia articles, with 444,880 sentences, 6.6 million tokens, with 5.3 million silver labelled tokens generated by a English rule based semantic tagger.
+## Evaluation
+We have evaluated the models on 5 datasets from 5 different languages, 4 of these datasets are publicly available whereas one (the Irish data) requires permission from the data owner to access it. The results for these models using top 1 and top 5 accuracy results are shown below, for a more comprehensive comparison please see the technical report.
+| Dataset | 17M English | 68M English | 140M Multilingual | 307M Multilingual |
+|:----------|:----|:----|:----|:-----|
+| **Top 1** |  |  |  |  |
+| Chinese | - | - | 42.2 | 47.9 |
+| English | 66.4 | 70.1 | 66.0 | 70.2 |
+| Finnish | - | - | 15.8 | 25.9 |
+| Irish | - | - | 28.5 | 35.6 |
+| Welsh | - | - | 21.7 | 42.0 |
+| **Top 5** |  |  |  |  |
+| Chinese | - | - | 66.3 | 70.4 |
+| English | 87.6 | 90.0 | 88.9 | 90.1 |
+| Finnish | - | - | 32.8 | 42.4 |
+| Irish | - | - | 47.6 | 51.6 |
+| Welsh | - | - | 40.8 | 56.4 |
+The publicly available datasets can be found on HuggingFace Hub [ucrelnlp/USAS-WSD](https://huggingface.co/datasets/ucrelnlp/USAS-WSD).
+**Note** the English models have not been evaluated on the non-English datasets as they are unlikely to be able to represent non-English text well or perform well on non-English data.
+## Citation
+Technical report is forthcoming.
+## Contact Information
+* Paul Rayson (p.rayson@lancaster.ac.uk)
+* Andrew Moore (a.p.moore@lancaster.ac.uk / andrew.p.moore94@gmail.com)
+* UCREL Research Centre (ucrel@lancaster.ac.uk) at Lancaster University.