apmoore1 commited on
Commit
806f3ad
·
verified ·
1 Parent(s): a7c0927

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +126 -4
README.md CHANGED
@@ -1,10 +1,132 @@
1
  ---
 
 
 
 
 
 
 
2
  tags:
3
  - model_hub_mixin
4
  - pytorch_model_hub_mixin
 
 
 
5
  ---
6
 
7
- This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:
8
- - Code: [More Information Needed]
9
- - Paper: [More Information Needed]
10
- - Docs: [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: cc-by-nc-sa-4.0
3
+ base_model: jhu-clsp/ettin-encoder-17m
4
+ base_model_relation: finetune
5
+ datasets:
6
+ - ucrelnlp/English-USAS-Mosaico
7
+ language:
8
+ - en
9
  tags:
10
  - model_hub_mixin
11
  - pytorch_model_hub_mixin
12
+ - pytorch
13
+ - word-sense-disambiguation
14
+ - lexical-semantics
15
  ---
16
 
17
+ # Model Card for PyMUSAS Neural English Small BEM
18
+
19
+ A fine tuned 17 Million (17M) parameter English ModernBERT architecture semantic tagger. The tagger outputs semantic tags at the token level from the [USAS tagset](https://ucrel.lancs.ac.uk/usas/usas_guide.pdf).
20
+
21
+ The semantic tagger is a variation of the [Bi-Encoder Model (BEM) from Blevins and Zettlemoyer 2020](https://aclanthology.org/2020.acl-main.95.pdf) a Word Sense Disambiguation (WSD) model.
22
+
23
+ ## Table of contents
24
+
25
+ ## Quick start
26
+
27
+ ### Installation
28
+
29
+ Requires Python `3.10` or greater, it is best that you install the version of PyTorch you would like to use, e.g. CPU/GPU version etc before installing this package else you will get the default version of PyTorch for your operating system/setup, but we do require `torch>=2.2,<3.0`.
30
+
31
+ ``` bash
32
+ pip install wsd-torch-models
33
+ ```
34
+
35
+ ### Usage
36
+
37
+ ``` python
38
+ from transformers import AutoTokenizer
39
+ import torch
40
+
41
+ from wsd_torch_models.bem import BEM
42
+
43
+
44
+ if __name__ == "__main__":
45
+ wsd_model_name = "ucrelnlp/PyMUSAS-Neural-English-Small-BEM"
46
+ wsd_model = BEM.from_pretrained(wsd_model_name)
47
+ tokenizer = AutoTokenizer.from_pretrained(wsd_model_name, add_prefix_space=True)
48
+
49
+ wsd_model.eval()
50
+ # Change this to the device you would like to use, e.g. cpu
51
+ model_device = "cpu"
52
+ wsd_model.to(device=model_device)
53
+
54
+ sentence = "The river bank was full of fish"
55
+ sentence_tokens = sentence.split()
56
+
57
+ with torch.inference_mode(mode=True):
58
+ # sub_word_tokenizer can be None when None it will download the appropriate tokenizer
59
+ # but generally it is better to give it the tokenizer as it saves the operation
60
+ # of checking if the tokenizer is already downloaded.
61
+ predictions = wsd_model.predict(sentence_tokens, sub_word_tokenizer=tokenizer, top_n=5)
62
+
63
+ for sentence_token, semantic_tags in zip(sentence_tokens, predictions):
64
+ print("Token: "+ sentence_token)
65
+ print("Most likely tags: ")
66
+ for tag in semantic_tags:
67
+ tag_definition = wsd_model.label_to_definition[tag]
68
+ print("\t" + tag + ":" + tag_definition)
69
+ print()
70
+ ```
71
+
72
+ ## Model Description
73
+
74
+ For more details about the model and how it was trained please see the [citation/technical report](#citation), as well as the links in the [model sources section.](#model-sources)
75
+
76
+ ### Model Sources
77
+
78
+ The training repository contains the code used to train this model. The inference repository contains the code used to run the model as shown in the [usage section.](#usage)
79
+
80
+ - Training Repository: [https://github.com/UCREL/experimental-wsd](https://github.com/UCREL/experimental-wsd)
81
+ - Inference/Usage Repository: [https://github.com/UCREL/WSD-Torch-Models](https://github.com/UCREL/WSD-Torch-Models)
82
+
83
+ ### Model Architecture
84
+
85
+ | Parameter | 17M English | 68M English | 140M Multilingual | 307M Multilingual |
86
+ |:----------|:----|:----|:----|:-----|
87
+ | Layers | 7 | 19 | 22 | 22 |
88
+ | Hidden Size | 256 | 512 | 384 | 768 |
89
+ | Intermediate Size | 384 | 768 | 1152 | 1152 |
90
+ | Attention Heads | 4 | 8 | 6 | 12 |
91
+ | Total Parameters | 17M | 68M | 140M | 307M |
92
+ | Non-embedding Parameters | 3.9M | 42.4M | 42M | 110M |
93
+ | Max Sequence Length | 8,000 | 8,000 | 8,192 | 8,192 |
94
+ | Vocabulary Size | 50,368 | 50,368 | 256,000 | 256,000 |
95
+ | Tokenizer | ModernBERT | ModernBERT | Gemma 2 | Gemma 2 |
96
+
97
+ ## Training Data
98
+
99
+ The model has been trained on a portion of the [ucrelnlp/English-USAS-Mosaico](https://huggingface.co/datasets/ucrelnlp/English-USAS-Mosaico), specifically [data/wikipedia_shard_0.jsonl.gz](https://huggingface.co/datasets/ucrelnlp/English-USAS-Mosaico/blob/main/data/wikipedia_shard_0.jsonl.gz), which contains 1,083 English Wikipedia articles, with 444,880 sentences, 6.6 million tokens, with 5.3 million silver labelled tokens generated by a English rule based semantic tagger.
100
+
101
+ ## Evaluation
102
+
103
+ We have evaluated the models on 5 datasets from 5 different languages, 4 of these datasets are publicly available whereas one (the Irish data) requires permission from the data owner to access it. The results for these models using top 1 and top 5 accuracy results are shown below, for a more comprehensive comparison please see the technical report.
104
+
105
+ | Dataset | 17M English | 68M English | 140M Multilingual | 307M Multilingual |
106
+ |:----------|:----|:----|:----|:-----|
107
+ | **Top 1** | | | | |
108
+ | Chinese | - | - | 42.2 | 47.9 |
109
+ | English | 66.4 | 70.1 | 66.0 | 70.2 |
110
+ | Finnish | - | - | 15.8 | 25.9 |
111
+ | Irish | - | - | 28.5 | 35.6 |
112
+ | Welsh | - | - | 21.7 | 42.0 |
113
+ | **Top 5** | | | | |
114
+ | Chinese | - | - | 66.3 | 70.4 |
115
+ | English | 87.6 | 90.0 | 88.9 | 90.1 |
116
+ | Finnish | - | - | 32.8 | 42.4 |
117
+ | Irish | - | - | 47.6 | 51.6 |
118
+ | Welsh | - | - | 40.8 | 56.4 |
119
+
120
+ The publicly available datasets can be found on HuggingFace Hub [ucrelnlp/USAS-WSD](https://huggingface.co/datasets/ucrelnlp/USAS-WSD).
121
+
122
+ **Note** the English models have not been evaluated on the non-English datasets as they are unlikely to be able to represent non-English text well or perform well on non-English data.
123
+
124
+ ## Citation
125
+
126
+ Technical report is forthcoming.
127
+
128
+ ## Contact Information
129
+
130
+ * Paul Rayson (p.rayson@lancaster.ac.uk)
131
+ * Andrew Moore (a.p.moore@lancaster.ac.uk / andrew.p.moore94@gmail.com)
132
+ * UCREL Research Centre (ucrel@lancaster.ac.uk) at Lancaster University.