Migrate model card from transformers-repo
Browse filesRead announcement at https://discuss.huggingface.co/t/announcement-all-model-cards-will-be-migrated-to-hf-co-model-repos/2755
Original file history: https://github.com/huggingface/transformers/commits/master/model_cards/google/reformer-enwik8/README.md
README.md
ADDED
|
@@ -0,0 +1,57 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
## Reformer Language model on character level and trained on enwik8.
|
| 2 |
+
|
| 3 |
+
*enwik8* is a dataset based on Wikipedia and is often used to measure the model's ability to *compress* data, *e.g.* in
|
| 4 |
+
the scope of the *Hutter prize*: https://en.wikipedia.org/wiki/Hutter_Prize.
|
| 5 |
+
|
| 6 |
+
`reformer-enwik8` was pretrained on the first 90M chars of *enwik8* whereas the text was chunked into batches of size 65536 chars (=2^16).
|
| 7 |
+
The model's weights were taken from https://console.cloud.google.com/storage/browser/trax-ml/reformer/enwik8 and converted
|
| 8 |
+
to Hugging Face's PyTorch ReformerLM model `ReformerModelWithLMHead`.
|
| 9 |
+
|
| 10 |
+
The model is a language model that operates on characters.
|
| 11 |
+
Therefore, this model does not need a tokenizer. The following function can instead be used for **encoding** and **decoding**:
|
| 12 |
+
|
| 13 |
+
```python
|
| 14 |
+
import torch
|
| 15 |
+
|
| 16 |
+
# Encoding
|
| 17 |
+
def encode(list_of_strings, pad_token_id=0):
|
| 18 |
+
max_length = max([len(string) for string in list_of_strings])
|
| 19 |
+
|
| 20 |
+
# create emtpy tensors
|
| 21 |
+
attention_masks = torch.zeros((len(list_of_strings), max_length), dtype=torch.long)
|
| 22 |
+
input_ids = torch.full((len(list_of_strings), max_length), pad_token_id, dtype=torch.long)
|
| 23 |
+
|
| 24 |
+
for idx, string in enumerate(list_of_strings):
|
| 25 |
+
# make sure string is in byte format
|
| 26 |
+
if not isinstance(string, bytes):
|
| 27 |
+
string = str.encode(string)
|
| 28 |
+
|
| 29 |
+
input_ids[idx, :len(string)] = torch.tensor([x + 2 for x in string])
|
| 30 |
+
attention_masks[idx, :len(string)] = 1
|
| 31 |
+
|
| 32 |
+
return input_ids, attention_masks
|
| 33 |
+
|
| 34 |
+
# Decoding
|
| 35 |
+
def decode(outputs_ids):
|
| 36 |
+
decoded_outputs = []
|
| 37 |
+
for output_ids in outputs_ids.tolist():
|
| 38 |
+
# transform id back to char IDs < 2 are simply transformed to ""
|
| 39 |
+
decoded_outputs.append("".join([chr(x - 2) if x > 1 else "" for x in output_ids]))
|
| 40 |
+
return decoded_outputs
|
| 41 |
+
```
|
| 42 |
+
|
| 43 |
+
Text can be generated as follows:
|
| 44 |
+
|
| 45 |
+
```python
|
| 46 |
+
from transformers import ReformerModelWithLMHead
|
| 47 |
+
|
| 48 |
+
model = ReformerModelWithLMHead.from_pretrained("google/reformer-enwik8")
|
| 49 |
+
encoded, attention_masks = encode(["In 1965, Brooks left IBM to found the Department of"])
|
| 50 |
+
decode(model.generate(encoded, do_sample=True, max_length=150))
|
| 51 |
+
|
| 52 |
+
# gives:
|
| 53 |
+
# In 1965, Brooks left IBM to found the Department of Journalism in 1968. IBM had jurisdiction himself in 1980, while Brooks resolved, nevertheless thro
|
| 54 |
+
|
| 55 |
+
```
|
| 56 |
+
|
| 57 |
+
***Note***: Language generation using `ReformerModelWithLMHead` is not optimized yet and is rather slow.
|