google
/

reformer-enwik8

Text Generation

Model card Files Files and versions

reformer-enwik8 / README.md

julien-c's picture

julien-c HF Staff

Migrate model card from transformers-repo

1a14c19 about 5 years ago

|

history blame contribute delete

2.4 kB

	## Reformer Language model on character level and trained on enwik8.

	enwik8 is a dataset based on Wikipedia and is often used to measure the model's ability to compress data, e.g. in
	the scope of the Hutter prize: https://en.wikipedia.org/wiki/Hutter_Prize.

	`reformer-enwik8` was pretrained on the first 90M chars of enwik8 whereas the text was chunked into batches of size 65536 chars (=2^16).
	The model's weights were taken from https://console.cloud.google.com/storage/browser/trax-ml/reformer/enwik8 and converted
	to Hugging Face's PyTorch ReformerLM model `ReformerModelWithLMHead`.

	The model is a language model that operates on characters.
	Therefore, this model does not need a tokenizer. The following function can instead be used for encoding and decoding:

	```python
	import torch

	# Encoding
	def encode(list_of_strings, pad_token_id=0):
	max_length = max([len(string) for string in list_of_strings])

	# create emtpy tensors
	attention_masks = torch.zeros((len(list_of_strings), max_length), dtype=torch.long)
	input_ids = torch.full((len(list_of_strings), max_length), pad_token_id, dtype=torch.long)

	for idx, string in enumerate(list_of_strings):
	# make sure string is in byte format
	if not isinstance(string, bytes):
	string = str.encode(string)

	input_ids[idx, :len(string)] = torch.tensor([x + 2 for x in string])
	attention_masks[idx, :len(string)] = 1

	return input_ids, attention_masks

	# Decoding
	def decode(outputs_ids):
	decoded_outputs = []
	for output_ids in outputs_ids.tolist():
	# transform id back to char IDs < 2 are simply transformed to ""
	decoded_outputs.append("".join([chr(x - 2) if x > 1 else "" for x in output_ids]))
	return decoded_outputs
	```

	Text can be generated as follows:

	```python
	from transformers import ReformerModelWithLMHead

	model = ReformerModelWithLMHead.from_pretrained("google/reformer-enwik8")
	encoded, attention_masks = encode(["In 1965, Brooks left IBM to found the Department of"])
	decode(model.generate(encoded, do_sample=True, max_length=150))

	# gives:
	# In 1965, Brooks left IBM to found the Department of Journalism in 1968. IBM had jurisdiction himself in 1980, while Brooks resolved, nevertheless thro

	```

	*Note*: Language generation using `ReformerModelWithLMHead` is not optimized yet and is rather slow.