Bajiyo
/

ml-en-transliteration

Model card Files Files and versions

ml-en-transliteration / README.md

Bajiyo's picture

Update README.md

5fcc8fb verified over 1 year ago

|

history blame contribute delete

3.52 kB

	# Malayalam to English Transliteration Model

	This repository contains a model for transliterating Malayalam names to English names using LSTM and Attention mechanisms.

	## Dataset

	The dataset used for training this model is a subset of the [Santhosh's English-Malayalam Names dataset](https://huggingface.co/datasets/santhosh/english-malayalam-names). Only a small subset of the large dataset was used for training.

	The code for training and testing the model, along with the subset of the dataset used, is available in the following GitHub repository:
	- [GitHub Repository](https://github.com/Bajiyo2223/ml-en_trasnliteration/blob/main/ml_en_transliteration.ipynb)

	You can run and use the train and test datasets from this GitHub link. The dataset is located in a folder called `dataset`.

	## Model Files

	- `saved_model.pb`: The trained model saved in TensorFlow's SavedModel format.
	- `source_tokenizer.json`: Tokenizer for Malayalam text.
	- `target_tokenizer.json`: Tokenizer for English text.
	- `variables.data-00000-of-00001`: Model variables.
	- `variables.index`: Index for model variables.

	## Model Architecture

	The model architecture consists of the following components:
	- Embedding Layer: Converts the input characters to dense vectors of fixed size.
	- Bidirectional LSTM Layer: Captures the sequence dependencies in both forward and backward directions.
	- Attention Layer: Helps the model focus on relevant parts of the input sequence when generating the output sequence.
	- Dense Layer: Produces the final output with a softmax activation function to generate character probabilities.

	## Preprocessing

	- Tokenization: Both source (Malayalam) and target (English) texts are tokenized at the character level.
	- Padding: Sequences are padded to ensure uniform input lengths.

	## Training

	- Optimizer: Adam
	- Loss Function: Sparse categorical cross-entropy
	- Metrics: Accuracy
	- Callbacks: EarlyStopping and ModelCheckpoint to save the best model during training.

	## Results

	The model achieved the following performance on the test set:
	- CER: `7`
	- WER: `53`

	## Usage

	To use the model for transliteration:

	```python
	import tensorflow as tf
	from keras.preprocessing.sequence import pad_sequences
	import numpy as np
	import json

	# Function to convert sequences back to strings
	def sequence_to_text(sequence, tokenizer):
	reverse_word_map = dict(map(reversed, tokenizer.word_index.items()))
	text = ''.join([reverse_word_map.get(i, '') for i in sequence])
	return text

	# Load the model
	model = tf.keras.models.load_model('path_to_your_model_directory')

	# Load tokenizers
	with open('source_tokenizer.json') as f:
	source_tokenizer_data = json.load(f)
	source_tokenizer = tf.keras.preprocessing.text.tokenizer_from_json(source_tokenizer_data)

	with open('target_tokenizer.json') as f:
	target_tokenizer_data = json.load(f)
	target_tokenizer = tf.keras.preprocessing.text.tokenizer_from_json(target_tokenizer_data)

	# Prepare the input text
	input_text = "your_input_text"
	input_sequence = source_tokenizer.texts_to_sequences([input_text])
	input_padded = pad_sequences(input_sequence, maxlen=100, padding='post') # Adjust maxlen if needed

	# Get the prediction
	prediction = model.predict(input_padded)
	predicted_sequence = np.argmax(prediction, axis=-1)[0]
	predicted_text = sequence_to_text(predicted_sequence, target_tokenizer)

	print("Transliterated Text:", predicted_text)
	---
	license: other
	license_name: other
	license_link: LICENSE
	---