| # Malayalam to English Transliteration Model | |
| This repository contains a model for transliterating Malayalam names to English names using LSTM and Attention mechanisms. | |
| ## Dataset | |
| The dataset used for training this model is a subset of the [Santhosh's English-Malayalam Names dataset](https://huggingface.co/datasets/santhosh/english-malayalam-names). Only a small subset of the large dataset was used for training. | |
| The code for training and testing the model, along with the subset of the dataset used, is available in the following GitHub repository: | |
| - [GitHub Repository](https://github.com/Bajiyo2223/ml-en_trasnliteration/blob/main/ml_en_transliteration.ipynb) | |
| You can run and use the train and test datasets from this GitHub link. The dataset is located in a folder called `dataset`. | |
| ## Model Files | |
| - `saved_model.pb`: The trained model saved in TensorFlow's SavedModel format. | |
| - `source_tokenizer.json`: Tokenizer for Malayalam text. | |
| - `target_tokenizer.json`: Tokenizer for English text. | |
| - `variables.data-00000-of-00001`: Model variables. | |
| - `variables.index`: Index for model variables. | |
| ## Model Architecture | |
| The model architecture consists of the following components: | |
| - **Embedding Layer**: Converts the input characters to dense vectors of fixed size. | |
| - **Bidirectional LSTM Layer**: Captures the sequence dependencies in both forward and backward directions. | |
| - **Attention Layer**: Helps the model focus on relevant parts of the input sequence when generating the output sequence. | |
| - **Dense Layer**: Produces the final output with a softmax activation function to generate character probabilities. | |
| ## Preprocessing | |
| - **Tokenization**: Both source (Malayalam) and target (English) texts are tokenized at the character level. | |
| - **Padding**: Sequences are padded to ensure uniform input lengths. | |
| ## Training | |
| - **Optimizer**: Adam | |
| - **Loss Function**: Sparse categorical cross-entropy | |
| - **Metrics**: Accuracy | |
| - **Callbacks**: EarlyStopping and ModelCheckpoint to save the best model during training. | |
| ## Results | |
| The model achieved the following performance on the test set: | |
| - **CER**: `7` | |
| - **WER**: `53` | |
| ## Usage | |
| To use the model for transliteration: | |
| ```python | |
| import tensorflow as tf | |
| from keras.preprocessing.sequence import pad_sequences | |
| import numpy as np | |
| import json | |
| # Function to convert sequences back to strings | |
| def sequence_to_text(sequence, tokenizer): | |
| reverse_word_map = dict(map(reversed, tokenizer.word_index.items())) | |
| text = ''.join([reverse_word_map.get(i, '') for i in sequence]) | |
| return text | |
| # Load the model | |
| model = tf.keras.models.load_model('path_to_your_model_directory') | |
| # Load tokenizers | |
| with open('source_tokenizer.json') as f: | |
| source_tokenizer_data = json.load(f) | |
| source_tokenizer = tf.keras.preprocessing.text.tokenizer_from_json(source_tokenizer_data) | |
| with open('target_tokenizer.json') as f: | |
| target_tokenizer_data = json.load(f) | |
| target_tokenizer = tf.keras.preprocessing.text.tokenizer_from_json(target_tokenizer_data) | |
| # Prepare the input text | |
| input_text = "your_input_text" | |
| input_sequence = source_tokenizer.texts_to_sequences([input_text]) | |
| input_padded = pad_sequences(input_sequence, maxlen=100, padding='post') # Adjust maxlen if needed | |
| # Get the prediction | |
| prediction = model.predict(input_padded) | |
| predicted_sequence = np.argmax(prediction, axis=-1)[0] | |
| predicted_text = sequence_to_text(predicted_sequence, target_tokenizer) | |
| print("Transliterated Text:", predicted_text) | |
| --- | |
| license: other | |
| license_name: other | |
| license_link: LICENSE | |
| --- | |