Model Card for Model ID
This is a sentence-transformers model finetuned from sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2. It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more. It is finetuned on Ottoman Turkish court registers also known as kadı sicilleri in Turkish.
Model Details
Model Description
- Developed by: Enes Yılandiloğlu
- Shared by: Enes Yılandiloğlu
- Model type: sentence-transformer
- Language(s) (NLP): Ottoman Turkish (1500-1928)
- License: cc-by-nc-4.0
- Finetuned from model: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
This model was finetuned on 54,554 Ottoman Turkish records that are available in Istanbul Kadı Sicilleri.
For training, the method used in Chen et al. (2024) was utilised. The method relies on creating triplets based on the titles. Each triplet consists of an anchor document, the positive and negative sample. In this case, we vectorised the summary for each document via using the fill-mask model, enesyila/ota-sbert-kadisicilleri and then calculated the 99th and 1st percentiles of the global similarity distribution to ensure that the model learns from the most distinct semantic relationships. This process yielded 54,428 triplets to finetune a sentence transformer model. The model was trained in 3 epochs with an effective batch size of 32, warmup steps of 500, and a learning rate of 3e-5. For triplet loss, cosine-based distance was leveraged with a margin value of 0.3.
The model can accept inputs up to 512 tokens. Since the median token value for the training data 499, the model can process more than half of the documents at once. However, for longer documents, you can use chunk-based processing.
Uses
This model is specifically fine-tuned for the semantic analysis of Ottoman Turkish judicial records. It is intended to be used for:
- Semantic Search
- Document Clustering
Recommendations
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
How to Get Started with the Model
You can find more detail on how to use the model on the Google Colab notebook of this model. Please find "Use this model" on upper right corner, click it, and then click "Google Colab". Use the code below to get started with the model.
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# 1. Load the model
model_path = 'enesyila/ota-sbert-kadisicilleri'
model = SentenceTransformer(model_path)
# 2. The input sentence
sentence = "Zeyd nam kimesne mahkemeye gelüb Amr'dan on bin akçe taleb eyledi."
# 3. Calculation
embedding = model.encode(sentence)
# 4. Output
print(f"Sentence: {sentence}")
print(f"The size of vector: {embedding.shape}")
print(f"Vector: {vector}")
Evaluation
The model was evaluated on a test set consisting of 5,443 unseen samples.
| Metric | Value | Description |
|---|---|---|
| Test Triplet Loss | 0.011216 |
Lower is better |
| Accuracy (Ranking) | 96.95% | Ability to correctly rank positive match above the negative match |
| Avg. Positive Similarity | 0.9806 |
Cosine similarity for semantically related pairs |
| Avg. Negative Similarity | -0.1384 |
Cosine similarity for unrelated pairs |
| Semantic Margin (P-N) | 1.1190 | The confidence gap between positive and negative matches |
Citation [optional]
The article is coming soon.
Model Card Authors
Enes Yılandiloğlu
Model Card Contact
- Downloads last month
- 53