Model Card for Model ID

This is a sentence-transformers model finetuned from sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2. It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more. It is finetuned on Ottoman Turkish court registers also known as kadı sicilleri in Turkish.

Model Details

Model Description

Developed by: Enes Yılandiloğlu
Shared by: Enes Yılandiloğlu
Model type: sentence-transformer
Language(s) (NLP): Ottoman Turkish (1500-1928)
License: cc-by-nc-4.0
Finetuned from model: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2

This model was finetuned on 54,554 Ottoman Turkish records that are available in Istanbul Kadı Sicilleri.

For training, the method used in Chen et al. (2024) was utilised. The method relies on creating triplets based on the titles. Each triplet consists of an anchor document, the positive and negative sample. In this case, we vectorised the summary for each document via using the fill-mask model, enesyila/ota-sbert-kadisicilleri and then calculated the 99th and 1st percentiles of the global similarity distribution to ensure that the model learns from the most distinct semantic relationships. This process yielded 54,428 triplets to finetune a sentence transformer model. The model was trained in 3 epochs with an effective batch size of 32, warmup steps of 500, and a learning rate of 3e-5. For triplet loss, cosine-based distance was leveraged with a margin value of 0.3.

The model can accept inputs up to 512 tokens. Since the median token value for the training data 499, the model can process more than half of the documents at once. However, for longer documents, you can use chunk-based processing.

Uses

This model is specifically fine-tuned for the semantic analysis of Ottoman Turkish judicial records. It is intended to be used for:

Semantic Search
Document Clustering

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

How to Get Started with the Model

You can find more detail on how to use the model on the Google Colab notebook of this model. Please find "Use this model" on upper right corner, click it, and then click "Google Colab". Use the code below to get started with the model.

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# 1. Load the model
model_path = 'enesyila/ota-sbert-kadisicilleri'
model = SentenceTransformer(model_path)

# 2. The input sentence
sentence = "Zeyd nam kimesne mahkemeye gelüb Amr'dan on bin akçe taleb eyledi."

# 3. Calculation
embedding = model.encode(sentence)

# 4. Output
print(f"Sentence: {sentence}")
print(f"The size of vector: {embedding.shape}")
print(f"Vector: {vector}")

Evaluation

The model was evaluated on a test set consisting of 5,443 unseen samples.

Metric	Value	Description
Test Triplet Loss	`0.011216`	Lower is better
Accuracy (Ranking)	96.95%	Ability to correctly rank positive match above the negative match
Avg. Positive Similarity	`0.9806`	Cosine similarity for semantically related pairs
Avg. Negative Similarity	`-0.1384`	Cosine similarity for unrelated pairs
Semantic Margin (P-N)	1.1190	The confidence gap between positive and negative matches

Citation [optional]

The article is coming soon.

Model Card Authors

Enes Yılandiloğlu

Model Card Contact

enes.yilandiloglu@helsinki.fi

Downloads last month: 53

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for enesyila/ota-sbert-kadisicilleri

Base model

sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2

Finetuned

(292)

this model

Paper for enesyila/ota-sbert-kadisicilleri

Surveying the Dead Minds: Historical-Psychological Text Analysis with Contextualized Construct Representation (CCR) for Classical Chinese

Paper • 2403.00509 • Published Mar 1, 2024