File size: 3,475 Bytes
72a266d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
# RoBERTa-Base Model for Temporal Information Extraction

This repository hosts a fine-tuned version of RoBERTa for **temporal information extraction**, where the model identifies and extracts time-related expressions (e.g., dates, durations) from text. The pipeline includes preprocessing, fine-tuning, and inference on labeled temporal datasets.

---

## Model Details

- **Model Name:** RoBERTa-Base  
- **Model Architecture:** RoBERTa Token Classification  
- **Task:** Temporal Entity Extraction  
- **Dataset:** Custom JSON format with annotated temporal SPO triples  
- **Fine-tuning Framework:** Hugging Face Transformers  
- **Output Labels:** `B-TIMEX`, `I-TIMEX`, `O`  

---

## Usage

### Installation

```bash

pip install transformers datasets evaluate



# Loading the Fine-Tuned Model



from transformers import RobertaTokenizerFast, RobertaForTokenClassification

import torch



# Load model and tokenizer

model = RobertaForTokenClassification.from_pretrained("./temporal_model")

tokenizer = RobertaTokenizerFast.from_pretrained("./temporal_model", add_prefix_space=True)



# Inference function

def extract_temporal_entities(text):

    tokens = text.split()

    inputs = tokenizer(tokens, return_tensors="pt", is_split_into_words=True)

    outputs = model(**inputs)

    predictions = outputs.logits.argmax(dim=-1).squeeze().tolist()

    word_ids = inputs.word_ids()[0]



    temporal_spans = []

    current = []

    for idx, word_idx in enumerate(word_ids):

        if word_idx is None:

            continue

        label = id2label[predictions[idx]]

        if label == "B-TIMEX":

            if current:

                temporal_spans.append(" ".join(current))

            current = [tokens[word_idx]]

        elif label == "I-TIMEX":

            current.append(tokens[word_idx])

        else:

            if current:

                temporal_spans.append(" ".join(current))

                current = []

    if current:

        temporal_spans.append(" ".join(current))

    return temporal_spans





# Performance Metrics

Evaluation Accuracy: ~0.76



F1 Score: Tracked using seqeval (BIO format)



Evaluation Strategy: epoch



# Fine-Tuning Details

Dataset

The dataset consists of manually or script-labeled SPO-style JSON entries with the following fields:



text: Raw input string



spo_list: A list of subject-predicate-object relations, including:



Subject & Object Span



Type (e.g., Date, Location)



The text is tokenized, and BIO labels are applied for token classification.



# Training Configuration



Epochs: 3



Batch Size: 16



Learning Rate: 2e-5



Max Sequence Length: 128 tokens



Tokenizer: roberta-base with add_prefix_space=True



# Repository Structure

pgsql

Copy

Edit

.

β”œβ”€β”€ temporal_model/             # Fine-tuned model and tokenizer

β”‚   β”œβ”€β”€ config.json

β”‚   β”œβ”€β”€ pytorch_model.bin

β”‚   β”œβ”€β”€ tokenizer_config.json

β”‚   β”œβ”€β”€ vocab.json

β”‚   └── special_tokens_map.json

β”œβ”€β”€ temporal-information-extraction.ipynb

β”œβ”€β”€ README.md



# Limitations



The model is domain-specific; generalization to other types of temporal expressions (e.g., informal text) may require additional training.



BIO tagging may fail in overlapping or nested entity scenarios.



# Contributing

Contributions are welcome! Feel free to open an issue or submit a pull request to improve model performance or add new datasets.