AnnyNguyen commited on
Commit
4d2c5be
verified
1 Parent(s): 198e1de

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +80 -0
README.md ADDED
@@ -0,0 +1,80 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: vi
3
+ tags:
4
+ - hate-speech-detection
5
+ - vietnamese
6
+ - phobert
7
+ license: apache-2.0
8
+ datasets:
9
+ - visolex/ViHOS
10
+ metrics:
11
+ - precision
12
+ - recall
13
+ - f1
14
+ model-index:
15
+ - name: phobert-hsd-span
16
+ results:
17
+ - task:
18
+ type: token-classification
19
+ name: Hate Speech Span Detection
20
+ dataset:
21
+ name: ViHOS
22
+ type: custom
23
+ metrics:
24
+ - name: Precision
25
+ type: precision
26
+ value: <INSERT_PRECISION>
27
+ - name: Recall
28
+ type: recall
29
+ value: <INSERT_RECALL>
30
+ - name: F1 Score
31
+ type: f1
32
+ value: <INSERT_F1>
33
+ base_model:
34
+ - vinai/phobert-base
35
+ pipeline_tag: token-classification
36
+ ---
37
+
38
+ # PhoBERT-HSD-Span
39
+
40
+ Fine-tuned from [`vinai/phobert-base`](https://huggingface.co/vinai/phobert-base) on **visolex/ViHOS** for token-level hate/offensive span detection.
41
+
42
+ ## Model Details
43
+
44
+ * **Base Model**: [`vinai/phobert-base`](https://huggingface.co/vinai/phobert-base)
45
+ * **Dataset**: [visolex/ViHOS](https://huggingface.co/datasets/visolex/ViHOS)
46
+ * **Fine-tuning**: HuggingFace Transformers
47
+
48
+ ### Hyperparameters
49
+
50
+ * Batch size: `16`
51
+ * Learning rate: `5e-5`
52
+ * Epochs: `100`
53
+ * Max sequence length: `128`
54
+ * Early stopping: `5`
55
+
56
+ ## Usage
57
+
58
+ ```python
59
+ from transformers import AutoTokenizer, AutoModelForTokenClassification
60
+
61
+ tokenizer = AutoTokenizer.from_pretrained("visolex/phobert-hsd-span")
62
+ model = AutoModelForTokenClassification.from_pretrained("visolex/phobert-hsd-span")
63
+
64
+ text = "N贸i c谩i lol . t th岷 th么 t峄 vl"
65
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
66
+ with torch.no_grad():
67
+ outputs = model(**inputs)
68
+ logits = outputs.logits # [batch, seq_len, num_labels]
69
+ # For binary: use sigmoid, for multi-class: use softmax+argmax
70
+ probs = torch.sigmoid(logits)
71
+ preds = (probs > 0.5).long().squeeze().tolist() # [seq_len]
72
+ tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
73
+
74
+ span_labels = [p[0] for p in preds]
75
+
76
+ span_tokens = [token for token, label in zip(tokens, span_labels) if label == 1 and token not in ['<s>', '</s>']]
77
+
78
+ print("Span tokens:", span_tokens)
79
+ print("Span text:", tokenizer.convert_tokens_to_string(span_tokens))
80
+ ```