visolex
/

phobert-hsd-span

Token Classification

hate-speech-detection

Model card Files Files and versions

AnnyNguyen commited on Jun 27, 2025

Commit

4d2c5be

·

verified ·

1 Parent(s): 198e1de

Create README.md

Files changed (1) hide show

README.md +80 -0

README.md ADDED Viewed

	@@ -0,0 +1,80 @@

+---
+language: vi
+tags:
+- hate-speech-detection
+- vietnamese
+- phobert
+license: apache-2.0
+datasets:
+- visolex/ViHOS
+metrics:
+- precision
+- recall
+- f1
+model-index:
+- name: phobert-hsd-span
+  results:
+  - task:
+      type: token-classification
+      name: Hate Speech Span Detection
+    dataset:
+      name: ViHOS
+      type: custom
+    metrics:
+    - name: Precision
+      type: precision
+      value: <INSERT_PRECISION>
+    - name: Recall
+      type: recall
+      value: <INSERT_RECALL>
+    - name: F1 Score
+      type: f1
+      value: <INSERT_F1>
+base_model:
+- vinai/phobert-base
+pipeline_tag: token-classification
+---
+# PhoBERT-HSD-Span
+Fine-tuned from [`vinai/phobert-base`](https://huggingface.co/vinai/phobert-base) on **visolex/ViHOS** for token-level hate/offensive span detection.
+## Model Details
+* **Base Model**: [`vinai/phobert-base`](https://huggingface.co/vinai/phobert-base)
+* **Dataset**: [visolex/ViHOS](https://huggingface.co/datasets/visolex/ViHOS)
+* **Fine-tuning**: HuggingFace Transformers
+### Hyperparameters
+* Batch size: `16`
+* Learning rate: `5e-5`
+* Epochs: `100`
+* Max sequence length: `128`
+* Early stopping: `5`
+## Usage
+```python
+from transformers import AutoTokenizer, AutoModelForTokenClassification
+tokenizer = AutoTokenizer.from_pretrained("visolex/phobert-hsd-span")
+model = AutoModelForTokenClassification.from_pretrained("visolex/phobert-hsd-span")
+text = "Nói cái lol . t thấy thô tục vl"
+inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
+with torch.no_grad():
+    outputs = model(**inputs)
+logits = outputs.logits  # [batch, seq_len, num_labels]
+# For binary: use sigmoid, for multi-class: use softmax+argmax
+probs = torch.sigmoid(logits)
+preds = (probs > 0.5).long().squeeze().tolist()  # [seq_len]
+tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
+span_labels = [p[0] for p in preds]
+span_tokens = [token for token, label in zip(tokens, span_labels) if label == 1 and token not in ['<s>', '</s>']]
+print("Span tokens:", span_tokens)
+print("Span text:", tokenizer.convert_tokens_to_string(span_tokens))
+```