phobert-hsd-span / README.md

AnnyNguyen

Create README.md

4d2c5be verified 6 months ago

preview code

raw

history blame contribute delete

2.11 kB

metadata

language: vi
tags:
  - hate-speech-detection
  - vietnamese
  - phobert
license: apache-2.0
datasets:
  - visolex/ViHOS
metrics:
  - precision
  - recall
  - f1
model-index:
  - name: phobert-hsd-span
    results:
      - task:
          type: token-classification
          name: Hate Speech Span Detection
        dataset:
          name: ViHOS
          type: custom
        metrics:
          - name: Precision
            type: precision
            value: <INSERT_PRECISION>
          - name: Recall
            type: recall
            value: <INSERT_RECALL>
          - name: F1 Score
            type: f1
            value: <INSERT_F1>
base_model:
  - vinai/phobert-base
pipeline_tag: token-classification

PhoBERT-HSD-Span

Fine-tuned from vinai/phobert-base on visolex/ViHOS for token-level hate/offensive span detection.

Model Details

Base Model: vinai/phobert-base
Dataset: visolex/ViHOS
Fine-tuning: HuggingFace Transformers

Hyperparameters

Batch size: 16
Learning rate: 5e-5
Epochs: 100
Max sequence length: 128
Early stopping: 5

Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification

tokenizer = AutoTokenizer.from_pretrained("visolex/phobert-hsd-span")
model = AutoModelForTokenClassification.from_pretrained("visolex/phobert-hsd-span")

text = "Nói cái lol . t thấy thô tục vl"
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
with torch.no_grad():
    outputs = model(**inputs)
logits = outputs.logits  # [batch, seq_len, num_labels]
# For binary: use sigmoid, for multi-class: use softmax+argmax
probs = torch.sigmoid(logits)
preds = (probs > 0.5).long().squeeze().tolist()  # [seq_len]
tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])

span_labels = [p[0] for p in preds]

span_tokens = [token for token, label in zip(tokens, span_labels) if label == 1 and token not in ['<s>', '</s>']]

print("Span tokens:", span_tokens)
print("Span text:", tokenizer.convert_tokens_to_string(span_tokens))