sphobert-hsd / README.md

Upload README.md with huggingface_hub

dc54582 verified about 20 hours ago

4.19 kB

	---
	license: mit
	base_model: vinai/phobert-base
	tags:
	- vietnamese
	- hate-speech-detection
	- text-classification
	- offensive-language-detection
	datasets:
	- visolex/vihsd
	metrics:
	- accuracy
	- macro-f1
	- weighted-f1
	model-index:
	- name: sphobert-hsd
	results:
	- task:
	type: text-classification
	name: Hate Speech Detection
	dataset:
	name: ViHSD
	type: hate-speech-detection
	metrics:
	- type: accuracy
	value: 0.9143
	- type: macro-f1
	value: 0.7378
	- type: weighted-f1
	value: 0.9096
	- type: macro-precision
	value: 0.7897
	- type: macro-recall
	value: 0.7027
	---

	# SPhoBERT: Hate Speech Detection for Vietnamese Text

	This model is a fine-tuned version of [vinai/phobert-base](https://huggingface.co/vinai/phobert-base)
	on the ViHSD (Vietnamese Hate Speech Detection Dataset) for classifying Vietnamese text into three categories: CLEAN, OFFENSIVE, and HATE.

	## Model Details

	* Base Model: vinai/phobert-base
	* Description: SPhoBERT fine-tuned cho bài toán phân loại Hate Speech tiếng Việt
	* Architecture: SPhoBERT (PhoBERT với syllable-level tokenization)
	* Dataset: ViHSD (Vietnamese Hate Speech Detection Dataset)
	* Fine-tuning Framework: HuggingFace Transformers + PyTorch
	* Task: Hate Speech Classification (3 classes)

	### Hyperparameters

	* Batch size: `32`
	* Learning rate: `2e-5`
	* Epochs: `100`
	* Max sequence length: `256`
	* Weight decay: `0.01`
	* Warmup steps: `500`
	* Early stopping patience: `5`
	* Optimizer: AdamW
	* Learning rate scheduler: Cosine with warmup

	## Dataset

	Model was trained on ViHSD (Vietnamese Hate Speech Detection Dataset) containing ~10,000 Vietnamese comments from social media.

	### Label Descriptions:

	* CLEAN (0): Normal content without offensive language
	* OFFENSIVE (1): Mildly offensive or inappropriate content
	* HATE (2): Hate speech, extremist language, severe threats

	## Evaluation Results

	The model was evaluated on test set with the following metrics:

	* Accuracy: `0.9143`
	* Macro-F1: `0.7378`
	* Weighted-F1: `0.9096`
	* Macro-Precision: `0.7897`
	* Macro-Recall: `0.7027`

	### Basic Usage

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	# Load model and tokenizer
	model_name = "visolex/sphobert-hsd"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForSequenceClassification.from_pretrained(
	model_name
	)

	# Classify text
	text = "Văn bản tiếng Việt cần phân loại"
	inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)

	with torch.no_grad():
	outputs = model(**inputs)
	predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
	predicted_label = torch.argmax(predictions, dim=-1).item()

	# Label mapping
	label_names = {
	0: "CLEAN",
	1: "OFFENSIVE",
	2: "HATE"
	}

	print(f"Predicted label: {label_names[predicted_label]}")
	print(f"Confidence scores: {predictions[0].tolist()}")
	```



	## Training Details

	### Training Data
	- Dataset: ViHSD (Vietnamese Hate Speech Detection Dataset)
	- Total samples: ~10,000 Vietnamese comments from social media
	- Training split: ~70%
	- Validation split: ~15%
	- Test split: ~15%

	### Training Configuration
	- Framework: PyTorch + HuggingFace Transformers
	- Optimizer: AdamW
	- Learning Rate: 2e-5
	- Batch Size: 32
	- Max Length: 256 tokens
	- Epochs: 100 (with early stopping patience: 5)
	- Weight Decay: 0.01
	- Warmup Steps: 500


	## Contact & Support

	- GitHub: [ViSoLex Hate Speech Detection](https://github.com/visolex/hate-speech-detection)
	- Issues: [Report Issues](https://github.com/visolex/hate-speech-detection/issues)
	- Questions: Open a discussion on the model's Hugging Face page

	## License

	This model is distributed under the MIT License.

	## Acknowledgments

	- Base model: [vinai/phobert-base](https://huggingface.co/vinai/phobert-base)
	- Dataset: ViHSD (Vietnamese Hate Speech Detection Dataset)
	- Framework: [Hugging Face Transformers](https://huggingface.co/transformers)
	- ViSoLex Toolkit

	---