safecircleai
/

heaven-1.1-base

@@ -1,22 +1,165 @@
 ---
-base_model: unsloth/meta-llama-3.1-8b-instruct-unsloth-bnb-4bit
 tags:
-- text-generation-inference
-- transformers
-- unsloth
 - llama
-- trl
-license: apache-2.0
 language:
 - en
 ---
-# Uploaded  model
-- **Developed by:** safecircleai
-- **License:** apache-2.0
-- **Finetuned from model :** unsloth/meta-llama-3.1-8b-instruct-unsloth-bnb-4bit
-This llama model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
-[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)

 ---
+base_model: meta-llama/meta-Llama-3.1-8B-Instruct
+library_name: peft
 tags:
 - llama
+- llama-3.1
+- llama-3.1-8b
+- safety
+- content-moderation
+- predatory-detection
+- harmful-content
+license: mit
 language:
 - en
+- es
 ---
+# Heaven 1.1 Base - Safeguarding Against Predatory Messages
+![Heaven 1.1 Banner](./heaven11.png)
+## Model Details
+### Model Description
+- **Developed by:** SafeCircle
+- **Model type:** Llama 3.1 8B finetuned for predatory content detection
+- **Language(s):** English, Spanish
+- **License:** Same as base model (Llama 3.1)
+- **Finetuned from model:** [meta-llama/meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/meta-Llama-3.1-8B-Instruct)
+Heaven 1.1 is a specialized model designed to detect and classify potentially harmful messages in online conversations, with a particular focus on identifying grooming, solicitation, and predatory communication patterns targeting minors. Based on the work from [safecircleai/heaven1-base](https://huggingface.co/safecircleai/heaven1-base), this model has been finetuned on the `heaven_dataset_refined.csv` dataset using GRPO (Generalized Reinforcement from Policy Optimization) training with Unsloth.
+### Training Procedure
+The model was trained using GRPO (Generalized Reinforcement from Policy Optimization) with customized reward functions designed to identify harmful content patterns:
+- XML format validation rewards
+- Content assessment rewards for harmful content detection
+- Correctness rewards based on labeled data
+#### Training Hyperparameters
+- **Training regime:** 4-bit quantization with LoRA
+- **LoRA rank:** 32
+- **Learning rate:** 5e-6
+- **Batch size:** 2 per device
+- **Gradient accumulation steps:** 2
+- **Optimizer:** AdamW 8-bit
+- **Training steps:** 250
+#### Hardware
+- 2x NVIDIA GeForce RTX 4090 GPUs with tensor parallelism
+## Uses
+### Direct Use
+This model is intended to be used for content moderation systems, online safety monitoring, and research into harmful content detection. It can analyze messages and determine if they contain potentially harmful or predatory content.
+### Downstream Use
+- Content filtering systems for social media and chat platforms
+- Educational tools for recognizing harmful communication patterns
+- Safety tools to protect minors online
+- Research into online predatory behavior detection
+### Out-of-Scope Use
+This model should not be used:
+- As the sole decision-maker for content moderation without human review
+- For surveillance purposes that violate privacy rights
+- To analyze communications without appropriate consent and safeguards
+- To profile individuals based on their communication patterns
+## Bias, Risks, and Limitations
+- The model may generate false positives or false negatives in content detection.
+- The model's effectiveness is dependent on the quality and diversity of its training data.
+- The model may have cultural or contextual biases based on its training data.
+- The model should be regularly evaluated against evolving patterns of harmful communication.
+### Recommendations
+- Use this model as part of a larger content moderation system that includes human review.
+- Continuously evaluate the model's performance against diverse test cases.
+- Be transparent with users about automated content moderation practices.
+- Provide clear appeal processes for content flagged by the model.
+## How to Get Started with the Model
+```python
+from unsloth import FastLanguageModel
+import torch
+# Load the model with Unsloth
+model, tokenizer = FastLanguageModel.from_pretrained(
+    model_name = "meta-llama/meta-Llama-3.1-8B-Instruct",
+    max_seq_length = 1024,
+    load_in_4bit = True,
+    fast_inference = True,
+)
+# Load the LoRA adapter
+lora_request = model.load_lora('path/to/heaven1.1-base/grpo_saved_lora')
+# Define a function to check if a message is harmful
+def check_message(message):
+    system_prompt = '\nAnalyze the following message and determine if it contains harmful or predatory content. Respond in the following format:\n<reasoning>\n...\n</reasoning>\n<answer>\n[harmful/safe]\n</answer>\n'
+    text = tokenizer.apply_chat_template([
+        {"role": "system", "content": system_prompt},
+        {"role": "user", "content": message}
+    ], tokenize=False, add_generation_prompt=True)
+    from vllm import SamplingParams
+    sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=1024)
+    output = model.fast_generate(
+        [text],
+        sampling_params=sampling_params,
+        lora_request=lora_request
+    )[0].outputs[0].text
+    return output
+# Example usage
+result = check_message("Hey there! Can you tell me what time you finish school? My cousin is your age and I was wondering if you'd like to meet up sometime?")
+print(result)
+```
+## Training Details
+### Training Data
+The model was trained on the `heaven_dataset_v2` dataset, which contains carefully labeled examples of both harmful and normal conversational messages. This dataset is specifically designed to help the model identify patterns of grooming, solicitation, and other predatory behavior in online conversations
+## Evaluation
+### Testing Data, Factors & Metrics
+#### Testing Data
+The model's performance was evaluated on a held-out portion of the heaven_dataset_refined.csv dataset.
+#### Metrics
+The model was evaluated based on:
+- Accuracy in correctly identifying harmful vs. safe content
+- Format adherence (correct output structure)
+- Reasoning quality
+## Environmental Impact
+- **Hardware Type:** 2x NVIDIA GeForce RTX 4090 GPUs
+- **Training duration:** ~1 hour
+## Model Card Authors
+Tomas Palma
+## Model Card Contact
+For questions about this model, please send an email to contact@safecircle.tech.