tomasps commited on
Commit
482d40e
·
verified ·
1 Parent(s): 5133e26

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +155 -12
README.md CHANGED
@@ -1,22 +1,165 @@
1
  ---
2
- base_model: unsloth/meta-llama-3.1-8b-instruct-unsloth-bnb-4bit
 
3
  tags:
4
- - text-generation-inference
5
- - transformers
6
- - unsloth
7
  - llama
8
- - trl
9
- license: apache-2.0
 
 
 
 
 
10
  language:
11
  - en
 
12
  ---
13
 
14
- # Uploaded model
15
 
16
- - **Developed by:** safecircleai
17
- - **License:** apache-2.0
18
- - **Finetuned from model :** unsloth/meta-llama-3.1-8b-instruct-unsloth-bnb-4bit
19
 
20
- This llama model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
21
 
22
- [<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ base_model: meta-llama/meta-Llama-3.1-8B-Instruct
3
+ library_name: peft
4
  tags:
 
 
 
5
  - llama
6
+ - llama-3.1
7
+ - llama-3.1-8b
8
+ - safety
9
+ - content-moderation
10
+ - predatory-detection
11
+ - harmful-content
12
+ license: mit
13
  language:
14
  - en
15
+ - es
16
  ---
17
 
18
+ # Heaven 1.1 Base - Safeguarding Against Predatory Messages
19
 
20
+ ![Heaven 1.1 Banner](./heaven11.png)
 
 
21
 
22
+ ## Model Details
23
 
24
+ ### Model Description
25
+
26
+ - **Developed by:** SafeCircle
27
+ - **Model type:** Llama 3.1 8B finetuned for predatory content detection
28
+ - **Language(s):** English, Spanish
29
+ - **License:** Same as base model (Llama 3.1)
30
+ - **Finetuned from model:** [meta-llama/meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/meta-Llama-3.1-8B-Instruct)
31
+
32
+ Heaven 1.1 is a specialized model designed to detect and classify potentially harmful messages in online conversations, with a particular focus on identifying grooming, solicitation, and predatory communication patterns targeting minors. Based on the work from [safecircleai/heaven1-base](https://huggingface.co/safecircleai/heaven1-base), this model has been finetuned on the `heaven_dataset_refined.csv` dataset using GRPO (Generalized Reinforcement from Policy Optimization) training with Unsloth.
33
+
34
+ ### Training Procedure
35
+
36
+ The model was trained using GRPO (Generalized Reinforcement from Policy Optimization) with customized reward functions designed to identify harmful content patterns:
37
+
38
+ - XML format validation rewards
39
+ - Content assessment rewards for harmful content detection
40
+ - Correctness rewards based on labeled data
41
+
42
+ #### Training Hyperparameters
43
+
44
+ - **Training regime:** 4-bit quantization with LoRA
45
+ - **LoRA rank:** 32
46
+ - **Learning rate:** 5e-6
47
+ - **Batch size:** 2 per device
48
+ - **Gradient accumulation steps:** 2
49
+ - **Optimizer:** AdamW 8-bit
50
+ - **Training steps:** 250
51
+
52
+ #### Hardware
53
+
54
+ - 2x NVIDIA GeForce RTX 4090 GPUs with tensor parallelism
55
+
56
+ ## Uses
57
+
58
+ ### Direct Use
59
+
60
+ This model is intended to be used for content moderation systems, online safety monitoring, and research into harmful content detection. It can analyze messages and determine if they contain potentially harmful or predatory content.
61
+
62
+ ### Downstream Use
63
+
64
+ - Content filtering systems for social media and chat platforms
65
+ - Educational tools for recognizing harmful communication patterns
66
+ - Safety tools to protect minors online
67
+ - Research into online predatory behavior detection
68
+
69
+ ### Out-of-Scope Use
70
+
71
+ This model should not be used:
72
+ - As the sole decision-maker for content moderation without human review
73
+ - For surveillance purposes that violate privacy rights
74
+ - To analyze communications without appropriate consent and safeguards
75
+ - To profile individuals based on their communication patterns
76
+
77
+ ## Bias, Risks, and Limitations
78
+
79
+ - The model may generate false positives or false negatives in content detection.
80
+ - The model's effectiveness is dependent on the quality and diversity of its training data.
81
+ - The model may have cultural or contextual biases based on its training data.
82
+ - The model should be regularly evaluated against evolving patterns of harmful communication.
83
+
84
+ ### Recommendations
85
+
86
+ - Use this model as part of a larger content moderation system that includes human review.
87
+ - Continuously evaluate the model's performance against diverse test cases.
88
+ - Be transparent with users about automated content moderation practices.
89
+ - Provide clear appeal processes for content flagged by the model.
90
+
91
+ ## How to Get Started with the Model
92
+
93
+ ```python
94
+ from unsloth import FastLanguageModel
95
+ import torch
96
+
97
+ # Load the model with Unsloth
98
+ model, tokenizer = FastLanguageModel.from_pretrained(
99
+ model_name = "meta-llama/meta-Llama-3.1-8B-Instruct",
100
+ max_seq_length = 1024,
101
+ load_in_4bit = True,
102
+ fast_inference = True,
103
+ )
104
+
105
+ # Load the LoRA adapter
106
+ lora_request = model.load_lora('path/to/heaven1.1-base/grpo_saved_lora')
107
+
108
+ # Define a function to check if a message is harmful
109
+ def check_message(message):
110
+ system_prompt = '\nAnalyze the following message and determine if it contains harmful or predatory content. Respond in the following format:\n<reasoning>\n...\n</reasoning>\n<answer>\n[harmful/safe]\n</answer>\n'
111
+
112
+ text = tokenizer.apply_chat_template([
113
+ {"role": "system", "content": system_prompt},
114
+ {"role": "user", "content": message}
115
+ ], tokenize=False, add_generation_prompt=True)
116
+
117
+ from vllm import SamplingParams
118
+ sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=1024)
119
+
120
+ output = model.fast_generate(
121
+ [text],
122
+ sampling_params=sampling_params,
123
+ lora_request=lora_request
124
+ )[0].outputs[0].text
125
+
126
+ return output
127
+
128
+ # Example usage
129
+ result = check_message("Hey there! Can you tell me what time you finish school? My cousin is your age and I was wondering if you'd like to meet up sometime?")
130
+ print(result)
131
+ ```
132
+
133
+ ## Training Details
134
+
135
+ ### Training Data
136
+
137
+ The model was trained on the `heaven_dataset_v2` dataset, which contains carefully labeled examples of both harmful and normal conversational messages. This dataset is specifically designed to help the model identify patterns of grooming, solicitation, and other predatory behavior in online conversations
138
+
139
+ ## Evaluation
140
+
141
+ ### Testing Data, Factors & Metrics
142
+
143
+ #### Testing Data
144
+
145
+ The model's performance was evaluated on a held-out portion of the heaven_dataset_refined.csv dataset.
146
+
147
+ #### Metrics
148
+
149
+ The model was evaluated based on:
150
+ - Accuracy in correctly identifying harmful vs. safe content
151
+ - Format adherence (correct output structure)
152
+ - Reasoning quality
153
+
154
+ ## Environmental Impact
155
+
156
+ - **Hardware Type:** 2x NVIDIA GeForce RTX 4090 GPUs
157
+ - **Training duration:** ~1 hour
158
+
159
+ ## Model Card Authors
160
+
161
+ Tomas Palma
162
+
163
+ ## Model Card Contact
164
+
165
+ For questions about this model, please send an email to contact@safecircle.tech.