Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,125 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
- ko
|
| 6 |
+
tags:
|
| 7 |
+
- gemma-2
|
| 8 |
+
- KINS-ai
|
| 9 |
+
base_model:
|
| 10 |
+
- google/gemma-2-27b
|
| 11 |
+
pipeline_tag: text-generation
|
| 12 |
+
library_name: transformers
|
| 13 |
+
---
|
| 14 |
+
|
| 15 |
+
# **Introduction**
|
| 16 |
+
|
| 17 |
+
### About the Model
|
| 18 |
+
|
| 19 |
+
We introduce ATOMIS, developed by the Korea Institute of Nuclear Safety (KINS). This model is specifically designed for the nuclear field and is a large language model (LLM) with 32 billion parameters. It achieves state-of-the-art performance among its peers on Logickor, a real-world Korean task benchmark; NuclearQA, a nuclear-domain benchmark; and RAGEval, a RAG benchmark. Please refer to the evaluation results table for details.
|
| 20 |
+
|
| 21 |
+
## Key Features
|
| 22 |
+
|
| 23 |
+
- **Korean Real-World use cases:** The model can understand and generate Korean text with high accuracy, making it suitable for practical scenarios.
|
| 24 |
+
- **Specialized in the Nuclear Domain:** The model has been specifically trained on a vast, specialized corpus of nuclear data.
|
| 25 |
+
- **RAG:** The model delivers accurate answers based on real documents through its high RAG performance.
|
| 26 |
+
|
| 27 |
+
|
| 28 |
+
### Pre-Training
|
| 29 |
+
|
| 30 |
+
We created the base model by expanding layers using a passthrough method, building on the gemma-2-27b model. Additionally, we extended the context length to 32K with RoPE and performed continuous pretraining to restore the model’s performance.
|
| 31 |
+
In particular, to train specialized knowledge in the nuclear domain, we included the following data.
|
| 32 |
+
|
| 33 |
+
- Atomic Wiki (https://atomic.snu.ac.kr)
|
| 34 |
+
- NText (https://paperswithcode.com/dataset/ntext)
|
| 35 |
+
- in-house data from KINS (Korea Institute of Nuclear Safety)
|
| 36 |
+
|
| 37 |
+
### Post-Training
|
| 38 |
+
|
| 39 |
+
The fine-tuning data includes over 1M publicly available instruction datasets as well as high-quality synthetic data. We use this dataset to perform supervised fine-tuning (SFT) and direct preference optimization (DPO).
|
| 40 |
+
|
| 41 |
+
# **How to use**
|
| 42 |
+
|
| 43 |
+
```python
|
| 44 |
+
# pip install transformers==4.43.4 or later
|
| 45 |
+
import torch
|
| 46 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 47 |
+
|
| 48 |
+
tokenizer = AutoTokenizer.from_pretrained("KINS-ai/ATOMIS")
|
| 49 |
+
model = AutoModelForCausalLM.from_pretrained(
|
| 50 |
+
"KINS-ai/ATOMIS",
|
| 51 |
+
device_map="auto",
|
| 52 |
+
torch_dtype=torch.bfloat16,
|
| 53 |
+
)
|
| 54 |
+
|
| 55 |
+
messages = [
|
| 56 |
+
{"role": "user", "content": "안녕하세요?"},
|
| 57 |
+
]
|
| 58 |
+
|
| 59 |
+
input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt", return_dict=True).to("cuda")
|
| 60 |
+
|
| 61 |
+
outputs = model.generate(**input_ids, max_new_tokens=256)
|
| 62 |
+
print(tokenizer.decode(outputs[0]))
|
| 63 |
+
```
|
| 64 |
+
|
| 65 |
+
# **Evaluation**
|
| 66 |
+
|
| 67 |
+
### Overall
|
| 68 |
+
|
| 69 |
+
| Model | LogicKor | NuclearQA | RAGEval | Avg |
|
| 70 |
+
|--------------------------------------|----- |-----|-----|-----|
|
| 71 |
+
| **c4ai-command-r-08-2024** | 8.27 | 7.82 | 9.41 | 8.50 |
|
| 72 |
+
| **gemma-2-27b-it** | 8.66 | 8.18 | 8.97 | 8.60 |
|
| 73 |
+
| **Qwen2.5-32B-instruct** | 8.93 | 8.61 | 9.36 | 8.97 |
|
| 74 |
+
| **phi-4** | 8.62 | 8.67 | 9.55 | 8.95 |
|
| 75 |
+
| **Mistral-Small-24B-Instruct-2501** | 8.36 | 8.68 | 9.04 | 8.69 |
|
| 76 |
+
| **Llama-3.3-70b-instruct** | 7.94 | 8.42 | 9.25 | 8.54 |
|
| 77 |
+
| **ATOMIS** | 9.00 | 8.72 | 9.65 | **9.12** |
|
| 78 |
+
|
| 79 |
+
|
| 80 |
+
### LogicKor
|
| 81 |
+
We evaluated the performance using the [LogicKor](https://github.com/instructkr/LogicKor) code. As the judge model, we employed the officially recommended GPT-4-1106-preview. These scores reflect only the default zero-shot evaluation.
|
| 82 |
+
|
| 83 |
+
| Model | Math | Reasoning | Coding | Writing | Understanding | Grammar | Single-turn | Multi-turn | Avg |
|
| 84 |
+
|--------------------------------------|----- |-----|-----|-----|-----|-----|-----|-----|-----|
|
| 85 |
+
| **c4ai-command-r-08-2024** | 6.14 | 7.36 | 9.43 | 9.64 | 9.21 | 7.86 | 8.05 | 8.52 | 8.27 |
|
| 86 |
+
| **gemma-2-27b-it** | 8.93 | 8.29 | 8.43 | 9.29 | 9.43 | 7.57 | 8.43 | 8.88 | 8.66 |
|
| 87 |
+
| **Qwen2.5-32B-instruct** | 8.79 | 8.64 | 9.36 | 9.50 | 9.29 | 8.00 | 8.79 | 9.10 | 8.93 |
|
| 88 |
+
| **phi-4** | 8.79 | 9.21 | 9.86 | 9.21 | 9.00 | 5.64 | 8.50 | 8.74 | 8.62 |
|
| 89 |
+
| **Mistral-Small-24B-Instruct-2501** | 8.00 | 8.14 | 9.36 | 9.43 | 8.50 | 6.71 | 8.29 | 8.43 | 8.36 |
|
| 90 |
+
| **Llama-3.3-70b-instruct** | 7.43 | 6.50 | 8.79 | 8.43 | 8.64 | 7.86 | 8.14 | 7.74 | 7.94 |
|
| 91 |
+
| **ATOMIS** | 8.36 | 8.71 | 9.79 | 9.64 | 8.29 | 9.21 | 9.14 | 8.86 | **9.00** |
|
| 92 |
+
|
| 93 |
+
### NuclearQA
|
| 94 |
+
We employed NuclearQA [1], a human-made benchmark consisting of 100 questions designed by experts to evaluate language models in the nuclear domain.
|
| 95 |
+
|
| 96 |
+
We then used this question set to assess the LLM’s responses in a manner similar to the Logickor benchmark.
|
| 97 |
+
|
| 98 |
+
[1] Acharya, A., Munikoti, S., Hellinger, A., Smith, S., Wagle, S. and Horawalavithana, S., 2023. NuclearQA: A Human-Made Benchmark for Language Models for the Nuclear Domain. arXiv:2310.10920.
|
| 99 |
+
|
| 100 |
+
| Model | Easy | Medium | Hard | General | Scientific | Numerical | Num+Sci | Avg |
|
| 101 |
+
|--------------------------------------|----- |-----|-----|-----|-----|-----|-----|-----|
|
| 102 |
+
| **c4ai-command-r-08-2024** | 8.77 | 8.21 | 6.47 | 7.73 | 8.38 | 7.35 | 7.35 | 7.82 |
|
| 103 |
+
| **gemma-2-27b-it** | 8.97 | 8.24 | 7.33 | 7.92 | 8.23 | 8.12 | 8.45 | 8.18 |
|
| 104 |
+
| **Qwen2.5-32B-instruct** | 8.97 | 8.42 | 8.38 | 8.54 | 8.15 | 8.76 | 9.03 | 8.61 |
|
| 105 |
+
| **phi-4** | 8.94 | 8.97 | 8.11 | 8.46 | 8.73 | 9.00 | 8.50 | 8.67 |
|
| 106 |
+
| **Mistral-Small-24B-Instruct-2501** | 9.13 | 8.76 | 8.14 | 8.41 | 8.81 | 8.59 | 8.95 | 8.68 |
|
| 107 |
+
| **Llama-3.3-70b-instruct** | 9.29 | 8.58 | 7.44 | 8.22 | 8.62 | 8.47 | 8.35 | 8.42 |
|
| 108 |
+
| **ATOMIS** | 9.10 | 8.64 | 8.31 | 8.16 | 9.00 | 8.71 | 9.10 | **8.72** |
|
| 109 |
+
|
| 110 |
+
### RAGEval
|
| 111 |
+
We used RAGEval [2], a benchmark designed to evaluate RAG performance in terms of factual accuracy, using three novel metrics: Completeness, Hallucination, and Irrelevance.
|
| 112 |
+
|
| 113 |
+
We evaluated performance using the [RAGEval](https://github.com/OpenBMB/RAGEval) code. As the judge model, we employed the officially recommended gpt-4o. These scores reflect only the completeness metric of the single-document QA evaluation.
|
| 114 |
+
|
| 115 |
+
[2] Zhu, K., Luo, Y., Xu, D., Wang, R., Yu, S., Wang, S., Yan, Y., Liu, Z., Han, X., Liu, Z. and Sun, M., 2024. RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework. arXiv:2408.01262.
|
| 116 |
+
|
| 117 |
+
| Model | Factual | Summarization | Multi-hop Reasoning | Avg |
|
| 118 |
+
|--------------------------------------|----- |-----|-----|-----|
|
| 119 |
+
| **c4ai-command-r-08-2024** | 1.000 | 0.913 | 0.908 | 0.941 |
|
| 120 |
+
| **gemma-2-27b-it** | 0.987 | 0.890 | 0.814 | 0.897 |
|
| 121 |
+
| **Qwen2.5-32B-instruct** | 0.980 | 0.906 | 0.923 | 0.936 |
|
| 122 |
+
| **phi-4** | 1.000 | 0.931 | 0.934 | 0.955 |
|
| 123 |
+
| **Mistral-Small-24B-Instruct-2501** | 0.980 | 0.951 | 0.781 | 0.904 |
|
| 124 |
+
| **Llama-3.3-70b-instruct** | 0.977 | 0.907 | 0.893 | 0.925 |
|
| 125 |
+
| **ATOMIS** | 0.993 | 0.942 | 0.960 | **0.965** |
|