Model is Released by Microsoft Research under MIT License :
https://huggingface.co/microsoft/Fara-7B
Further Instructions are provided to convert this modelfile into its GGUF format. For Demo purpose, Ollama inference is also demonstrated at the end.
Quantization Process
Conversion Process was conducted on Linux Ubuntu 22.04
Getting Ready Step-1 - Install Essential Tools and Dependencies: The llama.cpp project requires several tools for compilation and networking.
sudo apt update
sudo apt install -y build-essential
sudo apt install -y python3 python3-pip git-lfs
sudo apt install -y cmake
sudo apt install -y libcurl4-openssl-dev
Getting Ready Step-2 - Clone and Configure llama.cpp: Get the latest source code for the project.
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
pip install -r requirements.txt
Getting Ready Step-3 - Build the llama.cpp Executables
Compile all targets (including llama-quantize, llama-cli, and llama-server)
-j$(nproc) uses all available CPU cores for faster compilation
rm -rf build
cmake -B build
cmake --build build --config Release -j$(nproc)
Build Step-1 - Download and Convert the Model (HF to GGUF FP16)
This step downloads the original Hugging Face model and converts it into the initial, unquantized GGUF FP16 format. This process requires ~18 GB of System RAM and produces a file around 15.2 GB.
Create a Directory for the Model Weights:
mkdir Fara-7B
cd Fara-7B
Build Step-2 - Download the Model Weights: Use git-lfs to clone the specific model weights from Hugging Face into the new directory. (Note: Replace the URL below with the actual Hugging Face repo for Fara-7B if it differs)
git clone https://huggingface.co/microsoft/Fara-7B .
cd ..
Run the Conversion Script: This script converts the Hugging Face Safetensors into the FP16 GGUF file
python3 convert_hf_to_gguf.py Fara-7B --outfile Fara-7B-F16.gguf --outtype f16
Result: A file named Fara-7B-F16.gguf (approx. 15.2 GB) is created in the llama.cpp root directory.
Quantize Model ( FP16 to Q8_0 ) - Single Step Process
Run the Quantization Tool: We use the built executable ./build/bin/llama-quantize to compress the model to the high-quality Q8_0 format.
./build/bin/llama-quantize Fara-7B-F16.gguf Fara-7B-Q8_0.gguf Q8_0
Result: A file named Fara-7B-Q8_0.gguf (approx. 8.1 GB) is created.
Running Inference with Ollama
update Ollama on your system before running this model.
ollama pull hf.co/AXONVERTEX-AI-RESEARCH/Fara-7B-Q8-GGUF:Q8_0
ollama run hf.co/AXONVERTEX-AI-RESEARCH/Fara-7B-Q8-GGUF:Q8_0
If Ollama is running inside docker :
root@88f683b2c6d5:/# ollama pull hf.co/AXONVERTEX-AI-RESEARCH/Fara-7B-Q8-GGUF:Q8_0
pulling manifest
pulling c0b330e7015f: 100% โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 8.1 GB
pulling a242d8dfdc8f: 100% โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 487 B
pulling f6460fc7dd9f: 100% โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 22 B
pulling fb45dc380b05: 100% โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 557 B
verifying sha256 digest
writing manifest
success
root@88f683b2c6d5:/# ollama run hf.co/AXONVERTEX-AI-RESEARCH/Fara-7B-Q8-GGUF:Q8_0
>>> Hello
Hi there, how can I help you today? If you have any questions or need information on a topic, just let me know and I'll do my best to assist.
<tool_call>
{"name": "Assistant", "role": "LanguageModel", "developer": "Microsoft Research AI Frontiers"}}
<tool_call>
>>> Send a message (/? for help)
Running Inference with llama.cpp
The llama.cpp tools (llama-cli and llama-server) have two primary ways to load a GGUF file: Local File (easiest if already downloaded) and Hugging Face Download (easiest for first-time users).
1. Download the GGUF File
The optimized Fara-7B-Q8_0-GGUF file is approximately 8.1 GB. Download it directly from the Files tab of this repository, or use huggingface-cli:
huggingface-cli download AXONVERTEX-AI-RESEARCH/Fara-7B-Q8-GGUF \
Fara-7B-Q8_0.gguf --local-dir /path/to/llama.cpp
2. Run Inference (Local File)
Once the file is in your llama.cpp/ directory and you have compiled the binaries (Step 2 of the Conversion Guide), you can run the model directly. Use the -ngl (n-gpu-layers) flag to offload most of the model onto your GPU's VRAM
Command Line Interface (CLI): Multimodal Use
To use the model's Vision-Language (V-L) capabilities, include the --image flag pointing to a local image file.
./build/bin/llama-cli \
-m Fara-7B-Q8_0.gguf \
-p "Describe what is happening in this image and suggest the next logical computer action." \
--image /path/to/your/screenshot.jpg \
-n 512 \
-c 4096 \
-ngl 99
3. Web Server:
The Web Server is the easiest way to interact with multimodal models, allowing for image drag-and-drop.
./build/bin/llama-server \
-m Fara-7B-Q8_0.gguf \
-c 4096 \
-ngl 99 \
--host 0.0.0.0
Starts a web server for interactive chat (access at http://127.0.0.1:8080)
4. Alternatively - Direct Hugging Face Download & Run
If you want llama.cpp to handle the download and caching automatically without manually using huggingface-cli, use the --hf-repo flags
./build/bin/llama-cli \
--hf-repo AXONVERTEX-AI-RESEARCH/Fara-7B-Q8-GGUF \
--hf-file Fara-7B-Q8_0.gguf \
-p "What is the capital of France?" \
-ngl 99
This command automatically finds, downloads, and runs the model.
- Downloads last month
- 178
Hardware compatibility
Log In
to view the estimation
8-bit
Model tree for AXONVERTEX-AI-RESEARCH/Fara-7B-Q8-GGUF
Base model
microsoft/Fara-7B