Model is Released by Microsoft Research under MIT License :

https://huggingface.co/microsoft/Fara-7B

Further Instructions are provided to convert this modelfile into its GGUF format. For Demo purpose, Ollama inference is also demonstrated at the end.

Quantization Process

Conversion Process was conducted on Linux Ubuntu 22.04

Getting Ready Step-1 - Install Essential Tools and Dependencies: The llama.cpp project requires several tools for compilation and networking.

sudo apt update
sudo apt install -y build-essential
sudo apt install -y python3 python3-pip git-lfs
sudo apt install -y cmake
sudo apt install -y libcurl4-openssl-dev

Getting Ready Step-2 - Clone and Configure llama.cpp: Get the latest source code for the project.

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
pip install -r requirements.txt

Getting Ready Step-3 - Build the llama.cpp Executables

Compile all targets (including llama-quantize, llama-cli, and llama-server)

-j$(nproc) uses all available CPU cores for faster compilation

rm -rf build
cmake -B build
cmake --build build --config Release -j$(nproc)

Build Step-1 - Download and Convert the Model (HF to GGUF FP16)

This step downloads the original Hugging Face model and converts it into the initial, unquantized GGUF FP16 format. This process requires ~18 GB of System RAM and produces a file around 15.2 GB.

Create a Directory for the Model Weights:

mkdir Fara-7B
cd Fara-7B

Build Step-2 - Download the Model Weights: Use git-lfs to clone the specific model weights from Hugging Face into the new directory. (Note: Replace the URL below with the actual Hugging Face repo for Fara-7B if it differs)

git clone https://huggingface.co/microsoft/Fara-7B .
cd ..

Run the Conversion Script: This script converts the Hugging Face Safetensors into the FP16 GGUF file

python3 convert_hf_to_gguf.py Fara-7B --outfile Fara-7B-F16.gguf --outtype f16

Result: A file named Fara-7B-F16.gguf (approx. 15.2 GB) is created in the llama.cpp root directory.

Quantize Model ( FP16 to Q8_0 ) - Single Step Process

Run the Quantization Tool: We use the built executable ./build/bin/llama-quantize to compress the model to the high-quality Q8_0 format.

./build/bin/llama-quantize Fara-7B-F16.gguf Fara-7B-Q8_0.gguf Q8_0

Result: A file named Fara-7B-Q8_0.gguf (approx. 8.1 GB) is created.

Running Inference with Ollama

update Ollama on your system before running this model.

ollama pull hf.co/AXONVERTEX-AI-RESEARCH/Fara-7B-Q8-GGUF:Q8_0
ollama run hf.co/AXONVERTEX-AI-RESEARCH/Fara-7B-Q8-GGUF:Q8_0

If Ollama is running inside docker :

root@88f683b2c6d5:/# ollama pull hf.co/AXONVERTEX-AI-RESEARCH/Fara-7B-Q8-GGUF:Q8_0
pulling manifest 
pulling c0b330e7015f: 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 8.1 GB                         
pulling a242d8dfdc8f: 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏  487 B                         
pulling f6460fc7dd9f: 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏   22 B                         
pulling fb45dc380b05: 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏  557 B                         
verifying sha256 digest 
writing manifest 
success 
root@88f683b2c6d5:/# ollama run hf.co/AXONVERTEX-AI-RESEARCH/Fara-7B-Q8-GGUF:Q8_0
>>> Hello 
Hi there, how can I help you today?  If you have any questions or need information on a topic, just let me know and I'll do my best to assist.
<tool_call>
{"name": "Assistant", "role": "LanguageModel", "developer": "Microsoft Research AI Frontiers"}}
<tool_call>
>>> Send a message (/? for help)

Running Inference with llama.cpp

The llama.cpp tools (llama-cli and llama-server) have two primary ways to load a GGUF file: Local File (easiest if already downloaded) and Hugging Face Download (easiest for first-time users).

1. Download the GGUF File

The optimized Fara-7B-Q8_0-GGUF file is approximately 8.1 GB. Download it directly from the Files tab of this repository, or use `huggingface-cli`:

huggingface-cli download AXONVERTEX-AI-RESEARCH/Fara-7B-Q8-GGUF \
  Fara-7B-Q8_0.gguf --local-dir /path/to/llama.cpp

2. Run Inference (Local File)

Once the file is in your llama.cpp/ directory and you have compiled the binaries (Step 2 of the Conversion Guide), you can run the model directly. Use the -ngl (n-gpu-layers) flag to offload most of the model onto your GPU's VRAM

Command Line Interface (CLI): Multimodal Use

To use the model's Vision-Language (V-L) capabilities, include the --image flag pointing to a local image file.

./build/bin/llama-cli \
  -m Fara-7B-Q8_0.gguf \
  -p "Describe what is happening in this image and suggest the next logical computer action." \
  --image /path/to/your/screenshot.jpg \
  -n 512 \
  -c 4096 \
  -ngl 99

3. Web Server:

The Web Server is the easiest way to interact with multimodal models, allowing for image drag-and-drop.

./build/bin/llama-server \
  -m Fara-7B-Q8_0.gguf \
  -c 4096 \
  -ngl 99 \
  --host 0.0.0.0

Starts a web server for interactive chat (access at http://127.0.0.1:8080)

4. Alternatively - Direct Hugging Face Download & Run

If you want llama.cpp to handle the download and caching automatically without manually using huggingface-cli, use the --hf-repo flags

./build/bin/llama-cli \
  --hf-repo AXONVERTEX-AI-RESEARCH/Fara-7B-Q8-GGUF \
  --hf-file Fara-7B-Q8_0.gguf \
  -p "What is the capital of France?" \
  -ngl 99

This command automatically finds, downloads, and runs the model.

Downloads last month: 178

GGUF

Model size

8B params

Architecture

qwen2vl

Hardware compatibility

8-bit

Model tree for AXONVERTEX-AI-RESEARCH/Fara-7B-Q8-GGUF

Base model

microsoft/Fara-7B

Quantized

(7)

this model

Model is Released by Microsoft Research under MIT License :

Further Instructions are provided to convert this modelfile into its GGUF format. For Demo purpose, Ollama inference is also demonstrated at the end.

Quantization Process

Getting Ready Step-1 - Install Essential Tools and Dependencies: The llama.cpp project requires several tools for compilation and networking.

Getting Ready Step-2 - Clone and Configure llama.cpp: Get the latest source code for the project.

Getting Ready Step-3 - Build the llama.cpp Executables

Compile all targets (including llama-quantize, llama-cli, and llama-server)

-j$(nproc) uses all available CPU cores for faster compilation

Build Step-1 - Download and Convert the Model (HF to GGUF FP16)

This step downloads the original Hugging Face model and converts it into the initial, unquantized GGUF FP16 format. This process requires ~18 GB of System RAM and produces a file around 15.2 GB.

Create a Directory for the Model Weights:

Build Step-2 - Download the Model Weights: Use git-lfs to clone the specific model weights from Hugging Face into the new directory. (Note: Replace the URL below with the actual Hugging Face repo for Fara-7B if it differs)

Run the Conversion Script: This script converts the Hugging Face Safetensors into the FP16 GGUF file

Result: A file named Fara-7B-F16.gguf (approx. 15.2 GB) is created in the llama.cpp root directory.

Quantize Model ( FP16 to Q8_0 ) - Single Step Process

Run the Quantization Tool: We use the built executable ./build/bin/llama-quantize to compress the model to the high-quality Q8_0 format.

Result: A file named Fara-7B-Q8_0.gguf (approx. 8.1 GB) is created.

Running Inference with Ollama

update Ollama on your system before running this model.

If Ollama is running inside docker :

Running Inference with llama.cpp

The llama.cpp tools (llama-cli and llama-server) have two primary ways to load a GGUF file: Local File (easiest if already downloaded) and Hugging Face Download (easiest for first-time users).

1. Download the GGUF File

The optimized Fara-7B-Q8_0-GGUF file is approximately 8.1 GB. Download it directly from the Files tab of this repository, or use huggingface-cli:

2. Run Inference (Local File)

Once the file is in your llama.cpp/ directory and you have compiled the binaries (Step 2 of the Conversion Guide), you can run the model directly. Use the -ngl (n-gpu-layers) flag to offload most of the model onto your GPU's VRAM

Command Line Interface (CLI): Multimodal Use

To use the model's Vision-Language (V-L) capabilities, include the --image flag pointing to a local image file.

3. Web Server:

The Web Server is the easiest way to interact with multimodal models, allowing for image drag-and-drop.

Starts a web server for interactive chat (access at http://127.0.0.1:8080)

4. Alternatively - Direct Hugging Face Download & Run

If you want llama.cpp to handle the download and caching automatically without manually using huggingface-cli, use the --hf-repo flags

This command automatically finds, downloads, and runs the model.

Model tree for AXONVERTEX-AI-RESEARCH/Fara-7B-Q8-GGUF

The optimized Fara-7B-Q8_0-GGUF file is approximately 8.1 GB. Download it directly from the Files tab of this repository, or use `huggingface-cli`: