Performance report on RTX 4090D with 48GB RAM: 90 t/s
#1
by
SlavikF
- opened
I'm running this model on vllm + OpenWebUI.
GPU: Nvidia RTX 4090D 48GB VRAM
Running on Ubuntu 24 with this docker:
services:
qwen3vl:
image: vllm/vllm-openai:v0.11.0
container_name: qwen3vl-30b-4090D
deploy:
resources:
reservations:
devices:
- driver: nvidia
capabilities: [gpu]
device_ids: ['0'] # 4090D
ports:
- "36000:8000"
environment:
TORCH_CUDA_ARCH_LIST: "8.9"
volumes:
- /home/slavik/.cache:/root/.cache
ipc: host
command:
- "--model"
- "Qwen/Qwen3-VL-30B-A3B-Thinking-FP8"
- "--max-model-len"
- "139268"
- "--served-model-name"
- "local-qwen3vl-30b"
- "--dtype"
- "float16"
- "--gpu-memory-utilization"
- "0.98"
- "--max-num-seqs"
- "2"
- "--reasoning-parser"
- "deepseek_r1"
Takes 3-4 minutes to start.
nvtop shows that 45.2 GB of VRAM used
Prompt Processing: 4700+ t/s
Token Generation:
- 90 t/s for small context
- 60 t/s for 40k context
- 35 t/s for 128k context