Performance report on RTX 4090D with 48GB RAM: 90 t/s

#1
by SlavikF - opened

I'm running this model on vllm + OpenWebUI.
GPU: Nvidia RTX 4090D 48GB VRAM

Running on Ubuntu 24 with this docker:

services:
  qwen3vl:
    image: vllm/vllm-openai:v0.11.0
    container_name: qwen3vl-30b-4090D
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              capabilities: [gpu]
              device_ids: ['0']  # 4090D
    ports:
      - "36000:8000"
    environment:
      TORCH_CUDA_ARCH_LIST: "8.9"
    volumes:
      - /home/slavik/.cache:/root/.cache
    ipc: host
    command:
      - "--model"
      - "Qwen/Qwen3-VL-30B-A3B-Thinking-FP8"
      - "--max-model-len"
      - "139268"
      - "--served-model-name"
      - "local-qwen3vl-30b"
      - "--dtype"
      - "float16"
      - "--gpu-memory-utilization"
      - "0.98"
      - "--max-num-seqs"
      - "2"
      - "--reasoning-parser"
      - "deepseek_r1"

Takes 3-4 minutes to start.

nvtop shows that 45.2 GB of VRAM used

Prompt Processing: 4700+ t/s

Token Generation:

  • 90 t/s for small context
  • 60 t/s for 40k context
  • 35 t/s for 128k context

Sign up or log in to comment