open source · free · GPU-native

The simplest way
to serve vLLM.

Save per-model settings as profiles.
Pick one in the TUI and hit Enter to spin it up or down.

$ git clone https://github.com/Bae-ChangHyun/vllm-compose.git

GitHub Docs

Docker V vLLM

NVIDIA Python

the problem

Every ML engineer hits these walls

Config chaos

Different models need different GPU settings, memory limits, and serving params. You end up copy-pasting docker commands every time.

$ docker run --gpus ... -e ... vllm/vllm ...

Switching models is painful

Testing Qwen, then Llama, then DeepSeek? Each switch means stopping, reconfiguring, and restarting manually.

$ docker stop && docker rm && docker run ...

No visibility

Which container is running? How much GPU memory is left? You're constantly running docker ps and nvidia-smi.

$ docker ps && nvidia-smi

Multi-model headaches

Running multiple models simultaneously means juggling ports, GPU assignments, and compose files by hand.

$ vim docker-compose.yaml # again...

how it works

Three steps. That's it.

Clone & configure

Set your HuggingFace token and cache path. One-time setup.

$ git clone ... && cat > .env.common

Launch the TUI

Quick Setup auto-generates a profile from just a model name. Or create one manually with full control.

$ uv run vllm-compose

Select & deploy

Pick a profile, press Enter, choose Start. Your model is serving on the OpenAI-compatible API.

Enter → Start → Serving at :8000

demo

See it in action

vllm-compose

terminal

Quick Start in your terminal

$ git clone https://github.com/Bae-ChangHyun/vllm-compose.git
$ cd vllm-compose
 
$ cat > .env.common << 'EOF'
HF_TOKEN=hf_your_token_here
HF_CACHE_PATH=/home/user/.cache/huggingface
EOF
 
$ uv run vllm-compose
 
✔ TUI launched — press w for Quick Setup
✔ Enter model name: Qwen/Qwen3-30B-A3B
✔ Profile created — config auto-generated
✔ Container started — serving at http://localhost:8000/v1

before & after

From config hell to one-click deploy

manual workflow

Switch modelsdocker stop/rm/run

Manage configsremember CLI args

Multi-modeledit compose files

Monitor GPUsnvidia-smi loop

Memory checktrial & error

Setup time~15 min / model

with vllm compose

Switch modelsselect + Enter

Manage configsYAML + autocomplete

Multi-modelindependent profiles

Monitor GPUsreal-time dashboard

Memory checkbuilt-in estimator

Setup time30 seconds

features

Everything you need, nothing you don't

Interactive TUI

Start, stop, view logs, edit configs — all from one terminal screen with keyboard shortcuts.

Model Profiles

Save per-model settings independently. Switch between Qwen, Llama, DeepSeek instantly.

GPU Monitor

Real-time GPU usage bars on the dashboard. Auto-refresh every 5 seconds. No more nvidia-smi.

Memory Estimator

Estimate GPU memory before deploying. Know if your model fits before wasting time on OOM errors.

Source Build

Auto-detect GPU architecture. Fast Build in 10-30 min. Support for forks and custom versions.

LoRA Adapters

Multi-adapter loading with automatic path mapping. Serve fine-tuned models alongside base models.

faq

Common questions

How do I get started? +

Clone the repo, set your HuggingFace token in .env.common, and run uv run vllm-compose. Quick Setup auto-generates a profile from just a model name — you'll be serving in 30 seconds.

What GPU do I need? +

Any NVIDIA GPU supported by vLLM. The built-in Memory Estimator helps you check if a model fits your GPU before deploying. Tensor parallelism lets you split large models across multiple GPUs.

Can I run multiple models at once? +

Yes. Each profile runs independently with its own container, port, and GPU assignment. Start as many as your hardware supports — the dashboard shows all running containers.

What are the requirements? +

Docker with NVIDIA Container Toolkit, Python 3.10+, and an NVIDIA GPU. We recommend uv for Python package management but pip works too.

Can I use my own vLLM fork? +

Yes. The Source Build feature supports custom forks: ./run.sh build main --repo <your-fork-url>. It auto-detects your GPU architecture for optimized builds.

The simplest way to serve vLLM.