open source · free · GPU-native

The simplest way
to serve vLLM.

Save per-model settings as profiles.
Pick one in the TUI and hit Enter to spin it up or down.

$ git clone https://github.com/Bae-ChangHyun/vllm-compose.git Copied!
Docker V vLLM NVIDIA NVIDIA Python

Every ML engineer hits these walls

01

Config chaos

Different models need different GPU settings, memory limits, and serving params. You end up copy-pasting docker commands every time.

$ docker run --gpus ... -e ... vllm/vllm ...

02

Switching models is painful

Testing Qwen, then Llama, then DeepSeek? Each switch means stopping, reconfiguring, and restarting manually.

$ docker stop && docker rm && docker run ...

03

No visibility

Which container is running? How much GPU memory is left? You're constantly running docker ps and nvidia-smi.

$ docker ps && nvidia-smi

04

Multi-model headaches

Running multiple models simultaneously means juggling ports, GPU assignments, and compose files by hand.

$ vim docker-compose.yaml # again...

Three steps. That's it.

1

Clone & configure

Set your HuggingFace token and cache path. One-time setup.

$ git clone ... && cat > .env.common
2

Launch the TUI

Quick Setup auto-generates a profile from just a model name. Or create one manually with full control.

$ uv run vllm-compose
3

Select & deploy

Pick a profile, press Enter, choose Start. Your model is serving on the OpenAI-compatible API.

Enter → Start → Serving at :8000

See it in action

vllm-compose
vLLM Compose Demo

Quick Start in your terminal

~
$ git clone https://github.com/Bae-ChangHyun/vllm-compose.git
$ cd vllm-compose
 
$ cat > .env.common << 'EOF'
HF_TOKEN=hf_your_token_here
HF_CACHE_PATH=/home/user/.cache/huggingface
EOF
 
$ uv run vllm-compose
 
TUI launched — press w for Quick Setup
Enter model name: Qwen/Qwen3-30B-A3B
Profile created — config auto-generated
Container started — serving at http://localhost:8000/v1

From config hell to one-click deploy

manual workflow

Switch modelsdocker stop/rm/run
Manage configsremember CLI args
Multi-modeledit compose files
Monitor GPUsnvidia-smi loop
Memory checktrial & error
Setup time~15 min / model

with vllm compose

Switch modelsselect + Enter
Manage configsYAML + autocomplete
Multi-modelindependent profiles
Monitor GPUsreal-time dashboard
Memory checkbuilt-in estimator
Setup time30 seconds

Everything you need, nothing you don't

Interactive TUI

Start, stop, view logs, edit configs — all from one terminal screen with keyboard shortcuts.

Model Profiles

Save per-model settings independently. Switch between Qwen, Llama, DeepSeek instantly.

GPU Monitor

Real-time GPU usage bars on the dashboard. Auto-refresh every 5 seconds. No more nvidia-smi.

Memory Estimator

Estimate GPU memory before deploying. Know if your model fits before wasting time on OOM errors.

Source Build

Auto-detect GPU architecture. Fast Build in 10-30 min. Support for forks and custom versions.

LoRA Adapters

Multi-adapter loading with automatic path mapping. Serve fine-tuned models alongside base models.

Common questions

How do I get started? +
Clone the repo, set your HuggingFace token in .env.common, and run uv run vllm-compose. Quick Setup auto-generates a profile from just a model name — you'll be serving in 30 seconds.
What GPU do I need? +
Any NVIDIA GPU supported by vLLM. The built-in Memory Estimator helps you check if a model fits your GPU before deploying. Tensor parallelism lets you split large models across multiple GPUs.
Can I run multiple models at once? +
Yes. Each profile runs independently with its own container, port, and GPU assignment. Start as many as your hardware supports — the dashboard shows all running containers.
What are the requirements? +
Docker with NVIDIA Container Toolkit, Python 3.10+, and an NVIDIA GPU. We recommend uv for Python package management but pip works too.
Can I use my own vLLM fork? +
Yes. The Source Build feature supports custom forks: ./run.sh build main --repo <your-fork-url>. It auto-detects your GPU architecture for optimized builds.