Try an interactive version of this dialog: Sign up at solve.it.com, click Upload, and pass this URL.
| Full version | 📰 𝕏 | 📚 SolveIT | Ⓜ️ Markdown | 🗒️ Raw |
|---|---|---|---|---|
| Abridged | 📰 𝕏 | 📚 SolveIT | Ⓜ️ Markdown | 🗒️ Raw |
Dockerizing SGLang + vLLM on local RTX 3090 [ABRIDGED]
Mission 1: Foundations
Let's discover the basics of running fast local inference jobs!
⚠️ Warning
This ABRIDGED version assumes prior knowledge about Linux administration and LLM inference.
If you are new to this, or want more information, please check out the full version of this article!
Intro
Table of contents
| Heading | Topics | |
|---|---|---|
| 1. | Host setup | NVIDIA drivers Docker install |
| 2. | Server | 🅰️ | Configuration: ⚓ docker-compose.yml |
| 3. | Logs | Observe: server boot readiness |
| 4. | Client | 🅱️ | Tests: Bash curl🖼️ test_stream.py📗 long_ctx.py |
| 5. | Logs, continued | Observe: HTTP Requests Inference jobs |
| 6. | Hardware monitoring | nvidia-smibtop |
1. Host setup
ℹ️ Note
Skip to 2. Server if you already have the latest versions of NVIDIA drivers and Docker.
NVIDIA drivers
Make sure your host machine has the latest Nvidia drivers for your GPU.
Use your distribution's preferred method to install.
- On Gnome, it's in the app Software Sources > Additional Drivers.
- On KDE you can open that from Settings > Driver Manager.
- On Ubuntu CLI,
sudo ubuntu-drivers autoinstallshould work.
After reboot, check that it's fine by running nvidia-smi.
nvidia-smi
Sat Apr 18 22:05:40 2026
+----------------------------------------------------------------------+
| NVIDIA-SMI 590.48.01 Driver Version: 590.48.01 CUDA Version: 13.1 |
+------------------------------+------------------------+--------------+
Docker
Install Docker, following the steps for your Linux distribution.
docker --version
Docker version 29.4.0, build 9d7ad9f
⚠️ Caution
Make sure to add your user to the
dockergroup to avoid having tosudoall the time.sudo usermod -aG docker $USER
2. Server | 🅰️
⚓ docker-compose.yml
Use as a base template; add more services & profiles for each model.
Our Docker Compose file can run either:
- SGLang (port
8001), profilesglang - or vLLM (port
8002), profilevllm
📄 docker-compose.yml
%%yaml
services:
qwen35-4b-sglang:
image: lmsysorg/sglang:latest
container_name: qwen35-4b-sglang
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
ipc: host
shm_size: 32g
ports:
- "8001:8000"
volumes:
- ~/.cache/huggingface:/root/.cache/huggingface
command: >
sglang serve
--model-path Qwen/Qwen3.5-4B
--port 8000
--host 0.0.0.0
--tp-size 1
--mem-fraction-static 0.83
--context-length 262144
--kv-cache-dtype fp8_e4m3
--reasoning-parser qwen3
restart: no
profiles: ["sglang"]
environment:
- HF_TOKEN=${HF_TOKEN}
qwen35-4b-vllm:
image: vllm/vllm-openai:latest
container_name: qwen35-4b-vllm
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
ipc: host
shm_size: 32g
ports:
- "8002:8000"
volumes:
- ~/.cache/huggingface:/root/.cache/huggingface
command: >
Qwen/Qwen3.5-4B
--served-model-name Qwen/Qwen3.5-4B
--port 8000
--host 0.0.0.0
--tensor-parallel-size 1
--gpu-memory-utilization 0.78
--max-model-len 262144
--kv-cache-dtype fp8_e4m3
--reasoning-parser qwen3
--enable-prefix-caching
--enable-chunked-prefill
--max-num-seqs 64
restart: no
profiles: ["vllm"]
environment:
- HF_TOKEN=${HF_TOKEN}The LLM demonstrated here is Qwen3.5-4B (full 16-bit precision, most straightforward to run), with FP8 KV cache to fit Qwen3.5's full context (262,144 tokens) on a single RTX 3090 or any 24 GB-class GPU.
About this configuration
Three things you may want to adjust now: HF_TOKEN (get one or remove those lines), context size, and model choice.
- 🤗
HF_TOKENOptional: remove those two lines in the YAML if you won't have one. A Hugging Face (HF) Token to accelerate downloads.
Signup for a free HF account, and proceed to create new Access Token (Read). Then add it to your environment (where you'll rundockercommands).Alternatively, use aexport HF_TOKEN=hf_... # paste your token. # Consider adding this line to .bashrc for persistence..envfile (same dir asdocker-compose.yml, or at a path set by--env-filetherein).
- 🥵 If your GPU lacks VRAM for the above configuration, two low-hanging fruits.
- 👻 You may also rent a cloud GPU with ≥24 GB VRAM, but then you're on your own for the docker setup.
Deploy
Create the configuration file
Clone the repo 1iis/m01 (Mission 1) with all files.
%%bash
git clone https://github.com/1iis/m01.git
cd m01You may git init for convenience.
Build the containers
Either way, from the same directory, launch the service/profile with:
%%bash
# Pick one:
export COMPOSE_PROFILES=sglang
export COMPOSE_PROFILES=vllm
# Build it
docker compose up -d # will build the above choice
# Kill it
docker compose down # kill the container entirelyChange the value of the environment variable COMPOSE_PROFILES to select the other engine.
ℹ️ Important
Make sure you always
downthe one running before switching profile to buildupthe other one.
Otherwise, the GPU may get OOM (Out Of Memory) and the build will silently fail.
3. Logs
ℹ️ Note
You may skip to 4. Client | 🅱️ now that you have a
docker-compose.ymlfile ready and the server started.
Check server logs.
# Either by profile
docker compose --profile sglang logs -f
docker compose --profile vllm logs -f
# or container name
docker compose logs -f qwen35-4b-sglang
docker compose logs -f qwen35-4b-vllm
When the server is ready, a log entry tells you so.
- In SGLang:
The server is fired up and ready to roll!
INFO: 127.0.0.1:47826 - "POST /v1/chat/completions HTTP/1.1" 200 OK
The server is fired up and ready to roll!- In vLLM:
INFO: Application startup complete.
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO: Application startup complete.4. Client | 🅱️
Bash
Run a basic curl. Change localhost port to 8002 for vLLM.
%%bash
curl http://localhost:8001/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3.5-4B",
"messages": [{"role": "user", "content": "What is the meaning of life, the universe, and everything?"}],
"temperature": 0.7,
"max_tokens": 4096
}' | jq .👇 This returns a JSON object with "content" and "reasoning_content" fields that you may inspect.
OpenAI chat template, JSON payload.
Environment variables
For the Python scripts, we use the OpenAI library which automatically sources the following two environment variables.
This lets us keep our script generic, no hardcoded port or URL.
%%bash
export OPENAI_API_KEY="EMPTY"
# Pick one:
export OPENAI_BASE_URL="http://localhost:8001/v1" # SGLang
export OPENAI_BASE_URL="http://localhost:8002/v1" # vLLM🖼️ Text + Vision input → Streaming output
test_stream.py sends:
- an image of a real-world location (Comuna 13 in Bogotá, Colombia)
- a question in text: "Where it this?"
📄 test_stream.py
from openai import OpenAI
# Configured by environment variables: OPENAI_API_KEY and OPENAI_BASE_URL
client = OpenAI()
messages = [
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://qianwen-res.oss-accelerate.aliyuncs.com/Qwen3.5/demo/RealWorld/RealWorld-04.png"
}
},
{
"type": "text",
"text": "Where is this?"
}
]
}
]
stream = client.chat.completions.create(
model="Qwen/Qwen3.5-4B",
messages=messages,
max_tokens=32768,
temperature=0.7,
top_p=0.8,
presence_penalty=1.5,
extra_body={
"top_k": 20,
"chat_template_kwargs": {"enable_thinking": False},
},
stream=True,
stream_options={"include_usage": True}
)
def stream_and_print(response_stream):
usage = None
model_name = None
for chunk in response_stream:
if chunk.choices and chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
if chunk.usage is not None:
usage = chunk.usage
model_name = chunk.model
print()
print("\n=== Metadata ===")
print(f"Model: {model_name}")
print(f"Tokens: {usage}")
stream_and_print(stream)Run it.
python test_stream.py👇
Example output:
This is **Bogotá, Colombia** — specifically, the view from a rooftop or elevated location overlooking the city’s dense hillside neighborhoods.
The large statue in the foreground is part of the **“Origen”** project — a cultural and artistic initiative that includes sculptures, installations, and events celebrating indigenous heritage and identity. The word “Origen” (meaning “origin” or “source”) appears on the railing below the sculpture.
### Key features visible:
- **Dense urban sprawl across hillsides**, typical of Bogotá’s topography.
- **Residential buildings climbing up mountains**, showing how the city has grown vertically.
- **Modern high-rises mixed with older structures**, reflecting Bogotá’s evolving urban landscape.
- **Greenery and palm trees** near the sculpture, indicating some green spaces even within the city.
- The **“LIVE” badge** suggests this photo was taken during a live stream or social media post.
### Context:
The “Origen” project is located in the **Chapinero Alto** neighborhood, one of Bogotá’s most vibrant and culturally rich areas. It often hosts festivals, art exhibitions, and community gatherings.
So, while the exact spot isn’t named in the image, you’re looking at **Bogotá from above**, likely near the Chapinero Alto area where the Origen installation resides.
📍 *Location: Bogotá, Colombia – near Chapinero Alto / Origen Project*
=== Metadata ===
Model: Qwen/Qwen3.5-4B
Tokens: CompletionUsage(completion_tokens=304, prompt_tokens=2470, total_tokens=2774, completion_tokens_details=None, prompt_tokens_details=None)
📗 Book-long input → Long output
long_ctx.py sends a whole book to stress-test context length.
I've used the awesome Project Gutenberg to retrieve plain text (UTF-8) books.
They're in the books/ dir in the repo.
Select books whose token count is below your declared context window length in docker-compose.yml.
For instance:
- Frankenstein ~99k tokens: good for a 131k context;
- Dracula ~216k tokens: good for a 262k context.
This script:
- sends the book (shell argument);
- asks the LLM to write an essay and then a sequel chapter.
📄 long_ctx.py
import sys
from openai import OpenAI
client = OpenAI()
if len(sys.argv) < 2:
print("Usage: python long_ctx.py <path_to_txt_file>")
print("Example: python long_ctx.py frankenstein.txt")
sys.exit(1)
txt_path = sys.argv[1]
print(f"📖 Loading book from: {txt_path}")
with open(txt_path, 'r', encoding='utf-8') as f:
book_text = f.read()
print(f"✅ Loaded {len(book_text):,} characters (~{len(book_text.split()):,} words) — ready for 100k+ token test\n")
messages = [
{
"role": "user",
"content": f"""Here is the complete text of a novel:
{book_text}
Now, using the entire book above, write an extremely long and detailed response (use as many tokens as possible up to the limit):
1. Provide a comprehensive literary analysis essay (aim for maximum depth and length) covering all major themes, full character arcs, narrative structure (frame story), key symbols, and historical/biographical context.
2. After the essay, write an original sequel chapter that continues directly from the end of the novel. Make the sequel rich, emotionally intense, and at least 4,000 words long.
Be extremely thorough, quote specific passages from the book, and expand on every point. This is a long-context stress test: use everything you read."""
}
]
stream = client.chat.completions.create(
model="Qwen/Qwen3.5-4B",
messages=messages,
max_tokens=32768,
temperature=0.7,
top_p=0.8,
presence_penalty=1.5,
extra_body={
"top_k": 20,
"chat_template_kwargs": {"enable_thinking": False},
},
stream=True,
stream_options={"include_usage": True}
)
def stream_and_print(response_stream):
usage = None
model_name = None
for chunk in response_stream:
if chunk.choices and chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
if chunk.usage is not None:
usage = chunk.usage
model_name = chunk.model
print()
print("\n=== Metadata ===")
print(f"Model: {model_name}")
print(f"Tokens: {usage}")
stream_and_print(stream)Run it, with the text file containing the book as argument. E.g.:
%%bash
python long_ctx.py frankenstein.txt
python long_ctx.py dracula.txtThe LLM takes a few minutes to load the massive input (SGLang has great logs if you want to monitor that, see below 5. Logs, continued).
5. Logs, continued
Live inspection
The logs on the server will show activity during or after the task.
qwen35-4b-vllm | (APIServer pid=1) INFO: 172.18.0.1:33818 - "POST /v1/chat/completions HTTP/1.1" 200 OK
qwen35-4b-vllm | (APIServer pid=1) INFO 04-18 22:48:24 [loggers.py:259] Engine 000: Avg prompt throughput: 247.0 tokens/s, Avg generation throughput: 30.4 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, MM cache hit rate: 0.0%
qwen35-4b-vllm | (APIServer pid=1) INFO 04-18 22:48:34 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, MM cache hit rate: 0.0%
qwen35-4b-vllm | (APIServer pid=1) INFO: 172.18.0.1:40806 - "POST /v1/chat/completions HTTP/1.1" 200 OK
qwen35-4b-vllm | (APIServer pid=1) INFO 04-18 22:50:24 [loggers.py:259] Engine 000: Avg prompt throughput: 35.8 tokens/s, Avg generation throughput: 29.8 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 42.8%, MM cache hit rate: 50.0%SGLang is a bit more verbose by default.
qwen35-4b-sglang | [2026-04-18 23:02:14] Prefill batch, #new-seq: 1, #new-token: 2048, #cached-token: 0, full token usage: 0.36, mamba usage: 0.03, #running-req: 0, #queue-req: 0, cuda graph: False, input throughput (token/s): 2589.96
qwen35-4b-sglang | [2026-04-18 23:02:15] Prefill batch, #new-seq: 1, #new-token: 973, #cached-token: 0, full token usage: 0.36, mamba usage: 0.03, #running-req: 0, #queue-req: 0, cuda graph: False, input throughput (token/s): 4467.39
qwen35-4b-sglang | [2026-04-18 23:02:15] INFO: 172.18.0.1:45748 - "POST /v1/chat/completions HTTP/1.1" 200 OK
qwen35-4b-sglang | [2026-04-18 23:02:15] Decode batch, #running-req: 1, #full token: 95186, full token usage: 0.36, mamba num: 2, mamba usage: 0.03, cuda graph: True, gen throughput (token/s): 0.14, #queue-req: 0
qwen35-4b-sglang | [2026-04-18 23:02:15] Decode batch, #running-req: 1, #full token: 95226, full token usage: 0.36, mamba num: 2, mamba usage: 0.03, cuda graph: True, gen throughput (token/s): 55.54, #queue-req: 0Worth checking out
- HTTP requests: initiating URL, METHOD, API endpoint requested, response code (same grammar for both engines)
INFO: 172.18.0.1:45748 - "POST /v1/chat/completions HTTP/1.1" 200 OK - Engine ops: lines beginning with
qwen35-4b-sglang | [2026-04-20 21:08:24]
qwen35-4b-vllm | (APIServer pid=1) INFO 04-18 22:48:24 [loggers.py:259] Engine 000:- tokens/second in
SGLang:input throughput (token/s): 2632.62
vLLM:Avg prompt throughput: 247.0 tokens/s - tokens/second out
SGLang:gen throughput (token/s): 109.54
vLLM:Avg generation throughput: 30.4 tokens/s - KV cache usage (how much of your context is used on the GPU, in tokens)
SGLang:#full token: 140003, full token usage: 0.71(%)
vLLM:GPU KV cache usage: 38.6% - Cache hits (how much of your prompt was already in cache)
vLLM:Prefix cache hit rate: 42.8%, MM cache hit rate: 50.0%
- tokens/second in
Re-run the scripts
Both, and multiple requests, simultaneously! Check that everything works fine, and that cache works: generation starts instantly when you re-send the book prompt over and over again!
Look/grep for "cache":
- SGLang:
cached-tokens: ... - vLLM: [
prefix|MM]cache hit rate: ...%
Prefill batch, #new-seq: 1, #new-token: 973, #cached-token: 94208, ...6. Hardware monitoring
nvidia-smi
The first thing is to watch nvidia-smi to monitor your GPU temperature, power draw, and VRAM usage.
nvidia-smi -l 1 # Refresh every 1 second
ProcessTypeG: Graphics;TypeC: CUDA.
btop
A favorite of mine for quick system monitoring.
sudo apt install btop
btop
We get this pretty GPU box, whose dashboard is fairly complete and more compact than nvidia-smi.
btop is for top beauty
To see the above GPU box, you may need to add gpu0 to shown boxes (pic below).
Press Esc for Options.
btop Settings > Shown boxes: "gpu0" added after "cpu"
Managing temperature
If temperature is above 65-ish (GPU wil throttle), you have two major levers to tweak.
Fan speed
Check and enforce fan speed in the Nvidia X Server Settings app (should come with drivers).
- Check the box Enable GPU Fan Settings (if unavailable: search "nvidia coolbits" for your Xorg/Wayland config)
- Set desired % value
- Click Apply button
Try to keep temperate well below 80°C for best performance.
Power Draw
⚡ You may want to soft-limit power to a more reasonable value than your card's default, as this generally has negligible performance impact but makes it run cooler and lowers your electricity bill.
Right now, you can check Current Power Limit with:
nvidia-smi -q -d POWER -l 2
GPU Power Readings
Average Power Draw : 46.28 W
Instantaneous Power Draw : 54.28 W
Current Power Limit : 300.00 W
Requested Power Limit : 300.00 W
Default Power Limit : 420.00 W
Min Power Limit : 100.00 W
Max Power Limit : 450.00 W
Power Samples
Duration : 6.32 sec
Number of Samples : 119
Max : 67.46 W
Min : 42.01 W
Avg : 46.63 WYou may issue a new Requested Power Limit.
sudo nvidia-smi -pm 1
sudo nvidia-smi -pl 300 # value in Watts
nvidia-smi -q -d POWER # check effects
This should now read as your Current Power Limit.
It won't survive reboot (requires a systemd service, or DE script).
🚨 Caution
NEVER mess with power settings beyond the Min/Max Power Limit range. For our purposes, about ¼ to ⅓ below the Default Power Limit is usually where we want to live, for most GPUs. In my case, running at 300 W instead yields -10°C, 30% cheaper bill, and about the same performance.
Parting words
This concludes our first dive into local inference. If you've made it to the end, thank you for reading, and congratulations! 🙌
There is a ton more to cover. I hope you like the Docker approach and our local scripts being endpoint-agnostic, letting us orchestrate and scale entirely from the shell.
Coming Soon™
I will also soon begin tests on a 3060 12 GB to see what can effectively be run on the famed -60 class of GPUs.
Future articles, as hinted throughout this first one, will cover many more use cases and features, adding tools to our inference kit as we go. Let me know what you seek!
Until we meet again,
Happy prompting!