Understanding Layers, Threads & GPU Memory on JupyterHub¶
This notebook is inspired by the rh
We extend it to explore how llama.cpp parameters affect GPU utilization:
n_gpu_layers— how many transformer layers to offload to GPUn_threads— how many CPU threads handle the restn_ctx— context window size and its VRAM costn_batch— prompt batch size
We use llama-cpp-python with a small quantized model (TinyLlama or similar) so this works on modest GPUs.
🔧 Section 1: Environment Setup¶
Install dependencies. Run this once.
# Install llama-cpp-python with CUDA support
# The CMAKE_ARGS flag enables GPU (CUBLAS) backend
import subprocess, sys
result = subprocess.run(
[sys.executable, "-m", "pip", "install",
"llama-cpp-python",
"--extra-index-url", "https://abetlen.github.io/llama-cpp-python/whl/cu121",
"huggingface_hub", "psutil", "matplotlib", "ipywidgets", "tqdm"],
capture_output=True, text=True
)
print(result.stdout[-2000:] if result.stdout else "")
print(result.stderr[-500:] if result.returncode != 0 else "✅ Install complete")🖥️ Section 2: Check Your GPU (just like the rh-aiservices-bu notebook)¶
Before touching llama.cpp, let’s confirm CUDA is available and inspect the hardware.
import torch
print("=" * 50)
print("GPU AVAILABILITY CHECK")
print("=" * 50)
cuda_available = torch.cuda.is_available()
print(f"CUDA available: {cuda_available}")
if cuda_available:
n_gpus = torch.cuda.device_count()
print(f"Number of GPUs: {n_gpus}")
for i in range(n_gpus):
props = torch.cuda.get_device_properties(i)
vram_gb = props.total_memory / 1e9
print(f"\nGPU {i}: {props.name}")
print(f" VRAM: {vram_gb:.1f} GB")
print(f" CUDA Capability: {props.major}.{props.minor}")
print(f" Multiprocessors: {props.multi_processor_count}")
else:
print("\n⚠️ No GPU found. llama.cpp will run on CPU only.")
print(" n_gpu_layers will be ignored.")import subprocess
def get_gpu_memory():
"""Returns (used_mb, total_mb) for GPU 0 via nvidia-smi."""
try:
out = subprocess.check_output(
["nvidia-smi", "--query-gpu=memory.used,memory.total",
"--format=csv,noheader,nounits"],
text=True
).strip().split("\n")[0]
used, total = map(int, out.split(","))
return used, total
except Exception:
return None, None
used, total = get_gpu_memory()
if total:
free = total - used
pct = used / total * 100
print(f"GPU Memory — Used: {used} MB | Free: {free} MB | Total: {total} MB ({pct:.1f}% used)")
print(f"\n💡 Baseline (before loading any model): {used} MB used")
else:
print("nvidia-smi not available — memory tracking will be skipped.")📥 Section 3: Download a Small Test Model¶
We use TinyLlama-1.1B Q4_K_M (~670 MB) — small enough to fit on most GPUs but real enough to show meaningful GPU usage patterns.
It has 22 transformer layers, which gives us a nice range to experiment with (0 → 22).
import os
model_path = "/home/jovyan/shared/qwen2-1_5b-instruct-q4_0.gguf"
TOTAL_LAYERS = 28 # Qwen2-1.5B has 28 transformer blocks
assert os.path.exists(MODEL_PATH), f"Model not found at {MODEL_PATH}"
print(f"Model: {MODEL_PATH}")
print(f"Size: {os.path.getsize(MODEL_PATH) / 1e6:.0f} MB")
print(f"Layers: {TOTAL_LAYERS}")Section 4: Key Parameter Explainer¶
Before we experiment, here’s what each parameter means:
| Parameter | What it does | Tradeoff |
|---|---|---|
n_gpu_layers | # of transformer layers offloaded to GPU VRAM | Higher → faster inference, more VRAM used |
n_threads | CPU cores for layers NOT on GPU | More threads → faster CPU processing |
n_ctx | Context window (tokens in memory at once) | Larger → more VRAM per layer |
n_batch | Tokens processed together in prompt phase | Larger → faster prompt ingestion, more VRAM |
The key mental model:
Total model = GPU layers (VRAM, fast) + CPU layers (RAM, slower)
↑ controlled by n_gpu_layersSetting n_gpu_layers=999 offloads everything (if it fits in VRAM).
Section 5: Interactive Parameter Explorer¶
✏️ Edit the parameters below, then run the cell. Each run loads the model fresh so you get clean memory readings.
Try these experiments:
n_gpu_layers = 0→ all on CPU, watch GPU memory stay flatn_gpu_layers = 11→ half on GPUn_gpu_layers = 22→ fully on GPUn_gpu_layers = 999→ same as 22 (llama.cpp caps it)
# ============================================================
# ✏️ EDIT THESE PARAMETERS AND RE-RUN THIS CELL
# ============================================================
n_gpu_layers = 22 # 0 = CPU only, 22 = full GPU, 999 = all available
n_threads = 4 # CPU threads for non-GPU layers
n_ctx = 512 # Context window size in tokens
n_batch = 512 # Batch size for prompt processing
# ============================================================
from llama_cpp import Llama
import time, gc
# --- Memory before loading ---
used_before, total_vram = get_gpu_memory()
used_before = used_before or 0
print(f"⚙️ Parameters:")
print(f" n_gpu_layers = {n_gpu_layers} (TinyLlama has {TOTAL_LAYERS} layers)")
print(f" n_threads = {n_threads}")
print(f" n_ctx = {n_ctx}")
print(f" n_batch = {n_batch}")
print(f"\n📊 VRAM before loading: {used_before} MB")
print("\nLoading model...")
t0 = time.time()
llm = Llama(
model_path = model_path,
n_gpu_layers = n_gpu_layers,
n_threads = n_threads,
n_ctx = n_ctx,
n_batch = n_batch,
verbose = False,
)
load_time = time.time() - t0
used_after, _ = get_gpu_memory()
used_after = used_after or 0
vram_delta = used_after - used_before
# Figure out actual layers on GPU (capped at model max)
actual_gpu_layers = min(n_gpu_layers, TOTAL_LAYERS)
cpu_layers = TOTAL_LAYERS - actual_gpu_layers
print(f"\n✅ Model loaded in {load_time:.1f}s")
print(f"\n{'='*45}")
print(f" LAYER SPLIT")
print(f" GPU layers : {actual_gpu_layers:>3} / {TOTAL_LAYERS} ({actual_gpu_layers/TOTAL_LAYERS*100:.0f}%)")
print(f" CPU layers : {cpu_layers:>3} / {TOTAL_LAYERS} ({cpu_layers/TOTAL_LAYERS*100:.0f}%)")
print(f"{'='*45}")
print(f" GPU VRAM USED BY MODEL")
if total_vram:
print(f" Before : {used_before:>6} MB")
print(f" After : {used_after:>6} MB")
print(f" Delta : {vram_delta:>+6} MB ← model cost")
print(f" Total avail: {total_vram:>6} MB")
else:
print(" (nvidia-smi not available)")
print(f"{'='*45}")Section 6: Inference Speed Test¶
Now run a prompt and measure tokens/second — the real payoff of GPU offloading.
PROMPT = "Explain what a GPU is to a five year old in 3 sentences."
MAX_TOKENS = 80
print(f"📝 Prompt: '{PROMPT}'")
print(f" max_tokens = {MAX_TOKENS}\n")
t0 = time.time()
output = llm(
PROMPT,
max_tokens = MAX_TOKENS,
temperature = 0.7,
echo = False,
)
elapsed = time.time() - t0
response_text = output["choices"][0]["text"].strip()
tokens_generated = output["usage"]["completion_tokens"]
tok_per_sec = tokens_generated / elapsed
print(f"💬 Response:\n{response_text}")
print(f"\n{'='*45}")
print(f" INFERENCE STATS")
print(f" Tokens generated : {tokens_generated}")
print(f" Time elapsed : {elapsed:.2f}s")
print(f" Speed : {tok_per_sec:.1f} tokens/sec")
print(f" GPU layers used : {actual_gpu_layers} / {TOTAL_LAYERS}")
print(f"{'='*45}")📊 Section 7: Benchmark Sweep — Layers vs Speed vs VRAM¶
This cell runs a full sweep across different n_gpu_layers values and plots:
Tokens/second vs GPU layers
VRAM used vs GPU layers
⚠️ This takes a few minutes — each row reloads the model fresh.
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import gc
# Which layer counts to sweep
LAYER_VALUES = [0, 4, 8, 12, 16, 20, 22]
SWEEP_THREADS = 4 # fixed for sweep
SWEEP_CTX = 512
SWEEP_PROMPT = "What is machine learning?"
SWEEP_TOKENS = 50
results = []
for ngl in LAYER_VALUES:
print(f"Testing n_gpu_layers={ngl} ...", end=" ", flush=True)
# Clean up previous model
try:
del llm
gc.collect()
torch.cuda.empty_cache()
except NameError:
pass
time.sleep(1)
used_before, total_vram = get_gpu_memory()
used_before = used_before or 0
llm = Llama(
model_path = model_path,
n_gpu_layers = ngl,
n_threads = SWEEP_THREADS,
n_ctx = SWEEP_CTX,
n_batch = 512,
verbose = False,
)
used_after, _ = get_gpu_memory()
used_after = used_after or 0
vram_delta = max(0, used_after - used_before)
t0 = time.time()
out = llm(SWEEP_PROMPT, max_tokens=SWEEP_TOKENS, temperature=0.0, echo=False)
elapsed = time.time() - t0
toks = out["usage"]["completion_tokens"]
tps = toks / elapsed if elapsed > 0 else 0
actual = min(ngl, TOTAL_LAYERS)
results.append({"ngl": ngl, "actual": actual, "tps": tps, "vram": vram_delta})
print(f"{tps:.1f} tok/s | VRAM delta: +{vram_delta} MB")
print("\n✅ Sweep complete!")
# --- Plot ---
fig = plt.figure(figsize=(13, 5))
gs = gridspec.GridSpec(1, 2, figure=fig)
x_labels = [str(r["actual"]) for r in results]
tps_vals = [r["tps"] for r in results]
vram_vals = [r["vram"] for r in results]
# Speed chart
ax1 = fig.add_subplot(gs[0])
bars1 = ax1.bar(x_labels, tps_vals, color="steelblue", edgecolor="white")
ax1.set_title("Inference Speed vs GPU Layers", fontsize=13, fontweight="bold")
ax1.set_xlabel(f"GPU Layers (of {TOTAL_LAYERS} total)")
ax1.set_ylabel("Tokens / Second")
for bar, val in zip(bars1, tps_vals):
ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.3,
f"{val:.1f}", ha="center", va="bottom", fontsize=9)
# VRAM chart
ax2 = fig.add_subplot(gs[1])
bars2 = ax2.bar(x_labels, vram_vals, color="darkorange", edgecolor="white")
ax2.set_title("VRAM Usage vs GPU Layers", fontsize=13, fontweight="bold")
ax2.set_xlabel(f"GPU Layers (of {TOTAL_LAYERS} total)")
ax2.set_ylabel("VRAM Used by Model (MB)")
for bar, val in zip(bars2, vram_vals):
ax2.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 5,
f"{val} MB", ha="center", va="bottom", fontsize=9)
plt.suptitle("llama.cpp Parameter Explorer — TinyLlama 1.1B Q4_K_M",
fontsize=14, fontweight="bold", y=1.02)
plt.tight_layout()
plt.savefig("gpu_sweep_results.png", dpi=150, bbox_inches="tight")
plt.show()
print("📊 Chart saved as gpu_sweep_results.png")🧵 Section 8: Thread Sweep¶
Now hold GPU layers fixed and sweep CPU thread count — to see the effect when GPU isn’t doing all the work.
import psutil
cpu_count = psutil.cpu_count(logical=False)
THREAD_VALUES = [t for t in [1, 2, 4, 8, 16] if t <= cpu_count * 2]
THREAD_NGL = 0 # keep all on CPU so threads actually matter
print(f"Physical CPU cores: {cpu_count}")
print(f"Testing with n_gpu_layers={THREAD_NGL} (CPU-only mode so threads matter)")
print(f"Thread counts: {THREAD_VALUES}\n")
thread_results = []
for nt in THREAD_VALUES:
print(f"Testing n_threads={nt} ...", end=" ", flush=True)
try:
del llm
gc.collect()
except NameError:
pass
llm = Llama(
model_path = model_path,
n_gpu_layers = THREAD_NGL,
n_threads = nt,
n_ctx = 512,
n_batch = 512,
verbose = False,
)
t0 = time.time()
out = llm(SWEEP_PROMPT, max_tokens=SWEEP_TOKENS, temperature=0.0, echo=False)
elapsed = time.time() - t0
toks = out["usage"]["completion_tokens"]
tps = toks / elapsed
thread_results.append({"threads": nt, "tps": tps})
print(f"{tps:.1f} tok/s")
# Plot
fig, ax = plt.subplots(figsize=(7, 4))
x = [r["threads"] for r in thread_results]
y = [r["tps"] for r in thread_results]
ax.plot(x, y, marker="o", linewidth=2.5, color="forestgreen", markersize=8)
ax.set_title("CPU Thread Count vs Inference Speed (GPU layers = 0)",
fontsize=12, fontweight="bold")
ax.set_xlabel("n_threads")
ax.set_ylabel("Tokens / Second")
ax.axvline(x=cpu_count, color="red", linestyle="--", alpha=0.6,
label=f"Physical cores = {cpu_count}")
ax.legend()
ax.grid(alpha=0.3)
plt.tight_layout()
plt.savefig("thread_sweep_results.png", dpi=150, bbox_inches="tight")
plt.show()
print("📊 Chart saved as thread_sweep_results.png")
print(f"\n💡 Tip: Performance usually peaks around physical core count ({cpu_count}). Beyond that, hyperthreading adds little.")Section 9: Context Window (n_ctx) vs VRAM¶
A larger context window means more KV cache stored in VRAM — this can be significant with larger models.
CTX_VALUES = [128, 256, 512, 1024, 2048, 4096]
CTX_NGL = 22 # full GPU so VRAM cost is visible
print(f"Sweeping context sizes with n_gpu_layers={CTX_NGL}...\n")
ctx_results = []
for ctx in CTX_VALUES:
try:
del llm
gc.collect()
torch.cuda.empty_cache()
except NameError:
pass
time.sleep(0.5)
used_before, _ = get_gpu_memory()
used_before = used_before or 0
try:
llm = Llama(
model_path = model_path,
n_gpu_layers = CTX_NGL,
n_threads = 4,
n_ctx = ctx,
verbose = False,
)
used_after, _ = get_gpu_memory()
used_after = used_after or 0
delta = max(0, used_after - used_before)
ctx_results.append({"ctx": ctx, "vram": delta})
print(f" n_ctx={ctx:>5}: +{delta} MB VRAM")
except Exception as e:
print(f" n_ctx={ctx:>5}: ❌ OOM or error — {e}")
ctx_results.append({"ctx": ctx, "vram": None})
# Plot
valid = [(r["ctx"], r["vram"]) for r in ctx_results if r["vram"] is not None]
if valid:
xs, ys = zip(*valid)
fig, ax = plt.subplots(figsize=(7, 4))
ax.plot(xs, ys, marker="s", linewidth=2.5, color="purple", markersize=8)
ax.set_title("Context Window Size vs VRAM Used", fontsize=12, fontweight="bold")
ax.set_xlabel("n_ctx (context window tokens)")
ax.set_ylabel("VRAM Used by Model (MB)")
ax.set_xscale("log", base=2)
ax.grid(alpha=0.3)
plt.tight_layout()
plt.savefig("ctx_sweep_results.png", dpi=150, bbox_inches="tight")
plt.show()
print("📊 Chart saved as ctx_sweep_results.png")Section 10: Summary & Cheat Sheet¶
After running the sweeps, here’s your personalized cheat sheet:
used_now, total_vram = get_gpu_memory()
print("=" * 55)
print(" llama.cpp PARAMETER CHEAT SHEET FOR THIS MACHINE")
print("=" * 55)
print(f"\n GPU: {'Available (' + str(total_vram) + ' MB VRAM)' if total_vram else 'Not detected'}")
print(f" CPU cores: {psutil.cpu_count(logical=False)} physical / {psutil.cpu_count(logical=True)} logical")
print(f" Model tested: TinyLlama 1.1B Q4_K_M ({TOTAL_LAYERS} layers)")
print("\n RECOMMENDED SETTINGS:")
print(f" Full GPU offload : -ngl {TOTAL_LAYERS} (or -ngl 999)")
print(f" Half GPU : -ngl {TOTAL_LAYERS // 2}")
print(f" CPU only : -ngl 0")
print(f" Threads : -t {psutil.cpu_count(logical=False)} (match physical cores)")
if results:
best = max(results, key=lambda r: r["tps"])
print(f"\n ⚡ FASTEST CONFIG IN SWEEP:")
print(f" n_gpu_layers = {best['actual']} → {best['tps']:.1f} tokens/sec")
print("\n EXAMPLE llama-cli COMMAND:")
print(f" llama-cli -m model.gguf \\")
print(f" -ngl {TOTAL_LAYERS} \\")
print(f" -t {psutil.cpu_count(logical=False)} \\")
print(f" --ctx-size 2048 \\")
print(f" -p 'Your prompt here'")
print("=" * 55)