Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

llama.cpp GPU Explorer

Understanding Layers, Threads & GPU Memory on JupyterHub

This notebook is inspired by the rh-aiservices-bu/getting-started-with-gpus series.

We extend it to explore how llama.cpp parameters affect GPU utilization:

  • n_gpu_layers — how many transformer layers to offload to GPU

  • n_threads — how many CPU threads handle the rest

  • n_ctx — context window size and its VRAM cost

  • n_batch — prompt batch size

We use llama-cpp-python with a small quantized model (TinyLlama or similar) so this works on modest GPUs.

🔧 Section 1: Environment Setup

Install dependencies. Run this once.

# Install llama-cpp-python with CUDA support
# The CMAKE_ARGS flag enables GPU (CUBLAS) backend
import subprocess, sys

result = subprocess.run(
    [sys.executable, "-m", "pip", "install",
     "llama-cpp-python",
     "--extra-index-url", "https://abetlen.github.io/llama-cpp-python/whl/cu121",
     "huggingface_hub", "psutil", "matplotlib", "ipywidgets", "tqdm"],
    capture_output=True, text=True
)
print(result.stdout[-2000:] if result.stdout else "")
print(result.stderr[-500:] if result.returncode != 0 else "✅ Install complete")

🖥️ Section 2: Check Your GPU (just like the rh-aiservices-bu notebook)

Before touching llama.cpp, let’s confirm CUDA is available and inspect the hardware.

import torch

print("=" * 50)
print("GPU AVAILABILITY CHECK")
print("=" * 50)

cuda_available = torch.cuda.is_available()
print(f"CUDA available:        {cuda_available}")

if cuda_available:
    n_gpus = torch.cuda.device_count()
    print(f"Number of GPUs:        {n_gpus}")
    for i in range(n_gpus):
        props = torch.cuda.get_device_properties(i)
        vram_gb = props.total_memory / 1e9
        print(f"\nGPU {i}: {props.name}")
        print(f"  VRAM:              {vram_gb:.1f} GB")
        print(f"  CUDA Capability:   {props.major}.{props.minor}")
        print(f"  Multiprocessors:   {props.multi_processor_count}")
else:
    print("\n⚠️  No GPU found. llama.cpp will run on CPU only.")
    print("   n_gpu_layers will be ignored.")
import subprocess

def get_gpu_memory():
    """Returns (used_mb, total_mb) for GPU 0 via nvidia-smi."""
    try:
        out = subprocess.check_output(
            ["nvidia-smi", "--query-gpu=memory.used,memory.total",
             "--format=csv,noheader,nounits"],
            text=True
        ).strip().split("\n")[0]
        used, total = map(int, out.split(","))
        return used, total
    except Exception:
        return None, None

used, total = get_gpu_memory()
if total:
    free = total - used
    pct  = used / total * 100
    print(f"GPU Memory — Used: {used} MB | Free: {free} MB | Total: {total} MB ({pct:.1f}% used)")
    print(f"\n💡 Baseline (before loading any model): {used} MB used")
else:
    print("nvidia-smi not available — memory tracking will be skipped.")

📥 Section 3: Download a Small Test Model

We use TinyLlama-1.1B Q4_K_M (~670 MB) — small enough to fit on most GPUs but real enough to show meaningful GPU usage patterns.

It has 22 transformer layers, which gives us a nice range to experiment with (0 → 22).

import os

model_path   = "/home/jovyan/shared/qwen2-1_5b-instruct-q4_0.gguf"
TOTAL_LAYERS = 28   # Qwen2-1.5B has 28 transformer blocks

assert os.path.exists(MODEL_PATH), f"Model not found at {MODEL_PATH}"
print(f"Model:  {MODEL_PATH}")
print(f"Size:   {os.path.getsize(MODEL_PATH) / 1e6:.0f} MB")
print(f"Layers: {TOTAL_LAYERS}")

Section 4: Key Parameter Explainer

Before we experiment, here’s what each parameter means:

ParameterWhat it doesTradeoff
n_gpu_layers# of transformer layers offloaded to GPU VRAMHigher → faster inference, more VRAM used
n_threadsCPU cores for layers NOT on GPUMore threads → faster CPU processing
n_ctxContext window (tokens in memory at once)Larger → more VRAM per layer
n_batchTokens processed together in prompt phaseLarger → faster prompt ingestion, more VRAM

The key mental model:

Total model = GPU layers (VRAM, fast) + CPU layers (RAM, slower)
                   ↑ controlled by n_gpu_layers

Setting n_gpu_layers=999 offloads everything (if it fits in VRAM).

Section 5: Interactive Parameter Explorer

✏️ Edit the parameters below, then run the cell. Each run loads the model fresh so you get clean memory readings.

Try these experiments:

  • n_gpu_layers = 0 → all on CPU, watch GPU memory stay flat

  • n_gpu_layers = 11 → half on GPU

  • n_gpu_layers = 22 → fully on GPU

  • n_gpu_layers = 999 → same as 22 (llama.cpp caps it)

# ============================================================
#  ✏️  EDIT THESE PARAMETERS AND RE-RUN THIS CELL
# ============================================================
n_gpu_layers = 22    # 0 = CPU only, 22 = full GPU, 999 = all available
n_threads    = 4     # CPU threads for non-GPU layers
n_ctx        = 512   # Context window size in tokens
n_batch      = 512   # Batch size for prompt processing
# ============================================================

from llama_cpp import Llama
import time, gc

# --- Memory before loading ---
used_before, total_vram = get_gpu_memory()
used_before = used_before or 0

print(f"⚙️  Parameters:")
print(f"   n_gpu_layers = {n_gpu_layers}  (TinyLlama has {TOTAL_LAYERS} layers)")
print(f"   n_threads    = {n_threads}")
print(f"   n_ctx        = {n_ctx}")
print(f"   n_batch      = {n_batch}")
print(f"\n📊 VRAM before loading: {used_before} MB")
print("\nLoading model...")

t0 = time.time()
llm = Llama(
    model_path   = model_path,
    n_gpu_layers = n_gpu_layers,
    n_threads    = n_threads,
    n_ctx        = n_ctx,
    n_batch      = n_batch,
    verbose      = False,
)
load_time = time.time() - t0

used_after, _ = get_gpu_memory()
used_after = used_after or 0
vram_delta = used_after - used_before

# Figure out actual layers on GPU (capped at model max)
actual_gpu_layers = min(n_gpu_layers, TOTAL_LAYERS)
cpu_layers = TOTAL_LAYERS - actual_gpu_layers

print(f"\n✅ Model loaded in {load_time:.1f}s")
print(f"\n{'='*45}")
print(f"  LAYER SPLIT")
print(f"  GPU layers : {actual_gpu_layers:>3} / {TOTAL_LAYERS}  ({actual_gpu_layers/TOTAL_LAYERS*100:.0f}%)")
print(f"  CPU layers : {cpu_layers:>3} / {TOTAL_LAYERS}  ({cpu_layers/TOTAL_LAYERS*100:.0f}%)")
print(f"{'='*45}")
print(f"  GPU VRAM USED BY MODEL")
if total_vram:
    print(f"  Before     : {used_before:>6} MB")
    print(f"  After      : {used_after:>6} MB")
    print(f"  Delta      : {vram_delta:>+6} MB  ← model cost")
    print(f"  Total avail: {total_vram:>6} MB")
else:
    print("  (nvidia-smi not available)")
print(f"{'='*45}")

Section 6: Inference Speed Test

Now run a prompt and measure tokens/second — the real payoff of GPU offloading.

PROMPT = "Explain what a GPU is to a five year old in 3 sentences."
MAX_TOKENS = 80

print(f"📝 Prompt: '{PROMPT}'")
print(f"   max_tokens = {MAX_TOKENS}\n")

t0 = time.time()
output = llm(
    PROMPT,
    max_tokens   = MAX_TOKENS,
    temperature  = 0.7,
    echo         = False,
)
elapsed = time.time() - t0

response_text   = output["choices"][0]["text"].strip()
tokens_generated = output["usage"]["completion_tokens"]
tok_per_sec     = tokens_generated / elapsed

print(f"💬 Response:\n{response_text}")
print(f"\n{'='*45}")
print(f"  INFERENCE STATS")
print(f"  Tokens generated : {tokens_generated}")
print(f"  Time elapsed     : {elapsed:.2f}s")
print(f"  Speed            : {tok_per_sec:.1f} tokens/sec")
print(f"  GPU layers used  : {actual_gpu_layers} / {TOTAL_LAYERS}")
print(f"{'='*45}")

📊 Section 7: Benchmark Sweep — Layers vs Speed vs VRAM

This cell runs a full sweep across different n_gpu_layers values and plots:

  • Tokens/second vs GPU layers

  • VRAM used vs GPU layers

⚠️ This takes a few minutes — each row reloads the model fresh.

import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import gc

# Which layer counts to sweep
LAYER_VALUES   = [0, 4, 8, 12, 16, 20, 22]
SWEEP_THREADS  = 4    # fixed for sweep
SWEEP_CTX      = 512
SWEEP_PROMPT   = "What is machine learning?"
SWEEP_TOKENS   = 50

results = []

for ngl in LAYER_VALUES:
    print(f"Testing n_gpu_layers={ngl} ...", end=" ", flush=True)

    # Clean up previous model
    try:
        del llm
        gc.collect()
        torch.cuda.empty_cache()
    except NameError:
        pass
    time.sleep(1)

    used_before, total_vram = get_gpu_memory()
    used_before = used_before or 0

    llm = Llama(
        model_path   = model_path,
        n_gpu_layers = ngl,
        n_threads    = SWEEP_THREADS,
        n_ctx        = SWEEP_CTX,
        n_batch      = 512,
        verbose      = False,
    )

    used_after, _ = get_gpu_memory()
    used_after = used_after or 0
    vram_delta = max(0, used_after - used_before)

    t0 = time.time()
    out = llm(SWEEP_PROMPT, max_tokens=SWEEP_TOKENS, temperature=0.0, echo=False)
    elapsed = time.time() - t0
    toks = out["usage"]["completion_tokens"]
    tps  = toks / elapsed if elapsed > 0 else 0

    actual = min(ngl, TOTAL_LAYERS)
    results.append({"ngl": ngl, "actual": actual, "tps": tps, "vram": vram_delta})
    print(f"{tps:.1f} tok/s | VRAM delta: +{vram_delta} MB")

print("\n✅ Sweep complete!")

# --- Plot ---
fig = plt.figure(figsize=(13, 5))
gs  = gridspec.GridSpec(1, 2, figure=fig)

x_labels = [str(r["actual"]) for r in results]
tps_vals  = [r["tps"]  for r in results]
vram_vals = [r["vram"] for r in results]

# Speed chart
ax1 = fig.add_subplot(gs[0])
bars1 = ax1.bar(x_labels, tps_vals, color="steelblue", edgecolor="white")
ax1.set_title("Inference Speed vs GPU Layers", fontsize=13, fontweight="bold")
ax1.set_xlabel(f"GPU Layers (of {TOTAL_LAYERS} total)")
ax1.set_ylabel("Tokens / Second")
for bar, val in zip(bars1, tps_vals):
    ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.3,
             f"{val:.1f}", ha="center", va="bottom", fontsize=9)

# VRAM chart
ax2 = fig.add_subplot(gs[1])
bars2 = ax2.bar(x_labels, vram_vals, color="darkorange", edgecolor="white")
ax2.set_title("VRAM Usage vs GPU Layers", fontsize=13, fontweight="bold")
ax2.set_xlabel(f"GPU Layers (of {TOTAL_LAYERS} total)")
ax2.set_ylabel("VRAM Used by Model (MB)")
for bar, val in zip(bars2, vram_vals):
    ax2.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 5,
             f"{val} MB", ha="center", va="bottom", fontsize=9)

plt.suptitle("llama.cpp Parameter Explorer — TinyLlama 1.1B Q4_K_M",
             fontsize=14, fontweight="bold", y=1.02)
plt.tight_layout()
plt.savefig("gpu_sweep_results.png", dpi=150, bbox_inches="tight")
plt.show()
print("📊 Chart saved as gpu_sweep_results.png")

🧵 Section 8: Thread Sweep

Now hold GPU layers fixed and sweep CPU thread count — to see the effect when GPU isn’t doing all the work.

import psutil

cpu_count   = psutil.cpu_count(logical=False)
THREAD_VALUES = [t for t in [1, 2, 4, 8, 16] if t <= cpu_count * 2]
THREAD_NGL    = 0   # keep all on CPU so threads actually matter

print(f"Physical CPU cores: {cpu_count}")
print(f"Testing with n_gpu_layers={THREAD_NGL} (CPU-only mode so threads matter)")
print(f"Thread counts: {THREAD_VALUES}\n")

thread_results = []

for nt in THREAD_VALUES:
    print(f"Testing n_threads={nt} ...", end=" ", flush=True)

    try:
        del llm
        gc.collect()
    except NameError:
        pass

    llm = Llama(
        model_path   = model_path,
        n_gpu_layers = THREAD_NGL,
        n_threads    = nt,
        n_ctx        = 512,
        n_batch      = 512,
        verbose      = False,
    )
    t0  = time.time()
    out = llm(SWEEP_PROMPT, max_tokens=SWEEP_TOKENS, temperature=0.0, echo=False)
    elapsed = time.time() - t0
    toks = out["usage"]["completion_tokens"]
    tps  = toks / elapsed
    thread_results.append({"threads": nt, "tps": tps})
    print(f"{tps:.1f} tok/s")

# Plot
fig, ax = plt.subplots(figsize=(7, 4))
x = [r["threads"] for r in thread_results]
y = [r["tps"]     for r in thread_results]
ax.plot(x, y, marker="o", linewidth=2.5, color="forestgreen", markersize=8)
ax.set_title("CPU Thread Count vs Inference Speed (GPU layers = 0)",
             fontsize=12, fontweight="bold")
ax.set_xlabel("n_threads")
ax.set_ylabel("Tokens / Second")
ax.axvline(x=cpu_count, color="red", linestyle="--", alpha=0.6,
           label=f"Physical cores = {cpu_count}")
ax.legend()
ax.grid(alpha=0.3)
plt.tight_layout()
plt.savefig("thread_sweep_results.png", dpi=150, bbox_inches="tight")
plt.show()
print("📊 Chart saved as thread_sweep_results.png")
print(f"\n💡 Tip: Performance usually peaks around physical core count ({cpu_count}). Beyond that, hyperthreading adds little.")

Section 9: Context Window (n_ctx) vs VRAM

A larger context window means more KV cache stored in VRAM — this can be significant with larger models.

CTX_VALUES = [128, 256, 512, 1024, 2048, 4096]
CTX_NGL    = 22   # full GPU so VRAM cost is visible

print(f"Sweeping context sizes with n_gpu_layers={CTX_NGL}...\n")

ctx_results = []

for ctx in CTX_VALUES:
    try:
        del llm
        gc.collect()
        torch.cuda.empty_cache()
    except NameError:
        pass
    time.sleep(0.5)

    used_before, _ = get_gpu_memory()
    used_before = used_before or 0

    try:
        llm = Llama(
            model_path   = model_path,
            n_gpu_layers = CTX_NGL,
            n_threads    = 4,
            n_ctx        = ctx,
            verbose      = False,
        )
        used_after, _ = get_gpu_memory()
        used_after = used_after or 0
        delta = max(0, used_after - used_before)
        ctx_results.append({"ctx": ctx, "vram": delta})
        print(f"  n_ctx={ctx:>5}: +{delta} MB VRAM")
    except Exception as e:
        print(f"  n_ctx={ctx:>5}: ❌ OOM or error — {e}")
        ctx_results.append({"ctx": ctx, "vram": None})

# Plot
valid = [(r["ctx"], r["vram"]) for r in ctx_results if r["vram"] is not None]
if valid:
    xs, ys = zip(*valid)
    fig, ax = plt.subplots(figsize=(7, 4))
    ax.plot(xs, ys, marker="s", linewidth=2.5, color="purple", markersize=8)
    ax.set_title("Context Window Size vs VRAM Used", fontsize=12, fontweight="bold")
    ax.set_xlabel("n_ctx (context window tokens)")
    ax.set_ylabel("VRAM Used by Model (MB)")
    ax.set_xscale("log", base=2)
    ax.grid(alpha=0.3)
    plt.tight_layout()
    plt.savefig("ctx_sweep_results.png", dpi=150, bbox_inches="tight")
    plt.show()
    print("📊 Chart saved as ctx_sweep_results.png")

Section 10: Summary & Cheat Sheet

After running the sweeps, here’s your personalized cheat sheet:

used_now, total_vram = get_gpu_memory()

print("=" * 55)
print("  llama.cpp PARAMETER CHEAT SHEET FOR THIS MACHINE")
print("=" * 55)

print(f"\n  GPU: {'Available (' + str(total_vram) + ' MB VRAM)' if total_vram else 'Not detected'}")
print(f"  CPU cores: {psutil.cpu_count(logical=False)} physical / {psutil.cpu_count(logical=True)} logical")
print(f"  Model tested: TinyLlama 1.1B Q4_K_M ({TOTAL_LAYERS} layers)")

print("\n  RECOMMENDED SETTINGS:")
print(f"    Full GPU offload : -ngl {TOTAL_LAYERS} (or -ngl 999)")
print(f"    Half GPU         : -ngl {TOTAL_LAYERS // 2}")
print(f"    CPU only         : -ngl 0")
print(f"    Threads          : -t {psutil.cpu_count(logical=False)} (match physical cores)")

if results:
    best = max(results, key=lambda r: r["tps"])
    print(f"\n  ⚡ FASTEST CONFIG IN SWEEP:")
    print(f"    n_gpu_layers = {best['actual']} → {best['tps']:.1f} tokens/sec")

print("\n  EXAMPLE llama-cli COMMAND:")
print(f"    llama-cli -m model.gguf \\")
print(f"      -ngl {TOTAL_LAYERS} \\")
print(f"      -t {psutil.cpu_count(logical=False)} \\")
print(f"      --ctx-size 2048 \\")
print(f"      -p 'Your prompt here'")
print("=" * 55)