Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

llama.cpp GPU Explorer — JupyterHub Edition

Learning Objective

Students will understand how GPU parameters (n_gpu_layers, n_threads, n_ctx) affect model loading, memory usage, and inference speed when running a small language model with llama-cpp-python on a shared JupyterHub environment.

Section 1: Imports

We import all the libraries we need at the top of the notebook.

A library is a collection of pre-written code that gives us extra tools to use. We use try/except to import llama-cpp-python — if it is not installed, it prints a message and you can ask your instructor for help.

try:
    from llama_cpp import Llama  # the llama-cpp-python library for running small language models
except:
    print('Error: llama-cpp-python is not installed. Please ask your instructor.')

import torch                 # PyTorch — used here to check if a GPU is available
import subprocess            # lets us run shell commands from Python (used to call nvidia-smi)
import os                    # tools for working with file paths and the operating system
import time                  # used to measure how long inference takes
import gc                    # garbage collector — helps free memory between model loads
import matplotlib.pyplot as plt         # used to draw charts
import matplotlib.gridspec as gridspec  # used to arrange multiple charts side by side
import psutil                # reads CPU and system memory information
print('All imports successful.')

Section 2: Check Your GPU

Before we load any model, we check whether a GPU (Graphics Processing Unit) is available.

A GPU is a chip designed to do many math operations at once. Language models run much faster on a GPU than on a CPU because they involve huge numbers of matrix multiplications.

CUDA is NVIDIA’s software layer that lets Python code talk to an NVIDIA GPU.

print('=' * 50)
print('GPU AVAILABILITY CHECK')
print('=' * 50)

cuda_available = torch.cuda.is_available()
print(f'CUDA available: {cuda_available}')

if cuda_available:
    number_of_gpus = torch.cuda.device_count()
    print(f'Number of GPUs: {number_of_gpus}')
    for gpu_index in range(number_of_gpus):
        gpu_properties = torch.cuda.get_device_properties(gpu_index)
        vram_gb = gpu_properties.total_memory / 1e9
        print(f'\nGPU {gpu_index}: {gpu_properties.name}')
        print(f'  VRAM:            {vram_gb:.1f} GB')
        print(f'  CUDA Capability: {gpu_properties.major}.{gpu_properties.minor}')
        print(f'  Multiprocessors: {gpu_properties.multi_processor_count}')
else:
    print('\n  No GPU found. llama.cpp will run on CPU only.')
    print('  The n_gpu_layers parameter will be ignored.')

Reading GPU Memory

We will track VRAM (Video RAM — the GPU’s memory) as we load models.

We use the nvidia-smi command-line tool to ask the GPU how much memory is in use. The function below runs that command and returns the numbers we need.

We will call this function before and after loading a model to see how much VRAM each load uses.

def get_gpu_memory():
    # Runs nvidia-smi and returns (used_mb, total_mb) for GPU 0.
    # Returns (None, None) if nvidia-smi is not available.
    try:
        raw_output = subprocess.check_output(
            ['nvidia-smi', '--query-gpu=memory.used,memory.total',
             '--format=csv,noheader,nounits'],
            text=True
        ).strip().split('\n')[0]
        used_mb, total_mb = map(int, raw_output.split(','))
        return used_mb, total_mb
    except Exception:
        return None, None

used_mb, total_mb = get_gpu_memory()
if total_mb:
    free_mb = total_mb - used_mb
    used_pct = used_mb / total_mb * 100
    print(f'GPU Memory — Used: {used_mb} MB | Free: {free_mb} MB | Total: {total_mb} MB ({used_pct:.1f}% used)')
    print(f'Baseline before loading any model: {used_mb} MB used')
else:
    print('nvidia-smi not available — memory tracking will be skipped.')

Section 3: Find the Shared Model

On this JupyterHub, the instructor has placed model files in a shared folder. We do not need to download anything.

The shared folder is at /home/jovyan/shared/. Run the cell below to see what model files are available.

# List the files in the shared model folder
!ls /home/jovyan/shared/

Set the Model Directory and Model File Name

Set model_directory to the path of the shared folder. Set model_name to the name of the .gguf file you saw above.

A .gguf file is a quantized (compressed) version of a language model. It is smaller and faster to run than the full-precision version.

We also set TOTAL_LAYERS — the number of transformer blocks in this model. We use this number throughout the notebook to track how many layers are on the GPU.

# The shared folder where the instructor placed the model files
model_directory = '/home/jovyan/shared/'

# The name of the quantized model file we will use
model_name = 'qwen2-1_5b-instruct-q4_0.gguf'

# Build the full path to the model file
model_path = os.path.join(model_directory, model_name)

# Qwen2-1.5B has 28 transformer layers
TOTAL_LAYERS = 28

print(f'Model path: {model_path}')
print(f'File exists: {os.path.isfile(model_path)}')
print(f'File size: {os.path.getsize(model_path) / 1e6:.0f} MB')
print(f'Total transformer layers: {TOTAL_LAYERS}')

Section 4: Key Parameters Explained

llama-cpp-python lets you control exactly how the model uses your hardware. Here are the four parameters we will explore in this notebook:

ParameterWhat it doesTrade-off
n_gpu_layersHow many transformer layers are sent to the GPUMore layers → faster inference, more VRAM used
n_threadsHow many CPU cores handle the layers NOT on the GPUMore threads → faster CPU processing
n_ctxContext window — how many tokens the model can see at onceLarger window → more VRAM per layer
n_batchTokens processed together during the prompt phaseLarger batch → faster prompt ingestion, more VRAM

Mental model:

Total model = GPU layers (VRAM, fast) + CPU layers (RAM, slower)
                   ↑ controlled by n_gpu_layers

Setting n_gpu_layers=999 tells llama.cpp to put everything on the GPU (if it fits in VRAM).

Section 5: Interactive Parameter Explorer

Edit the four parameters below, then run the cell. Each run loads the model fresh so you get a clean VRAM reading.

Try these experiments:

  • n_gpu_layers = 0 → all on CPU, watch GPU memory stay flat

  • n_gpu_layers = 14 → roughly half on GPU

  • n_gpu_layers = 28 → fully on GPU

  • n_gpu_layers = 999 → same as 28 (llama.cpp caps it at the model maximum)

# ============================================================
#  Edit these parameters and re-run this cell
# ============================================================
n_gpu_layers = 28    # 0 = CPU only, 28 = full GPU, 999 = all available
n_threads    = 4     # CPU threads for the layers NOT on the GPU
n_ctx        = 512   # Context window size in tokens
n_batch      = 512   # Batch size for prompt processing
# ============================================================

# Read VRAM before loading
used_before, total_vram = get_gpu_memory()
used_before = used_before or 0

print(f'Parameters:')
print(f'  n_gpu_layers = {n_gpu_layers}  (model has {TOTAL_LAYERS} layers)')
print(f'  n_threads    = {n_threads}')
print(f'  n_ctx        = {n_ctx}')
print(f'  n_batch      = {n_batch}')
print(f'\nVRAM before loading: {used_before} MB')
print('Loading model...')

load_start_time = time.time()
model = Llama(
    model_path   = model_path,
    n_gpu_layers = n_gpu_layers,
    n_threads    = n_threads,
    n_ctx        = n_ctx,
    n_batch      = n_batch,
    verbose      = False,
)
load_time = time.time() - load_start_time

# Read VRAM after loading
used_after, _ = get_gpu_memory()
used_after = used_after or 0
vram_delta = used_after - used_before

# Figure out the actual number of layers on the GPU (capped at model max)
actual_gpu_layers = min(n_gpu_layers, TOTAL_LAYERS)
cpu_layers = TOTAL_LAYERS - actual_gpu_layers

print(f'\nModel loaded in {load_time:.1f}s')
print(f'\n{"="*45}')
print(f'  LAYER SPLIT')
print(f'  GPU layers : {actual_gpu_layers:>3} / {TOTAL_LAYERS}  ({actual_gpu_layers/TOTAL_LAYERS*100:.0f}%)')
print(f'  CPU layers : {cpu_layers:>3} / {TOTAL_LAYERS}  ({cpu_layers/TOTAL_LAYERS*100:.0f}%)')
print(f'{"="*45}')
print(f'  GPU VRAM USED BY MODEL')
if total_vram:
    print(f'  Before     : {used_before:>6} MB')
    print(f'  After      : {used_after:>6} MB')
    print(f'  Delta      : {vram_delta:>+6} MB  (model cost)')
    print(f'  Total avail: {total_vram:>6} MB')
else:
    print('  (nvidia-smi not available)')
print(f'{"="*45}')

Section 6: Inference Speed Test

Now we run a prompt and measure tokens per second — the real benefit of GPU offloading.

A token is roughly one word (or part of a word). Language models generate one token at a time. More tokens per second means faster responses.

GPU offloading speeds up inference because the GPU can run many computations in parallel.

test_prompt = 'Explain what a GPU is to a five year old in 3 sentences.'
max_response_tokens = 80

print(f'Prompt: "{test_prompt}"')
print(f'max_tokens = {max_response_tokens}\n')

inference_start_time = time.time()
raw_output = model(
    test_prompt,
    max_tokens  = max_response_tokens,
    temperature = 0.7,
    echo        = False,
)
inference_elapsed = time.time() - inference_start_time

response_text     = raw_output['choices'][0]['text'].strip()
tokens_generated  = raw_output['usage']['completion_tokens']
tokens_per_second = tokens_generated / inference_elapsed

print(f'Response:\n{response_text}')
print(f'\n{"="*45}')
print(f'  INFERENCE STATS')
print(f'  Tokens generated : {tokens_generated}')
print(f'  Time elapsed     : {inference_elapsed:.2f}s')
print(f'  Speed            : {tokens_per_second:.1f} tokens/sec')
print(f'  GPU layers used  : {actual_gpu_layers} / {TOTAL_LAYERS}')
print(f'{"="*45}')

Section 7: Benchmark Sweep — Layers vs Speed vs VRAM

This section runs a full sweep across different n_gpu_layers values and plots two charts:

  • Tokens per second vs number of GPU layers

  • VRAM used vs number of GPU layers

This takes a few minutes because each step reloads the model fresh.

A benchmark means running the same task multiple times under different conditions so you can compare the results fairly.

# Layer values to test (spread evenly across 0 to TOTAL_LAYERS)
layer_values_to_test = [0, 4, 8, 14, 20, 24, 28]
sweep_threads = 4
sweep_ctx     = 512
sweep_prompt  = 'What is machine learning?'
sweep_tokens  = 50

sweep_results = []

for number_of_gpu_layers in layer_values_to_test:
    print(f'Testing n_gpu_layers={number_of_gpu_layers} ...', end=' ', flush=True)

    # Clean up the previous model to free memory before the next run
    try:
        del model
        gc.collect()
        torch.cuda.empty_cache()
    except NameError:
        pass
    time.sleep(1)

    used_before_sweep, total_vram_sweep = get_gpu_memory()
    used_before_sweep = used_before_sweep or 0

    sweep_model = Llama(
        model_path   = model_path,
        n_gpu_layers = number_of_gpu_layers,
        n_threads    = sweep_threads,
        n_ctx        = sweep_ctx,
        n_batch      = 512,
        verbose      = False,
    )

    used_after_sweep, _ = get_gpu_memory()
    used_after_sweep = used_after_sweep or 0
    vram_cost = used_after_sweep - used_before_sweep
    if vram_cost < 0:
        vram_cost = 0

    sweep_start = time.time()
    sweep_output = sweep_model(sweep_prompt, max_tokens=sweep_tokens, temperature=0.0, echo=False)
    sweep_elapsed = time.time() - sweep_start
    sweep_tokens_count = sweep_output['usage']['completion_tokens']
    tokens_per_second_sweep = sweep_tokens_count / sweep_elapsed if sweep_elapsed > 0 else 0

    actual_gpu_count = min(number_of_gpu_layers, TOTAL_LAYERS)
    sweep_results.append({
        'ngl': number_of_gpu_layers,
        'actual': actual_gpu_count,
        'tps': tokens_per_second_sweep,
        'vram': vram_cost
    })
    print(f'{tokens_per_second_sweep:.1f} tok/s | VRAM delta: +{vram_cost} MB')

print('\nSweep complete!')

# Plot the results
fig = plt.figure(figsize=(13, 5))
gs  = gridspec.GridSpec(1, 2, figure=fig)

x_labels  = []
tps_vals  = []
vram_vals = []
for result in sweep_results:
    x_labels.append(str(result['actual']))
    tps_vals.append(result['tps'])
    vram_vals.append(result['vram'])

ax1 = fig.add_subplot(gs[0])
speed_bars = ax1.bar(x_labels, tps_vals, color='steelblue', edgecolor='white')
ax1.set_title('Inference Speed vs GPU Layers', fontsize=13, fontweight='bold')
ax1.set_xlabel(f'GPU Layers (of {TOTAL_LAYERS} total)')
ax1.set_ylabel('Tokens / Second')
for bar, val in zip(speed_bars, tps_vals):
    ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.3,
             f'{val:.1f}', ha='center', va='bottom', fontsize=9)

ax2 = fig.add_subplot(gs[1])
vram_bars = ax2.bar(x_labels, vram_vals, color='darkorange', edgecolor='white')
ax2.set_title('VRAM Usage vs GPU Layers', fontsize=13, fontweight='bold')
ax2.set_xlabel(f'GPU Layers (of {TOTAL_LAYERS} total)')
ax2.set_ylabel('VRAM Used by Model (MB)')
for bar, val in zip(vram_bars, vram_vals):
    ax2.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 5,
             f'{val} MB', ha='center', va='bottom', fontsize=9)

plt.suptitle(f'llama.cpp Parameter Explorer — {model_name}',
             fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

Section 8: Thread Sweep

Now we hold GPU layers fixed at 0 (CPU only) and sweep the number of CPU threads.

A thread is a single stream of instructions running on one CPU core. Using more threads allows multiple calculations to happen at the same time.

When n_gpu_layers = 0, all the model’s math runs on the CPU. In this mode, the number of threads has the biggest impact on speed.

physical_cpu_cores = psutil.cpu_count(logical=False)
max_useful_threads = physical_cpu_cores * 2

all_candidate_threads = [1, 2, 4, 8, 16]
thread_values_to_test = []
for candidate in all_candidate_threads:
    if candidate <= max_useful_threads:
        thread_values_to_test.append(candidate)

thread_sweep_ngl = 0  # CPU only so that thread count actually matters

print(f'Physical CPU cores: {physical_cpu_cores}')
print(f'Testing with n_gpu_layers={thread_sweep_ngl} (CPU-only mode so threads matter)')
print(f'Thread counts to test: {thread_values_to_test}\n')

thread_results = []

for number_of_threads in thread_values_to_test:
    print(f'Testing n_threads={number_of_threads} ...', end=' ', flush=True)

    try:
        del sweep_model
        gc.collect()
    except NameError:
        pass

    thread_model = Llama(
        model_path   = model_path,
        n_gpu_layers = thread_sweep_ngl,
        n_threads    = number_of_threads,
        n_ctx        = 512,
        n_batch      = 512,
        verbose      = False,
    )
    thread_start = time.time()
    thread_output = thread_model(sweep_prompt, max_tokens=sweep_tokens, temperature=0.0, echo=False)
    thread_elapsed = time.time() - thread_start
    thread_toks = thread_output['usage']['completion_tokens']
    thread_tps = thread_toks / thread_elapsed
    thread_results.append({'threads': number_of_threads, 'tps': thread_tps})
    print(f'{thread_tps:.1f} tok/s')

# Plot the results
thread_x = []
thread_y = []
for result in thread_results:
    thread_x.append(result['threads'])
    thread_y.append(result['tps'])

fig, ax = plt.subplots(figsize=(7, 4))
ax.plot(thread_x, thread_y, marker='o', linewidth=2.5, color='forestgreen', markersize=8)
ax.set_title('CPU Thread Count vs Inference Speed (GPU layers = 0)',
             fontsize=12, fontweight='bold')
ax.set_xlabel('n_threads')
ax.set_ylabel('Tokens / Second')
ax.axvline(x=physical_cpu_cores, color='red', linestyle='--', alpha=0.6,
           label=f'Physical cores = {physical_cpu_cores}')
ax.legend()
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()
print(f'Tip: Speed usually peaks near the physical core count ({physical_cpu_cores}). Beyond that, extra threads add little.')

Section 9: Context Window (n_ctx) vs VRAM

A larger context window means the model can remember more text at once. But it also costs more VRAM, because the model must store a KV cache for every token in the window.

A KV cache (key-value cache) is a table of numbers the model keeps so it does not have to recompute attention scores for tokens it has already seen.

In this section we load the model at different context sizes and measure the VRAM cost each time.

context_sizes_to_test = [128, 256, 512, 1024, 2048, 4096]
ctx_sweep_ngl = 28  # full GPU so the VRAM cost is visible

print(f'Sweeping context sizes with n_gpu_layers={ctx_sweep_ngl}...\n')

ctx_results = []

for ctx_size in context_sizes_to_test:
    try:
        del thread_model
        gc.collect()
        torch.cuda.empty_cache()
    except NameError:
        pass
    time.sleep(0.5)

    used_before_ctx, _ = get_gpu_memory()
    used_before_ctx = used_before_ctx or 0

    try:
        ctx_model = Llama(
            model_path   = model_path,
            n_gpu_layers = ctx_sweep_ngl,
            n_threads    = 4,
            n_ctx        = ctx_size,
            verbose      = False,
        )
        used_after_ctx, _ = get_gpu_memory()
        used_after_ctx = used_after_ctx or 0
        ctx_vram_delta = used_after_ctx - used_before_ctx
        if ctx_vram_delta < 0:
            ctx_vram_delta = 0
        ctx_results.append({'ctx': ctx_size, 'vram': ctx_vram_delta})
        print(f'  n_ctx={ctx_size:>5}: +{ctx_vram_delta} MB VRAM')
    except Exception as ctx_error:
        print(f'  n_ctx={ctx_size:>5}: Error (possibly out of VRAM) — {ctx_error}')
        ctx_results.append({'ctx': ctx_size, 'vram': None})

# Build separate lists for valid (non-None) results
ctx_x_values = []
ctx_y_values = []
for result in ctx_results:
    if result['vram'] is not None:
        ctx_x_values.append(result['ctx'])
        ctx_y_values.append(result['vram'])

if ctx_x_values:
    fig, ax = plt.subplots(figsize=(7, 4))
    ax.plot(ctx_x_values, ctx_y_values, marker='s', linewidth=2.5, color='purple', markersize=8)
    ax.set_title('Context Window Size vs VRAM Used', fontsize=12, fontweight='bold')
    ax.set_xlabel('n_ctx (context window tokens)')
    ax.set_ylabel('VRAM Used by Model (MB)')
    ax.set_xscale('log', base=2)
    ax.grid(alpha=0.3)
    plt.tight_layout()
    plt.show()

Section 10: Summary and Cheat Sheet

After running all the sweeps above, this cell prints a personalized cheat sheet for your specific machine and model.

current_used_mb, current_total_vram = get_gpu_memory()

print('=' * 55)
print('  llama.cpp PARAMETER CHEAT SHEET FOR THIS MACHINE')
print('=' * 55)

if current_total_vram:
    gpu_summary = f'Available ({current_total_vram} MB VRAM)'
else:
    gpu_summary = 'Not detected'
print(f'\n  GPU: {gpu_summary}')
print(f'  CPU cores: {psutil.cpu_count(logical=False)} physical / {psutil.cpu_count(logical=True)} logical')
print(f'  Model tested: {model_name} ({TOTAL_LAYERS} layers)')

print('\n  RECOMMENDED SETTINGS:')
print(f'    Full GPU offload : n_gpu_layers={TOTAL_LAYERS} (or 999)')
print(f'    Half GPU         : n_gpu_layers={TOTAL_LAYERS // 2}')
print(f'    CPU only         : n_gpu_layers=0')
print(f'    Threads          : n_threads={psutil.cpu_count(logical=False)} (match physical cores)')

if sweep_results:
    best_result = sweep_results[0]
    for result in sweep_results:
        if result['tps'] > best_result['tps']:
            best_result = result
    print(f'\n  FASTEST CONFIG IN SWEEP:')
    print(f'    n_gpu_layers = {best_result["actual"]} -> {best_result["tps"]:.1f} tokens/sec')

print('\n  EXAMPLE llama-cli COMMAND:')
print(f'    llama-cli -m {model_name} \\')
print(f'      -ngl {TOTAL_LAYERS} \\')
print(f'      -t {psutil.cpu_count(logical=False)} \\')
print(f'      --ctx-size 2048 \\')
print(f'      -p \'Your prompt here\'')
print('=' * 55)

What You Learned

In this notebook you:

  1. Imported all needed libraries using try/except — no installation required

  2. Checked GPU availability using PyTorch and nvidia-smi

  3. Found the shared model in the JupyterHub shared folder

  4. Loaded a model using llama-cpp-python and controlled how many layers go to the GPU

  5. Measured inference speed in tokens per second

  6. Ran three benchmark sweeps: GPU layers, CPU threads, and context window size

  7. Read charts showing how each parameter affects speed and memory

The key takeaway: more GPU layers → faster inference, but costs more VRAM. Finding the right balance for your hardware is what this notebook is all about.