Using llama-cpp-python and the GGUF Format¶
This notebook pulls back the curtain on how a language model actually works with numbers. We’ll explore:
Tokens: How text is broken into numeric pieces before the model ever sees it
Tokens going in and out: The integer IDs that flow through a model
Text chunks: What a chunk of text looks like vs. its tokenized form
GGUF files: The binary container that stores a model on disk
Model weights: The actual floating-point numbers that give a model its “knowledge”
How concepts map to numbers: Token embedding vectors as coordinates in meaning-space
Throughout, we use economics examples — tariffs, inflation, GDP, supply and demand — as the text to analyze, since these are rich, domain-specific phrases that illustrate how tokenizers handle specialized vocabulary.
What You’ll Need¶
A
.ggufmodel file already downloaded (seeGPT4All_Download_gguf.ipynb)Python packages:
llama-cpp-python,gguf,numpy
Resources¶
The Illustrated Word2Vec — great visual intro to embeddings
Tokenizer visualization — explore GPT tokenization online
Attribution¶
Notebook developed by Eric Van Dusen and contributors for the ds-modules/SmallLM-FA25 project.
1. Environment Setup¶
We need two packages beyond the standard library:
| Package | Purpose |
|---|---|
llama-cpp-python | Load GGUF models, run inference, access the built-in tokenizer |
gguf | Read GGUF file metadata and weight tensors directly (no model load needed) |
numpy | Display weight arrays as readable tables |
The gguf package is the official Python reader published by the llama.cpp project.
# Install llama-cpp-python if not already present
try:
from llama_cpp import Llama
except ImportError:
%pip install llama-cpp-python
from llama_cpp import Llama
# Install the gguf reader package
try:
from gguf import GGUFReader
except ImportError:
%pip install gguf
from gguf import GGUFReader
import numpy as np
print("All packages loaded successfully!")1.1 Locate the Model File¶
Set model_path and model_name to match your environment. Common locations:
Shared JupyterHub:
/home/jovyan/shared/Local machine: your own path (e.g.
~/models/)
# ── Set these two variables to match your setup ─────────────────────────────
model_path = "/home/jovyan/shared/" # directory containing .gguf files
model_name = "qwen2-1_5b-instruct-q4_0.gguf" # filename of the model
# ─────────────────────────────────────────────────────────────────────────────
# For local use, uncomment and adjust:
# model_path = "shared-rw/"
#model_path = "/Users/ericvandusen/SmallLM/Models/"
import os
full_model_path = os.path.join(model_path, model_name)
print(f"Looking for: {full_model_path}")
print(f"File exists: {os.path.exists(full_model_path)}")# See what .gguf files are available
gguf_files = [f for f in os.listdir(model_path) if f.endswith(".gguf")]
print(f"Available .gguf models in {model_path}:")
for f in gguf_files:
size_mb = os.path.getsize(os.path.join(model_path, f)) / (1024**2)
print(f" {f:60s} {size_mb:7.1f} MB")2. Motivation: Everything Is a Number¶
Before we load the model, let’s build intuition. Consider this economics sentence:
“When a country imposes a tariff on imported steel, domestic producers gain while consumers pay higher prices.”
A language model never sees letters. It sees only:
Token IDs — integers that index a vocabulary table
Embedding vectors — lists of floating-point numbers that represent each token
Weight matrices — huge arrays of floats that transform those vectors step by step
The diagram below summarizes the pipeline:
Raw text → Tokenizer → Token IDs → Embedding lookup → Transformer layers → Logits → Next token
(str) (list[int]) (matrix of floats) (weight math) (floats) (int → str)Every step in this pipeline is pure arithmetic on numbers. This section walks through each step using economic text.
3. Tokens: Text Broken Into Pieces¶
What is a Token?¶
A token is the atomic unit of text that the model processes. Tokens are not the same as words:
A common short word (e.g.
the) is usually one token.A longer or rarer word may be split into sub-word pieces (e.g.
inflation→[inflation], buthyperinflation→[hyper, inflation]).Spaces, punctuation, and capitalization all affect tokenization.
Numbers can be tokenized digit-by-digit or as whole numbers depending on the tokenizer.
Modern LLMs typically use Byte-Pair Encoding (BPE) or SentencePiece tokenizers. Both learn a vocabulary of common sub-word units from a large training corpus.
Why Does This Matter for Economics?¶
Economics has technical vocabulary that may be rare in general text: amortization, monopsony, heteroscedasticity. A tokenizer trained on general web data may split these into many small pieces, giving the model less efficient representations of economic concepts.
3.1 Load the Model and Access Its Tokenizer¶
We load the model with verbose=False to suppress the C++ startup messages. Once loaded, the Llama object exposes a tokenize() method that runs the same tokenizer used during training.
# Load the model
# verbose=False silences the C++ loading messages
model = Llama(
model_path=full_model_path,
n_ctx=2048,
verbose=False,
# n_threads=1, #CPU settings
# n_gpu_layers=-1 #GPU Settings
)
print(f"Model loaded: {model_name}")
print(f"Vocabulary size: {model.n_vocab()} tokens")
print(f"Context window: {model.n_ctx()} tokens")3.2 Tokenizing Economics Text¶
Let’s tokenize a few economics sentences and see what comes out. The tokenize() method returns a Python list of integers — the token IDs.
# Economics sentences to tokenize
economics_sentences = [
"Comparative advantage explains why countries specialize in producing goods they can make at a lower opportunity cost.",
"When a central bank raises interest rates, borrowing becomes more expensive and inflation tends to fall.",
"GDP measures the total monetary value of all finished goods and services produced within a country.",
"A tariff is a tax on imported goods that raises their price for domestic consumers.",
"Supply and demand determine the equilibrium price in a competitive market.",
]
# Tokenize each sentence and display the result
for sentence in economics_sentences:
# encode the text to bytes first (llama_cpp expects bytes)
token_ids = model.tokenize(sentence.encode("utf-8"))
print(f"Text : {sentence[:70]}..." if len(sentence) > 70 else f"Text : {sentence}")
print(f"Tokens: {token_ids}")
print(f"Count : {len(token_ids)} tokens for {len(sentence)} characters "
f"(ratio: {len(sentence)/len(token_ids):.1f} chars/token)")
print()3.3 Tokens Going In: The Integer Stream¶
When you call a model with a prompt, the model receives a list of integers — not text. Let’s look closely at one sentence and see exactly which token IDs correspond to which pieces of text.
# Pick one sentence to examine closely
example_text = "A tariff is a tax on imported goods that raises their price for domestic consumers."
token_ids = model.tokenize(example_text.encode("utf-8"))
print("=" * 60)
print("TOKENS GOING IN")
print("=" * 60)
print(f"Input text: '{example_text}'")
print(f"\nToken IDs (the integers the model actually sees):")
print(token_ids)
print(f"\nTotal: {len(token_ids)} tokens")# Now decode each token ID back to text to see what each integer represents
print(f"{'Token ID':>10} {'Text piece':30}")
print("-" * 45)
for token_id in token_ids:
# detokenize a single token
piece_bytes = model.detokenize([token_id])
piece_str = piece_bytes.decode("utf-8", errors="replace")
print(f"{token_id:>10} {repr(piece_str):30}")Notice a few things:
Spaces are usually attached to the beginning of the next word token (you’ll see
' tariff'not'tariff').Punctuation gets its own token.
Common short words (
a,is,on) are single tokens with small IDs (frequent words get low IDs in BPE).Words like
importedmay be one token because they’re common enough, while rare words would split.
3.4 Tokens Going Out: Generating New Token IDs¶
When the model generates text, it also produces token IDs first, then decodes them to text. Let’s capture the output token IDs directly.
# Run inference and capture token-level output
prompt = "Define comparative advantage in one sentence:"
print("=" * 60)
print("TOKENS GOING OUT")
print("=" * 60)
print(f"Prompt: '{prompt}'")
print()
# Use the low-level generate() method to get token IDs
prompt_tokens = model.tokenize(prompt.encode("utf-8"))
print(f"Prompt token IDs ({len(prompt_tokens)} tokens): {prompt_tokens}")
print()
# Generate up to 60 output tokens
output_tokens = []
output_text_pieces = []
for token_id in model.generate(prompt_tokens, top_k=1): # top_k=1: greedy decoding — always pick the single highest-probability token
output_tokens.append(token_id)
piece = model.detokenize([token_id]).decode("utf-8", errors="replace")
output_text_pieces.append(piece)
# Stop at EOS or after 60 tokens
if token_id == model.token_eos() or len(output_tokens) >= 60:
break
print(f"Output token IDs ({len(output_tokens)} tokens):")
print(output_tokens)
print()
print("Decoded output:")
print("".join(output_text_pieces))# Show output token-by-token in a table
print(f"{'Step':>6} {'Token ID':>10} {'Text piece'}")
print("-" * 40)
for i, (tid, piece) in enumerate(zip(output_tokens, output_text_pieces)):
print(f"{i+1:>6} {tid:>10} {repr(piece)}")Key insight: The model generates one integer at a time. Each output integer is a token ID from its vocabulary. Only at the end does the application decode those IDs back to readable text. The LLM is, at its core, a machine that predicts the next integer in a sequence.
4. What Does a Chunk Look Like?¶
Text Chunks¶
In real applications (like Retrieval-Augmented Generation, or RAG), long documents are chunked — split into overlapping windows — before being fed to a model. A chunk is just a substring of the source text, small enough to fit in the model’s context window.
Below we chunk an economics paragraph by character count (a simple strategy) and then tokenize each chunk.
# A longer economics passage to chunk
economics_passage = """
Comparative advantage is one of the most important concepts in international trade theory,
first articulated by David Ricardo in 1817. The principle states that even if one country is
more efficient at producing all goods than another, both countries can still benefit from
specialization and trade. The key insight is opportunity cost: a country should specialize
in producing goods where its relative efficiency advantage is greatest, or its relative
inefficiency is smallest. For example, if the United States can produce both wheat and
semiconductors more efficiently than Vietnam, but its advantage in semiconductors is
proportionally larger, then the US should specialize in semiconductors and import wheat
from Vietnam. Both countries end up with more of both goods than if each tried to produce
everything domestically. This principle underpins the argument for free trade and explains
much of the pattern of international specialization we observe in the global economy.
""".strip()
print(f"Passage length: {len(economics_passage)} characters")
print()
print(economics_passage)def chunk_text(text, chunk_size=200, overlap=50):
"""
Split text into overlapping chunks of roughly chunk_size characters.
overlap: how many characters from the end of one chunk appear at the start of the next.
"""
chunks = []
start = 0
while start < len(text):
end = min(start + chunk_size, len(text))
chunks.append(text[start:end])
if end == len(text):
break
start += chunk_size - overlap
return chunks
chunks = chunk_text(economics_passage, chunk_size=200, overlap=40)
for i, chunk in enumerate(chunks):
print(f"── Chunk {i+1} ({len(chunk)} chars) ──────────────────────────")
print(chunk)
print()4.1 What Does a Tokenized Chunk Look Like?¶
Now let’s take the same chunks and tokenize them. This shows exactly what the model receives as input — not sentences or words, but lists of integers.
for i, chunk in enumerate(chunks):
token_ids = model.tokenize(chunk.encode("utf-8"))
print(f"── Chunk {i+1} ────────────────────────────────────────")
print(f" Text ({len(chunk):3d} chars): {chunk[:80]}..." if len(chunk) > 80 else
f" Text ({len(chunk):3d} chars): {chunk}")
print(f" Token IDs ({len(token_ids):2d} tokens): {token_ids}")
print(f" Chars-per-token ratio: {len(chunk)/len(token_ids):.2f}")
print()# Side-by-side comparison: raw chunk vs tokenized chunk (first chunk only)
chunk = chunks[0]
token_ids = model.tokenize(chunk.encode("utf-8"))
print("RAW TEXT CHUNK:")
print("-" * 60)
print(chunk)
print()
print("TOKENIZED CHUNK (token IDs):")
print("-" * 60)
print(token_ids)
print()
print("TOKEN → TEXT MAPPING (first 20 tokens):")
print("-" * 60)
print(f"{'ID':>8} {'Text piece'}")
for tid in token_ids[:20]:
piece = model.detokenize([tid]).decode("utf-8", errors="replace")
print(f"{tid:>8} {repr(piece)}")
if len(token_ids) > 20:
print(f" ... ({len(token_ids)-20} more tokens)")5. Inside a GGUF File¶
What is GGUF?¶
GGUF (GGML Universal File Format) is the binary file format used to store quantized models. A single .gguf file contains everything needed to run a model:
File header — magic bytes, version number
Key-value metadata — model architecture, hyperparameters, tokenizer vocabulary
Tensor data — the actual weight matrices, stored as quantized floats
Think of it as a self-describing archive: you can learn almost everything about a model just by reading its GGUF file, without loading it into a neural-network framework.
The Python gguf package (from the llama.cpp project) lets us read this file directly.
5.1 Reading GGUF Metadata¶
The metadata section stores configuration values — things like the number of attention heads, the hidden dimension size, the context length, and the vocabulary.
from gguf import GGUFReader
# Open the GGUF file without loading it into GPU/CPU as a model
reader = GGUFReader(full_model_path)
print(f"GGUF file: {full_model_path}")
print(f"File size: {os.path.getsize(full_model_path) / (1024**2):.1f} MB")
print()
print(f"Number of metadata key-value pairs: {len(reader.fields)}")
print(f"Number of weight tensors: {len(reader.tensors)}") from gguf import GGUFValueType
# Print all metadata fields
print("=" * 70)
print("GGUF METADATA (key-value pairs)")
print("=" * 70)
for name, field in reader.fields.items():
# field.parts contains the raw data; field.data gives usable values
try:
# field.parts[field.data[0]] gets the actual value
if field.types[0] == GGUFValueType.STRING:
value = str(bytes(field.parts[-1]), encoding='utf-8')
elif len(field.data) == 1:
value = field.parts[field.data[0]][0]
else:
value = f"[array of {len(field.data)} items]"
# Skip huge tokenizer arrays for now (we'll look at them separately)
if "token" in name.lower() and isinstance(value, str) and value.startswith("["):
value = "[vocabulary array — shown below]"
print(f" {name:<50} = {str(value)[:60]}")
except Exception as e:
value = f"<unreadable: {e}>"
# Skip huge tokenizer arrays for now (we'll look at them separately)
if "token" in name.lower() and isinstance(value, str) and value.startswith("["):
value = "[vocabulary array — shown below]"
print(f" {name:<50} = {str(value)[:60]}")# Highlight the architecture-defining fields
architecture_keys = [
"general.architecture",
"general.name",
"general.parameter_count",
"general.quantization_version",
]
# Collect keys that mention important dimensions
dimension_keywords = ["hidden", "head", "layer", "context", "embed", "ff", "feed",
"attention", "block", "n_head", "n_layer", "n_ctx", "n_embd"]
print("=" * 70)
print("KEY ARCHITECTURAL PARAMETERS")
print("=" * 70)
for name, field in reader.fields.items():
if any(kw in name.lower() for kw in dimension_keywords) or name in architecture_keys:
try:
# field.parts[field.data[0]] gets the actual value
if field.types[0] == GGUFValueType.STRING:
value = str(bytes(field.parts[-1]), encoding='utf-8')
elif len(field.data) == 1:
value = field.parts[field.data[0]][0]
else:
value = f"[array of {len(field.data)} items]"
except Exception:
value = "<unreadable>"
print(f" {name:<50} = {value}")What those numbers mean:
| Metadata key | Meaning |
|---|---|
*.embedding_length | Size of each token’s embedding vector (e.g. 1536 = 1536 floats per token) |
*.block_count | Number of transformer layers stacked on top of each other |
*.attention.head_count | Number of parallel attention heads per layer |
*.context_length | Maximum number of tokens the model can see at once |
*.feed_forward_length | Size of the intermediate layer inside each transformer block |
These numbers determine the capacity (and the size) of the model.
5.2 The Tokenizer Vocabulary Inside the GGUF File¶
The vocabulary — all the text pieces the tokenizer knows — is stored directly in the GGUF file. Let’s look at the first and last few entries to see what the token table looks like.
# Find the tokenizer vocabulary field
vocab_field = None
for name, field in reader.fields.items():
if "tokens" in name.lower() and "tokenizer" in name.lower():
vocab_field = (name, field)
break
if vocab_field is None:
print("Vocabulary field not found — trying alternative names...")
for name, field in reader.fields.items():
if "vocab" in name.lower() or "token" in name.lower():
print(f" Found: {name} ({len(field.data)} entries)")
else:
name, field = vocab_field
print(f"Vocabulary field: '{name}'")
print(f"Total vocabulary size: {len(field.data)} tokens")
print()
# Show first 20 tokens
print("First 20 tokens (low IDs = most common in training data):")
print(f"{'Token ID':>10} {'Token text'}")
print("-" * 35)
for i in range(min(20, len(field.data))):
token_bytes = field.data[i]
if isinstance(token_bytes, (bytes, bytearray, memoryview)):
token_text = bytes(token_bytes).decode("utf-8", errors="replace")
else:
token_text = str(token_bytes)
print(f"{i:>10} {repr(token_text)}")
print()
print("Last 10 tokens (high IDs = rare or special tokens):")
print(f"{'Token ID':>10} {'Token text'}")
print("-" * 35)
total = len(field.data)
for i in range(total - 10, total):
token_bytes = field.data[i]
if isinstance(token_bytes, (bytes, bytearray, memoryview)):
token_text = bytes(token_bytes).decode("utf-8", errors="replace")
else:
token_text = str(token_bytes)
print(f"{i:>10} {repr(token_text)}")5.3 Listing All Weight Tensors¶
Now let’s look at all the weight tensors stored in the GGUF file. Each tensor has:
A name (tells you which layer and what type of weight)
A shape (dimensions of the matrix)
A type (quantization format: Q4_0, Q8_0, F16, etc.)
A size (bytes on disk — much smaller than the raw float32 equivalent)
print("=" * 80)
print("WEIGHT TENSORS IN THE GGUF FILE")
print("=" * 80)
print(f"{'Tensor name':<50} {'Shape':<25} {'Type':<10}")
print("-" * 85)
for tensor in reader.tensors:
shape_str = str(list(tensor.shape))
print(f"{tensor.name:<50} {shape_str:<25} {str(tensor.tensor_type.name):<10}")
print()
print(f"Total: {len(reader.tensors)} tensors")Reading the tensor names — the naming convention follows a pattern:
| Name pattern | Meaning |
|---|---|
token_embd.weight | The embedding matrix — one row per vocabulary token |
blk.N.attn_q.weight | Query weight matrix for attention in layer N |
blk.N.attn_k.weight | Key weight matrix for attention in layer N |
blk.N.attn_v.weight | Value weight matrix for attention in layer N |
blk.N.attn_output.weight | Attention output projection in layer N |
blk.N.ffn_up.weight | Feed-forward network up-projection in layer N |
blk.N.ffn_down.weight | Feed-forward network down-projection in layer N |
blk.N.ffn_gate.weight | Feed-forward gate (SwiGLU activation) in layer N |
blk.N.attn_norm.weight | Layer normalization scale in attention block N |
output.weight | Final output (lm head) matrix that maps to vocabulary logits |
The model processes tokens by running them through all N blocks in sequence.
# Count tensor types to understand the model structure
from collections import Counter
# Categorize tensors by their role
categories = {
"embedding": [],
"attention_q": [],
"attention_k": [],
"attention_v": [],
"attention_output": [],
"feed_forward": [],
"normalization": [],
"output": [],
"other": []
}
for tensor in reader.tensors:
n = tensor.name
if "token_embd" in n:
categories["embedding"].append(n)
elif "attn_q" in n:
categories["attention_q"].append(n)
elif "attn_k" in n:
categories["attention_k"].append(n)
elif "attn_v" in n:
categories["attention_v"].append(n)
elif "attn_output" in n or "attn_out" in n:
categories["attention_output"].append(n)
elif "ffn" in n:
categories["feed_forward"].append(n)
elif "norm" in n:
categories["normalization"].append(n)
elif n in ("output.weight", "output_norm.weight"):
categories["output"].append(n)
else:
categories["other"].append(n)
print("Tensor breakdown by role:")
print(f"{'Role':<25} {'Count':>6}")
print("-" * 33)
for role, tensors in categories.items():
if tensors:
print(f"{role:<25} {len(tensors):>6}")
print("-" * 33)
print(f"{'TOTAL':<25} {len(reader.tensors):>6}")6. Model Weights as Numbers¶
Now let’s actually look at the numbers inside the weight tensors.
6.1 The Token Embedding Matrix¶
The embedding matrix is the most conceptually important tensor:
It has one row per token in the vocabulary
Each row is a vector of floats (the “embedding”) for that token
This is how abstract token IDs get turned into rich numeric representations
Shape: [vocab_size × embedding_dim]
Example: [151936 × 1536] for Qwen2-1.5B
# Find the token embedding tensor
embd_tensor = None
for tensor in reader.tensors:
if tensor.name == "token_embd.weight":
embd_tensor = tensor
break
if embd_tensor is not None:
print(f"Tensor name: {embd_tensor.name}")
print(f"Shape: {list(embd_tensor.shape)}")
print(f" → {embd_tensor.shape[0]} tokens in vocabulary")
print(f" → {embd_tensor.shape[1]} floats per token embedding")
print(f"Storage type: {embd_tensor.tensor_type.name} (quantized to save space)")
vocab_size = embd_tensor.shape[0]
embd_dim = embd_tensor.shape[1]
full_fp32_mb = vocab_size * embd_dim * 4 / (1024**2)
print(f"\nIf stored as full float32: {full_fp32_mb:.1f} MB")
print(f"(Quantization compresses this significantly)")
else:
print("Embedding tensor not found — check tensor names above.")# Read the raw quantized data from the embedding tensor
# GGUFReader stores the raw bytes; we dequantize to float32 for display
if embd_tensor is not None:
# Convert to float32 numpy array
raw_data = embd_tensor.data # numpy array (possibly quantized uint8)
print(f"Raw data shape (quantized storage): {raw_data.shape}")
print(f"Raw data dtype: {raw_data.dtype}")
print()
print("First 20 raw quantized bytes:")
print(raw_data.flat[:20])
print()
print("(These bytes encode groups of 32 float values in a compact quantized format.)")6.2 Dequantizing to See Actual Float Values¶
GGUF stores weights in a quantized format to save space. Q4_0 quantization packs 32 float values into 18 bytes (instead of 128 bytes for float32) by storing them as 4-bit integers with a shared scaling factor.
To see the actual float values, we need to dequantize. The llama_cpp library does this automatically when running inference, but we can also do it manually using the gguf package utilities.
# Attempt to dequantize using numpy (manual Q4_0 dequantization)
# Q4_0 format: each block of 32 weights = 1 fp16 scale + 16 bytes of 4-bit ints
# Block size in bytes: 2 (scale) + 16 (4-bit data) = 18 bytes per 32 weights
if embd_tensor is not None and str(embd_tensor.tensor_type.name) == "Q4_0":
raw = embd_tensor.data # shape: (n_bytes,) uint8
# Each Q4_0 block = 18 bytes = 2 bytes (fp16 scale) + 16 bytes (4-bit * 32 weights)
block_size_bytes = 18
weights_per_block = 32
n_blocks = len(raw) // block_size_bytes
print(f"Dequantizing Q4_0 embedding matrix...")
print(f" Total blocks: {n_blocks}")
print(f" Total weights: {n_blocks * weights_per_block:,}")
print()
# Dequantize first few blocks so we can display the actual float values
blocks_to_show = min(4, n_blocks) # just first 4 blocks = 128 weights
decoded_weights = []
for b in range(blocks_to_show):
block = raw[b * block_size_bytes : (b + 1) * block_size_bytes]
# Extract fp16 scale (first 2 bytes)
scale = np.frombuffer(block[:2], dtype=np.float16)[0].astype(np.float32)
# Extract 4-bit quantized values (remaining 16 bytes = 32 values)
quant_bytes = block[2:]
lo = (quant_bytes & 0x0F).astype(np.int8)
hi = (quant_bytes >> 4).astype(np.int8)
# Values are stored as signed 4-bit: range -8..7
nibbles = np.empty(32, dtype=np.int8)
nibbles[0::2] = lo - 8
nibbles[1::2] = hi - 8
dequant = nibbles.astype(np.float32) * scale
decoded_weights.append((scale, dequant))
# Display block 0 (first 32 weights of the first token's embedding)
scale0, weights0 = decoded_weights[0]
print(f"First Q4_0 block of the embedding matrix:")
print(f" Scale factor: {scale0:.6f}")
print(f" 32 dequantized float values:")
print(np.round(weights0, 5))
elif embd_tensor is not None:
ttype = str(embd_tensor.tensor_type.name)
print(f"Tensor is stored as {ttype} — showing raw data sample:")
print(embd_tensor.data[:64])6.3 Looking at Other Weight Tensors¶
Let’s look at a normalization layer weight — these are stored as full floats (not quantized) and are the easiest to inspect directly.
# Find a normalization tensor (usually stored as F32 or F16 — easy to read)
norm_tensor = None
for tensor in reader.tensors:
if "norm" in tensor.name and "blk.0" in tensor.name:
norm_tensor = tensor
break
if norm_tensor is not None:
print(f"Tensor: {norm_tensor.name}")
print(f"Shape: {list(norm_tensor.shape)}")
print(f"Type: {norm_tensor.tensor_type.name}")
print()
norm_data = norm_tensor.data
print(f"Data dtype: {norm_data.dtype}")
# Convert to float32 for display
floats = norm_data.astype(np.float32)
print(f"First 20 weight values:")
print(np.round(floats[:20], 5))
print()
print(f"Statistics: min={floats.min():.5f}, max={floats.max():.5f}, "
f"mean={floats.mean():.5f}, std={floats.std():.5f}")
print()
print("Note: normalization weights are usually close to 1.0 (they start at 1.0 during training)")
else:
# Fallback: show any tensor with F32 data
for tensor in reader.tensors:
if "norm" in tensor.name:
print(f"Found norm tensor: {tensor.name}")
print(f" Shape: {list(tensor.shape)}, Type: {tensor.tensor_type.name}")
data = tensor.data.astype(np.float32)
print(f" First 10 values: {np.round(data[:10], 5)}")
break7. How Concepts Map to Numbers: Token Embeddings¶
The Embedding Space¶
Each token maps to a dense vector — a list of ~1000–4000 floating-point numbers. These vectors aren’t arbitrary: after training, tokens with similar meanings end up with similar vectors.
Famous analogy (from Word2Vec, 2013):
vector("king") - vector("man") + vector("woman") ≈ vector("queen")
For economics, we’d hope something like:
vector("inflation") - vector("price") + vector("quantity") ≈ vector("output")
We can extract embedding vectors for specific tokens using llama_cpp’s embedding mode and visualize them.
Measuring Token Similarity with Cosine Distance¶
The standard way to compare embedding vectors is cosine similarity: values range from -1 (opposite) to +1 (identical direction in embedding space).
# Load the model in embedding mode to extract token vectors
# Note: we need embedding=True to get per-token vectors
embed_model = Llama(
model_path=full_model_path,
embedding=True,
n_ctx=128,
verbose=False
)
print("Embedding model ready.")# Economics words to compare
economics_words = [
"inflation",
"deflation",
"tariff",
"trade",
"GDP",
"recession",
"growth",
"supply",
"demand",
"price",
]
def collapse_embedding(vec: np.ndarray) -> np.ndarray:
"""Return a single vector even if llama-cpp gives per-token rows."""
if vec.ndim == 1:
return vec
if vec.ndim == 2:
return vec.mean(axis=0)
raise ValueError(f"Unexpected embedding shape: {vec.shape}")
# Get embeddings for each word
word_embeddings = {}
for word in economics_words:
emb = embed_model.embed(word)
emb_array = np.array(emb, dtype=np.float32)
word_embeddings[word] = collapse_embedding(emb_array)
# Print the shape of one embedding
sample_word = economics_words[0]
print(f"Embedding for '{sample_word}':")
print(f" Shape: {word_embeddings[sample_word].shape}")
print(f" First 10 values: {np.round(word_embeddings[sample_word][:10], 4)}")
print(f" Min: {word_embeddings[sample_word].min():.4f}")
print(f" Max: {word_embeddings[sample_word].max():.4f}")
print(f" Norm (length): {np.linalg.norm(word_embeddings[sample_word]):.4f}")def cosine_similarity(v1, v2):
"""Cosine similarity between two vectors. Range: -1 (opposite) to +1 (same direction)."""
n1, n2 = np.linalg.norm(v1), np.linalg.norm(v2)
if n1 == 0 or n2 == 0:
return 0.0
return float(np.dot(v1, v2) / (n1 * n2))
# Build a similarity matrix
words = economics_words
n = len(words)
sim_matrix = np.zeros((n, n))
for i, w1 in enumerate(words):
for j, w2 in enumerate(words):
sim_matrix[i, j] = cosine_similarity(word_embeddings[w1], word_embeddings[w2])
# Print the similarity matrix
print("Cosine Similarity Matrix for Economics Words")
print("(1.0 = identical direction, 0.0 = unrelated, -1.0 = opposite)")
print()
print(f"{'':15}", end="")
for w in words:
print(f"{w:>10}", end="")
print()
print("-" * (15 + 10 * n))
for i, w1 in enumerate(words):
print(f"{w1:<15}", end="")
for j in range(n):
v = sim_matrix[i, j]
print(f"{v:>10.3f}", end="")
print()# Find the most similar pairs (excluding self-similarity)
print("Most Similar Word Pairs:")
print("-" * 40)
pairs = []
for i in range(n):
for j in range(i + 1, n):
pairs.append((sim_matrix[i, j], words[i], words[j]))
pairs.sort(reverse=True)
for sim, w1, w2 in pairs[:8]:
print(f" {w1:12} ↔ {w2:12} similarity = {sim:.4f}")
print()
print("Least Similar Word Pairs:")
print("-" * 40)
for sim, w1, w2 in pairs[-5:]:
print(f" {w1:12} ↔ {w2:12} similarity = {sim:.4f}")7.1 Embedding Arithmetic: Concepts as Directions¶
One of the most remarkable properties of trained embedding spaces is that relationships between concepts are encoded as directions in vector space. Let’s test some economic analogies.
# Economics analogy: which word completes the relationship?
# Example: inflation is to price as recession is to ???
def find_closest(query_vec, candidates, word_embeddings, exclude=None):
"""Find the word in candidates whose embedding is closest to query_vec."""
exclude = exclude or []
best_sim, best_word = -np.inf, None
for word in candidates:
if word in exclude:
continue
sim = cosine_similarity(query_vec, word_embeddings[word])
if sim > best_sim:
best_sim, best_word = sim, word
return best_word, best_sim
# Extended word list for analogy search
analogy_words = [
"inflation", "deflation", "tariff", "trade", "GDP", "recession",
"growth", "supply", "demand", "price", "interest", "bank",
"export", "import", "tax", "subsidy", "unemployment", "wages"
]
# Get embeddings for the extended list
for word in analogy_words:
if word not in word_embeddings:
emb = embed_model.embed(word)
word_embeddings[word] = np.array(emb, dtype=np.float32).flatten()
# Test: "export" - "trade" + "tax" ≈ ??? (export is to trade as tax is to ?)
analogies = [
("export", "trade", "tariff", "import"), # A:B as C:D
("inflation", "price", "recession", "growth"),
("supply", "producer", "demand", "consumer"),
]
print("Embedding Arithmetic: A is to B as C is to ???")
print("Formula: vec(A) - vec(B) + vec(C) → find closest word")
print("-" * 60)
for A, B, C, expected in analogies:
if all(w in word_embeddings for w in [A, B, C]):
query = word_embeddings[A] - word_embeddings[B] + word_embeddings[C]
result, sim = find_closest(query, analogy_words, word_embeddings, exclude=[A, B, C])
print(f" {A:12} - {B:12} + {C:12} ≈ {result:12} (similarity: {sim:.3f})")
print(f" Expected: {expected}")
print()
else:
missing = [w for w in [A, B, C] if w not in word_embeddings]
print(f" Skipped (missing embeddings for: {missing})")Note: Embedding arithmetic works best in larger models with richer training data. Small 1B-parameter models may not perfectly recover every analogy, but the nearest neighbors should still be semantically reasonable. The point is that relationships between words are encoded as geometric relationships between vectors.
8. Putting It All Together: The Full Pipeline¶
Let’s trace one economics sentence all the way through the pipeline to see every numeric transformation.
input_text = "Lower interest rates stimulate investment and economic growth."
print("=" * 70)
print("STEP 1: RAW TEXT")
print("=" * 70)
print(f" '{input_text}'")
print(f" Length: {len(input_text)} characters")
print()
print("=" * 70)
print("STEP 2: TOKENIZATION (text → integers)")
print("=" * 70)
token_ids = model.tokenize(input_text.encode("utf-8"))
print(f" Token IDs: {token_ids}")
print(f" Count: {len(token_ids)} tokens")
print()
print("=" * 70)
print("STEP 3: TOKEN → TEXT MAPPING")
print("=" * 70)
pieces = [model.detokenize([tid]).decode("utf-8", errors="replace") for tid in token_ids]
print(f" {'ID':>8} {'Text piece'}")
for tid, piece in zip(token_ids, pieces):
print(f" {tid:>8} {repr(piece)}")
print()
print("=" * 70)
print("STEP 4: EMBEDDING LOOKUP (each integer → vector of floats)")
print("=" * 70)
print(" (Getting embeddings for each token piece...)")
for tid, piece in zip(token_ids[:5], pieces[:5]): # show first 5 to keep output manageable
emb = embed_model.embed(piece.strip() or piece)
emb_arr = np.array(emb, dtype=np.float32)
print(f" Token {tid:>6} {repr(piece):>20} → vector shape {emb_arr.shape}")
print(f" First 8 values: {np.round(emb_arr[:8], 4)}")
print(f" Norm: {np.linalg.norm(emb_arr):.4f}")
if len(token_ids) > 5:
print(f" ... ({len(token_ids) - 5} more tokens)")
print()
print("=" * 70)
print("STEP 5: TRANSFORMER LAYERS")
print("=" * 70)
n_layers = sum(1 for t in reader.tensors if "blk." in t.name and "attn_q.weight" in t.name)
print(f" The model runs {n_layers} transformer blocks in sequence.")
print(f" Each block applies: attention (Q/K/V) + feed-forward network + layer norm")
print(f" This is {n_layers * 4}+ matrix multiplications per token per forward pass.")
print()
print("=" * 70)
print("STEP 6: GENERATE NEXT TOKEN")
print("=" * 70)
prompt_tokens = model.tokenize(input_text.encode("utf-8"))
next_tok = next(model.generate(prompt_tokens, top_k=1))
next_text = model.detokenize([next_tok]).decode("utf-8", errors="replace")
print(f" Most probable next token ID: {next_tok}")
print(f" Decoded: {repr(next_text)}")
print(f" Continued sentence: '{input_text}{next_text}...'") 9. Summary: Numbers All the Way Down¶
In this notebook, you explored how a language model is entirely a numeric computation:
| Stage | What It Looks Like | The Numbers |
|---|---|---|
| Text input | "A tariff is a tax on..." | A Python string |
| After tokenization | Tokens going in | [362, 287, 31954, ...] (list of ints) |
| Embedding lookup | Each token → vector | [-0.0123, 0.0412, ...] (1000s of floats) |
| Transformer forward pass | Weight matrices × embedding vectors | Billions of float multiplications |
| Output logits | Score for every vocab token | One float per vocab entry (~150k floats) |
| Sampling / decoding | Pick next token ID | One int → decoded back to text |
Key Takeaways¶
A vocabulary is just a lookup table: token IDs are indices into a big table mapping integers to text pieces.
Embeddings are coordinates in meaning-space: semantically similar words end up with geometrically close vectors after training.
GGUF is a self-describing file: metadata + weight tensors in one binary file. You can inspect a model’s architecture, vocabulary, and weights without ever running it.
Quantization trades precision for space: Q4_0 stores 32 weights in 18 bytes instead of 128 bytes, with only a small quality penalty.
Generation is sequential: the model produces one token ID at a time, each time doing a full forward pass through all transformer layers.
Next Steps¶
Inside_Small_Model.ipynb— visualize token probabilities and decoding strategiesLlamaCpp_SmallLM_Demo.ipynb— build multi-turn economic policy chatbotThe Illustrated Transformer — visual walkthrough of attention and transformer math
GGUF spec — full binary format documentation
# Clean up: close model handles to free memory
del model
del embed_model
del reader
print("Models and file handles released.")