Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Model Weights and Tokens: Numbers All the Way Down

Using llama-cpp-python and the GGUF Format

This notebook pulls back the curtain on how a language model actually works with numbers. We’ll explore:

  • Tokens: How text is broken into numeric pieces before the model ever sees it

  • Tokens going in and out: The integer IDs that flow through a model

  • Text chunks: What a chunk of text looks like vs. its tokenized form

  • GGUF files: The binary container that stores a model on disk

  • Model weights: The actual floating-point numbers that give a model its “knowledge”

  • How concepts map to numbers: Token embedding vectors as coordinates in meaning-space

Throughout, we use economics examples — tariffs, inflation, GDP, supply and demand — as the text to analyze, since these are rich, domain-specific phrases that illustrate how tokenizers handle specialized vocabulary.

What You’ll Need

  • A .gguf model file already downloaded (see GPT4All_Download_gguf.ipynb)

  • Python packages: llama-cpp-python, gguf, numpy

Resources

Attribution

Notebook developed by Eric Van Dusen and contributors for the ds-modules/SmallLM-FA25 project.

1. Environment Setup

We need two packages beyond the standard library:

PackagePurpose
llama-cpp-pythonLoad GGUF models, run inference, access the built-in tokenizer
ggufRead GGUF file metadata and weight tensors directly (no model load needed)
numpyDisplay weight arrays as readable tables

The gguf package is the official Python reader published by the llama.cpp project.

# Install llama-cpp-python if not already present
try:
    from llama_cpp import Llama
except ImportError:
    %pip install llama-cpp-python
    from llama_cpp import Llama

# Install the gguf reader package
try:
    from gguf import GGUFReader
except ImportError:
    %pip install gguf
    from gguf import GGUFReader

import numpy as np
print("All packages loaded successfully!")

1.1 Locate the Model File

Set model_path and model_name to match your environment. Common locations:

  • Shared JupyterHub: /home/jovyan/shared/

  • Local machine: your own path (e.g. ~/models/)

# ── Set these two variables to match your setup ─────────────────────────────
model_path = "/home/jovyan/shared/"           # directory containing .gguf files
model_name = "qwen2-1_5b-instruct-q4_0.gguf" # filename of the model
# ─────────────────────────────────────────────────────────────────────────────

# For local use, uncomment and adjust:
# model_path = "shared-rw/"
#model_path = "/Users/ericvandusen/SmallLM/Models/"

import os
full_model_path = os.path.join(model_path, model_name)
print(f"Looking for: {full_model_path}")
print(f"File exists: {os.path.exists(full_model_path)}")
# See what .gguf files are available
gguf_files = [f for f in os.listdir(model_path) if f.endswith(".gguf")]
print(f"Available .gguf models in {model_path}:")
for f in gguf_files:
    size_mb = os.path.getsize(os.path.join(model_path, f)) / (1024**2)
    print(f"  {f:60s}  {size_mb:7.1f} MB")

2. Motivation: Everything Is a Number

Before we load the model, let’s build intuition. Consider this economics sentence:

“When a country imposes a tariff on imported steel, domestic producers gain while consumers pay higher prices.”

A language model never sees letters. It sees only:

  1. Token IDs — integers that index a vocabulary table

  2. Embedding vectors — lists of floating-point numbers that represent each token

  3. Weight matrices — huge arrays of floats that transform those vectors step by step

The diagram below summarizes the pipeline:

Raw text  →  Tokenizer  →  Token IDs  →  Embedding lookup  →  Transformer layers  →  Logits  →  Next token
 (str)                    (list[int])     (matrix of floats)     (weight math)        (floats)    (int → str)

Every step in this pipeline is pure arithmetic on numbers. This section walks through each step using economic text.

3. Tokens: Text Broken Into Pieces

What is a Token?

A token is the atomic unit of text that the model processes. Tokens are not the same as words:

  • A common short word (e.g. the) is usually one token.

  • A longer or rarer word may be split into sub-word pieces (e.g. inflation[inflation], but hyperinflation[hyper, inflation]).

  • Spaces, punctuation, and capitalization all affect tokenization.

  • Numbers can be tokenized digit-by-digit or as whole numbers depending on the tokenizer.

Modern LLMs typically use Byte-Pair Encoding (BPE) or SentencePiece tokenizers. Both learn a vocabulary of common sub-word units from a large training corpus.

Why Does This Matter for Economics?

Economics has technical vocabulary that may be rare in general text: amortization, monopsony, heteroscedasticity. A tokenizer trained on general web data may split these into many small pieces, giving the model less efficient representations of economic concepts.

3.1 Load the Model and Access Its Tokenizer

We load the model with verbose=False to suppress the C++ startup messages. Once loaded, the Llama object exposes a tokenize() method that runs the same tokenizer used during training.

# Load the model
# verbose=False silences the C++ loading messages
model = Llama(
    model_path=full_model_path,
    n_ctx=2048,
    verbose=False,
    # n_threads=1, #CPU settings
    # n_gpu_layers=-1 #GPU Settings

)
print(f"Model loaded: {model_name}")
print(f"Vocabulary size: {model.n_vocab()} tokens")
print(f"Context window:  {model.n_ctx()} tokens")

3.2 Tokenizing Economics Text

Let’s tokenize a few economics sentences and see what comes out. The tokenize() method returns a Python list of integers — the token IDs.

# Economics sentences to tokenize
economics_sentences = [
    "Comparative advantage explains why countries specialize in producing goods they can make at a lower opportunity cost.",
    "When a central bank raises interest rates, borrowing becomes more expensive and inflation tends to fall.",
    "GDP measures the total monetary value of all finished goods and services produced within a country.",
    "A tariff is a tax on imported goods that raises their price for domestic consumers.",
    "Supply and demand determine the equilibrium price in a competitive market.",
]

# Tokenize each sentence and display the result
for sentence in economics_sentences:
    # encode the text to bytes first (llama_cpp expects bytes)
    token_ids = model.tokenize(sentence.encode("utf-8"))
    print(f"Text  : {sentence[:70]}..." if len(sentence) > 70 else f"Text  : {sentence}")
    print(f"Tokens: {token_ids}")
    print(f"Count : {len(token_ids)} tokens for {len(sentence)} characters  "
          f"(ratio: {len(sentence)/len(token_ids):.1f} chars/token)")
    print()

3.3 Tokens Going In: The Integer Stream

When you call a model with a prompt, the model receives a list of integers — not text. Let’s look closely at one sentence and see exactly which token IDs correspond to which pieces of text.

# Pick one sentence to examine closely
example_text = "A tariff is a tax on imported goods that raises their price for domestic consumers."

token_ids = model.tokenize(example_text.encode("utf-8"))

print("=" * 60)
print("TOKENS GOING IN")
print("=" * 60)
print(f"Input text: '{example_text}'")
print(f"\nToken IDs (the integers the model actually sees):")
print(token_ids)
print(f"\nTotal: {len(token_ids)} tokens")
# Now decode each token ID back to text to see what each integer represents
print(f"{'Token ID':>10}  {'Text piece':30}")
print("-" * 45)
for token_id in token_ids:
    # detokenize a single token
    piece_bytes = model.detokenize([token_id])
    piece_str = piece_bytes.decode("utf-8", errors="replace")
    print(f"{token_id:>10}  {repr(piece_str):30}")

Notice a few things:

  • Spaces are usually attached to the beginning of the next word token (you’ll see ' tariff' not 'tariff').

  • Punctuation gets its own token.

  • Common short words (a, is, on) are single tokens with small IDs (frequent words get low IDs in BPE).

  • Words like imported may be one token because they’re common enough, while rare words would split.

3.4 Tokens Going Out: Generating New Token IDs

When the model generates text, it also produces token IDs first, then decodes them to text. Let’s capture the output token IDs directly.

# Run inference and capture token-level output
prompt = "Define comparative advantage in one sentence:"

print("=" * 60)
print("TOKENS GOING OUT")
print("=" * 60)
print(f"Prompt: '{prompt}'")
print()

# Use the low-level generate() method to get token IDs
prompt_tokens = model.tokenize(prompt.encode("utf-8"))
print(f"Prompt token IDs ({len(prompt_tokens)} tokens): {prompt_tokens}")
print()

# Generate up to 60 output tokens
output_tokens = []
output_text_pieces = []

for token_id in model.generate(prompt_tokens, top_k=1):  # top_k=1: greedy decoding — always pick the single highest-probability token
    output_tokens.append(token_id)
    piece = model.detokenize([token_id]).decode("utf-8", errors="replace")
    output_text_pieces.append(piece)
    
    # Stop at EOS or after 60 tokens
    if token_id == model.token_eos() or len(output_tokens) >= 60:
        break

print(f"Output token IDs ({len(output_tokens)} tokens):")
print(output_tokens)
print()
print("Decoded output:")
print("".join(output_text_pieces))
# Show output token-by-token in a table
print(f"{'Step':>6}  {'Token ID':>10}  {'Text piece'}")
print("-" * 40)
for i, (tid, piece) in enumerate(zip(output_tokens, output_text_pieces)):
    print(f"{i+1:>6}  {tid:>10}  {repr(piece)}")

Key insight: The model generates one integer at a time. Each output integer is a token ID from its vocabulary. Only at the end does the application decode those IDs back to readable text. The LLM is, at its core, a machine that predicts the next integer in a sequence.

4. What Does a Chunk Look Like?

Text Chunks

In real applications (like Retrieval-Augmented Generation, or RAG), long documents are chunked — split into overlapping windows — before being fed to a model. A chunk is just a substring of the source text, small enough to fit in the model’s context window.

Below we chunk an economics paragraph by character count (a simple strategy) and then tokenize each chunk.

# A longer economics passage to chunk
economics_passage = """
Comparative advantage is one of the most important concepts in international trade theory, 
first articulated by David Ricardo in 1817. The principle states that even if one country is 
more efficient at producing all goods than another, both countries can still benefit from 
specialization and trade. The key insight is opportunity cost: a country should specialize 
in producing goods where its relative efficiency advantage is greatest, or its relative 
inefficiency is smallest. For example, if the United States can produce both wheat and 
semiconductors more efficiently than Vietnam, but its advantage in semiconductors is 
proportionally larger, then the US should specialize in semiconductors and import wheat 
from Vietnam. Both countries end up with more of both goods than if each tried to produce 
everything domestically. This principle underpins the argument for free trade and explains 
much of the pattern of international specialization we observe in the global economy.
""".strip()

print(f"Passage length: {len(economics_passage)} characters")
print()
print(economics_passage)
def chunk_text(text, chunk_size=200, overlap=50):
    """
    Split text into overlapping chunks of roughly chunk_size characters.
    overlap: how many characters from the end of one chunk appear at the start of the next.
    """
    chunks = []
    start = 0
    while start < len(text):
        end = min(start + chunk_size, len(text))
        chunks.append(text[start:end])
        if end == len(text):
            break
        start += chunk_size - overlap
    return chunks

chunks = chunk_text(economics_passage, chunk_size=200, overlap=40)

for i, chunk in enumerate(chunks):
    print(f"── Chunk {i+1} ({len(chunk)} chars) ──────────────────────────")
    print(chunk)
    print()

4.1 What Does a Tokenized Chunk Look Like?

Now let’s take the same chunks and tokenize them. This shows exactly what the model receives as input — not sentences or words, but lists of integers.

for i, chunk in enumerate(chunks):
    token_ids = model.tokenize(chunk.encode("utf-8"))
    print(f"── Chunk {i+1} ────────────────────────────────────────")
    print(f"  Text  ({len(chunk):3d} chars):  {chunk[:80]}..." if len(chunk) > 80 else 
          f"  Text  ({len(chunk):3d} chars):  {chunk}")
    print(f"  Token IDs ({len(token_ids):2d} tokens): {token_ids}")
    print(f"  Chars-per-token ratio: {len(chunk)/len(token_ids):.2f}")
    print()
# Side-by-side comparison: raw chunk vs tokenized chunk (first chunk only)
chunk = chunks[0]
token_ids = model.tokenize(chunk.encode("utf-8"))

print("RAW TEXT CHUNK:")
print("-" * 60)
print(chunk)
print()

print("TOKENIZED CHUNK (token IDs):")
print("-" * 60)
print(token_ids)
print()

print("TOKEN → TEXT MAPPING (first 20 tokens):")
print("-" * 60)
print(f"{'ID':>8}  {'Text piece'}")
for tid in token_ids[:20]:
    piece = model.detokenize([tid]).decode("utf-8", errors="replace")
    print(f"{tid:>8}  {repr(piece)}")
if len(token_ids) > 20:
    print(f"    ... ({len(token_ids)-20} more tokens)")

5. Inside a GGUF File

What is GGUF?

GGUF (GGML Universal File Format) is the binary file format used to store quantized models. A single .gguf file contains everything needed to run a model:

  1. File header — magic bytes, version number

  2. Key-value metadata — model architecture, hyperparameters, tokenizer vocabulary

  3. Tensor data — the actual weight matrices, stored as quantized floats

Think of it as a self-describing archive: you can learn almost everything about a model just by reading its GGUF file, without loading it into a neural-network framework.

The Python gguf package (from the llama.cpp project) lets us read this file directly.

5.1 Reading GGUF Metadata

The metadata section stores configuration values — things like the number of attention heads, the hidden dimension size, the context length, and the vocabulary.

from gguf import GGUFReader

# Open the GGUF file without loading it into GPU/CPU as a model
reader = GGUFReader(full_model_path)

print(f"GGUF file: {full_model_path}")
print(f"File size: {os.path.getsize(full_model_path) / (1024**2):.1f} MB")
print()
print(f"Number of metadata key-value pairs: {len(reader.fields)}")
print(f"Number of weight tensors:           {len(reader.tensors)}")
    from gguf import GGUFValueType
    # Print all metadata fields
    print("=" * 70)
    print("GGUF METADATA (key-value pairs)")
    print("=" * 70)

    for name, field in reader.fields.items():
        # field.parts contains the raw data; field.data gives usable values
        try:
            # field.parts[field.data[0]] gets the actual value
            if field.types[0] == GGUFValueType.STRING:
                value = str(bytes(field.parts[-1]), encoding='utf-8')
            elif len(field.data) == 1:
                value = field.parts[field.data[0]][0]
            else:
                value = f"[array of {len(field.data)} items]"

            # Skip huge tokenizer arrays for now (we'll look at them separately)
            if "token" in name.lower() and isinstance(value, str) and value.startswith("["):
                value = "[vocabulary array — shown below]"
        
            print(f"  {name:<50} = {str(value)[:60]}")
        except Exception as e:
            value = f"<unreadable: {e}>"
        

        # Skip huge tokenizer arrays for now (we'll look at them separately)
        if "token" in name.lower() and isinstance(value, str) and value.startswith("["):
            value = "[vocabulary array — shown below]"
        
        print(f"  {name:<50} = {str(value)[:60]}")
# Highlight the architecture-defining fields
architecture_keys = [
    "general.architecture",
    "general.name",
    "general.parameter_count",
    "general.quantization_version",
]

# Collect keys that mention important dimensions
dimension_keywords = ["hidden", "head", "layer", "context", "embed", "ff", "feed",
                      "attention", "block", "n_head", "n_layer", "n_ctx", "n_embd"]

print("=" * 70)
print("KEY ARCHITECTURAL PARAMETERS")
print("=" * 70)

for name, field in reader.fields.items():
    if any(kw in name.lower() for kw in dimension_keywords) or name in architecture_keys:
        try:
            # field.parts[field.data[0]] gets the actual value
            if field.types[0] == GGUFValueType.STRING:
                value = str(bytes(field.parts[-1]), encoding='utf-8')
            elif len(field.data) == 1:
                value = field.parts[field.data[0]][0]
            else:
                value = f"[array of {len(field.data)} items]"
        except Exception:
            value = "<unreadable>"
        print(f"  {name:<50} = {value}")

What those numbers mean:

Metadata keyMeaning
*.embedding_lengthSize of each token’s embedding vector (e.g. 1536 = 1536 floats per token)
*.block_countNumber of transformer layers stacked on top of each other
*.attention.head_countNumber of parallel attention heads per layer
*.context_lengthMaximum number of tokens the model can see at once
*.feed_forward_lengthSize of the intermediate layer inside each transformer block

These numbers determine the capacity (and the size) of the model.

5.2 The Tokenizer Vocabulary Inside the GGUF File

The vocabulary — all the text pieces the tokenizer knows — is stored directly in the GGUF file. Let’s look at the first and last few entries to see what the token table looks like.

# Find the tokenizer vocabulary field
vocab_field = None
for name, field in reader.fields.items():
    if "tokens" in name.lower() and "tokenizer" in name.lower():
        vocab_field = (name, field)
        break

if vocab_field is None:
    print("Vocabulary field not found — trying alternative names...")
    for name, field in reader.fields.items():
        if "vocab" in name.lower() or "token" in name.lower():
            print(f"  Found: {name} ({len(field.data)} entries)")
else:
    name, field = vocab_field
    print(f"Vocabulary field: '{name}'")
    print(f"Total vocabulary size: {len(field.data)} tokens")
    print()
    
    # Show first 20 tokens
    print("First 20 tokens (low IDs = most common in training data):")
    print(f"{'Token ID':>10}  {'Token text'}")
    print("-" * 35)
    for i in range(min(20, len(field.data))):
        token_bytes = field.data[i]
        if isinstance(token_bytes, (bytes, bytearray, memoryview)):
            token_text = bytes(token_bytes).decode("utf-8", errors="replace")
        else:
            token_text = str(token_bytes)
        print(f"{i:>10}  {repr(token_text)}")
    
    print()
    print("Last 10 tokens (high IDs = rare or special tokens):")
    print(f"{'Token ID':>10}  {'Token text'}")
    print("-" * 35)
    total = len(field.data)
    for i in range(total - 10, total):
        token_bytes = field.data[i]
        if isinstance(token_bytes, (bytes, bytearray, memoryview)):
            token_text = bytes(token_bytes).decode("utf-8", errors="replace")
        else:
            token_text = str(token_bytes)
        print(f"{i:>10}  {repr(token_text)}")

5.3 Listing All Weight Tensors

Now let’s look at all the weight tensors stored in the GGUF file. Each tensor has:

  • A name (tells you which layer and what type of weight)

  • A shape (dimensions of the matrix)

  • A type (quantization format: Q4_0, Q8_0, F16, etc.)

  • A size (bytes on disk — much smaller than the raw float32 equivalent)

print("=" * 80)
print("WEIGHT TENSORS IN THE GGUF FILE")
print("=" * 80)
print(f"{'Tensor name':<50} {'Shape':<25} {'Type':<10}")
print("-" * 85)

for tensor in reader.tensors:
    shape_str = str(list(tensor.shape))
    print(f"{tensor.name:<50} {shape_str:<25} {str(tensor.tensor_type.name):<10}")

print()
print(f"Total: {len(reader.tensors)} tensors")

Reading the tensor names — the naming convention follows a pattern:

Name patternMeaning
token_embd.weightThe embedding matrix — one row per vocabulary token
blk.N.attn_q.weightQuery weight matrix for attention in layer N
blk.N.attn_k.weightKey weight matrix for attention in layer N
blk.N.attn_v.weightValue weight matrix for attention in layer N
blk.N.attn_output.weightAttention output projection in layer N
blk.N.ffn_up.weightFeed-forward network up-projection in layer N
blk.N.ffn_down.weightFeed-forward network down-projection in layer N
blk.N.ffn_gate.weightFeed-forward gate (SwiGLU activation) in layer N
blk.N.attn_norm.weightLayer normalization scale in attention block N
output.weightFinal output (lm head) matrix that maps to vocabulary logits

The model processes tokens by running them through all N blocks in sequence.

# Count tensor types to understand the model structure
from collections import Counter

# Categorize tensors by their role
categories = {
    "embedding": [],
    "attention_q": [],
    "attention_k": [],
    "attention_v": [],
    "attention_output": [],
    "feed_forward": [],
    "normalization": [],
    "output": [],
    "other": []
}

for tensor in reader.tensors:
    n = tensor.name
    if "token_embd" in n:
        categories["embedding"].append(n)
    elif "attn_q" in n:
        categories["attention_q"].append(n)
    elif "attn_k" in n:
        categories["attention_k"].append(n)
    elif "attn_v" in n:
        categories["attention_v"].append(n)
    elif "attn_output" in n or "attn_out" in n:
        categories["attention_output"].append(n)
    elif "ffn" in n:
        categories["feed_forward"].append(n)
    elif "norm" in n:
        categories["normalization"].append(n)
    elif n in ("output.weight", "output_norm.weight"):
        categories["output"].append(n)
    else:
        categories["other"].append(n)

print("Tensor breakdown by role:")
print(f"{'Role':<25} {'Count':>6}")
print("-" * 33)
for role, tensors in categories.items():
    if tensors:
        print(f"{role:<25} {len(tensors):>6}")
print("-" * 33)
print(f"{'TOTAL':<25} {len(reader.tensors):>6}")

6. Model Weights as Numbers

Now let’s actually look at the numbers inside the weight tensors.

6.1 The Token Embedding Matrix

The embedding matrix is the most conceptually important tensor:

  • It has one row per token in the vocabulary

  • Each row is a vector of floats (the “embedding”) for that token

  • This is how abstract token IDs get turned into rich numeric representations

Shape: [vocab_size × embedding_dim]
Example: [151936 × 1536] for Qwen2-1.5B

# Find the token embedding tensor
embd_tensor = None
for tensor in reader.tensors:
    if tensor.name == "token_embd.weight":
        embd_tensor = tensor
        break

if embd_tensor is not None:
    print(f"Tensor name:  {embd_tensor.name}")
    print(f"Shape:        {list(embd_tensor.shape)}")
    print(f"  → {embd_tensor.shape[0]} tokens in vocabulary")
    print(f"  → {embd_tensor.shape[1]} floats per token embedding")
    print(f"Storage type: {embd_tensor.tensor_type.name}  (quantized to save space)")
    
    vocab_size   = embd_tensor.shape[0]
    embd_dim     = embd_tensor.shape[1]
    full_fp32_mb = vocab_size * embd_dim * 4 / (1024**2)
    print(f"\nIf stored as full float32: {full_fp32_mb:.1f} MB")
    print(f"(Quantization compresses this significantly)")
else:
    print("Embedding tensor not found — check tensor names above.")
# Read the raw quantized data from the embedding tensor
# GGUFReader stores the raw bytes; we dequantize to float32 for display
if embd_tensor is not None:
    # Convert to float32 numpy array
    raw_data = embd_tensor.data          # numpy array (possibly quantized uint8)
    
    print(f"Raw data shape (quantized storage): {raw_data.shape}")
    print(f"Raw data dtype: {raw_data.dtype}")
    print()
    print("First 20 raw quantized bytes:")
    print(raw_data.flat[:20])
    print()
    print("(These bytes encode groups of 32 float values in a compact quantized format.)")

6.2 Dequantizing to See Actual Float Values

GGUF stores weights in a quantized format to save space. Q4_0 quantization packs 32 float values into 18 bytes (instead of 128 bytes for float32) by storing them as 4-bit integers with a shared scaling factor.

To see the actual float values, we need to dequantize. The llama_cpp library does this automatically when running inference, but we can also do it manually using the gguf package utilities.

# Attempt to dequantize using numpy (manual Q4_0 dequantization)
# Q4_0 format: each block of 32 weights = 1 fp16 scale + 16 bytes of 4-bit ints
# Block size in bytes: 2 (scale) + 16 (4-bit data) = 18 bytes per 32 weights

if embd_tensor is not None and str(embd_tensor.tensor_type.name) == "Q4_0":
    raw = embd_tensor.data   # shape: (n_bytes,) uint8
    
    # Each Q4_0 block = 18 bytes = 2 bytes (fp16 scale) + 16 bytes (4-bit * 32 weights)
    block_size_bytes = 18
    weights_per_block = 32
    n_blocks = len(raw) // block_size_bytes
    
    print(f"Dequantizing Q4_0 embedding matrix...")
    print(f"  Total blocks: {n_blocks}")
    print(f"  Total weights: {n_blocks * weights_per_block:,}")
    print()
    
    # Dequantize first few blocks so we can display the actual float values
    blocks_to_show = min(4, n_blocks)  # just first 4 blocks = 128 weights
    decoded_weights = []
    
    for b in range(blocks_to_show):
        block = raw[b * block_size_bytes : (b + 1) * block_size_bytes]
        # Extract fp16 scale (first 2 bytes)
        scale = np.frombuffer(block[:2], dtype=np.float16)[0].astype(np.float32)
        # Extract 4-bit quantized values (remaining 16 bytes = 32 values)
        quant_bytes = block[2:]
        lo = (quant_bytes & 0x0F).astype(np.int8)
        hi = (quant_bytes >> 4).astype(np.int8)
        # Values are stored as signed 4-bit: range -8..7
        nibbles = np.empty(32, dtype=np.int8)
        nibbles[0::2] = lo - 8
        nibbles[1::2] = hi - 8
        dequant = nibbles.astype(np.float32) * scale
        decoded_weights.append((scale, dequant))
    
    # Display block 0 (first 32 weights of the first token's embedding)
    scale0, weights0 = decoded_weights[0]
    print(f"First Q4_0 block of the embedding matrix:")
    print(f"  Scale factor: {scale0:.6f}")
    print(f"  32 dequantized float values:")
    print(np.round(weights0, 5))

elif embd_tensor is not None:
    ttype = str(embd_tensor.tensor_type.name)
    print(f"Tensor is stored as {ttype} — showing raw data sample:")
    print(embd_tensor.data[:64])

6.3 Looking at Other Weight Tensors

Let’s look at a normalization layer weight — these are stored as full floats (not quantized) and are the easiest to inspect directly.

# Find a normalization tensor (usually stored as F32 or F16 — easy to read)
norm_tensor = None
for tensor in reader.tensors:
    if "norm" in tensor.name and "blk.0" in tensor.name:
        norm_tensor = tensor
        break

if norm_tensor is not None:
    print(f"Tensor: {norm_tensor.name}")
    print(f"Shape:  {list(norm_tensor.shape)}")
    print(f"Type:   {norm_tensor.tensor_type.name}")
    print()
    
    norm_data = norm_tensor.data
    print(f"Data dtype: {norm_data.dtype}")
    
    # Convert to float32 for display
    floats = norm_data.astype(np.float32)
    print(f"First 20 weight values:")
    print(np.round(floats[:20], 5))
    print()
    print(f"Statistics: min={floats.min():.5f}, max={floats.max():.5f}, "
          f"mean={floats.mean():.5f}, std={floats.std():.5f}")
    print()
    print("Note: normalization weights are usually close to 1.0 (they start at 1.0 during training)")
else:
    # Fallback: show any tensor with F32 data
    for tensor in reader.tensors:
        if "norm" in tensor.name:
            print(f"Found norm tensor: {tensor.name}")
            print(f"  Shape: {list(tensor.shape)}, Type: {tensor.tensor_type.name}")
            data = tensor.data.astype(np.float32)
            print(f"  First 10 values: {np.round(data[:10], 5)}")
            break

7. How Concepts Map to Numbers: Token Embeddings

The Embedding Space

Each token maps to a dense vector — a list of ~1000–4000 floating-point numbers. These vectors aren’t arbitrary: after training, tokens with similar meanings end up with similar vectors.

Famous analogy (from Word2Vec, 2013):

vector("king") - vector("man") + vector("woman") ≈ vector("queen")

For economics, we’d hope something like:

vector("inflation") - vector("price") + vector("quantity") ≈ vector("output")

We can extract embedding vectors for specific tokens using llama_cpp’s embedding mode and visualize them.

Measuring Token Similarity with Cosine Distance

The standard way to compare embedding vectors is cosine similarity: values range from -1 (opposite) to +1 (identical direction in embedding space).

# Load the model in embedding mode to extract token vectors
# Note: we need embedding=True to get per-token vectors
embed_model = Llama(
    model_path=full_model_path,
    embedding=True,
    n_ctx=128,
    verbose=False
)
print("Embedding model ready.")
# Economics words to compare
economics_words = [
    "inflation",
    "deflation",
    "tariff",
    "trade",
    "GDP",
    "recession",
    "growth",
    "supply",
    "demand",
    "price",
]


def collapse_embedding(vec: np.ndarray) -> np.ndarray:
    """Return a single vector even if llama-cpp gives per-token rows."""
    if vec.ndim == 1:
        return vec
    if vec.ndim == 2:
        return vec.mean(axis=0)
    raise ValueError(f"Unexpected embedding shape: {vec.shape}")


# Get embeddings for each word
word_embeddings = {}
for word in economics_words:
    emb = embed_model.embed(word)
    emb_array = np.array(emb, dtype=np.float32)
    word_embeddings[word] = collapse_embedding(emb_array)

# Print the shape of one embedding
sample_word = economics_words[0]
print(f"Embedding for '{sample_word}':")
print(f"  Shape: {word_embeddings[sample_word].shape}")
print(f"  First 10 values: {np.round(word_embeddings[sample_word][:10], 4)}")
print(f"  Min: {word_embeddings[sample_word].min():.4f}")
print(f"  Max: {word_embeddings[sample_word].max():.4f}")
print(f"  Norm (length): {np.linalg.norm(word_embeddings[sample_word]):.4f}")
def cosine_similarity(v1, v2):
    """Cosine similarity between two vectors. Range: -1 (opposite) to +1 (same direction)."""
    n1, n2 = np.linalg.norm(v1), np.linalg.norm(v2)
    if n1 == 0 or n2 == 0:
        return 0.0
    return float(np.dot(v1, v2) / (n1 * n2))

# Build a similarity matrix
words = economics_words
n = len(words)
sim_matrix = np.zeros((n, n))

for i, w1 in enumerate(words):
    for j, w2 in enumerate(words):
        sim_matrix[i, j] = cosine_similarity(word_embeddings[w1], word_embeddings[w2])

# Print the similarity matrix
print("Cosine Similarity Matrix for Economics Words")
print("(1.0 = identical direction, 0.0 = unrelated, -1.0 = opposite)")
print()
print(f"{'':15}", end="")
for w in words:
    print(f"{w:>10}", end="")
print()
print("-" * (15 + 10 * n))
for i, w1 in enumerate(words):
    print(f"{w1:<15}", end="")
    for j in range(n):
        v = sim_matrix[i, j]
        print(f"{v:>10.3f}", end="")
    print()
# Find the most similar pairs (excluding self-similarity)
print("Most Similar Word Pairs:")
print("-" * 40)

pairs = []
for i in range(n):
    for j in range(i + 1, n):
        pairs.append((sim_matrix[i, j], words[i], words[j]))

pairs.sort(reverse=True)
for sim, w1, w2 in pairs[:8]:
    print(f"  {w1:12} ↔ {w2:12}  similarity = {sim:.4f}")

print()
print("Least Similar Word Pairs:")
print("-" * 40)
for sim, w1, w2 in pairs[-5:]:
    print(f"  {w1:12} ↔ {w2:12}  similarity = {sim:.4f}")

7.1 Embedding Arithmetic: Concepts as Directions

One of the most remarkable properties of trained embedding spaces is that relationships between concepts are encoded as directions in vector space. Let’s test some economic analogies.

# Economics analogy: which word completes the relationship?
# Example: inflation is to price as recession is to ???

def find_closest(query_vec, candidates, word_embeddings, exclude=None):
    """Find the word in candidates whose embedding is closest to query_vec."""
    exclude = exclude or []
    best_sim, best_word = -np.inf, None
    for word in candidates:
        if word in exclude:
            continue
        sim = cosine_similarity(query_vec, word_embeddings[word])
        if sim > best_sim:
            best_sim, best_word = sim, word
    return best_word, best_sim

# Extended word list for analogy search
analogy_words = [
    "inflation", "deflation", "tariff", "trade", "GDP", "recession",
    "growth", "supply", "demand", "price", "interest", "bank",
    "export", "import", "tax", "subsidy", "unemployment", "wages"
]

# Get embeddings for the extended list
for word in analogy_words:
    if word not in word_embeddings:
        emb = embed_model.embed(word)
        word_embeddings[word] = np.array(emb, dtype=np.float32).flatten()

# Test: "export" - "trade" + "tax" ≈ ???  (export is to trade as tax is to ?)
analogies = [
    ("export",    "trade",    "tariff",   "import"),     # A:B as C:D
    ("inflation", "price",    "recession", "growth"),
    ("supply",    "producer", "demand",   "consumer"),
]

print("Embedding Arithmetic: A is to B as C is to ???")
print("Formula: vec(A) - vec(B) + vec(C) → find closest word")
print("-" * 60)

for A, B, C, expected in analogies:
    if all(w in word_embeddings for w in [A, B, C]):
        query = word_embeddings[A] - word_embeddings[B] + word_embeddings[C]
        result, sim = find_closest(query, analogy_words, word_embeddings, exclude=[A, B, C])
        print(f"  {A:12} - {B:12} + {C:12} ≈ {result:12}  (similarity: {sim:.3f})")
        print(f"     Expected: {expected}")
        print()
    else:
        missing = [w for w in [A, B, C] if w not in word_embeddings]
        print(f"  Skipped (missing embeddings for: {missing})")

Note: Embedding arithmetic works best in larger models with richer training data. Small 1B-parameter models may not perfectly recover every analogy, but the nearest neighbors should still be semantically reasonable. The point is that relationships between words are encoded as geometric relationships between vectors.


8. Putting It All Together: The Full Pipeline

Let’s trace one economics sentence all the way through the pipeline to see every numeric transformation.

input_text = "Lower interest rates stimulate investment and economic growth."

print("=" * 70)
print("STEP 1: RAW TEXT")
print("=" * 70)
print(f"  '{input_text}'")
print(f"  Length: {len(input_text)} characters")

print()
print("=" * 70)
print("STEP 2: TOKENIZATION (text → integers)")
print("=" * 70)
token_ids = model.tokenize(input_text.encode("utf-8"))
print(f"  Token IDs: {token_ids}")
print(f"  Count: {len(token_ids)} tokens")

print()
print("=" * 70)
print("STEP 3: TOKEN → TEXT MAPPING")
print("=" * 70)
pieces = [model.detokenize([tid]).decode("utf-8", errors="replace") for tid in token_ids]
print(f"  {'ID':>8}  {'Text piece'}")
for tid, piece in zip(token_ids, pieces):
    print(f"  {tid:>8}  {repr(piece)}")

print()
print("=" * 70)
print("STEP 4: EMBEDDING LOOKUP (each integer → vector of floats)")
print("=" * 70)
print("  (Getting embeddings for each token piece...)")
for tid, piece in zip(token_ids[:5], pieces[:5]):  # show first 5 to keep output manageable
    emb = embed_model.embed(piece.strip() or piece)
    emb_arr = np.array(emb, dtype=np.float32)
    print(f"  Token {tid:>6} {repr(piece):>20}  →  vector shape {emb_arr.shape}")
    print(f"    First 8 values: {np.round(emb_arr[:8], 4)}")
    print(f"    Norm: {np.linalg.norm(emb_arr):.4f}")
if len(token_ids) > 5:
    print(f"  ... ({len(token_ids) - 5} more tokens)")

print()
print("=" * 70)
print("STEP 5: TRANSFORMER LAYERS")
print("=" * 70)
n_layers = sum(1 for t in reader.tensors if "blk." in t.name and "attn_q.weight" in t.name)
print(f"  The model runs {n_layers} transformer blocks in sequence.")
print(f"  Each block applies: attention (Q/K/V) + feed-forward network + layer norm")
print(f"  This is {n_layers * 4}+ matrix multiplications per token per forward pass.")

print()
print("=" * 70)
print("STEP 6: GENERATE NEXT TOKEN")
print("=" * 70)
prompt_tokens = model.tokenize(input_text.encode("utf-8"))
next_tok = next(model.generate(prompt_tokens, top_k=1))
next_text = model.detokenize([next_tok]).decode("utf-8", errors="replace")
print(f"  Most probable next token ID: {next_tok}")
print(f"  Decoded: {repr(next_text)}")
print(f"  Continued sentence: '{input_text}{next_text}...'") 

9. Summary: Numbers All the Way Down

In this notebook, you explored how a language model is entirely a numeric computation:

StageWhat It Looks LikeThe Numbers
Text input"A tariff is a tax on..."A Python string
After tokenizationTokens going in[362, 287, 31954, ...] (list of ints)
Embedding lookupEach token → vector[-0.0123, 0.0412, ...] (1000s of floats)
Transformer forward passWeight matrices × embedding vectorsBillions of float multiplications
Output logitsScore for every vocab tokenOne float per vocab entry (~150k floats)
Sampling / decodingPick next token IDOne int → decoded back to text

Key Takeaways

  1. A vocabulary is just a lookup table: token IDs are indices into a big table mapping integers to text pieces.

  2. Embeddings are coordinates in meaning-space: semantically similar words end up with geometrically close vectors after training.

  3. GGUF is a self-describing file: metadata + weight tensors in one binary file. You can inspect a model’s architecture, vocabulary, and weights without ever running it.

  4. Quantization trades precision for space: Q4_0 stores 32 weights in 18 bytes instead of 128 bytes, with only a small quality penalty.

  5. Generation is sequential: the model produces one token ID at a time, each time doing a full forward pass through all transformer layers.

Next Steps

  • Inside_Small_Model.ipynb — visualize token probabilities and decoding strategies

  • LlamaCpp_SmallLM_Demo.ipynb — build multi-turn economic policy chatbot

  • The Illustrated Transformer — visual walkthrough of attention and transformer math

  • GGUF spec — full binary format documentation

# Clean up: close model handles to free memory
del model
del embed_model
del reader
print("Models and file handles released.")