🧠 LLM Context Management and Dynamic System Prompts

Notebook developed by SzuLun Huang szuh@berkeley.edu
Under the guidance of Eric Van Dusen ericvd@berkeley.edu
UC Berkeley, Data Science

📖 The Story¶

Prof. Eric teaches Data 8 — UC Berkeley’s intro data science course. This semester he has 300 students, each at a different skill level. He pulls Zoe aside after class:

“Zoe, I want to build an AI study assistant for my students. But here’s the problem — a Data 8 beginner and a Data 100 student need completely different explanations for the same question. Can you make it adapt automatically?”

Zoe says yes. She has no idea what she’s gotten herself into.

But here’s the thing — a standard LLM doesn’t know who it’s talking to. Ask it “what is a function?” and it gives the same answer to a complete beginner and an experienced programmer alike. It has no memory of past conversations, no sense of the user’s background, and no way to adjust its tone on its own.

So how do you make it adapt? The answer lies in two key ideas: context management and dynamic system prompts.

This notebook is Zoe’s journey — every Part is a new problem she runs into, and a new concept she learns to solve it. By the end, she has a production-ready design and Prof. Eric has his assistant.

🗺️ What You’ll Learn¶

Concept	What it solves
System Prompts	Tell the LLM who it is and how to behave
Dynamic Prompts	Adapt the LLM’s behavior based on the user’s level
Context Management	Give the LLM memory within a conversation

🚀 How to Start¶

Click Kernel in the top menu
Select Restart Kernel and Run All Cells
Wait about 1–2 minutes for the model to load ⏳
Then explore the interactive sections below! ✅

⚠️ You only need to do this once each time you open the notebook.

📦 Step 0: Setup — run once (~1-2 min)¶

The three cells below import packages, locate the model file, and load it into memory. Run them in order once per session — everything else depends on them.

# ── All imports ───────────────────────────────────────────────────────
# ── Standard Library ──────────────────────────────────────────────────
import os
import json
import re
import re as _re
import time
import time as _time
import warnings
from datetime import datetime
warnings.filterwarnings('ignore')
import threading
from IPython.display import display, Javascript

# ── Notebook Display ──────────────────────────────────────────────────
# Tools for rendering interactive UI elements inside Jupyter
from IPython.display import display, HTML, clear_output
import ipywidgets as widgets

# ── LLM Backend ───────────────────────────────────────────────────────
# llama_cpp loads and runs our local language model
from llama_cpp import Llama

# ── Custom Utilities (utils.py) ───────────────────────────────────────
# All heavy logic is abstracted here to keep the notebook clean and readable
from utils import (
    build_system_message, build_course_prompt,   # Prompt construction
    count_tokens, estimated_wait,                 # Token management
    summarize_full_text, extract_key_entities,    # Compression strategies
    compress_semantic, detect_profile_changes,    # Context optimization
    ChatAssistant,                                # Core chat engine
    visualize_context_window,                     # Visual diagnostics
    retrieve, rag_chat,                           # RAG pipeline
    bubble, token_pill,                           # UI components
)

print('✅ Imports done.')

✅ Imports done.

# ── Model Configuration ────────────────────────────────────────────────
# We use a quantized (Q4_K_M) version of Llama 3.2 1B to reduce memory
# usage while keeping reasonable performance on a shared server.

model_filename = 'Llama-3.2-1B-Instruct-Q4_K_M.gguf'
model_path     = f'/home/jovyan/shared/{model_filename}'

# n_ctx: how many tokens the model can "see" at once (its short-term memory)
# 4096 tokens ≈ ~3,000 words — enough for a full tutoring conversation
n_ctx     = 4096

# n_threads: number of CPU cores used for inference
# 4 is a safe default for a shared JupyterHub environment
n_threads = 4

print(f'Model    : {model_filename}')
print(f'Path     : {model_path}')
print(f'Context  : {n_ctx} tokens')
print(f'Threads  : {n_threads}')

Model    : Llama-3.2-1B-Instruct-Q4_K_M.gguf
Path     : /home/jovyan/shared/Llama-3.2-1B-Instruct-Q4_K_M.gguf
Context  : 4096 tokens
Threads  : 4

# ── Verify the model file exists before loading ────────────────────────
# Loading a missing model causes a cryptic error deep in llama_cpp.
# We catch it early here and give a clear, actionable message instead.

if os.path.exists(model_path):
    size_gb = os.path.getsize(model_path) / (1024**3)
    print(f'✅ Found  {model_filename}  ({size_gb:.2f} GB)')
else:
    print('❌ Model not found — ask your teacher to check the shared folder.')
    raise FileNotFoundError(f'Model not found at {model_path}')

✅ Found  Llama-3.2-1B-Instruct-Q4_K_M.gguf  (0.75 GB)

# ── Load model into memory ─────────────────────────────────────────────
# This is the most time-consuming step — the model file is read from disk
# and loaded into RAM. It only happens once per kernel session.

print('⏳ Loading model…  (this may take 1-2 minutes)\n')

model = Llama(
    model_path = model_path,
    n_ctx      = n_ctx,
    n_threads  = n_threads,
    verbose    = False,  # suppress llama_cpp's internal logs (very noisy)
)

# Clear the noisy llama_cpp output and replace with a clean summary
clear_output(wait=True)
print('=' * 50)
print('✅ Model loaded and ready!')
print('=' * 50)
print(f'  Model   : {model_filename}')
print(f'  Context : {n_ctx} tokens')
print(f'  Threads : {n_threads}')

==================================================
✅ Model loaded and ready!
==================================================
  Model   : Llama-3.2-1B-Instruct-Q4_K_M.gguf
  Context : 4096 tokens
  Threads : 4

🧠 Part 1: “Why Doesn’t It Remember Anything?”¶

Zoe’s first prototype is just a few lines: she types a question, the AI answers. It works. She’s excited.

She sends Prof. Eric a demo. He replies: “Great! Can I ask a follow-up question?”

She tries it — and the AI has no idea what was just discussed.

After some digging through the docs, she finds the answer: the model is completely stateless. Every API call starts fresh. It only knows what you pass in right now:

response = model(messages)  # the model only sees what you pass in right now

If Zoe wants her assistant to remember the conversation, she has to keep track of it and pass it back every time. The messages list is the model’s entire world.

What goes in a messages list?¶

Each message has a role and content:

Role	Who it’s from	When to use
`system`	Zoe (as the developer)	Set behaviour, tone, or rules
`user`	The student using the assistant	What they typed
`assistant`	The AI	Previous AI responses

💡 Right now Zoe is testing the assistant herself, so she’s playing both roles. Later, Prof. Eric’s students will be the user.

Her solution: wrap the model in a simple class that stores the conversation history and automatically appends each new message before every call. That’s what you’ll build in the challenges below.

⭐ Try it yourself¶

Below is an interactive demo. You’ll go through 2 challenges that let you feel how AI memory works — and breaks.

Challenge	What you’ll do	What you’ll learn
1 of 2 🧠	Introduce yourself to the AI, then ask if it remembers you	The AI remembers because you passed it the memory
2 of 2 💀	Watch that memory get erased, then ask the same question	Without the messages list, the AI has no idea who you are

⏳ Each time you hit Send, the model will take 5–15 seconds to respond. A progress bar will appear while it’s thinking — just wait for it to finish!

# Part 1 — Interactive widget: students experience how the messages list works as the model's only memory


display(HTML('<style>.jp-OutputArea-output { overflow: visible !important; max-height: none !important; }</style>'))

display(HTML("""
<style>
@keyframes fadeOut {
  from { opacity:1; transform:translateY(0);   max-height:120px; margin:5px 0; }
  to   { opacity:0; transform:translateY(-6px); max-height:0;    margin:0;     }
}
@keyframes fadeIn {
  from { opacity:0; transform:translateY(8px); }
  to   { opacity:1; transform:translateY(0);   }
}
@keyframes shake {
  0%,100%{transform:translateX(0)}
  20%{transform:translateX(-8px)} 40%{transform:translateX(8px)}
  60%{transform:translateX(-5px)} 80%{transform:translateX(5px)}
}
.bubble-wrap  { animation: fadeIn 0.35s ease both; }
.bubble-dying { animation: fadeOut 0.5s ease forwards; overflow:hidden; }
.amnesiac     { animation: shake 0.5s ease; }
.token-pill {
  display:inline-block; background:#313244; border-radius:20px;
  padding:2px 10px; font-size:0.76em; color:#a6adc8; margin-left:8px;
}
.token-pill.warn { background:#f9e2af22; color:#f9e2af; }
.token-pill.crit { background:#f38ba822; color:#f38ba8; }
</style>
"""))

# ── Progress bar (shown while the model is generating) ────────────────
def show_progress(output_widget, estimated_seconds):
    """
    Display an animated progress bar while the model is thinking.
    Uses estimated_wait() from utils.py to set the duration.
    """
    bar = widgets.IntProgress(
        value=0, min=0, max=100,
        bar_style='info',
        layout=widgets.Layout(width='100%', height='12px'),
    )
    label = widgets.HTML(
        f'<span style="color:#a6adc8;font-size:0.82em">'
        f'🧠 Model is thinking… (~{estimated_seconds:.0f}s)</span>'
    )
    with output_widget:
        clear_output(wait=True)
        display(widgets.VBox([label, bar],
                layout=widgets.Layout(padding='10px 0')))

    steps = 30  # number of increments — more steps = smoother animation
    for i in range(steps + 1):
        bar.value = int(i / steps * 100)
        time.sleep(estimated_seconds / steps)

# ── Shared state ──────────────────────────────────────────────────────
# Dicts instead of plain variables so nested functions can mutate them
state        = {"index": 0}   # which challenge is currently active
student_name = {"v": ""}      # saved so Challenge 2 can personalise its message
_c1_history  = {"msgs": []}   # full messages list from Challenge 1, reused in Challenge 2

# ── System prompt ─────────────────────────────────────────────────────
# Intentionally strict about name recall so the contrast between
# Challenge 1 (remembers) and Challenge 2 (forgets) is unmistakable
SYSTEM_PROMPT = (
    "You are an AI tutor. A student is talking to you. "
    "The student's name and background will appear in their first message. "
    "IMPORTANT: When the student introduces themselves, acknowledge them by name in your reply. "
    "When asked who the student is, state their name and background exactly as they told you. "
    "For example: 'The student I am talking to is Zoey, a Data 100 student learning machine learning.' "
    "Never say you don't know the student's name if they have already introduced themselves. "
    "Keep replies to 1-2 sentences."
)

# ── Widgets ───────────────────────────────────────────────────────────
banner_out  = widgets.Output()
preview_out = widgets.Output()
output_area = widgets.Output()

c1_name = widgets.Text(placeholder="e.g. your name",
                       layout=widgets.Layout(width="280px"))
c1_like = widgets.Text(placeholder="e.g. Data 8 student, just starting with the datascience library",
                       layout=widgets.Layout(width="480px"))
c1_form = widgets.VBox([
    widgets.HTML('<div style="color:#a6adc8;font-size:0.84em;margin:10px 0 4px 0">'
                 '👤 Your name:</div>'),
    c1_name,
    widgets.HTML('<div style="color:#a6adc8;font-size:0.84em;margin:8px 0 4px 0">'
                 '📚 Something about yourself (course, background, what you\'re learning):</div>'),
    c1_like,
], layout=widgets.Layout(display="none", padding="0 0 10px 0"))

send_btn = widgets.Button(description="▶ Send", button_style="primary",
                          layout=widgets.Layout(width="110px", margin="10px 6px 0 0"))
next_btn = widgets.Button(description="Next →", button_style="info",
                          layout=widgets.Layout(width="110px", margin="10px 0 0 0"))

# ── Challenge definitions ─────────────────────────────────────────────
CHALLENGES = [
    {
        "id": "c1", "label": "Challenge 1 of 2",
        "title": "The AI remembers — because YOU gave it the memory",
        "color": "#89b4fa", "icon": "🧠",
        "instruction": "Enter your name and something about yourself, then hit <strong>Send</strong>.",
        "hint": "Every message in the list is part of the AI's memory. It can see all of them.",
        "form": "c1",
    },
    {
        "id": "c2", "label": "Challenge 2 of 2",
        "title": "Watch your memory disappear.",
        "color": "#f38ba8", "icon": "💀",
        "instruction": "Watch the messages vanish — then hit <strong>Send</strong> to confirm the AI forgot everything.",
        "hint": "The AI only sees what's in the list right now. No list = no memory.",
        "form": None,
    },
]

# ── Banner & layout ───────────────────────────────────────────────────
def load_challenge(idx):
    c = CHALLENGES[idx]
    with banner_out:
        clear_output(wait=True)
        display(HTML(f"""
        <div style="background:#1e1e2e;border:2px solid {c['color']};
                    border-radius:12px;padding:16px 20px;margin:10px 0">
          <div style="display:flex;align-items:center;gap:10px;margin-bottom:8px">
            <span style="font-size:1.6em">{c['icon']}</span>
            <div>
              <div style="color:{c['color']};font-weight:bold;font-size:0.85em">{c['label']}</div>
              <div style="color:#cdd6f4;font-weight:bold;font-size:1em">{c['title']}</div>
            </div>
          </div>
          <div style="color:#cdd6f4;font-size:0.9em;margin-bottom:6px">{c['instruction']}</div>
          <div style="color:#a6adc8;font-size:0.8em">💡 {c['hint']}</div>
        </div>"""))

    c1_form.layout.display = "" if c["form"] == "c1" else "none"

    if c["id"] == "c1":
        refresh_c1_preview()
    else:
        saved = _c1_history.get("msgs")
        render_preview(saved if saved else [], label="📨 This was your memory from Challenge 1…")

    next_btn.disabled    = (idx == len(CHALLENGES) - 1)
    next_btn.description = "✅ Done" if idx == len(CHALLENGES) - 1 else "Next →"
    with output_area:
        clear_output()

# ── Preview panel ─────────────────────────────────────────────────────
def render_preview(messages, label=None):
    label = label or f'📨 Messages the model will receive {token_pill(messages, model)}'
    bubs  = "".join(bubble(m["role"], m["content"]) for m in messages)
    with preview_out:
        clear_output(wait=True)
        display(HTML(
            f'<div style="background:#1e1e2e;border-radius:10px;padding:14px">'
            f'<div style="color:#585b70;font-size:0.78em;margin-bottom:6px">{label}</div>'
            f'{bubs}</div>'))

def refresh_c1_preview(*_):
    name = c1_name.value.strip() or "you"
    like = c1_like.value.strip() or "…"
    msgs = [
        {"role": "system",    "content": SYSTEM_PROMPT},
        {"role": "user",      "content": f"Hi! My name is {name}. {like}."},
        {"role": "assistant", "content": "( AI will reply here after Turn 1 )"},
        {"role": "user",      "content": "What is my name, and what is my background?"},
    ]
    render_preview(msgs)

c1_name.observe(refresh_c1_preview, names="value")
c1_like.observe(refresh_c1_preview, names="value")

# ── Amnesia animation (Challenge 2) ───────────────────────────────────
def play_amnesia_then_send():
    full_msgs = _c1_history.get("msgs", [])
    if not full_msgs:
        with output_area:
            clear_output()
            display(HTML('<p style="color:#f38ba8">⚠️ Please complete Challenge 1 first!</p>'))
        return

    for i in range(len(full_msgs) - 1):
        surviving = [
            bubble(m["role"], m["content"],
                   extra_class="bubble-dying" if j == i else "")
            for j, m in enumerate(full_msgs)
        ]
        with preview_out:
            clear_output(wait=True)
            display(HTML(
                '<div style="background:#1e1e2e;border-radius:10px;padding:14px">'
                '<div style="color:#f38ba8;font-size:0.78em;margin-bottom:6px">'
                '🗑️ Erasing memory…</div>'
                + "".join(surviving) + "</div>"))
        time.sleep(0.6)

    lone = [{"role": "user", "content": "What is the name of the student you are talking to, and what is their background?"}]
    with preview_out:
        clear_output(wait=True)
        display(HTML(
            '<div style="background:#1e1e2e;border-radius:10px;padding:14px">'
            f'<div style="color:#f38ba8;font-size:0.78em;margin-bottom:6px">'
            f'📨 All that remains {token_pill(lone, model)}</div>'
            + bubble(lone[0]["role"], lone[0]["content"]) + "</div>"))

    # Show progress bar while model generates
    wait = estimated_wait(count_tokens(lone, model))
    show_progress(output_area, wait)

    resp  = model.create_chat_completion(messages=lone, max_tokens=60, temperature=0.7)
    reply = resp["choices"][0]["message"]["content"].strip()
    name  = student_name["v"] or "you"

    with output_area:
        clear_output(wait=True)
        display(HTML(f"""
        <div class="amnesiac" style="background:#1e1e2e;border-radius:10px;
                    padding:14px;margin-top:8px">
          <div style="color:#585b70;font-size:0.78em;margin-bottom:8px">🤖 Model reply</div>
          {bubble("assistant", reply)}
          <div style="background:#f38ba822;border:2px solid #f38ba8;border-radius:10px;
               padding:12px 16px;margin-top:12px;text-align:center">
            <div style="font-size:1.4em;margin-bottom:4px">🫥</div>
            <div style="color:#f38ba8;font-weight:bold">Complete amnesia.</div>
            <div style="color:#a6adc8;font-size:0.82em;margin-top:6px;line-height:1.7">
              The model received <strong style="color:#f38ba8">1 message</strong> — no history, no name, nothing.<br>
              It has no idea who <strong style="color:#cdd6f4">{name}</strong> is.<br><br>
              <strong style="color:#cdd6f4">This is what Zoe's first broken prototype felt like.</strong><br>
              <span style="color:#a6adc8">The fix? Pass the full messages list every time. That's what Part 2 builds.</span>
            </div>
          </div>
        </div>"""))

# ── Send handler ──────────────────────────────────────────────────────
def on_send(btn):
    c = CHALLENGES[state["index"]]
    send_btn.disabled    = True
    send_btn.description = "⏳ …"

    if c["id"] == "c2":
        with output_area:
            clear_output()
        play_amnesia_then_send()

    else:  # Challenge 1
        name = c1_name.value.strip()
        like = c1_like.value.strip()
        if not name or not like:
            with output_area:
                clear_output()
                display(HTML('<p style="color:#f38ba8">⚠️ Enter your name and something about yourself first!</p>'))
            send_btn.disabled    = False
            send_btn.description = "▶ Send"
            return
        student_name["v"] = name

        # Turn 1: student introduces themselves
        turn1 = [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user",   "content": f"Hi! My name is {name}. {like}."},
        ]
        wait = estimated_wait(count_tokens(turn1, model))
        show_progress(output_area, wait)
        resp1    = model.create_chat_completion(messages=turn1, max_tokens=60, temperature=0.7)
        ai_reply = resp1["choices"][0]["message"]["content"].strip()

        # Turn 2: ask if the model remembers — the key demonstration
        turn2 = turn1 + [
            {"role": "assistant", "content": ai_reply},
            {"role": "user",      "content": "What is the name of the student you are talking to, and what is their background?"},
        ]
        _c1_history["msgs"] = turn2  # save for Challenge 2 to reuse

        wait = estimated_wait(count_tokens(turn2, model))
        show_progress(output_area, wait)
        resp2 = model.create_chat_completion(messages=turn2, max_tokens=80, temperature=0.7)
        final = resp2["choices"][0]["message"]["content"].strip()

        bubs = "".join(bubble(m["role"], m["content"]) for m in turn2)
        with output_area:
            clear_output(wait=True)
            display(HTML(f"""
            <div style="background:#1e1e2e;border-radius:10px;padding:14px;margin-top:8px">
              <div style="color:#585b70;font-size:0.78em;margin-bottom:8px">
                📨 What the model received {token_pill(turn2, model)}</div>
              {bubs}
              <div style="border-top:1px solid #313244;margin:10px 0"></div>
              <div style="color:#585b70;font-size:0.78em;margin-bottom:6px">🤖 Model reply</div>
              {bubble("assistant", final)}
              <div style="background:#89b4fa22;border:1px solid #89b4fa44;border-radius:8px;
                   padding:10px 14px;margin-top:10px;font-size:0.82em;color:#a6adc8">
                ✅ The AI knows your name because <strong style="color:#cdd6f4">you gave it the memory</strong>.
                Hit <strong style="color:#89b4fa">Next →</strong> to see what happens when that memory disappears.
              </div>
            </div>"""))

    send_btn.disabled    = False
    send_btn.description = "▶ Send"

def on_next(btn):
    nxt = state["index"] + 1
    if nxt < len(CHALLENGES):
        state["index"] = nxt
        load_challenge(nxt)

send_btn.on_click(on_send)
next_btn.on_click(on_next)

# ── Initial render ────────────────────────────────────────────────────
display(widgets.HTML("""
<div style="background:#1e1e2e;padding:16px 20px;border-radius:12px;margin-bottom:4px">
  <h3 style="color:#cdd6f4;margin:0 0 6px 0">🧠 Part 1: The messages List is the AI's Entire Memory</h3>
  <p style="color:#a6adc8;margin:0;font-size:0.88em">
    Two challenges. Each one lets you <em>feel</em> how AI memory works — and breaks.
  </p>
</div>
"""))
load_challenge(0)
display(banner_out, c1_form, preview_out,
        widgets.HBox([send_btn, next_btn]), output_area)

👤 Part 2: “300 Students, 300 Different Needs”¶

Zoe shows the working prototype to Prof. Eric. He’s happy — but immediately has a new request:

“This is great. But my Data 8 students are complete beginners — they need simple language and analogies. My Data 100 students are much more advanced — they’d find that patronising. Can the same assistant handle both?”

Zoe’s first instinct: write two different system prompts. But Prof. Eric has 300 students. She can’t hardcode a prompt for each one.

Her solution: store the student’s info as a dict, and let code write the prompt automatically.

profile = {"name": "Student", "expertise": "beginner", ...}
system_prompt = build_system_message(profile)  # prompt writes itself

Change one field in the profile → the whole prompt regenerates. One function, any student.

This also means the prompt stays consistent — no typos, no forgotten fields, no two students accidentally getting different formats. The profile is the single source of truth.

🎓 Demo: Data 8 vs Data 100¶

Same question, two very different students:

Course	Student type	What they need
Data 8	Complete beginner	Simple language, analogies, `datascience` library
Data 100	Experienced	Concise, technical, `pandas` / industry tools

What to watch: Ask the AI “How do I read a CSV file?” — but change the course field.

🔍 The model is identical. The question is identical. Only the system prompt changes — but the answer is completely different. This is exactly what Zoe builds for Prof. Eric.

# Demo: same question sent to Data 8 and Data 100 prompts side-by-side to show how system prompts change behaviour

# ── Data 8 vs Data 100 System Prompt Builder ─────────────────────────
question = "How do I read a CSV file?"

COURSE_COLORS = {
    "Data 8":   {"border": "#a6e3a1", "label": "#a6e3a1", "keyword": "datascience", "kw_color": "#a6e3a1"},
    "Data 100": {"border": "#89b4fa", "label": "#89b4fa", "keyword": "pandas",      "kw_color": "#89b4fa"},
}

def highlight(text, keyword, color):
    return text.replace(
        keyword,
        f'<span style="color:{color};font-weight:bold;background:#1e1e2e;'
        f'padding:1px 5px;border-radius:3px">{keyword}</span>'
    )

output_area = widgets.Output()

# ── Progress indicator (no sleep — just shows message instantly) ──────
def show_progress(estimated_seconds):
    bar = widgets.IntProgress(
        value=0, min=0, max=100,
        bar_style='info',
        layout=widgets.Layout(width='100%', height='12px'),
    )
    label = widgets.HTML(
        f'<span style="color:#a6adc8;font-size:0.82em">'
        f'🧠 Querying both courses… (~{estimated_seconds:.0f}s)</span>'
    )
    with output_area:
        clear_output(wait=True)
        display(widgets.VBox([label, bar],
                layout=widgets.Layout(padding='10px 0')))

   
    def _animate():
        steps = 30
        for i in range(steps + 1):
            bar.value = int(i / steps * 100)
            time.sleep(estimated_seconds / steps)

    threading.Thread(target=_animate, daemon=True).start()

# ── Run comparison ────────────────────────────────────────────────────
def run_comparison(_=None):
    run_btn.disabled    = True
    run_btn.description = "⏳ Running..."

    sample_msgs = [
        {"role": "system", "content": build_course_prompt("Data 8")},
        {"role": "user",   "content": question},
    ]
    wait = estimated_wait(count_tokens(sample_msgs, model)) * 2 * 2
    show_progress(wait)

    results = {}
    prompts = {}
    for course, cfg in COURSE_COLORS.items():
        prompts[course] = build_course_prompt(course)
        resp = model.create_chat_completion(
            messages=[
                {"role": "system", "content": prompts[course]},
                {"role": "user",   "content": question},
            ],
            max_tokens=150,
            temperature=0.7,
        )
        results[course] = resp["choices"][0]["message"]["content"].strip()

    with output_area:
        clear_output()
        html_parts = ['<div style="display:flex;gap:14px;margin-top:10px">']
        for course, cfg in COURSE_COLORS.items():
            reply_html = highlight(results[course], cfg["keyword"], cfg["kw_color"])
            reply_html = reply_html.replace(
                "Table.read_table",
                f'<span style="color:{cfg["kw_color"]};font-weight:bold">Table.read_table</span>'
            ).replace(
                "pd.read_csv",
                f'<span style="color:{cfg["kw_color"]};font-weight:bold">pd.read_csv</span>'
            )

            html_parts.append(f"""
            <div style="flex:1;background:#1e1e2e;border:2px solid {cfg['border']};
                        border-radius:10px;padding:14px">

              <div style="color:{cfg['label']};font-weight:bold;font-size:1em;margin-bottom:10px">
                🎓 {course}
              </div>

              <!-- System prompt section -->
              <div style="color:#a6adc8;font-size:0.75em;font-weight:bold;
                          text-transform:uppercase;letter-spacing:0.05em;margin-bottom:4px">
                ⚙️ System Prompt
              </div>
              <div style="background:#181825;border:1px solid {cfg['border']}44;
                          border-radius:6px;padding:8px 10px;
                          color:#a6adc8;font-size:0.78em;line-height:1.6;
                          max-height:90px;overflow-y:auto;
                          white-space:pre-wrap;margin-bottom:10px">{prompts[course]}</div>

              <!-- Arrow -->
              <div style="text-align:center;color:#7c7f93;font-size:1.2em;margin-bottom:8px">↓</div>

              <!-- AI response section -->
              <div style="color:#a6adc8;font-size:0.75em;font-weight:bold;
                          text-transform:uppercase;letter-spacing:0.05em;margin-bottom:4px">
                🤖 AI Response
              </div>
              <div style="background:#313244;padding:10px;border-radius:6px;
                          color:#cdd6f4;font-size:0.87em;line-height:1.6;
                          white-space:pre-wrap">{reply_html}</div>

            </div>""")

        html_parts.append('</div>')
        html_parts.append("""
        <div style="margin-top:12px;background:#313244;padding:10px 14px;border-radius:8px;
                    font-size:0.85em;color:#a6adc8">
          ✨ <strong style="color:#fab387">Key observation:</strong>
          Same model, same question — only the system prompt changed.<br>
          The highlighted library name shows exactly where the behaviour diverged.
        </div>""")
        display(HTML("".join(html_parts)))

    run_btn.disabled    = False
    run_btn.description = "▶ Run Again"

run_btn = widgets.Button(
    description="▶ Run Comparison",
    button_style="primary",
    layout=widgets.Layout(width="160px", margin="0 0 10px 0")
)
run_btn.on_click(run_comparison)

display(widgets.HTML("""
<div style="background:#1e1e2e;padding:14px 18px;border-radius:10px;margin-bottom:10px">
  <h4 style="color:#cdd6f4;margin:0 0 6px 0">🎓 Data 8 vs Data 100 — Side-by-Side</h4>
  <p style="color:#a6adc8;margin:0;font-size:0.88em">
    Question: <em>"How do I read a CSV file?"</em><br>
    Notice how the <strong style="color:#fab387">system prompt</strong> changes the library the AI recommends.
  </p>
</div>
"""), run_btn, output_area)

🪙 Token Counter: The Hidden Cost Zoe Didn’t Expect¶

The assistant is working. Prof. Eric is happy. But a week into the semester, Zoe notices something: the assistant is getting slower.

Every message — system prompt, chat history, new question — consumes tokens from the model’s context window (4,096 tokens for our local model). More tokens in = longer wait.

What gets sent	Approx. tokens
Simple question only	~8 tokens
Question + system prompt	~100 tokens
Question + system prompt + 20–30 turns of history	~1,000–3,000 tokens

Run the cell below to see the exact numbers for each scenario.

💡 This is why history compression (Part 3) is not optional — especially on a local 1B model where every extra token adds real waiting time for students.

# Part 2 — Token counter: shows how system prompts and history multiply token usage and wait time

# ── Token Counter Demonstration ──────────────────────────────────────
simple_question  = "How do I read a CSV file in Python?"
data100_prompt   = build_course_prompt("Data 100")

msgs_simple = [
    {"role": "user", "content": simple_question}
]
msgs_with_prompt = [
    {"role": "system", "content": data100_prompt},
    {"role": "user",   "content": simple_question},
]
msgs_with_history = [
    {"role": "system", "content": data100_prompt},
    {"role": "user",      "content": "Hi! I'm Zoe, a student from Taiwan studying AI at UC Berkeley."},
    {"role": "assistant", "content": "Welcome Zoe! That's exciting. What are you working on?"},
    {"role": "user",      "content": "I'm learning about context management in LLMs for a class project."},
    {"role": "assistant", "content": "Great topic! Context management is key for building real AI apps."},
    {"role": "user",      "content": "I'm also learning pandas and scikit-learn for the data side."},
    {"role": "assistant", "content": "Nice combo. Are you using Jupyter or a local environment?"},
    {"role": "user",      "content": simple_question},
]

t1 = count_tokens(msgs_simple)
t2 = count_tokens(msgs_with_prompt)
t3 = count_tokens(msgs_with_history)

SPEED = 25  # conservative tokens/sec estimate for llama-cpp on a shared CPU

rows = [
    ("Simple question only",     t1),
    ("+ System prompt",          t2),
    ("+ 6-turn history preview", t3),
]

html = ['<div style="background:#1e1e2e;border-radius:10px;padding:16px 20px;margin-top:8px">']
html.append('<h4 style="color:#cdd6f4;margin:0 0 12px 0">🪙 Token Count & Estimated Wait Time</h4>')
html.append('<table style="width:100%;border-collapse:collapse;font-size:0.88em">')
html.append('''<tr style="background:#2a2a3d;border-bottom:1px solid #45475a">
  <th style="text-align:left;padding:6px 10px;color:#cdd6f4">Scenario</th>
  <th style="text-align:right;padding:6px 10px;color:#cdd6f4">Tokens</th>
  <th style="text-align:right;padding:6px 10px;color:#cdd6f4">% of 4096</th>
  <th style="text-align:right;padding:6px 10px;color:#cdd6f4">Est. wait (@25 tok/s)</th>
</tr>''')

bar_colors = ["#a6e3a1", "#89b4fa", "#f38ba8"]
for (label, t), color in zip(rows, bar_colors):
    pct   = t / 4096 * 100
    wait  = estimated_wait(t, SPEED)
    bar_w = max(4, int(pct * 1.5))
    html.append(f'''<tr style="background:#1e1e2e;border-bottom:1px solid #313244">
  <td style="padding:8px 10px;color:#cdd6f4">{label}</td>
  <td style="text-align:right;padding:8px 10px;color:{color};font-weight:bold">{t}</td>
  <td style="text-align:right;padding:8px 10px">
    <span style="display:inline-block;width:{bar_w}px;height:10px;
                 background:{color};border-radius:3px;vertical-align:middle"></span>
    <span style="color:#cdd6f4;margin-left:6px">{pct:.1f}%</span>
  </td>
  <td style="text-align:right;padding:8px 10px;color:#cdd6f4">~{wait:.1f} s</td>
</tr>''')

html.append('</table>')
html.append(f'''<div style="margin-top:12px;padding:10px 14px;background:#313244;border-radius:8px;
                            font-size:0.85em;line-height:1.8">
  💡 <strong style="color:#fab387">Cost us time:</strong>
  <span style="color:#cdd6f4"> Adding a system prompt multiplies token count by ~{t2/t1:.0f}×.</span><br>
  <span style="color:#cdd6f4">Adding 6 turns of history multiplies it by ~{t3/t1:.0f}× vs. the bare question.</span><br>
  <span style="color:#cdd6f4">With 25 full turns (Zoe\'s story below), expect </span>
  <strong style="color:#f38ba8">1,000 + tokens</strong>
  <span style="color:#cdd6f4"> and ~{1000/SPEED:.0f}+ seconds of wait time on this local 1B model.</span><br>
  👉 <strong style="color:#cdd6f4">This is exactly why history compression (Part 3) matters.</strong>
</div>
</div>''')

display(HTML("".join(html)))

📖 Part 3: How the Assistant Learns Who Zoe Is¶

Prof. Eric asks: “Can the assistant learn a student’s background just from conversation — without making them fill out a form?”

Yes. Instead of asking “What’s your skill level?” upfront, the assistant can infer the profile from what the student says across multiple turns.

Below is a simulated 25-turn conversation with Zoe herself as the student. Her background — Taiwan, AI focus, pandas experience — emerges naturally through the chat.

What to observe:

Early turns: Zoe is just asking questions, profile is mostly empty
Middle turns: the AI starts tailoring responses to her background
Later turns: the AI knows her skills, goals, and style without ever being told explicitly

Run the cell and watch how the system prompt grows with the history.

# Part 3 — Dataset: Zoe's 25-turn conversation, used to show how the assistant infers a student profile over time

# ── DATASET: Zoe's Storyline History (25 turns) ──────────────────────
ZOE_HISTORY = [
    {"role": "user",      "content": "Hi! I'm Zoe. I'm from Taiwan and I'm studying at UC Berkeley."},
    {"role": "assistant", "content": "Welcome Zoe! Great to meet you. What are you studying?"},
    {"role": "user",      "content": "I'm focusing on AI and machine learning. It's my first semester here."},
    {"role": "assistant", "content": "That's exciting! Berkeley has a great CS program. What draws you to AI?"},
    {"role": "user",      "content": "I want to understand how language models work — especially memory and context."},
    {"role": "assistant", "content": "Context management is a fascinating area. Are you working on a project?"},
    {"role": "user",      "content": "Yes, I'm building a teaching notebook about LLM context management for a class."},
    {"role": "assistant", "content": "That sounds like a great project. What tools are you using?"},
    {"role": "user",      "content": "Python, Jupyter, and llama-cpp-python with a local Llama model."},
    {"role": "assistant", "content": "Nice setup! Running locally avoids API costs. How's the performance?"},
    {"role": "user",      "content": "It's a bit slow — the 1B model takes a few seconds per response."},
    {"role": "assistant", "content": "That's expected. Reducing token count helps. Have you tried history compression?"},
    {"role": "user",      "content": "Not yet. That's actually one of the topics I want to teach in the notebook."},
    {"role": "assistant", "content": "Perfect timing then. Summarization and entity extraction are two common approaches."},
    {"role": "user",      "content": "I also want to show students the difference between Data 8 and Data 100 workflows."},
    {"role": "assistant", "content": "Good idea. The datascience vs pandas distinction is a classic Berkeley contrast."},
    {"role": "user",      "content": "Exactly! I think system prompts are the key to making that demo work."},
    {"role": "assistant", "content": "Right — change one field in the profile, the whole prompt regenerates."},
    {"role": "user",      "content": "I'm also learning about token budgets. Every token = more wait time locally."},
    {"role": "assistant", "content": "That's a great insight to teach. Students often ignore token costs until they feel it."},
    {"role": "user",      "content": "My background is more stats than CS, so I'm still getting comfortable with Python."},
    {"role": "assistant", "content": "Stats is a great foundation. pandas and numpy will feel natural to you."},
    {"role": "user",      "content": "I also want a fun mode where the AI speaks in a playful, encouraging tone."},
    {"role": "assistant", "content": "Easy — just add that as a style option in the system prompt."},
    {"role": "user",      "content": "Cool, I think that will make the notebook more fun. Thanks for helping me plan this!"},
]

def display_dataset_professional(history):
    """Render the conversation history as a styled HTML table."""
    # Use count_tokens from utils.py — no need to reimplement here
    total_tok = count_tokens(history)

    rows = ""
    for m in history:
        is_user    = m["role"] == "user"
        role_color = "#1a5fb4" if is_user else "#26a269"
        label      = m["role"].upper()
        rows += f"""
        <tr style="border-bottom:1px solid #eee">
          <td style="padding:12px 14px; vertical-align:top; width:110px;
                     font-weight:700; color:{role_color}; font-size:0.88em; text-align:left;">{label}</td>
          <td style="padding:12px 14px; color:#000000; line-height:1.6; font-size:0.9em; text-align:left;">{m['content']}</td>
        </tr>"""

    html = f"""
    <div style="font-family: sans-serif; margin:10px 0">
      <div style="background:#1e2d40; border:1px solid #3b5268; border-radius:10px 10px 0 0;
                  padding:12px 16px; position:relative;">
        <div style="color:#f0f6ff; font-weight:700; font-size:1em; text-align:center;">
          📖 Full Conversation History — Zoe
        </div>
        <div style="position:absolute; right:16px; top:14px; font-size:0.75em; color:#7ea8c9; display:flex; gap:10px;">
          <span>{len(history)} turns</span>
          <span style="color:#3b5268">|</span>
          <span style="color:#38bdf8; font-weight:700">{total_tok} tokens</span>
        </div>
      </div>

      <div style="max-height:300px; overflow-y:auto;
                  border:1px solid #3b5268; border-top:none;
                  border-radius:0 0 10px 10px; background:#ffffff">
        <table style="width:100%; border-collapse:collapse; table-layout:fixed;">
          <thead>
            <tr style="background:#f8f9fa; border-bottom:2px solid #dee2e6; position:sticky; top:0; z-index:10;">
              <th style="padding:10px 14px; width:110px; text-align:center;
                         color:#444444; font-size:0.75em; text-transform:uppercase;
                         letter-spacing:0.1em; font-weight:bold;">Role</th>
              <th style="padding:10px 14px; text-align:center;
                         color:#444444; font-size:0.75em; text-transform:uppercase;
                         letter-spacing:0.1em; font-weight:bold;">Message</th>
            </tr>
          </thead>
          <tbody>{rows}</tbody>
        </table>
      </div>
    </div>"""

    display(HTML(html))

display_dataset_professional(ZOE_HISTORY)

# Demo: full 25-turn history injected into context to generate a personalized learning roadmap for Zoe
import threading

# ── Setup ─────────────────────────────────────────────────────────────
simple_question = "Based on my background and what I'm working on, what should I focus on to get better at AI?"

msgs_full_zoe = [
    {
        "role": "system",
        "content": (
            "You are a specialized AI mentor for Data Science students at UC Berkeley. "
            "The user is Zoe. You MUST tailor your advice to her specific background: "
            "Taiwanese, Statistics background, UC Berkeley student, building an LLM context notebook. "
            "FORMAT REQUIREMENT: Provide 5 specific focus areas. Each point must start with a "
            "BOLD KEYWORD followed by a very brief one-sentence explanation. "
            "Example format: '**Keyword**: Brief explanation.' "
            "Do NOT give generic advice. Keep the bolded parts strictly to the keywords."
        )
    },
    *ZOE_HISTORY,
    {"role": "user", "content": simple_question},
]

# ── Bold fix: replace **text** with proper <b>text</b> ───────────────
def render_bold(text):
    text = re.sub(r'\*\*(.+?)\*\*', r'<strong style="color:#f0f6ff">\1</strong>', text)
    text = text.replace("\n", "<br>")
    return text

# ── Widgets ───────────────────────────────────────────────────────────
progress_out = widgets.Output()
result_out   = widgets.Output()

run_btn = widgets.Button(
    description="▶ Generate Roadmap",
    button_style="success",
    layout=widgets.Layout(width="200px", margin="10px 0")
)

# ── Input panel ───────────────────────────────────────────────────────
display(HTML(f"""
<div style="font-family:'IBM Plex Mono','Fira Code',monospace;
            background:#0d1420;border:1px solid #3b5268;
            border-radius:12px;padding:18px 20px;margin:10px 0;color:#e2e8f0">

  <div style="font-size:0.63em;color:#7ea8c9;text-transform:uppercase;
              letter-spacing:0.15em;margin-bottom:8px">Input Context</div>
  <h3 style="color:#f0f6ff;margin:0 0 14px;font-size:1.0em">🔍 Student Profile</h3>

  <div style="display:flex;flex-direction:column;gap:6px;
              font-size:0.8em;color:#94b8d4;line-height:1.7;margin-bottom:14px">
    <div><span style="color:#7a9bb5">Name / Origin  </span> Zoe (Taiwan)</div>
    <div><span style="color:#7a9bb5">Affiliation    </span> UC Berkeley — Statistics background</div>
    <div><span style="color:#7a9bb5">Current project</span> LLM Context Management Notebook</div>
  </div>

  <div style="background:#0f2336;border:1px solid #38bdf855;border-radius:8px;
              padding:10px 14px;margin-bottom:14px;font-size:0.78em;color:#7dd3fc;
              display:flex;align-items:flex-start;gap:8px">
    <span style="font-size:1.1em;margin-top:1px">💬</span>
    <span>
      The roadmap below is generated from <strong style="color:#38bdf8">Zoe's full 25-turn conversation history</strong>
      injected into the model's context — not a generic profile. Every recommendation is grounded
      in what she actually said across the entire session.
    </span>
  </div>

  <div style="background:#020408;border:1px solid #38bdf844;border-left:3px solid #38bdf8;
              border-radius:0 8px 8px 0;padding:12px 14px;font-size:0.82em;
              color:#cbd5e1;line-height:1.6;font-style:italic">
    "{simple_question}"
  </div>

</div>
"""))

display(run_btn, progress_out, result_out)

# ── Button handler ────────────────────────────────────────────────────
def on_run(_):
    run_btn.disabled    = True
    run_btn.description = "⏳ Running..."

    # Clear previous results
    with progress_out:
        clear_output()
    with result_out:
        clear_output()

    # Show progress bar in background thread
    wait = 50
    bar = widgets.IntProgress(
        value=0, min=0, max=100,
        bar_style='info',
        layout=widgets.Layout(width='100%', height='12px'),
    )
    label = widgets.HTML(
        f'<span style="color:#a6adc8;font-size:0.82em">'
        f'🧠 Generating personalized roadmap from conversation history... (~{wait}s)</span>'
    )
    with progress_out:
        display(widgets.VBox([label, bar], layout=widgets.Layout(padding='10px 0')))

    def _animate():
        steps = 30
        for i in range(steps + 1):
            bar.value = int(i / steps * 100)
            time.sleep(wait / steps)

    threading.Thread(target=_animate, daemon=True).start()

    # Inference
    try:
        resp = model.create_chat_completion(
            messages=msgs_full_zoe,
            max_tokens=250,
            temperature=0.4,
        )
        final_reply = resp["choices"][0]["message"]["content"].strip()
    except Exception as e:
        final_reply = f"Error: {e}"

    with progress_out:
        clear_output()

    with result_out:
        display(HTML(f"""
        <div style="font-family:'IBM Plex Mono','Fira Code',monospace;
                    background:#0d1420;border:2px solid #34d399;
                    border-radius:12px;padding:18px 20px;margin-top:10px;color:#e2e8f0">

          <div style="font-size:0.63em;color:#34d399;text-transform:uppercase;
                      letter-spacing:0.15em;margin-bottom:8px">Output</div>
          <h3 style="color:#f0f6ff;margin:0 0 14px;font-size:1.0em">🎯 Personalized AI Learning Roadmap</h3>

          <div style="background:#0d2b1a;border:1px solid #34d39944;border-radius:6px;
                      padding:8px 12px;margin-bottom:12px;font-size:0.75em;color:#6ee7b7;
                      display:flex;align-items:center;gap:6px">
            <span>📜</span>
            <span>Based on <strong style="color:#34d399">25 turns of conversation history</strong>
            — this roadmap reflects Zoe's actual questions, struggles, and progress during the session.</span>
          </div>

          <div style="background:#020408;border:1px solid #34d39922;border-radius:8px;
                      padding:14px 16px;font-size:0.82em;color:#cbd5e1;line-height:1.9">
            {render_bold(final_reply)}
          </div>

          <div style="margin-top:12px;font-size:0.72em;color:#7a9bb5;font-style:italic">
            Focus areas are prioritized based on Zoe's statistical background and local LLM development context.
          </div>

        </div>
        """))

    run_btn.disabled    = False
    run_btn.description = "▶ Run Again"

run_btn.on_click(on_run)

🎨 Bonus: Change the AI’s Style with One Line¶

System prompts don’t just control content — they control tone and style too.
Here we swap a single instruction to show how the same answer sounds completely different depending on who’s “speaking”.

💡 What to notice: Same question, same model, same facts — only the style instruction changes.

# Bonus — Interactive widget: same question, five different style instructions, instant side-by-side comparison

# ── Style options ─────────────────────────────────────────────────────
STYLES = {
    "👤 Friendly Senior":   "You are a helpful senior student. Answer in 1-2 sentences only, like you're explaining to a friend.",
    "🎉 Encouraging Coach": "Respond in an upbeat tone. Answer in 1-2 sentences only, then add one short encouragement.",
    "🎓 Professor":         "Respond formally. Answer in 1-2 sentences only, using precise terminology.",
    "🎤 Rap / Freestyle":   "Respond ONLY in rap or freestyle rhyme. Every sentence must rhyme. Keep it fun and educational.",
    "🇹🇼 中文回覆":          "請只用繁體中文回覆。無論問題是什麼語言，一律用繁體中文回答。不超過2-3句話。",
}
TOPIC    = "What is a context window in LLMs, and why should I care?"
TOPIC_ZH = "什麼是 LLM 的 context window？為什麼我需要了解它？"

# ── Widgets ───────────────────────────────────────────────────────────
style_dropdown = widgets.Dropdown(
    options=list(STYLES.keys()),
    description="Style:",
    layout=widgets.Layout(width="280px"),
)
send_btn = widgets.Button(
    description="▶ Try this style",
    button_style="primary",
    layout=widgets.Layout(width="160px", margin="0 0 0 10px"),
)
output_area = widgets.Output()

display(widgets.HTML(f"""
<div style="font-family:'IBM Plex Mono','Fira Code',monospace;
            background:#1e1e2e;padding:16px 18px;border-radius:10px;margin-bottom:10px;
            border:1px solid #45475a">

  <div style="font-size:0.63em;color:#9399b2;text-transform:uppercase;
              letter-spacing:0.15em;margin-bottom:6px">Bonus Demo</div>
  <h4 style="color:#cdd6f4;margin:0 0 10px 0">🎨 Same Question, Different Style</h4>

  <!-- What this demo shows -->
  <div style="background:#181825;border:1px solid #cba6f755;border-left:3px solid #cba6f7;
              border-radius:0 8px 8px 0;padding:10px 14px;margin-bottom:12px;
              font-size:0.8em;color:#a6adc8;line-height:1.7">
    System prompts don't just control <strong style="color:#89b4fa">content</strong> —
    they control <strong style="color:#89b4fa">tone and style</strong> too.<br>
    We swap a <strong style="color:#fab387">single system instruction</strong> to show how
    the same answer sounds completely different depending on who's "speaking".
  </div>

  <!-- The fixed question -->
  <div style="background:#181825;border:1px solid #a6e3a133;border-radius:8px;
              padding:10px 14px;font-size:0.82em;color:#a6e3a1;margin-bottom:6px">
    <span style="color:#6c7086;font-size:0.85em;display:block;margin-bottom:4px">
      💬 Fixed question (same for all styles):
    </span>
    <em>"{TOPIC}"</em>
  </div>

  <p style="color:#6c7086;margin:8px 0 0;font-size:0.78em">
    💡 Pick a style → hit <strong style="color:#cdd6f4">Try this style</strong> →
    compare outputs. Same model, same facts — only the instruction changes.
  </p>
</div>
"""))
display(widgets.HBox([style_dropdown, send_btn]))
display(output_area)

# ── Progress bar ──────────────────────────────────────────────────────

def show_progress(estimated_seconds):
    bar = widgets.IntProgress(
        value=0, min=0, max=100,
        bar_style='info',
        layout=widgets.Layout(width='100%', height='12px'),
    )
    label = widgets.HTML(
        f'<span style="color:#a6adc8;font-size:0.82em">'
        f'🧠 Model is thinking… (~{estimated_seconds:.0f}s)</span>'
    )
    with output_area:
        clear_output(wait=True)
        display(widgets.VBox([label, bar],
                layout=widgets.Layout(padding='10px 0')))

    def _animate():
        steps = 30
        for i in range(steps + 1):
            bar.value = int(i / steps * 100)
            time.sleep(estimated_seconds / steps)

    threading.Thread(target=_animate, daemon=True).start()

# ── Send handler ──────────────────────────────────────────────────────
def on_send(btn):
    send_btn.disabled    = True
    send_btn.description = "⏳ …"

    style_name  = style_dropdown.value
    instruction = STYLES[style_name]
    user_topic  = TOPIC_ZH if style_name == "🇹🇼 中文回覆" else TOPIC

    msgs = [
        {"role": "system", "content": instruction},
        {"role": "user",   "content": user_topic},
    ]

    # Show progress bar while model generates
    wait = estimated_wait(count_tokens(msgs, model))
    show_progress(wait)

    resp = model.create_chat_completion(
        messages=msgs,
        max_tokens=80,   # style demos are short by design — we want snappy, comparable outputs
        temperature=0.8,
    )
    reply = resp["choices"][0]["message"]["content"].strip()

    with output_area:
        clear_output(wait=True)
        display(HTML(f"""
        <div style="font-family:'IBM Plex Mono','Fira Code',monospace;
                    background:#1e1e2e;border:2px solid #cba6f7;
                    border-radius:10px;padding:16px;margin-top:4px">

          <!-- Style badge -->
          <div style="display:flex;align-items:center;gap:8px;margin-bottom:10px">
            <span style="background:#cba6f722;border:1px solid #cba6f7;border-radius:6px;
                         padding:3px 10px;color:#cba6f7;font-size:0.78em">
              Style: <strong style="color:#cdd6f4">{style_name}</strong>
            </span>
            <span style="color:#6c7086;font-size:0.75em">← only this line changed</span>
          </div>

          <!-- System prompt used -->
          <div style="background:#181825;border:1px solid #45475a;border-radius:6px;
                      padding:8px 12px;margin-bottom:10px;font-size:0.75em;color:#6c7086;
                      line-height:1.6">
            <span style="color:#9399b2;display:block;margin-bottom:2px">System prompt:</span>
            <em style="color:#a6adc8">{instruction}</em>
          </div>

          <!-- Model reply -->
          <div style="background:#313244;padding:12px 14px;border-radius:8px;
                      color:#cdd6f4;font-size:0.9em;line-height:1.7">
            {reply}
          </div>

          <div style="margin-top:10px;color:#9399b2;font-size:0.78em">
            💡 Now pick a different style and compare —
            <strong style="color:#a6e3a1">same question, same model</strong>, completely different voice.
          </div>
        </div>
        """))

    send_btn.disabled    = False
    send_btn.description = "▶ Try this style"

send_btn.on_click(on_send)

🗜️ Part 4: “The Assistant Is Getting Slower Every Day”¶

Two weeks into the semester, Prof. Eric messages Zoe:

“Students are complaining the assistant takes forever to respond. What’s going on?”

Zoe checks the logs. Some students have been chatting for 40+ turns. The context window is nearly full — and every call has to process the entire history from scratch.

She needs a way to shrink old history without losing important information.

Part 4a: Three Approaches to History Compression¶

Zoe’s first idea is simple: just delete old messages. But that causes amnesia — like a doctor who shreds a patient’s records before every visit. The model forgets everything the student said earlier.

Her solution: compress the old turns into a shorter summary, then keep that summary in context instead of the raw messages.

There are three ways to do this:

Strategy	What it does	Trade-off
Full Text Summarization	Rewrites old turns as 1-2 sentences	Simple, but may lose specific details
Key Entity Extraction	Pulls out names, tools, decisions as bullet points	Token-efficient, but misses narrative flow
Semantic Compression	Combines both — one summary + key bullets	Best quality, but output depends on summarizer quality

👇 Pick a strategy below and see what it produces from Zoe’s actual 25-turn conversation history — the same one we’ve been building throughout this notebook.

Truncation Strategy Comparison¶

So Zoe has three strategies on paper. But which one actually works best for her students?

She runs a live experiment — feeding the same 25-turn conversation history through all three approaches, then compares what the model actually sees after each one.

🎯 Teaching goal: There is no single best strategy.
Each one trades off speed, information retention, and token efficiency differently. In production, you often combine all three.

The visualization below shows:

❌ Dropped messages — what the model will never see again
✅ Kept messages — what survives into the next call
📦 Token savings — how much context window you reclaim

👇 Read each panel left → right: BEFORE shows the full history, AFTER shows what the model receives.

# Truncation Strategy Comparison: BEFORE/AFTER + what AI actually receives + token bar chart

# ═══════════════════════════════════════════════════════════
#  Truncation Strategy Comparison
#  Three strategies, three completely different AFTER states:
#
#  🪟 Sliding Window      → last N raw messages only (hard cutoff)
#  🔍 Entity Extraction   → ALL history → 1 structured facts block
#  🗜️ Summary Compression → ALL history → 1 natural-language summary
#
#  Visualizes per strategy:
#    1. BEFORE/AFTER bubbles  (❌ deleted · 🗜️ compressed · ✅ kept)
#    2. Exact context the AI receives (rendered verbatim)
#    3. Token bar + summary chart across all three
#
#  PREREQUISITE: model loaded (Step 0) + ZOE_HISTORY from Part 2b
# ═══════════════════════════════════════════════════════════

from IPython.display import display, HTML
import re

DEMO_HISTORY = ZOE_HISTORY

# ── Strategy 1: Sliding Window ───────────────────────────────
# Keeps only the last N messages. Everything before is deleted.
# The AI has NO knowledge of anything said before the cutoff.
def strategy_sliding_window(history, n=4):
    kept    = history[-n:]
    deleted = history[:-n]
    note    = f"Hard cutoff: only the last {n} messages survive. {len(deleted)} messages permanently deleted."
    # deleted msgs → state "dropped" (gone forever, no recovery)
    return kept, [], deleted, note

# ── Strategy 2: Entity Extraction ────────────────────────────
# Scans the ENTIRE history, pulls out structured facts.
# ALL original messages are replaced by 1 system message.
# No "keep last N" — the whole point is to compress everything.
def strategy_entity_extraction(history):
    all_text = " ".join(m["content"] for m in history)

    facts = {"Name": "Zoe"}
    m_origin  = re.search(r'from (\w+)', all_text, re.I)
    m_school  = re.search(r'studying (?:at )?(.+?)[.,]', all_text, re.I)
    m_project = re.search(r'building (?:a )?(.+?)[.,]', all_text, re.I)
    topics    = list(set(re.findall(r'learning (\w+)', all_text, re.I)))

    if m_origin:  facts["Origin"]  = m_origin.group(1)
    if m_school:  facts["School"]  = m_school.group(1)
    if m_project: facts["Project"] = m_project.group(1)
    if topics:    facts["Topics"]  = ", ".join(topics)

    summary_lines = "\n".join(f"  • {k}: {v}" for k, v in facts.items())
    summary = f"[ENTITY MEMORY — extracted from {len(history)} messages]\n{summary_lines}"

    kept       = [{"role": "system", "content": summary}]
    compressed = history   # ALL original messages → compressed into the 1 system msg
    note       = f"Scanned all {len(history)} messages → extracted {len(facts)} facts → replaced with 1 system message. The AI sees only structured facts, no conversation."
    return kept, compressed, [], note

# ── Strategy 3: Summary Compression ──────────────────────────
# Rewrites the ENTIRE history as natural-language prose.
# ALL original messages are replaced by 1 summary message.
# Feels more human than entity extraction — narrative is preserved.
def strategy_summary_compression(history):
    topics = set()
    for m in history:
        c = m["content"].lower()
        if "taiwan"   in c: topics.add("comes from Taiwan")
        if "berkeley" in c: topics.add("is studying at UC Berkeley")
        if "notebook" in c: topics.add("is building a teaching notebook about LLMs")
        if "llama"    in c: topics.add("is using llama-cpp-python for local inference")
        if "token"    in c: topics.add("is learning about token budgets and context windows")
        if "pandas"   in c: topics.add("has a stats background and is learning pandas")
        if "rag"      in c: topics.add("is exploring RAG for memory management")

    prose = (
        f"[SUMMARY — rewrote {len(history)} messages as natural language]\n"
        f"  Zoe is a student who " + ", ".join(topics) + ".\n"
        f"  This summary replaces the full conversation history to save tokens.\n"
        f"  The most recent context is captured here — no raw messages are kept."
    )

    kept       = [{"role": "assistant", "content": prose}]
    compressed = history   # ALL original messages → compressed into the 1 summary
    note       = f"Rewrote all {len(history)} messages as 1 natural-language summary. The AI reads prose, not raw dialogue — closest to how a human would recall a conversation."
    return kept, compressed, [], note

# ── Strategy registry ─────────────────────────────────────────
STRATEGIES = {
    "🪟 Sliding Window": {
        "fn":    strategy_sliding_window,
        "color": "#89b4fa",   # 🔵 technical term
        "explanation": (
            "<strong style='color:#f38ba8'>What the AI loses:</strong> "
            "every message before the cutoff — deleted forever, no summary, no facts.<br>"
            "<strong style='color:#a6e3a1'>What the AI keeps:</strong> "
            "the last 4 raw messages, exactly as written.<br>"
            "<strong style='color:#89b4fa'>The AI's memory:</strong> "
            "a short slice of recent chat — it has no idea who Zoe is."
        ),
    },
    "🔍 Entity Extraction": {
        "fn":    strategy_entity_extraction,
        "color": "#a6e3a1",   # 🟢 solution
        "explanation": (
            "<strong style='color:#f38ba8'>What the AI loses:</strong> "
            "all conversational flow — no back-and-forth, just extracted data.<br>"
            "<strong style='color:#a6e3a1'>What the AI keeps:</strong> "
            "structured facts (name, school, project, topics) in 1 system message.<br>"
            "<strong style='color:#89b4fa'>The AI's memory:</strong> "
            "like reading a student's file instead of their actual conversation."
        ),
    },
    "🗜️ Summary Compression": {
        "fn":    strategy_summary_compression,
        "color": "#fab387",   # 🟠 strategy / emphasis
        "explanation": (
            "<strong style='color:#f38ba8'>What the AI loses:</strong> "
            "exact wording and fine-grained details from older turns.<br>"
            "<strong style='color:#a6e3a1'>What the AI keeps:</strong> "
            "a natural-language retelling of the whole conversation in 1 message.<br>"
            "<strong style='color:#89b4fa'>The AI's memory:</strong> "
            "like a colleague catching you up — narrative intact, details softened."
        ),
    },
}

# ── Token counter ─────────────────────────────────────────────
def count_tokens_list(msgs):
    return sum(len(model.tokenize(m["content"].encode("utf-8"))) for m in msgs)

# ── Role colours ──────────────────────────────────────────────
ROLE_COLORS = {
    "user":      {"border": "#89b4fa", "label": "#89b4fa", "bg": "#89b4fa18"},
    "assistant": {"border": "#a6e3a1", "label": "#a6e3a1", "bg": "#a6e3a118"},
    "system":    {"border": "#f9e2af", "label": "#f9e2af", "bg": "#f9e2af18"},
}

def make_bubble(m, state="kept"):
    """
    state:
      'kept'       ✅  full opacity  — AI still sees this verbatim
      'dropped'    ❌  very faded    — permanently deleted (Sliding Window)
      'compressed' 🗜️  orange-tinted — turned into summary (Entity / Summary)
    """
    cfg = ROLE_COLORS.get(m["role"], {"border": "#7a9bb5", "label": "#7a9bb5", "bg": "#7a9bb518"})
    if state == "dropped":
        opacity, icon, border, bg = "0.18", "❌", cfg["border"] + "33", cfg["bg"]
    elif state == "compressed":
        opacity, icon, border, bg = "0.42", "🗜️", "#fab38799", "#fab38712"
    else:
        opacity, icon, border, bg = "1",    "✅", cfg["border"] + "88", cfg["bg"]

    return f"""
    <div style="margin-bottom:5px;opacity:{opacity}">
      <span style="color:{cfg['label']};font-size:0.68em;font-weight:bold">
        {icon} {m['role'].upper()}
      </span>
      <div style="background:{bg};border:1px solid {border};padding:5px 9px;
                  border-radius:5px;color:#cdd6f4;font-size:0.76em;
                  line-height:1.5;margin-top:2px;word-break:break-word">
        {m['content'][:90]}{"…" if len(m['content']) > 90 else ""}
      </div>
    </div>"""

def make_ai_context_box(kept, color):
    """
    Shows exactly what the AI receives — the full content of every message,
    especially the summary/facts block which is the whole point of compression.
    """
    rows = ""
    for m in kept:
        cfg = ROLE_COLORS.get(m["role"], {"border": "#7a9bb5", "label": "#7a9bb5", "bg": "#7a9bb518"})
        # Show full content — this IS what the AI reads
        rows += f"""
        <div style="margin-bottom:7px">
          <span style="color:{cfg['label']};font-size:0.68em;font-weight:bold">
            {m['role'].upper()}
          </span>
          <div style="background:{cfg['bg']};border:1px solid {cfg['border']}88;
                      padding:8px 11px;border-radius:5px;color:#cdd6f4;
                      font-size:0.76em;line-height:1.7;margin-top:2px;
                      white-space:pre-wrap;word-break:break-word">{m['content']}</div>
        </div>"""
    return f"""
    <div style="background:#0d0d1a;border:1px solid {color}55;border-radius:8px;
                padding:10px;max-height:340px;overflow-y:auto">
      {rows}
    </div>"""

def make_token_bar(result_tokens, original_tokens, color):
    pct     = result_tokens / original_tokens * 100
    savings = original_tokens - result_tokens
    return f"""
    <div style="margin-top:12px">
      <div style="display:flex;justify-content:space-between;
                  font-size:0.75em;color:#a6adc8;margin-bottom:4px">
        <span>Token usage after compression</span>
        <span>
          <strong style="color:{color}">{result_tokens}</strong>
          <span style="color:#585b70"> / {original_tokens} tokens · </span>
          <span style="color:#a6e3a1">saved {savings} ({100-pct:.0f}% reduction)</span>
        </span>
      </div>
      <div style="background:#313244;border-radius:4px;height:10px;overflow:hidden">
        <div style="width:{pct:.1f}%;background:{color};height:100%;border-radius:4px"></div>
      </div>
      <div style="display:flex;justify-content:space-between;
                  font-size:0.68em;color:#585b70;margin-top:3px">
        <span>0</span><span>{original_tokens} tokens (original)</span>
      </div>
    </div>"""

def render_strategy_panel(label, cfg, original, kept, compressed, deleted, note, original_tokens, result_tokens):
    compressed_set = set(m["content"] for m in compressed)
    deleted_set    = set(m["content"] for m in deleted)

    left_html = ""
    for m in original:
        if m["content"] in deleted_set:
            left_html += make_bubble(m, state="dropped")
        elif m["content"] in compressed_set:
            left_html += make_bubble(m, state="compressed")
        else:
            left_html += make_bubble(m, state="kept")

    color = cfg["color"]

    return f"""
    <div style="background:#1e1e2e;border:2px solid {color}55;border-radius:12px;
                padding:18px;margin-bottom:22px">

      <div style="color:{color};font-weight:bold;font-size:1.05em;margin-bottom:3px">{label}</div>
      <div style="color:#a6adc8;font-size:0.79em;margin-bottom:12px">{note}</div>

      <div style="background:{color}10;border-left:3px solid {color};border-radius:6px;
                  padding:10px 14px;margin-bottom:14px;color:#cdd6f4;
                  font-size:0.83em;line-height:1.9">
        {cfg['explanation']}
      </div>

      <div style="display:grid;grid-template-columns:1fr 1fr;gap:14px">

        <div>
          <div style="color:#585b70;font-size:0.70em;font-weight:bold;
                      text-transform:uppercase;letter-spacing:0.05em;margin-bottom:6px">
            BEFORE — {len(original)} messages
            <span style="font-weight:normal;margin-left:6px">
              ❌ deleted &nbsp;🗜️ compressed &nbsp;✅ kept
            </span>
          </div>
          <div style="background:#0d0d1a;border-radius:8px;padding:10px;
                      max-height:340px;overflow-y:auto">
            {left_html}
          </div>
        </div>

        <div>
          <div style="color:#585b70;font-size:0.70em;font-weight:bold;
                      text-transform:uppercase;letter-spacing:0.05em;margin-bottom:6px">
            WHAT THE AI ACTUALLY RECEIVES
            <span style="font-weight:normal;margin-left:6px">— its entire memory</span>
          </div>
          {make_ai_context_box(kept, color)}
        </div>

      </div>

      {make_token_bar(result_tokens, original_tokens, color)}
    </div>"""

def make_summary_chart(chart_data, original_tokens):
    rows = ""
    for label, d in chart_data.items():
        pct   = d["kept"] / original_tokens * 100
        color = d["color"]
        rows += f"""
        <div style="margin-bottom:12px">
          <div style="display:flex;justify-content:space-between;
                      font-size:0.80em;color:#cdd6f4;margin-bottom:4px">
            <span>{label}</span>
            <span>
              <strong style="color:{color}">{d['kept']} tokens</strong>
              <span style="color:#a6e3a1"> · saved {original_tokens - d['kept']} ({100-pct:.0f}%)</span>
            </span>
          </div>
          <div style="background:#313244;border-radius:4px;height:14px;overflow:hidden">
            <div style="width:{pct:.1f}%;background:{color};height:100%;border-radius:4px"></div>
          </div>
        </div>"""

    return f"""
    <div style="background:#1e1e2e;border:1px solid #45475a;border-radius:10px;
                padding:18px;margin-top:6px">
      <div style="color:#cdd6f4;font-weight:bold;font-size:0.95em;margin-bottom:4px">
        📊 Token Usage Comparison — All Three Strategies
      </div>
      <div style="color:#a6adc8;font-size:0.78em;margin-bottom:14px">
        Original history: <strong style="color:#f9e2af">{original_tokens} tokens</strong>
        &nbsp;·&nbsp; bars show what's left after each strategy
      </div>
      {rows}
      <div style="margin-top:14px;padding-top:12px;border-top:1px solid #313244;
                  font-size:0.82em;color:#a6adc8;line-height:1.8">
        <strong style="color:#cdd6f4">🧠 Key takeaway for Zoe's students:</strong><br>
        <span style="color:#89b4fa">■ Sliding Window</span>
          — fastest · lowest token cost · AI has no idea who Zoe is.<br>
        <span style="color:#a6e3a1">■ Entity Extraction</span>
          — AI knows the facts · but has never "heard" Zoe's voice.<br>
        <span style="color:#fab387">■ Summary Compression</span>
          — AI understands the narrative · costs one extra model call.<br>
        <strong>In production, tutoring assistants often combine all three.</strong>
      </div>
    </div>"""

# ── Main render ───────────────────────────────────────────────
original_tokens = count_tokens_list(DEMO_HISTORY)

header_html = f"""
<div style="background:#313244;border-left:4px solid #f9e2af;padding:12px 16px;
            border-radius:6px;margin-bottom:18px;font-size:0.88em">
  <strong style="color:#f9e2af">Zoe's full conversation history:</strong>
  <span style="color:#cdd6f4"> {len(DEMO_HISTORY)} messages · {original_tokens} tokens</span>
  <span style="color:#585b70"> · {original_tokens/4096*100:.1f}% of the 4096-token context window</span><br>
  <span style="color:#a6adc8;font-size:0.9em">
    Each strategy below shows a <em>completely different</em> picture of what the AI will receive. 👇
  </span>
</div>"""

panels_html = ""
chart_data  = {}

for label, cfg in STRATEGIES.items():
    kept, compressed, deleted, note = cfg["fn"](DEMO_HISTORY)
    t = count_tokens_list(kept)
    panels_html += render_strategy_panel(
        label, cfg, DEMO_HISTORY, kept, compressed, deleted, note, original_tokens, t
    )
    chart_data[label] = {"kept": t, "color": cfg["color"]}

display(HTML(header_html + panels_html + make_summary_chart(chart_data, original_tokens)))

Part 4b: Side-by-side comparison of three compression¶

# Strategies applied to Zoe's 25-turn history


conv_text   = "\n".join(f"{m['role'].upper()}: {m['content']}" for m in ZOE_HISTORY)
orig_tokens = count_tokens(ZOE_HISTORY)  # ← uses utils count_tokens, same as rest of notebook

# ── Strategy definitions ─────────────────────────────────────────────
STRATEGIES = {
    "📝 Summarization": {
        "accent": "#38bdf8",
        "result": (
            "Zoe, a student from Taiwan, is studying AI and machine learning at UC Berkeley. "
            "She is focusing on language models and wants to understand context management, "
            "particularly memory and context. She is building a teaching notebook on LLM context "
            "management for a class and is using Python, Jupyter, and llama-cpp-python with a "
            "local Llama model. Zoe is looking for ways to improve the performance of her model, "
            "which is currently slow, and is considering using history compression and a fun mode "
            "with a playful tone."
        ),
        "prompt_lines": [
            ("You are a conversation summarizer.", False, ""),
            ("Summarize the following conversation into a single coherent paragraph.", True,  "→ forces ONE paragraph: context collapses into prose, detail is traded for flow"),
            ("Preserve the main topic, key decisions, and overall flow.", True,  "→ 'overall flow' keeps narrative arc, not just facts"),
            ("Be concise but complete. Output only the summary, no preamble.", False, ""),
        ],
    },
    "🏷️ Entity Extraction": {
        "accent": "#34d399",
        "result": (
            '{\n'
            '  "user": {\n'
            '    "name": "Zoe",\n'
            '    "origin": "Taiwan",\n'
            '    "school": "UC Berkeley",\n'
            '    "program": {\n'
            '      "name": "AI and Machine Learning",\n'
            '      "tools": ["Python", "Jupyter", "llama-cpp-python"]\n'
            '    },\n'
            '    "topics_to_teach": [\n'
            '      "Context management in LLMs",\n'
            '      "Token count and performance optimization"\n'
            '    ],\n'
            '    "decisions": [\n'
            '      "Use local Llama models to reduce API costs",\n'
            '      "Use history compression to improve performance"\n'
            '    ],\n'
            '    "background": "The 1B model is slow... [repeated sentences]",\n'
            '    "fun_mode": "Add a fun mode where the AI speaks in a playful tone."\n'
            '  }\n'
            '}'
        ),
        "prompt_lines": [
            ("You are a structured data extractor.", False, ""),
            ("Extract key entities from the conversation as compact JSON.", True,  "→ 'compact JSON' signals a strict format — but small models often invent their own structure anyway"),
            ("Required fields: person, project, topics_to_teach, decisions, background.", True,  "→ 📌 Teaching note: the 1B model ignored this schema entirely, invented a 'user' wrapper, repeated keys, and produced invalid JSON. We ran this twice with increasingly strict constraints — same result. Larger models (7B+) or models with JSON mode / function calling handle schema compliance much more reliably."),
            ("Output ONLY valid JSON. No preamble, no markdown fences.", True,  "→ the format constraint was partially followed (no preamble) — but schema compliance is a separate, harder problem for small models"),
        ],
    },
    "🔀 Semantic Compression": {
        "accent": "#a78bfa",
        "result": (
            "Zoe, a Taiwanese student at UC Berkeley, is studying AI and machine learning. "
            "She is excited to explore the context management of language models, particularly "
            "memory and context. Zoe is building a teaching notebook on LLM context management "
            "for a class and is interested in using Python, Jupyter, and llama-cpp-python with "
            "a local Llama model. She finds that running locally helps reduce API costs, but her "
            "model is slow, taking a few seconds per response. To improve performance, Zoe is "
            "considering history compression and wants to teach students about the difference "
            "between data workflows (Data 8 and Data 100) and system prompts.\n\n"
            "💡 Notice: this output looks very similar to Summarization above. The 1B model\n"
            "   does not meaningfully distinguish 'semantic compression' from general summarization.\n"
            "   This is a model capability limit, not a prompt problem."
        ),
        "prompt_lines": [
            ("You are an expert at semantic compression.", False, ""),
            ("Compress this conversation to its essential meaning.", False, ""),
            ("1. Keep all decisions AND their reasoning.", True,  "→ 'AND their reasoning' is what separates this from summarization — causality is preserved"),
            ("2. Merge redundant exchanges into single statements.", True,  "→ small-talk and back-and-forth collapses; only net information survives"),
            ("3. Preserve technical specifics (tool names, constraints, proper nouns).", True,  "→ named entities stay verbatim — 'llama-cpp-python' not 'a local model library'"),
            ("4. Write in third-person summary style.", False, ""),
            ("5. Target 30–40% of original token count.", True,  "→ explicit compression ratio gives the model a measurable target, not just 'be concise'"),
            ("6. Line 1 = person + core project. Then compress thematically.", False, ""),
        ],
    },
}

# ── HTML helpers ─────────────────────────────────────────────────────
def token_badge(label, count, accent):
    return (
        f'<span style="display:inline-flex;align-items:center;gap:4px;'
        f'background:#ffffff08;border:1px solid {accent}44;border-radius:4px;'
        f'padding:1px 7px;font-size:0.68em;color:{accent};font-family:monospace">'
        f'<span style="color:#7a9bb5">{label}</span>'
        f'<strong>{count}</strong>'
        f'<span style="color:#7a9bb5">tok</span></span>'
    )

def render_prompt_with_highlights(lines, accent):
    rows = ""
    for (line, is_key, reason) in lines:
        safe_line   = line.replace("&","&amp;").replace("<","&lt;").replace(">","&gt;")
        safe_reason = reason.replace("&","&amp;").replace("<","&lt;").replace(">","&gt;")
        if is_key:
            rows += f"""
            <div style="background:{accent}18;border-left:3px solid {accent};
                        border-radius:0 6px 6px 0;padding:5px 10px;margin:3px 0">
              <div style="color:{accent};font-family:'Fira Code',monospace;
                          font-size:0.76em;font-weight:600">{safe_line}</div>
              <div style="color:#94a3b8;font-size:0.69em;margin-top:3px;
                          font-style:italic">{safe_reason}</div>
            </div>"""
        else:
            rows += f"""
            <div style="padding:4px 10px;margin:2px 0;border-left:3px solid #1e293b">
              <div style="color:#6b7fa8;font-family:'Fira Code',monospace;
                          font-size:0.76em">{safe_line}</div>
            </div>"""
    return f"""
    <div style="background:#020408;border:1px solid #1e293b;border-radius:8px;
                padding:10px 8px;margin-top:4px">
      {rows}
    </div>"""

def render_panel(name):
    cfg         = STRATEGIES[name]
    accent      = cfg["accent"]
    res_tokens  = count_tokens([{"role": "assistant", "content": cfg["result"]}])
    saved       = orig_tokens - res_tokens
    pct         = round((saved / orig_tokens) * 100)
    result_safe = cfg["result"].replace("&","&amp;").replace("<","&lt;").replace(">","&gt;")
    prompt_html = render_prompt_with_highlights(cfg["prompt_lines"], accent)
    key_count   = sum(1 for _, is_key, _ in cfg["prompt_lines"] if is_key)

    # ── Entity Extraction teaching note (pulled out of annotation) ────
    entity_warning = ""
    if name == "🏷️ Entity Extraction":
        entity_warning = f"""
        <div style="background:#1a1200;border:1px solid #f59e0baa;border-left:3px solid #f59e0b;
                    border-radius:0 8px 8px 0;padding:10px 14px;margin-top:10px;
                    font-size:0.78em;color:#fde68a;line-height:1.7">
          <strong style="color:#f59e0b">⚠️ Teaching note:</strong>
          The 1B model ignored the required schema entirely — it invented its own <code>user</code> wrapper,
          repeated keys, and produced invalid JSON. We ran this twice with stricter constraints: same result.<br>
          <span style="color:#fbbf24">Larger models (7B+) or models with JSON mode handle schema compliance
          much more reliably.</span> This is a model capability limit, not a prompt problem.
        </div>"""

    return f"""
    <div style="font-family:'IBM Plex Mono','Fira Code',monospace;
                background:#080c12;border-radius:12px;padding:18px;color:#e2e8f0">

      <!-- Story context -->
      <div style="background:#0a1628;border:1px solid #3b5268;border-left:3px solid {accent};
                  border-radius:0 8px 8px 0;padding:8px 14px;margin-bottom:14px;
                  font-size:0.78em;color:#94b8d4;line-height:1.6">
        Zoe tries all three strategies on her own conversation history to decide
        which one to use in production. Results below were
        <strong style="color:#f0f6ff">pre-generated</strong> — the model is local and slow,
        so we cached the output for this side-by-side comparison.
      </div>

      <div style="display:grid;grid-template-columns:1fr 1fr;gap:16px">

        <!-- LEFT: compressed result -->
        <div style="display:flex;flex-direction:column;gap:12px">

          <div style="background:#0d1420;border:2px solid {accent};
                      border-radius:10px;padding:16px">
            <div style="display:flex;justify-content:space-between;align-items:center;margin-bottom:12px">
              <span style="font-size:0.63em;font-weight:700;text-transform:uppercase;
                           letter-spacing:0.1em;color:{accent}">🗜️ Compressed Output</span>
              {token_badge("output", res_tokens, accent)}
            </div>
            <div style="background:#020408;border:1px solid {accent}22;border-radius:8px;
                        padding:12px 14px;color:#cbd5e1;font-size:0.78em;
                        line-height:1.75;white-space:pre-wrap">{result_safe}</div>
          </div>

          {entity_warning}

          <!-- token savings -->
          <div style="background:#0d1420;border:1px solid #1e293b;border-radius:10px;padding:14px">
            <div style="display:flex;justify-content:space-between;margin-bottom:10px">
              <span style="font-size:0.63em;color:#7a9bb5;text-transform:uppercase;letter-spacing:0.1em">📊 Token savings</span>
            </div>
            {"".join(f'''
            <div style="display:flex;justify-content:space-between;align-items:center;margin-bottom:7px">
              <span style="color:#7a9bb5;font-size:0.73em">{lbl}</span>
              <div style="display:flex;align-items:center;gap:8px">
                <div style="width:90px;height:5px;background:#1e293b;border-radius:3px">
                  <div style="width:{min(100,round(val/orig_tokens*100))}%;height:5px;background:{col};border-radius:3px"></div>
                </div>
                <strong style="font-size:0.73em;min-width:32px;text-align:right;color:{col}">{val}</strong>
              </div>
            </div>''' for lbl, val, col in [
                ("Original (25 turns)", orig_tokens, "#6b7fa8"),
                ("Compressed output",   res_tokens,  accent),
            ])}
            <div style="margin-top:10px;background:#020408;border-radius:6px;
                        padding:8px 12px;display:flex;justify-content:space-between;align-items:center">
              <span style="font-size:0.71em;color:#7a9bb5">💾 tokens saved</span>
              <span style="font-size:0.85em;color:#4ade80;font-weight:700">−{saved} ({pct}% reduction)</span>
            </div>
          </div>

        </div>

        <!-- RIGHT: system prompt with highlights -->
        <div style="background:#0d1420;border:1px solid {accent}33;border-radius:10px;padding:16px">
          <div style="display:flex;justify-content:space-between;align-items:center;margin-bottom:4px">
            <span style="font-size:0.63em;font-weight:700;text-transform:uppercase;
                         letter-spacing:0.1em;color:{accent}">🔍 System Prompt</span>
            <span style="font-size:0.68em;color:#7a9bb5">
              <span style="color:{accent}">{key_count} key lines</span> highlighted
            </span>
          </div>
          <div style="font-size:0.7em;color:#7a9bb5;margin-bottom:10px;line-height:1.5">
            Highlighted lines are the ones that shape the output above.
          </div>
          {prompt_html}
        </div>

      </div>
    </div>"""

# ── Widgets ──────────────────────────────────────────────────────────
header = widgets.HTML(f"""
<div style="font-family:'IBM Plex Mono','Fira Code',monospace;
            background:#1e2d40;padding:14px 18px;border-radius:10px;
            margin:10px 0;border:1px solid #3b5268">
  <div style="font-size:0.6em;color:#7ea8c9;text-transform:uppercase;
              letter-spacing:0.15em;margin-bottom:4px">Context Window Management</div>
  <h3 style="color:#f0f6ff;margin:0 0 6px;font-size:1.1em">
    Why does each strategy produce a different result?
  </h3>
  <p style="color:#94b8d4;margin:0;font-size:0.78em;line-height:1.6">
    Same input: Zoe's {len(ZOE_HISTORY)}-turn conversation ({token_badge("", orig_tokens, "#7ea8c9")} tokens).<br>
    Select a strategy to see the compressed output and <strong style="color:#cde4f5">which prompt lines cause it</strong>.
  </p>
</div>
""")

strategy_toggle = widgets.ToggleButtons(
    options=list(STRATEGIES.keys()),
    description="",
    style={"button_width": "210px", "description_width": "0px"},
)
output_area = widgets.Output()

def on_change(change):
    if change["name"] == "value":
        with output_area:
            clear_output(wait=True)
            display(HTML(render_panel(change["new"])))

strategy_toggle.observe(on_change, names="value")

display(header)
display(strategy_toggle)
display(output_area)

with output_area:
    display(HTML(render_panel(strategy_toggle.value)))

# Conclusion: which compression strategy works best depending on model size


conclusion_html = """
<div style="font-family:'IBM Plex Mono','Fira Code',monospace;
            background:#0d1420;border:1px solid #1e293b;
            border-radius:12px;padding:20px 24px;margin-top:10px;color:#e2e8f0">

  <div style="font-size:0.63em;color:#7ea8c9;text-transform:uppercase;
              letter-spacing:0.15em;margin-bottom:8px">Conclusion</div>
  <h3 style="color:#f0f6ff;margin:0 0 16px;font-size:1.05em">
    Choosing a compression strategy depends on your model size
  </h3>

  <div style="display:flex;flex-direction:column;gap:10px;margin-bottom:20px">

    <div style="display:flex;gap:12px;align-items:flex-start;background:#0f1e10;
                border:1px solid #34d39933;border-radius:8px;padding:12px 14px">
      <span style="font-size:1.1em;margin-top:1px">✅</span>
      <div>
        <div style="color:#34d399;font-weight:700;font-size:0.85em;margin-bottom:3px">
          Summarization — works well on small models
        </div>
        <div style="color:#94a3b8;font-size:0.78em;line-height:1.6">
          Asks the model to do what it does best: generate free-form text.
          No strict format requirements means no format failures.
          The output may vary slightly each run, but it will always be usable.
        </div>
      </div>
    </div>

    <div style="display:flex;gap:12px;align-items:flex-start;background:#1f0e0e;
                border:1px solid #f8717133;border-radius:8px;padding:12px 14px">
      <span style="font-size:1.1em;margin-top:1px">❌</span>
      <div>
        <div style="color:#f87171;font-weight:700;font-size:0.85em;margin-bottom:3px">
          Entity Extraction — unreliable on small models
        </div>
        <div style="color:#94a3b8;font-size:0.78em;line-height:1.6">
          Requires strict schema compliance. The 1B model ignored the required fields,
          invented its own structure, and produced invalid JSON — even after adding
          explicit constraints. For structured output, use a larger model (7B+)
          or a model that supports JSON mode / function calling.
        </div>
      </div>
    </div>

    <div style="display:flex;gap:12px;align-items:flex-start;background:#150f1f;
                border:1px solid #a78bfa33;border-radius:8px;padding:12px 14px">
      <span style="font-size:1.1em;margin-top:1px">⚠️</span>
      <div>
        <div style="color:#a78bfa;font-weight:700;font-size:0.85em;margin-bottom:3px">
          Semantic Compression — limited benefit on small models
        </div>
        <div style="color:#94a3b8;font-size:0.78em;line-height:1.6">
          The intent is to preserve causal reasoning, not just surface content.
          In practice, the 1B model produced output nearly identical to Summarization.
          The distinction becomes meaningful at larger model sizes.
        </div>
      </div>
    </div>

  </div>

  <div style="background:#020408;border:1px solid #334155;border-radius:8px;
              padding:12px 16px;font-size:0.8em;color:#94a3b8;line-height:1.8">
    <span style="color:#f0f6ff;font-weight:700">Key takeaway: </span>
    Compression strategy choice is not just about what information you want to preserve —
    it also depends on what your model can reliably produce.
    When working with small local models, prefer strategies with
    <span style="color:#34d399">low output constraints</span> and
    <span style="color:#34d399">flexible formats</span>.
  </div>

</div>
"""

display(HTML(conclusion_html))

⚠️ Part 4c: What Zoe Almost Did Wrong¶

Before finding the right solution, Zoe tries a few things that seem reasonable — and discovers why each one fails.

These are the three most common mistakes developers make with LLM context. Each one has a measurable cost.

🎯 Teaching goal: Every anti-pattern below produces either a TokenError, degraded quality, or silent slowness. Run the cell to feel the difference.

# Anti-pattern demo: three common context mistakes and their token cost on a local 1B model

SPEED = 25   # tokens/sec
N_CTX = 4096

def tok(text):
    return len(model.tokenize(text.encode("utf-8")))

def count_msgs(msgs):
    return sum(tok(m["content"]) for m in msgs)

# ── Anti-Pattern 1: Dump entire history without compression ───────────
raw_history = list(ZOE_HISTORY) * 2
ap1_msgs = (
    [{"role": "system", "content": "You are a helpful assistant."}]
    + raw_history
    + [{"role": "user", "content": "What should I do next?"}]
)
ap1_tok = count_msgs(ap1_msgs)

# ── Anti-Pattern 2: Bloated system prompt with irrelevant rules ───────
bloated_lines = [
    "You are a helpful AI assistant for UC Berkeley students.",
    "Always be polite. Never be rude. Always use proper grammar. Do not use slang.",
    "Remember to be helpful. Be concise. But also be thorough. Include examples.",
    "Do not make things up. Always cite sources when possible. Be encouraging.",
    "Use bullet points when appropriate. Avoid passive voice where possible.",
    "Remember: the student is always right. Treat every question with respect.",
    "The student may be stressed. Be empathetic. Consider cultural differences.",
    "Do not assume gender. Use inclusive language. Be aware of accessibility.",
    "Always end with a summary. Start with the most important point.",
    "If you are unsure, say so. If you know, say so confidently.",
    "Current date: 2025. Location: Berkeley, CA. Language: English (default).",
]
bloated_system = "\n".join(bloated_lines) + "\nExtra padding: " + "x " * 200

ap2_msgs = [
    {"role": "system", "content": bloated_system},
    {"role": "user",   "content": "How do I read a CSV file?"},
]
ap2_tok = count_msgs(ap2_msgs)

# ── Anti-Pattern 3: Raw DB dump injected into context ─────────────────
fake_db_rows = [
    {
        "id": i,
        "user": "zoe@berkeley.edu",
        "timestamp": "2025-01-01T00:00:00Z",
        "session_id": "sess_" + str(i).zfill(4),
        "event": "page_view",
        "page": "/page/" + str(i),
        "metadata": {"browser": "Chrome", "os": "macOS", "ip": "10.0.0." + str(i % 255)},
    }
    for i in range(30)
]
fake_db_dump = str(fake_db_rows)

ap3_msgs = [
    {"role": "system", "content": "You are a helpful assistant. Here is the user activity log: " + fake_db_dump},
    {"role": "user",   "content": "What courses should I take next semester?"},
]
ap3_tok = count_msgs(ap3_msgs)

# ── Good practice: minimal, targeted context ──────────────────────────
good_system = (
    "You are a helpful TA for UC Berkeley's data science program. "
    "The student (Zoe, from Taiwan) is in Data 100, learning LLM context management. "
    "Keep answers concise and include code examples."
)
good_msgs = [
    {"role": "system", "content": good_system},
    {"role": "user",   "content": "How do I read a CSV file?"},
]
good_tok = count_msgs(good_msgs)

# ── Patterns ──────────────────────────────────────────────────────────
PATTERNS = [
    {
        "label":   "Anti-Pattern 1",
        "icon":    "❌",
        "title":   "Dumping full history without compression",
        "code":    "messages = system + ZOE_HISTORY * 2 + [user_question]",
        "problem": "Zoe kept everything to avoid the model forgetting — but every message is re-read from scratch. At 100 turns the context window overflows; model throws an error or silently drops early turns.",
        "tokens":  ap1_tok,
        "accent":  "#f87171",
        "fix":     "Use sliding window or summary compression (Part 4a). The model only needs a compact summary of old turns, not every word.",
    },
    {
        "label":   "Anti-Pattern 2",
        "icon":    "❌",
        "title":   "Bloated system prompt with generic rules",
        "code":    "system = 'Be polite. Be concise. Be thorough...' x 50 lines",
        "problem": "Zoe copied a 'best practices' prompt from a tutorial. Most of those lines describe behaviour the model already defaults to — they burn tokens and dilute the instructions that actually matter.",
        "tokens":  ap2_tok,
        "accent":  "#fb923c",
        "fix":     "Keep system prompt under 150 tokens. Every line should earn its place — if the model would do it anyway, cut it.",
    },
    {
        "label":   "Anti-Pattern 3",
        "icon":    "❌",
        "title":   "Raw database / file dump into context",
        "code":    "system += str(db.query('SELECT * FROM logs LIMIT 30'))",
        "problem": "Zoe dumped the raw activity log to give the model 'more context'. The model didn't complain — but 30 JSON rows is 800+ tokens of irrelevant fields. Useful signal buried in noise, answer quality drops silently.",
        "tokens":  ap3_tok,
        "accent":  "#fbbf24",
        "fix":     "Extract only the fields the model actually needs. More context is not always better — targeted context is.",
    },
    {
        "label":   "Good Practice",
        "icon":    "✅",
        "title":   "Minimal, targeted system prompt",
        "code":    "system = 3-sentence profile + 1 behavioural rule",
        "problem": "After all three failures, Zoe stripped the prompt down to only what the model couldn't infer on its own: who the student is, what they're working on, and one clear behavioural rule.",
        "tokens":  good_tok,
        "accent":  "#34d399",
        "fix":     "",
    },
]

def render_card(p):
    wait    = p["tokens"] / SPEED
    pct     = p["tokens"] / N_CTX * 100
    bar_pct = min(100, pct)
    accent  = p["accent"]
    is_good = p["problem"] == "—"

    fix_html = "" if is_good else f"""
    <div style="margin-top:10px;padding-top:10px;border-top:1px solid #1e293b;
                font-size:0.76em;color:#7a9bb5">
      <strong style="color:#34d399">Fix:</strong> <span style="color:#a6e3a1">{p['fix']}</span>
    </div>"""

    code_safe = p["code"].replace("<","&lt;").replace(">","&gt;")

    return f"""
    <div style="background:#0d1420;border:1px solid {accent}44;border-left:3px solid {accent};
                border-radius:8px;padding:14px 16px;margin-bottom:10px">

      <!-- header row -->
      <div style="display:flex;justify-content:space-between;align-items:baseline;
                  margin-bottom:8px;flex-wrap:wrap;gap:6px">
        <span style="color:{accent};font-weight:700;font-size:0.85em">
          {p['icon']} {p['label']}: {p['title']}
        </span>
        <span style="color:#7a9bb5;font-size:0.72em;font-family:monospace">
          {p['tokens']} tok &nbsp;|&nbsp; {pct:.1f}% of window &nbsp;|&nbsp; ~{wait:.1f}s wait
        </span>
      </div>

      <!-- token bar -->
      <div style="background:#1e293b;border-radius:3px;height:5px;margin-bottom:12px;width:100%">
        <div style="width:{bar_pct:.1f}%;height:5px;background:{accent};border-radius:3px"></div>
      </div>

      <!-- code + problem -->
      <div style="display:grid;grid-template-columns:1fr 1fr;gap:12px;font-size:0.78em">
        <div>
          <div style="color:#7a9bb5;margin-bottom:4px;font-size:0.9em">Code pattern</div>
          <code style="color:#cbd5e1;background:#020408;padding:4px 8px;
                       border-radius:4px;font-family:'Fira Code',monospace;
                       font-size:0.9em;display:block;line-height:1.5">{code_safe}</code>
        </div>
        <div>
          <div style="color:#7a9bb5;margin-bottom:4px;font-size:0.9em">Problem</div>
          <span style="color:#94b8d4;line-height:1.6">{p['problem']}</span>
        </div>
      </div>
      {fix_html}
    </div>"""

cards = "".join(render_card(p) for p in PATTERNS)

html = f"""
<div style="font-family:'IBM Plex Mono','Fira Code',monospace;
            background:#080c12;border-radius:12px;padding:18px 20px;color:#e2e8f0">

  <div style="font-size:0.63em;color:#7ea8c9;text-transform:uppercase;
              letter-spacing:0.15em;margin-bottom:4px">Part 4c</div>
  <h3 style="color:#f0f6ff;margin:0 0 8px;font-size:1.05em">
    Anti-Pattern Cost Comparison
  </h3>

  <!-- Story + Why -->
  <div style="background:#0a1628;border:1px solid #3b5268;border-left:3px solid #38bdf8;
              border-radius:0 8px 8px 0;padding:10px 14px;margin-bottom:14px;
              font-size:0.8em;color:#94b8d4;line-height:1.7">
    Before finding the right solution, Zoe tried three things that <em>seemed</em> reasonable.
    Each one made the assistant slower or less accurate — with no error message to explain why.<br><br>
    On a cloud API, bad context design costs <strong style="color:#f38ba8">money</strong>.
    On a local 1B model, it costs <strong style="color:#f38ba8">time</strong> — and students feel it immediately.
  </div>

  <p style="color:#7a9bb5;font-size:0.78em;margin:0 0 16px;line-height:1.6">
    Same local Llama 1B model. Different context designs. Token count directly determines wait time.
  </p>

  {cards}

  <div style="margin-top:4px;background:#0d1420;border:1px solid #1e293b;
              border-radius:8px;padding:12px 16px;font-size:0.78em;
              color:#7a9bb5;line-height:1.9">
    <div><span style="color:#f87171">Anti-Pattern 1</span> — can fill the entire context window with one bad session.</div>
    <div><span style="color:#fb923c">Anti-Pattern 2</span> — wastes tokens before the model even sees the question.</div>
    <div><span style="color:#fbbf24">Anti-Pattern 3</span> — introduces noise that degrades answer quality silently.</div>
    <div style="margin-top:8px;padding-top:8px;border-top:1px solid #1e293b;color:#94b8d4">
      <strong style="color:#f0f6ff">Key takeaway:</strong>
      Good context design = minimum tokens, maximum signal.
    </div>
  </div>

</div>"""

display(HTML(html))

📚 Part 5: Prof. Eric’s Next Request — Course-Specific Knowledge¶

Prof. Eric has a new idea:

“Can the assistant answer questions about specific assignments and course policies? That information isn’t in the model’s training data — it changes every semester.”

Zoe realises: context doesn’t have to come only from conversation history. In real products, it also comes from external sources — databases, documents, APIs, course notes.

Retrieval-Augmented Generation (RAG) is the pattern for doing this:

Student question
     ↓
 Search course knowledge base for relevant snippets
     ↓
 Inject snippets into the messages list
     ↓
 Model answers using injected knowledge

Below is a minimal working demo using a fake Berkeley course catalogue — but the injection pattern is identical to what production RAG systems (LlamaIndex, LangChain, pgvector) do.

💡 What to observe: The model has no training knowledge about our fake course catalogue. Watch how it answers without and with the retrieved snippet injected into context.

# RAG Demo: retrieve -> inject -> compare WITHOUT vs WITH context (real progress integration)
# ═══════════════════════════════════════════════════════════
#  Part 5 — RAG Demo
#  Three real steps, each updates the progress bar as it runs:
#    Step 1 — retrieve(): keyword search over knowledge base
#    Step 2 — model call WITHOUT retrieved context (baseline)
#    Step 3 — model call WITH retrieved context injected
#  PREREQUISITE: model loaded (Step 0), widgets + time imported
# ═══════════════════════════════════════════════════════════
import threading

# ── Knowledge base ────────────────────────────────────────────
KNOWLEDGE_BASE = [
    {
        "id": "course_001",
        "title": "CS 189/289A: Introduction to Machine Learning",
        "content": (
            "Covers theoretical foundations, algorithms, and applications for machine learning. "
            "Topics include supervised methods (linear models, trees, neural networks, ensemble methods), "
            "generative and discriminative probabilistic models, Bayesian learning, density estimation, "
            "clustering, and dimensionality reduction. Programming projects use Python (scikit-learn). "
            "Prerequisites: MATH 53, MATH 54, and CS 70 (or consent of instructor)."
        ),
        "source": "https://www2.eecs.berkeley.edu/Courses/CS189/",
    },
    {
        "id": "course_002",
        "title": "Data C100 / CS C100: Principles and Techniques of Data Science",
        "content": (
            "Covers the data science lifecycle: question formulation, data collection and cleaning, "
            "EDA and visualization, statistical inference, prediction, and decision-making. "
            "Topics include pandas, SQL, regex, linear regression, PCA, and clustering. "
            "Heavy use of Jupyter notebooks. Bridges Data 8 to upper-division CS and statistics courses. "
            "Good precursor to CS 189. "
            "Prerequisites: Data C8 (or STAT 20), CS 61A (or CS 88 / ENGIN 7). "
            "Co-requisite: MATH 54, 56, 110, EECS 16A, or PHYSICS 89."
        ),
        "source": "https://ds100.org/",
    },
    {
        "id": "course_003",
        "title": "CS 194/294-196: Large Language Model Agents",
        "content": (
            "Covers fundamental concepts for LLM agents: foundation of LLMs, essential LLM abilities "
            "for task automation, and infrastructures for agent development. "
            "Topics include RAG, context management, code generation, web automation, and agent design. "
            "Graduate-level (CS 294) and upper-division undergraduate (CS 194). "
            "Recommended background: CS 182, CS 188, or CS 189. "
            "Taught by Prof. Dawn Song. Variable-unit course (1-4 units)."
        ),
        "source": "https://rdi.berkeley.edu/llm-agents/f24",
    },
    {
        "id": "tip_001",
        "title": "Study tip: Managing LLM context in Python",
        "content": (
            "When building LLM applications, keep your messages list lean. "
            "Use summarisation for history older than 10 turns. "
            "Always separate system prompt from user context. "
            "Profile injection works better than dumping all user data inline."
        ),
        "source": "Course notebook -- Part 4",
    },
    {
        "id": "tip_002",
        "title": "Berkeley resource: JupyterHub access",
        "content": (
            "Berkeley students can access the shared JupyterHub at hub.data8.org. "
            "For local LLM models, use llama-cpp-python with n_ctx=4096. "
            "The 1B Llama model runs locally and avoids API costs, though inference is slower."
        ),
        "source": "Course notebook -- setup cell",
    },
]

# ── Retriever ─────────────────────────────────────────────────
STOPWORDS = {
    "i", "am", "a", "an", "the", "is", "are", "was", "be", "to", "of",
    "and", "in", "for", "on", "with", "what", "should", "after", "about",
    "my", "me", "we", "it", "that", "this", "can", "do", "how", "take",
}

def retrieve(query, top_k=2):
    query_words = {w.strip("?.,!") for w in query.lower().split()} - STOPWORDS
    scored = []
    for doc in KNOWLEDGE_BASE:
        doc_words = set((doc["title"] + " " + doc["content"]).lower().split())
        score = len(query_words & doc_words)
        scored.append((score, doc))
    scored.sort(key=lambda x: -x[0])
    return [doc for score, doc in scored[:top_k] if score > 0]

# ── Question ──────────────────────────────────────────────────
QUESTION   = "I am Zoe, a Data 100 student. What course should I take after Data 100 to learn about LLMs?"
SYSTEM_BASE = "You are a helpful AI assistant for Berkeley students."

# ── Widgets ───────────────────────────────────────────────────
out     = widgets.Output()
run_btn = widgets.Button(
    description="▶ Run RAG Demo",
    button_style="primary",
    layout=widgets.Layout(width="160px", margin="0 0 10px 0")
)

display(widgets.HTML(f"""
<div style="font-family:'IBM Plex Mono','Fira Code',monospace;
            background:#080c12;border:1px solid #3b5268;
            border-radius:12px;padding:14px 18px;margin:10px 0">
  <div style="font-size:0.63em;color:#7ea8c9;text-transform:uppercase;
              letter-spacing:0.15em;margin-bottom:6px">Part 5 — RAG Demo</div>
  <div style="background:#020408;border:1px solid #38bdf844;
              border-left:3px solid #38bdf8;border-radius:0 8px 8px 0;
              padding:8px 12px;font-size:0.8em;color:#94b8d4;font-style:italic">
    "{QUESTION}"
  </div>
  <div style="color:#7a9bb5;font-size:0.74em;margin-top:10px">
    This demo makes <strong style="color:#f0f6ff">2 separate model calls</strong> —
    one without context, one with retrieved knowledge injected.
    Progress updates in real time as each step completes.
  </div>
</div>"""))

display(run_btn, out)

# ── Progress helpers ──────────────────────────────────────────
def make_step_html(step, color, title, detail):
    return f"""
    <div style="font-family:'IBM Plex Mono','Fira Code',monospace;
                border-left:3px solid {color};padding:6px 12px;
                border-radius:0 6px 6px 0;background:{color}0d;margin-bottom:4px">
      <div style="color:{color};font-size:0.80em;font-weight:700">
        Step {step} of 3 — {title}
      </div>
      <div style="color:#7a9bb5;font-size:0.73em;margin-top:2px;font-style:italic">
        {detail}
      </div>
    </div>"""

bar        = widgets.IntProgress(value=0, min=0, max=100,
               layout=widgets.Layout(width='100%', height='12px'))
label_step = widgets.HTML()
label_time = widgets.HTML()

def set_progress(pct, color, step_html, elapsed=None):
    bar.value        = pct
    bar.style        = {'bar_color': color}
    label_step.value = step_html
    if elapsed is not None:
        label_time.value = (
            f'<div style="font-family:monospace;color:#475569;font-size:0.72em;margin-top:3px">'
            f'⏱ {elapsed:.1f}s elapsed</div>'
        )

# ── Render helpers ────────────────────────────────────────────
def render_doc_card(d):
    return f"""
    <div style="background:#0d1420;border-left:3px solid #a78bfa;
                padding:8px 12px;border-radius:0 6px 6px 0;margin-bottom:6px">
      <div style="color:#a78bfa;font-size:0.75em;font-weight:700;margin-bottom:2px">
        {d['title']}
      </div>
      <div style="color:#7a9bb5;font-size:0.72em;margin-bottom:4px">
        {d['content'][:120]}...
      </div>
      <div style="font-size:0.67em;color:#6b7fa8">
        Source: <a href="{d['source']}" style="color:#7a9bb5">{d['source']}</a>
      </div>
    </div>"""

def render_answer_card(label, accent, tokens, extra, reply):
    trunc = reply[:300] + ("..." if len(reply) > 300 else "")
    return f"""
    <div style="background:#0d1420;border:1px solid {accent}55;
                border-radius:10px;padding:14px">
      <div style="color:{accent};font-weight:700;font-size:0.82em;margin-bottom:4px">
        {label}
      </div>
      <div style="color:#7a9bb5;font-size:0.72em;margin-bottom:8px;font-family:monospace">
        {tokens} tokens &nbsp;|&nbsp; ~{tokens/25:.1f}s inference{extra}
      </div>
      <div style="background:#020408;border:1px solid {accent}22;border-radius:6px;
                  padding:10px 12px;color:#cbd5e1;font-size:0.80em;line-height:1.65">
        {trunc}
      </div>
    </div>"""

# ── Button handler ────────────────────────────────────────────
def on_run(_):
    run_btn.disabled    = True
    run_btn.description = "⏳ Running..."

    with out:
        clear_output()
        display(widgets.VBox([label_step, bar, label_time],
                             layout=widgets.Layout(padding='4px 0')))

    def _run():
        t0 = time.time()

        # Step 1 — Retrieve
        set_progress(5, "#38bdf8",
            make_step_html(1, "#38bdf8", "Searching knowledge base",
                           "Scoring all documents by keyword overlap with the question..."))
        retrieved_docs = retrieve(QUESTION)
        set_progress(30, "#38bdf8",
            make_step_html(1, "#38bdf8", "Knowledge base searched",
                           f"Found {len(retrieved_docs)} relevant documents · "
                           f"stopword-filtered query matched against {len(KNOWLEDGE_BASE)} docs"),
            elapsed=time.time() - t0)

        # Step 2 — Without RAG
        set_progress(35, "#fb923c",
            make_step_html(2, "#fb923c", "Running model call WITHOUT retrieved context",
                           "Generic system prompt only — no course knowledge injected."),
            elapsed=time.time() - t0)
        no_rag_msgs = [
            {"role": "system", "content": SYSTEM_BASE},
            {"role": "user",   "content": QUESTION},
        ]
        no_rag_tok  = sum(len(model.tokenize(m["content"].encode("utf-8"))) for m in no_rag_msgs)
        no_rag_resp = model.create_chat_completion(messages=no_rag_msgs, max_tokens=180, temperature=0.7)
        no_rag_reply = no_rag_resp["choices"][0]["message"]["content"].strip()
        set_progress(65, "#fb923c",
            make_step_html(2, "#fb923c", "Call 1 complete",
                           f"{no_rag_tok} tokens sent · model answered without any course knowledge"),
            elapsed=time.time() - t0)

        # Step 3 — With RAG
        set_progress(70, "#34d399",
            make_step_html(3, "#34d399", "Running model call WITH retrieved context injected",
                           f"Injecting {len(retrieved_docs)} retrieved docs into system prompt..."),
            elapsed=time.time() - t0)
        if retrieved_docs:
            context_block = "\n\n## Retrieved Knowledge\n"
            for d in retrieved_docs:
                context_block += f"\n### {d['title']}\n{d['content']}\n"
            system_with_context = SYSTEM_BASE + context_block
        else:
            system_with_context = SYSTEM_BASE
        rag_msgs = [
            {"role": "system", "content": system_with_context},
            {"role": "user",   "content": QUESTION},
        ]
        rag_tok  = sum(len(model.tokenize(m["content"].encode("utf-8"))) for m in rag_msgs)
        rag_resp = model.create_chat_completion(messages=rag_msgs, max_tokens=180, temperature=0.7)
        rag_reply = rag_resp["choices"][0]["message"]["content"].strip()

        total_elapsed = time.time() - t0
        set_progress(100, "#34d399",
            make_step_html(3, "#34d399", "All steps complete",
                           f"{rag_tok} tokens sent · retrieved context injected · results ready below"),
            elapsed=total_elapsed)

        # Render results
        retrieved_html = "".join(render_doc_card(d) for d in retrieved_docs) or \
            '<div style="color:#6b7fa8;font-size:0.8em">No documents retrieved.</div>'

        result_html = f"""
        <div style="font-family:'IBM Plex Mono','Fira Code',monospace;
                    background:#080c12;border-radius:12px;padding:18px 20px;
                    color:#e2e8f0;margin-top:10px">
          <h3 style="color:#f0f6ff;margin:0 0 12px;font-size:1.05em">📚 Part 5 — RAG Demo Results</h3>
          <div style="background:#020408;border:1px solid #38bdf844;
                      border-left:3px solid #38bdf8;border-radius:0 8px 8px 0;
                      padding:10px 14px;font-size:0.80em;color:#94b8d4;
                      margin-bottom:16px;font-style:italic">
            "{QUESTION}"
          </div>
          <div style="margin-bottom:16px">
            <div style="font-size:0.65em;color:#a78bfa;text-transform:uppercase;
                        letter-spacing:0.1em;margin-bottom:8px">
              🔍 Step 1 result — retrieved from knowledge base ({len(retrieved_docs)} docs injected)
            </div>
            {retrieved_html}
          </div>
          <div style="font-size:0.65em;color:#94b8d4;text-transform:uppercase;
                      letter-spacing:0.1em;margin-bottom:8px">
            Steps 2 & 3 — model answers compared
          </div>
          <div style="display:grid;grid-template-columns:1fr 1fr;gap:12px;margin-bottom:14px">
            {render_answer_card("Without RAG", "#f87171", no_rag_tok, "", no_rag_reply)}
            {render_answer_card("With RAG",    "#34d399", rag_tok,
                                f" · {len(retrieved_docs)} docs injected", rag_reply)}
          </div>
          <div style="background:#0d1420;border:1px solid #1e293b;border-radius:8px;
                      padding:12px 16px;font-size:0.78em;color:#7a9bb5;line-height:1.9">
            <span style="color:#f0f6ff;font-weight:700">What just happened:</span>
            The retriever searched our {len(KNOWLEDGE_BASE)}-doc knowledge base, found the most relevant entries,
            and injected them into the system prompt <em>before</em> calling the model.
            The model never "learned" our course catalogue — it just
            <span style="color:#cbd5e1">read it in context</span>.<br>
            Total time: <strong style="color:#f9e2af">{total_elapsed:.1f}s</strong>
            &nbsp;·&nbsp;
            Token overhead from RAG: <strong style="color:#a6e3a1">+{rag_tok - no_rag_tok} tokens</strong><br>
            In production: replace <code style="background:#020408;padding:1px 5px;
            border-radius:3px;color:#a78bfa">retrieve()</code> with a vector DB query
            (FAISS, pgvector, Pinecone) for true semantic similarity.
          </div>
        </div>"""

        out.clear_output()
        out.append_display_data(HTML(result_html))

        run_btn.disabled    = False
        run_btn.description = "▶ Run Again"

    threading.Thread(target=_run, daemon=True).start()

run_btn.on_click(on_run)

⚡ Part 6: “A Student Said They’re a Beginner. Now They’re Claiming to Be an Expert.”¶

Prof. Eric calls Zoe with an edge case:

“One of my students told the assistant they were a complete beginner. Now, three sessions later, they’re saying they have five years of Python experience. The assistant is still explaining things like they’re five. Can you fix that?”

The problem: the profile was set at the start and never updated. Zoe needs the assistant to detect when a student contradicts their earlier profile and adapt automatically.

🎯 Three things to watch for¶

#	Goal	What you’ll see
1	Detection	AI spots the contradiction and prints `⚠️ MEMORY CONFLICT DETECTED!`
2	Dynamic update	`user_profile["expertise"]` automatically changes from `beginner` to `professional`
3	Behaviour shift	The next response switches from beginner-friendly to concise and technical

Run the three cells below in order. ↓

# Memory Conflict Detection Demo
# ═══════════════════════════════════════════════════════════
#
#  TRANSPARENCY NOTE — what is hard-coded vs real in this demo:
#
#  HARD-CODED (fixed, won't change each run):
#    - conflict_message       : the sentence Zoe says to trigger the conflict
#    - assistant.recent_history : pre-loaded conversation history (skips wait time)
#    - expert_question        : the follow-up question in Step 3
#    - expert_reply           : the expert-mode answer (written by us, not the model)
#
#  REAL MODEL OUTPUT (runs inference, may vary each run):
#    - assistant.chat(conflict_message) : Step 2 — model detects conflict + replies
#    - detect_profile_changes(...)      : called inside .chat(), model outputs JSON
#
#  Why hard-code some parts?
#  The 1B local model is slow (~25 tok/s). Pre-loading history and the expert reply
#  keeps the demo fast and predictable for teaching. The conflict detection in Step 2
#  is the real inference we want to showcase.
# ═══════════════════════════════════════════════════════════


# ── STEP 1: Initialize session ────────────────────────────────────────
# ⚠️ HARD-CODED: user profile, project profile, and pre-loaded history
# These are fixed so every student sees the same starting state.

assistant = ChatAssistant(
    user_profile={
        "name":             "Zoe",
        "language":         "English",
        "expertise":        "Python beginner",   # ← this is what will be contradicted
        "style_preferences": ["Simple and clear", "Include code examples"]
    },
    project_profile={
        "name":         "LLM Teaching Notebook",
        "description":  "Building a teaching notebook about LLM context management",
        "tools":        ["Python", "Jupyter", "llama-cpp-python"],
        "current_goal": "Teach context management concepts to students"
    },
    model=model,
    max_turns=4,
    chunk_size=2
)

# ⚠️ HARD-CODED: pre-loaded history — no model call, instant
# In a real session these would have been generated by the model.
assistant.recent_history = [
    {"role": "user",      "content": "How do I load a CSV file in Python?"},
    {"role": "assistant", "content": "Use pd.read_csv('filename.csv'). It returns a DataFrame."},
    {"role": "user",      "content": "What about checking for missing values?"},
    {"role": "assistant", "content": "Use df.isnull().sum() to count missing values per column."},
]
assistant.total_turns = 2

# ── Render Step 1 ─────────────────────────────────────────────────────
history_divs = "".join(f"""
<div style="display:flex;gap:12px;padding:8px 12px;border-bottom:1px solid #1e293b;
            background:{'#0d1420' if i%2==0 else '#0a1220'}">
  <span style="font-size:0.72em;font-weight:700;width:90px;flex-shrink:0;
               color:{'#38bdf8' if m['role']=='user' else '#34d399'}">{m['role'].upper()}</span>
  <span style="color:#cbd5e1;font-size:0.78em;line-height:1.6">{m['content']}</span>
</div>""" for i, m in enumerate(assistant.recent_history))

display(HTML(f"""
<div style="font-family:'IBM Plex Mono','Fira Code',monospace;
            background:#080c12;border-radius:12px;padding:18px 20px;color:#e2e8f0">

  <div style="font-size:0.63em;color:#7ea8c9;text-transform:uppercase;
              letter-spacing:0.15em;margin-bottom:4px">Part 6 — Step 1 of 3</div>
  <h3 style="color:#f0f6ff;margin:0 0 14px;font-size:1.0em">⚙️ Session Initialized</h3>

  <!-- hard-code banner -->
  <div style="background:#f9e2af18;border:1px solid #f9e2af44;border-left:3px solid #f9e2af;
              border-radius:0 8px 8px 0;padding:8px 14px;margin-bottom:14px;font-size:0.78em">
    <span style="color:#f9e2af;font-weight:700">⚠️ Hard-coded</span>
    <span style="color:#94b8d4"> — profile and history are fixed for this demo.
    No model call was made here.</span>
  </div>

  <div style="display:grid;grid-template-columns:1fr 1fr;gap:12px;margin-bottom:16px">
    <div style="background:#0d1420;border:1px solid #1e293b;border-radius:8px;padding:12px 14px">
      <div style="font-size:0.65em;color:#7ea8c9;text-transform:uppercase;
                  letter-spacing:0.1em;margin-bottom:8px">User Profile</div>
      <div style="font-size:0.78em;color:#94b8d4;line-height:1.8">
        <div><span style="color:#64748b">name      </span> Zoe</div>
        <div><span style="color:#64748b">expertise </span>
          <span style="color:#f87171">Python beginner</span>
          <span style="color:#64748b;font-size:0.85em"> ← will be contradicted in Step 2</span>
        </div>
        <div><span style="color:#64748b">style     </span> Simple, with code examples</div>
      </div>
    </div>
    <div style="background:#0d1420;border:1px solid #1e293b;border-radius:8px;padding:12px 14px">
      <div style="font-size:0.65em;color:#7ea8c9;text-transform:uppercase;
                  letter-spacing:0.1em;margin-bottom:8px">Project Profile</div>
      <div style="font-size:0.78em;color:#94b8d4;line-height:1.8">
        <div><span style="color:#64748b">name  </span> LLM Teaching Notebook</div>
        <div><span style="color:#64748b">tools </span> Python, Jupyter, llama-cpp-python</div>
      </div>
    </div>
  </div>

  <div style="background:#0d1420;border:1px solid #1e293b;border-radius:8px;overflow:hidden">
    <div style="padding:10px 14px;background:#0a1628;font-size:0.65em;color:#7ea8c9;
                text-transform:uppercase;letter-spacing:0.1em">
      💬 Pre-loaded history (2 turns)
      <span style="color:#f9e2af;font-weight:700"> — hard-coded, no model call</span>
    </div>
    {history_divs}
  </div>

  <div style="margin-top:12px;font-size:0.75em;color:#64748b;text-align:center">
    ⬇️ Run the next cell to trigger the conflict
  </div>
</div>
"""))


# ── STEP 2: Trigger conflict + detect ────────────────────────────────
# ⚠️ HARD-CODED: the message Zoe sends — fixed so the demo is predictable.
# ✅ REAL MODEL OUTPUT: assistant.chat() calls the model twice:
#     Call 1 — generate a reply to conflict_message
#     Call 2 — detect_profile_changes() asks the model to output JSON

conflict_message = "Actually, I've been working with Python professionally for 5 years."
# This sentence is chosen to reliably trigger the keyword fallback in detect_profile_changes()
# even if the model fails to return valid JSON ("years of experience" is in the keyword list).

display(HTML(f"""
<div style="font-family:'IBM Plex Mono','Fira Code',monospace;
            background:#080c12;border-radius:12px;padding:18px 20px;
            color:#e2e8f0;margin-top:10px">

  <div style="font-size:0.63em;color:#7ea8c9;text-transform:uppercase;
              letter-spacing:0.15em;margin-bottom:4px">Part 6 — Step 2 of 3</div>
  <h3 style="color:#f0f6ff;margin:0 0 10px;font-size:1.0em">⚠️ Conflict Triggered</h3>

  <!-- labels -->
  <div style="display:flex;gap:8px;margin-bottom:12px;flex-wrap:wrap">
    <span style="background:#f9e2af18;border:1px solid #f9e2af44;color:#f9e2af;
                 font-size:0.72em;padding:3px 10px;border-radius:20px">
      ⚠️ Hard-coded: conflict_message
    </span>
    <span style="background:#34d39918;border:1px solid #34d39944;color:#34d399;
                 font-size:0.72em;padding:3px 10px;border-radius:20px">
      ✅ Real model output: reply + conflict detection JSON
    </span>
  </div>

  <div style="background:#020408;border:1px solid #f8717144;border-left:3px solid #f87171;
              border-radius:0 8px 8px 0;padding:10px 14px;font-size:0.8em;color:#fca5a5;
              margin-bottom:4px">
    Zoe says: <em>"{conflict_message}"</em><br>
    <span style="color:#64748b;font-size:0.9em">
      Current profile says: <strong style="color:#f87171">Python beginner</strong>
    </span>
  </div>
  <div style="font-size:0.72em;color:#64748b;margin-bottom:12px;padding-left:4px">
    The model will now be called to: (1) reply to Zoe, (2) check if the profile needs updating.
  </div>
</div>
"""))

# ✅ REAL MODEL OUTPUT — two inference calls happen inside .chat()
reply = assistant.chat(conflict_message)

if assistant.conflicts_detected:
    last       = assistant.conflicts_detected[-1]
    resolution = last['resolution']
    change_lines = []
    if isinstance(resolution, dict):
        for section, fields in resolution.items():
            if isinstance(fields, dict):
                for k, v in fields.items():
                    change_lines.append(
                        f"<div>{k} &nbsp;→&nbsp; <span style='color:#34d399'>{v}</span></div>"
                    )
    resolution_html = "".join(change_lines) if change_lines else str(resolution)

    display(HTML(f"""
    <div style="font-family:'IBM Plex Mono','Fira Code',monospace;
                background:#0d1420;border:2px solid #34d399;
                border-radius:10px;padding:16px;margin-top:10px;color:#e2e8f0">
      <div style="color:#34d399;font-weight:700;font-size:0.85em;margin-bottom:10px">
        ✅ Memory Conflict Resolved
        <span style="color:#64748b;font-weight:normal;font-size:0.85em;margin-left:8px">
          — real model output
        </span>
      </div>
      <div style="font-size:0.78em;line-height:1.9;color:#94b8d4">
        <div><span style="color:#64748b">Detected at turn  </span> {last['turn']}</div>
        <div><span style="color:#64748b">Profile updated   </span></div>
        <div style="padding-left:16px;color:#e2e8f0">{resolution_html}</div>
        <div style="margin-top:6px">
          <span style="color:#64748b">AI now sees Zoe as  </span>
          <strong style="color:#34d399">{assistant.user_profile.get('expertise', 'N/A')}</strong>
        </div>
      </div>
    </div>
    """))
else:
    display(HTML("""
    <div style="font-family:'IBM Plex Mono','Fira Code',monospace;
                background:#0d1420;border:1px solid #f87171;border-radius:8px;
                padding:12px 16px;margin-top:10px;font-size:0.8em">
      <span style="color:#f87171">⚠️ No conflict detected this run.</span>
      <span style="color:#64748b"> The 1B model may have failed to return valid JSON.
      The keyword fallback also did not trigger — try running again.</span>
    </div>
    """))


# ── STEP 3: Expert mode response + token recap ───────────────────────
# ⚠️ HARD-CODED: expert_question and expert_reply
# Why hard-code the reply?
#   Running a third inference call would add ~30s wait for a teaching demo.
#   The reply below is written to show what an expert-mode response looks like —
#   concise, technical, no hand-holding. Compare it to a beginner response mentally:
#   a beginner answer would start with "pandas is a data analysis library that..."

expert_question = "Should I use polars instead of pandas for performance optimization?"

# ⚠️ HARD-CODED — written by us, not generated by the model this run
expert_reply = (
    "For datasets under ~500MB, pandas is fine. "
    "Polars is worth it when you need lazy evaluation, true parallelism, "
    "or are hitting pandas memory limits. "
    "Key difference: Polars uses Apache Arrow under the hood — "
    "faster groupby and joins, but less ecosystem support."
)

assistant.recent_history.append({"role": "user",      "content": expert_question})
assistant.recent_history.append({"role": "assistant", "content": expert_reply})
assistant.total_turns += 1

final_msgs = [{"role": "system", "content": assistant._build_system_message()}]
final_msgs += assistant.recent_history
t_final    = sum(len(model.tokenize(m["content"].encode("utf-8"))) for m in final_msgs)
SPEED      = 25

display(HTML(f"""
<div style="font-family:'IBM Plex Mono','Fira Code',monospace;
            background:#080c12;border-radius:12px;padding:18px 20px;
            color:#e2e8f0;margin-top:10px">

  <div style="font-size:0.63em;color:#7ea8c9;text-transform:uppercase;
              letter-spacing:0.15em;margin-bottom:4px">Part 6 — Step 3 of 3</div>
  <h3 style="color:#f0f6ff;margin:0 0 14px;font-size:1.0em">🎯 Behaviour Shift — Expert Mode</h3>

  <!-- labels -->
  <div style="display:flex;gap:8px;margin-bottom:12px;flex-wrap:wrap">
    <span style="background:#f9e2af18;border:1px solid #f9e2af44;color:#f9e2af;
                 font-size:0.72em;padding:3px 10px;border-radius:20px">
      ⚠️ Hard-coded: expert_question + expert_reply
    </span>
    <span style="background:#a78bfa18;border:1px solid #a78bfa44;color:#a78bfa;
                 font-size:0.72em;padding:3px 10px;border-radius:20px">
      ✅ Real: system prompt was updated by Step 2's conflict detection
    </span>
  </div>

  <!-- profile change notice -->
  <div style="background:#a78bfa18;border:1px solid #a78bfa44;border-left:3px solid #a78bfa;
              border-radius:0 8px 8px 0;padding:10px 14px;margin-bottom:14px;font-size:0.82em">
    <span style="color:#a78bfa;font-weight:700">✨ System prompt has changed.</span>
    <span style="color:#94b8d4"> The AI now treats Zoe as an </span>
    <span style="color:#34d399;font-weight:700">EXPERT</span>
    <span style="color:#94b8d4"> — this part IS real, updated by Step 2.</span>
  </div>

  <!-- expert exchange -->
  <div style="background:#0d1420;border:1px solid #a78bfa33;border-radius:8px;
              padding:14px;margin-bottom:14px">
    <div style="font-size:0.65em;color:#a78bfa;text-transform:uppercase;
                letter-spacing:0.1em;margin-bottom:10px">
      Simulated expert-mode exchange
      <span style="color:#f9e2af;font-weight:700"> — reply is hard-coded</span>
    </div>
    <div style="font-size:0.78em;color:#38bdf8;margin-bottom:4px">👤 Zoe</div>
    <div style="background:#020408;border-radius:6px;padding:8px 12px;
                color:#cbd5e1;font-size:0.78em;margin-bottom:10px;
                border:1px solid #38bdf822">{expert_question}</div>
    <div style="font-size:0.78em;color:#34d399;margin-bottom:4px">
      🤖 AI (expert mode)
      <span style="color:#f9e2af;font-size:0.85em"> — hard-coded example</span>
    </div>
    <div style="background:#020408;border-radius:6px;padding:8px 12px;
                color:#cbd5e1;font-size:0.78em;border:1px solid #34d39922">{expert_reply}</div>
    <div style="margin-top:10px;padding-top:10px;border-top:1px solid #1e293b;
                font-size:0.72em;color:#7ea8c9;font-style:italic">
      💡 Compare: a <strong>beginner</strong> response would start with
      "pandas is a data analysis library that helps you work with tabular data..."<br>
      This response skips all that — no definition, no hand-holding, straight to the trade-off.
      That shift happened because Step 2 updated the system prompt.
    </div>
  </div>

  <!-- token recap -->
  <div style="background:#0d1420;border:1px solid #fbbf2444;border-radius:8px;padding:14px">
    <div style="font-size:0.65em;color:#fbbf24;text-transform:uppercase;
                letter-spacing:0.1em;margin-bottom:10px">🪙 Token Cost Recap</div>
    <div style="font-size:0.78em;color:#94b8d4;line-height:1.9">
      <div>
        <span style="color:#64748b">Context tokens  </span>
        <strong style="color:#f87171">{t_final}</strong>
        <span style="color:#7ea8c9"> / 4096 ({t_final/4096*100:.1f}% full)</span>
      </div>
      <div>
        <span style="color:#64748b">Est. wait time  </span>
        <span style="color:#fbbf24">~{t_final/SPEED:.1f}s</span>
        <span style="color:#7ea8c9"> at 25 tok/s</span>
      </div>
      <div>
        <span style="color:#64748b">Profile update cost  </span>
        <span style="color:#94b8d4">~1 extra model call to detect the conflict</span>
        <span style="color:#64748b"> — a fixed overhead per turn, not per token</span>
      </div>
    </div>
    <div style="margin-top:10px;padding-top:10px;border-top:1px solid #1e293b;
                font-size:0.75em;color:#64748b;line-height:1.6">
      This is why Part 4 (History Compression) matters —
      on a local 1B model, every saved token = less wait time for students.
    </div>
  </div>

</div>
"""))

# ✅ REAL: show_state() reads from the live assistant object updated by Step 2
assistant.show_state()


============================================================
Turn 3
============================================================
👤 User: "Actually, I've been working with Python professionally for 5 years."

⚠️  MEMORY CONFLICT DETECTED!
♻️  Profile updated: {'expertise': 'professional'}
    Next response will adjust accordingly.

🤖 AI: With 5 years of experience, you're likely familiar with a range of Python libraries and frameworks. What's been the most challenging part of your recent projects?

🎓 Final Demo: Putting It All Together — Zoe’s Finished Assistant¶

This is the final demo. Every concept from the notebook is now running live inside a single chat UI.

What you’ll see	Where it comes from
Responses adapt to skill level	Part 2 — dynamic system prompts
Profile updates automatically as you chat	Part 3 — conversation history
Context window fills up in real time	Part 4 — token budgets
Old messages compress when history gets long	Part 4 — semantic compression
Answers about course content the model wasn’t trained on	Part 5 — RAG

Prof. Eric asked for an assistant that adapts to 300 different students automatically. This is Zoe’s answer.

💡 Try these to see the system react:
“I actually have 3 years of Python experience” — watch the profile update and the tone shift
“What’s on the next homework?” — the model answers from injected context, not training data
Keep chatting until you see “— history compressed —” appear — that’s Part 4 running live

# Final Demo -- Putting It All Together: Zoe's finished assistant
# ══════════════════════════════════════════════════════════════
#  Combines Parts 1-6: dynamic system prompt, conversation history,
#  profile update detection, token budget + semantic compression,
#  RAG retrieval, and a threaded UI that stays live during inference.
#
#  PREREQUISITE: All cells above must have been run.
#  Uses: model, detect_profile_changes, compress_semantic,
#        retrieve, KNOWLEDGE_BASE  (all from utils.py / Part 5)
# ══════════════════════════════════════════════════════════════
#  SECTION 1 — Profile extraction from ZOE_HISTORY
# ══════════════════════════════════════════════════════════════
def _extract_profile_from_history(history):
    all_text = " ".join(m["content"] for m in history)
    user_profile = {
        "name":              "Zoe",
        "language":          "English",
        "expertise":         "Python beginner",
        "style_preferences": ["Keep answers concise", "Include code examples"],
    }
    m_origin = _re.search(r'from\s+(Taiwan|Hong Kong|China|Japan|Korea|Singapore)',
                           all_text, _re.I)
    if m_origin:
        user_profile["origin"] = m_origin.group(1)
    m_school = _re.search(r'(?:studying|student)\s+at\s+([\w\s]+?)(?:\.|,|\n)',
                           all_text, _re.I)
    if m_school:
        user_profile["school"] = m_school.group(1).strip()
    topics = list(set(_re.findall(r'learning\s+(\w+)', all_text, _re.I)))
    if topics:
        user_profile["topics"] = ", ".join(topics[:4])
    project_profile = {
        "name":         "LLM Teaching Notebook",
        "description":  "UC Berkeley course project",
        "tools":        ["Python", "Jupyter", "llama-cpp-python"],
        "current_goal": "Teach context management concepts to students",
    }
    m_project = _re.search(r'building\s+(?:a\s+)?([\w\s]+?)(?:\.|,|\n)',
                            all_text, _re.I)
    if m_project:
        project_profile["description"] = m_project.group(1).strip()
    return user_profile, project_profile
# ══════════════════════════════════════════════════════════════
#  SECTION 2 — Startup progress bar
# ══════════════════════════════════════════════════════════════
_startup_bar  = widgets.IntProgress(value=0, min=0, max=100,
                    layout=widgets.Layout(width="100%", height="14px"))
_startup_step = widgets.HTML(value="")
_startup_time = widgets.HTML(value="")
display(widgets.VBox([
    widgets.HTML(value="""
    <div style="font-family:'IBM Plex Mono','Fira Code',monospace;
                background:#080c12;border:1px solid #3b5268;
                border-radius:12px;padding:14px 18px;margin-bottom:8px">
      <div style="font-size:0.63em;color:#7ea8c9;text-transform:uppercase;
                  letter-spacing:0.15em;margin-bottom:6px">Final Demo -- Startup</div>
      <div style="color:#94b8d4;font-size:0.80em">
        Loading Zoe's session from <strong style="color:#f0f6ff">ZOE_HISTORY</strong>
        -- runs Part 4 compression once so the demo begins with real context.
      </div>
    </div>"""),
    _startup_step, _startup_bar, _startup_time,
]))
def _set_startup(pct, color, title, detail, elapsed=None):
    _startup_bar.value  = pct
    _startup_bar.style  = {"bar_color": color}
    _startup_step.value = (
        f'<div style="font-family:\'IBM Plex Mono\',monospace;border-left:3px solid {color};'
        f'padding:6px 12px;border-radius:0 6px 6px 0;background:{color}0d;margin:6px 0">'
        f'<div style="color:{color};font-size:0.80em;font-weight:700">{title}</div>'
        f'<div style="color:#7a9bb5;font-size:0.73em;margin-top:2px;font-style:italic">{detail}</div>'
        f'</div>'
    )
    if elapsed is not None:
        _startup_time.value = (
            f'<div style="font-family:monospace;color:#475569;font-size:0.72em;margin-top:2px">'
            f'⏱ {elapsed:.1f}s elapsed</div>'
        )
# Step 1 — entity extraction
_t0 = _time.time()
_set_startup(5, "#38bdf8",
    "Step 1 of 3 -- Extracting profile from ZOE_HISTORY",
    f"Scanning {len(ZOE_HISTORY)} messages for name, origin, school, topics...")
_chat_profile, _chat_project = _extract_profile_from_history(ZOE_HISTORY)
_set_startup(30, "#38bdf8",
    "Step 1 of 3 -- Profile extracted",
    f"name={_chat_profile.get('name')} · "
    f"origin={_chat_profile.get('origin','?')} · "
    f"school={_chat_profile.get('school','?')}",
    elapsed=_time.time() - _t0)
# Step 2 — split history
_set_startup(35, "#fb923c",
    "Step 2 of 3 -- Splitting conversation history",
    "First 4 messages -> live context · rest queued for compression...")
_HISTORY_SEED = 4
_chat_history = list(ZOE_HISTORY[:_HISTORY_SEED])
_older        = ZOE_HISTORY[_HISTORY_SEED:]
_set_startup(55, "#fb923c",
    "Step 2 of 3 -- History split",
    f"{len(_chat_history)} messages -> live history · {len(_older)} -> compression",
    elapsed=_time.time() - _t0)
# Step 3 — semantic compression
_older_tokens = sum(len(model.tokenize(m["content"].encode())) for m in _older)
_set_startup(60, "#34d399",
    "Step 3 of 3 -- Running Part 4 semantic compression",
    f"Compressing {len(_older)} messages ({_older_tokens} tokens) · "
    f"~{_older_tokens // 25}s estimated...")
_chat_summary = None
if _older:
    try:
        _chat_summary = compress_semantic(_older, model)
    except Exception:
        _topics = set()
        for _m in _older:
            _c = _m["content"].lower()
            if "taiwan"   in _c: _topics.add("is from Taiwan")
            if "berkeley" in _c: _topics.add("studies at UC Berkeley")
            if "notebook" in _c: _topics.add("is building a teaching notebook")
            if "token"    in _c: _topics.add("is learning about token budgets")
            if "rag"      in _c: _topics.add("is exploring RAG")
        _chat_summary = (
            f"[Compressed from {len(_older)} messages] Zoe {', '.join(_topics)}."
        ) if _topics else None
_set_startup(100, "#34d399",
    "Step 3 of 3 -- Compression complete",
    "Zoe's session is ready · chat UI loading below...",
    elapsed=_time.time() - _t0)
_conflicts  = []
MAX_HISTORY = 8
# ══════════════════════════════════════════════════════════════
#  SECTION 3 — System prompt builder
# ══════════════════════════════════════════════════════════════
def _build_sys(rag_context=""):
    p, proj = _chat_profile, _chat_project
    lines = [
        "You are Prof. Eric's AI study assistant for UC Berkeley Data 8.",
        "",
        "## Student Profile",
        f"- Name: {p.get('name', 'Student')}",
        f"- Skill level: {p.get('expertise', 'intermediate')}",
        f"- Language: {p.get('language', 'English')}",
        "",
        f"## Course: {proj.get('name', '')}",
        f"- Tools: {', '.join(proj.get('tools', []))}",
        f"- Goal: {proj.get('current_goal', '')}",
    ]
    exp = p.get("expertise", "").lower()
    if any(w in exp for w in ["beginner", "new", "intro"]):
        lines += ["", "## Style",
                  "- Use simple language and analogies",
                  "- Always include a short code example",
                  "- Explain each step clearly"]
    elif any(w in exp for w in ["expert", "senior", "advanced", "professional", "experienced"]):
        lines += ["", "## Style",
                  "- Be concise and technical",
                  "- Skip basic explanations",
                  "- Focus on edge cases and best practices"]
    if _chat_summary:
        lines += ["", "## Earlier conversation (compressed -- Part 4)", _chat_summary]
    if rag_context:
        lines += ["", "## Retrieved Course Knowledge (Part 5 -- RAG)",
                  "# Use the information below to answer the student's question.",
                  rag_context]
    return "\n".join(lines)
def _tok(text):
    return len(model.tokenize(text.encode("utf-8")))
# ══════════════════════════════════════════════════════════════
#  SECTION 4 — Widgets
# ══════════════════════════════════════════════════════════════
chat_log = widgets.HTML(
    value='<div style="color:#585b70;font-size:0.78em;text-align:center;'
          'padding:20px 0">Session loaded — type a message below to begin.</div>',
    layout=widgets.Layout(
        width="100%", height="600px",   # increased from 340px
        border="1px solid #313244", border_radius="8px",
        overflow_y="auto", padding="10px"
    )
)
user_input = widgets.Text(
    placeholder="Type your question...",
    layout=widgets.Layout(width="82%")
)
send_btn = widgets.Button(
    description="Send",
    button_style="primary",
    layout=widgets.Layout(width="16%", margin="0 0 0 2%")
)
profile_html  = widgets.HTML(value="")
context_html  = widgets.HTML(value="")
conflict_html = widgets.HTML(value="")
log_html      = widgets.HTML(value="")
_chat_log_buffer = ""
# ══════════════════════════════════════════════════════════════
#  SECTION 5 — Render helpers
# ══════════════════════════════════════════════════════════════
def render_bubble(role, text, rag_docs=None, system_event=None):
    if system_event:
        color = {"profile": "#f38ba8", "compress": "#6b7fa8"}.get(system_event, "#585b70")
        return (
            f'<div style="text-align:center;color:{color};'
            f'font-size:0.75em;margin:4px 0">{text}</div>'
        )
    if role == "user":
        return (
            f'<div style="display:flex;justify-content:flex-end;margin:6px 0">'
            f'<div style="background:#89b4fa;color:#1e1e2e;padding:9px 14px;'
            f'border-radius:16px 16px 4px 16px;max-width:78%;'
            f'font-size:0.88em;line-height:1.6">{text}</div></div>'
        )
    rag_badge = ""
    if rag_docs:
        titles = " · ".join(d["title"][:35] + "..." for d in rag_docs)
        rag_badge = (
            f'<div style="font-size:0.70em;color:#a78bfa;margin-top:6px;'
            f'border-top:1px solid #45475a;padding-top:5px">RAG -- {titles}</div>'
        )
    return (
        f'<div style="display:flex;justify-content:flex-start;margin:6px 0">'
        f'<div style="background:#313244;color:#cdd6f4;padding:9px 14px;'
        f'border-radius:16px 16px 16px 4px;max-width:78%;'
        f'font-size:0.88em;line-height:1.6">{text}{rag_badge}</div></div>'
    )
def _append_chat(html):
    """Append HTML to chat log and scroll to bottom via Javascript."""
    global _chat_log_buffer
    _chat_log_buffer += html
    chat_log.value = _chat_log_buffer
    display(Javascript("""
        setTimeout(function() {
            var els = document.querySelectorAll('.widget-html-content');
            els.forEach(function(el) {
                if (el.scrollHeight > el.clientHeight) {
                    el.scrollTop = el.scrollHeight;
                }
            });
        }, 150);
    """))
def render_profile():
    rows = "".join(
        f'<div style="display:flex;justify-content:space-between;padding:6px 10px;'
        f'background:{"#0d1420" if i%2==0 else "#111827"};border-bottom:1px solid #1e293b">'
        f'<span style="color:#7a9bb5;font-size:0.78em">{k}</span>'
        f'<span style="color:#cdd6f4;font-size:0.78em;font-weight:bold">{v}</span></div>'
        for i, (k, v) in enumerate(
            {k: v for k, v in _chat_profile.items() if k != "style_preferences"}.items()
        )
    )
    profile_html.value = (
        f'<div style="background:#1e1e2e;border-radius:8px;overflow:hidden;margin-bottom:6px">'
        f'<div style="background:#0a1628;padding:8px 10px;'
        f'color:#89b4fa;font-size:0.78em;font-weight:bold">STUDENT PROFILE</div>'
        f'{rows}</div>'
    )
def render_context(pending_msg=""):
    sys_tok  = _tok(_build_sys())
    hist_tok = sum(_tok(m["content"]) for m in _chat_history)
    pend_tok = _tok(pending_msg) if pending_msg else 0
    total    = sys_tok + hist_tok + pend_tok
    pct      = min(100, total / 4096 * 100)
    bar_color = "#f38ba8" if pct > 75 else "#f9e2af" if pct > 40 else "#a6e3a1"
    pending_row = (
        f'<span><span style="color:#fab387">■</span>'
        f'<span style="color:#94b8d4"> Pending {pend_tok} tok</span></span>'
    ) if pend_tok else ""
    context_html.value = (
        f'<div style="background:#1e1e2e;border-radius:8px;padding:10px 12px;margin-bottom:6px">'
        f'<div style="color:#f9e2af;font-size:0.78em;font-weight:bold;margin-bottom:6px">'
        f'CONTEXT WINDOW</div>'
        f'<div style="width:100%;height:12px;background:#313244;border-radius:4px;'
        f'overflow:hidden;margin-bottom:6px">'
        f'<div style="display:inline-block;width:{sys_tok/4096*100:.1f}%;height:100%;background:#f9e2af"></div>'
        f'<div style="display:inline-block;width:{hist_tok/4096*100:.1f}%;height:100%;background:#89b4fa"></div>'
        f'<div style="display:inline-block;width:{pend_tok/4096*100:.1f}%;height:100%;background:#fab387;opacity:0.7"></div>'
        f'</div>'
        f'<div style="display:flex;gap:10px;font-size:0.73em;flex-wrap:wrap">'
        f'<span><span style="color:#f9e2af">■</span><span style="color:#94b8d4"> System {sys_tok} tok</span></span>'
        f'<span><span style="color:#89b4fa">■</span><span style="color:#94b8d4"> History {hist_tok} tok</span></span>'
        f'{pending_row}'
        f'<span style="color:{bar_color};font-weight:bold;margin-left:auto">{pct:.0f}% used</span>'
        f'</div></div>'
    )
def render_conflicts():
    if _conflicts:
        items = "".join(
            f'<div style="color:#f38ba8;font-size:0.76em;padding:4px 10px;'
            f'background:{"#0d1420" if i%2==0 else "#111827"};border-bottom:1px solid #1e293b">'
            f'Turn {c["turn"]}: {c["change"]}</div>'
            for i, c in enumerate(_conflicts)
        )
        conflict_html.value = (
            f'<div style="background:#1e1e2e;border-radius:8px;overflow:hidden;margin-bottom:6px">'
            f'<div style="background:#0a1628;padding:8px 10px;'
            f'color:#f38ba8;font-size:0.78em;font-weight:bold">PROFILE CHANGES</div>'
            f'{items}</div>'
        )
    else:
        conflict_html.value = (
            f'<div style="background:#1e1e2e;border-radius:8px;padding:10px 12px;margin-bottom:6px">'
            f'<div style="color:#7a9bb5;font-size:0.78em">PROFILE CHANGES<br>'
            f'<span style="color:#6b7fa8">None yet -- try "I actually have 3 years of experience"</span>'
            f'</div></div>'
        )
ROLE_LOG_COLOR = {"system": "#f9e2af", "user": "#89b4fa", "assistant": "#a6e3a1"}
def render_live_log(turn, messages, retrieved_docs, reply_tokens,
                    profile_changed, compressed):
    msg_rows = ""
    for i, m in enumerate(messages):
        color = ROLE_LOG_COLOR.get(m["role"], "#cdd6f4")
        tok   = _tok(m["content"])
        preview = m["content"].replace("<", "&lt;").replace(">", "&gt;")
        content_html = (
            f'<pre style="white-space:pre-wrap;margin:4px 0 0 0;font-size:0.72em;'
            f'color:#94b8d4;max-height:100px;overflow-y:auto">{preview}</pre>'
            if m["role"] == "system"
            else f'<div style="color:#94b8d4;font-size:0.75em;margin-top:3px">'
                 f'{preview[:100]}{"..." if len(preview) > 100 else ""}</div>'
        )
        connector = "|--" if i < len(messages) - 1 else "L--"
        msg_rows += (
            f'<div style="margin-bottom:8px">'
            f'<div style="display:flex;align-items:center;gap:6px">'
            f'<span style="color:#585b70;font-size:0.70em;font-family:monospace">{connector}</span>'
            f'<span style="color:{color};font-size:0.72em;font-weight:700;'
            f'background:{color}18;padding:1px 8px;border-radius:10px">{m["role"].upper()}</span>'
            f'<span style="color:#585b70;font-size:0.70em;margin-left:auto">{tok} tok</span>'
            f'</div>{content_html}</div>'
        )
    badges = ""
    if retrieved_docs:
        badges += f'<span style="background:#a78bfa18;border:1px solid #a78bfa44;color:#a78bfa;font-size:0.70em;padding:2px 8px;border-radius:10px;margin-right:4px">RAG: {len(retrieved_docs)} docs</span>'
    if profile_changed:
        badges += '<span style="background:#f38ba818;border:1px solid #f38ba844;color:#f38ba8;font-size:0.70em;padding:2px 8px;border-radius:10px;margin-right:4px">Profile updated</span>'
    if compressed:
        badges += '<span style="background:#fab38718;border:1px solid #fab38744;color:#fab387;font-size:0.70em;padding:2px 8px;border-radius:10px;margin-right:4px">Compressed</span>'
    total_tok = sum(_tok(m["content"]) for m in messages)
    log_html.value = (
        f'<div style="font-family:\'IBM Plex Mono\',\'Fira Code\',monospace;'
        f'background:#080c12;border:1px solid #313244;border-radius:8px;'
        f'padding:12px 14px;margin-bottom:8px">'
        f'<div style="display:flex;justify-content:space-between;align-items:center;'
        f'margin-bottom:8px;padding-bottom:8px;border-bottom:1px solid #1e293b">'
        f'<span style="color:#cdd6f4;font-weight:700;font-size:0.82em">Turn {turn}</span>'
        f'<div>{badges or "<span style=\'color:#585b70;font-size:0.70em\'>no special events</span>"}</div>'
        f'</div>'
        f'<div style="color:#585b70;font-size:0.68em;text-transform:uppercase;'
        f'letter-spacing:0.08em;margin-bottom:8px">'
        f'Messages sent to AI ({len(messages)} total · {total_tok} tokens)</div>'
        f'{msg_rows}'
        f'<div style="margin-top:8px;padding-top:8px;border-top:1px solid #1e293b;'
        f'display:flex;justify-content:space-between;font-size:0.73em">'
        f'<span style="color:#585b70">AI replied</span>'
        f'<span style="color:#a6e3a1">{reply_tokens} tokens generated</span>'
        f'</div></div>'
    ) + log_html.value
# ══════════════════════════════════════════════════════════════
#  SECTION 6 — Send logic (threaded)
# ══════════════════════════════════════════════════════════════
turn_count = [0]
def on_send(btn):
    msg = user_input.value.strip()
    if not msg:
        return
    user_input.value     = ""
    send_btn.disabled    = True
    send_btn.description = "..."
    turn_count[0] += 1
    _append_chat(render_bubble("user", msg))
    _append_chat(
        '<div style="display:flex;justify-content:flex-start;margin:6px 0" id="loading">'
        '<div style="background:#313244;color:#585b70;padding:9px 14px;'
        'border-radius:16px 16px 16px 4px;font-size:0.88em;font-style:italic">'
        'AI is thinking... '
        '<span style="color:#45475a;font-size:0.85em">(~25 tok/s -- hang tight)</span>'
        '</div></div>'
    )
    render_context(pending_msg=msg)
    def run():
        global _chat_summary
        try:
            _run_inference()
        except Exception as e:
            _append_chat(render_bubble(None, f"Error: {e}", system_event="profile"))
            send_btn.disabled    = False
            send_btn.description = "Send"
    def _run_inference():
        global _chat_summary, _chat_log_buffer
        # Part 5 — RAG retrieve
        retrieved_docs = retrieve(msg)
        rag_context    = "\n".join(
            f"### {d['title']}\n{d['content']}" for d in retrieved_docs
        ) if retrieved_docs else ""
        # Build messages
        messages = [{"role": "system", "content": _build_sys(rag_context=rag_context)}]
        messages += _chat_history
        messages.append({"role": "user", "content": msg})
        # Call 1 — generate reply
        resp         = model.create_chat_completion(messages=messages, max_tokens=200, temperature=0.7)
        reply        = resp["choices"][0]["message"]["content"].strip()
        reply_tokens = _tok(reply)
        _chat_history.append({"role": "user",      "content": msg})
        _chat_history.append({"role": "assistant",  "content": reply})
        # Remove loading bubble, add real reply
        _chat_log_buffer = _chat_log_buffer.replace(
            '<div style="display:flex;justify-content:flex-start;margin:6px 0" id="loading">'
            '<div style="background:#313244;color:#585b70;padding:9px 14px;'
            'border-radius:16px 16px 16px 4px;font-size:0.88em;font-style:italic">'
            'AI is thinking... '
            '<span style="color:#45475a;font-size:0.85em">(~25 tok/s -- hang tight)</span>'
            '</div></div>',
            ""
        )
        _append_chat(render_bubble("assistant", reply, rag_docs=retrieved_docs or None))
        # Call 2 — conflict detection (keyword-first, model as fallback)
        profile_changed = False
        experience_kws  = ["years of experience", "years experience", "professionally",
                            "i'm an expert", "i am an expert", "senior developer",
                            "professional developer"]
        if any(kw in msg.lower() for kw in experience_kws):
            old = _chat_profile.get("expertise", "?")
            if old != "professional":
                _chat_profile["expertise"] = "professional"
                _conflicts.append({"turn": turn_count[0], "change": f"expertise: {old} -> professional"})
                profile_changed = True
                _append_chat(render_bubble(None, "Profile updated -- tone shifts from next reply",
                                           system_event="profile"))
        else:
            updates = detect_profile_changes(msg, _chat_profile, _chat_project, model)
            if updates:
                updates.pop("conflict", False)
                if "user_profile" in updates:
                    for k, v in updates["user_profile"].items():
                        old = _chat_profile.get(k, "?")
                        if old != v:
                            _chat_profile[k] = v
                            _conflicts.append({"turn": turn_count[0], "change": f"{k}: {old} -> {v}"})
                            profile_changed = True
                    if profile_changed:
                        _append_chat(render_bubble(None, "Profile updated -- tone shifts from next reply",
                                                   system_event="profile"))
        # Part 4 — compress if history too long
        compressed = False
        if len(_chat_history) > MAX_HISTORY:
            to_compress       = _chat_history[:4]
            _chat_history[:4] = []
            try:
                _chat_summary = compress_semantic(to_compress, model)
                compressed    = True
                _append_chat(render_bubble(None, "-- history compressed (Part 4) --",
                                           system_event="compress"))
            except Exception:
                pass
        # Final side panel updates
        render_profile()
        render_context()
        render_conflicts()
        render_live_log(turn_count[0], messages, retrieved_docs,
                        reply_tokens, profile_changed, compressed)
        send_btn.disabled    = False
        send_btn.description = "Send"
    threading.Thread(target=run, daemon=True).start()
send_btn.on_click(on_send)
user_input.on_submit(on_send)
# ══════════════════════════════════════════════════════════════
#  SECTION 7 — Initial render + layout
# ══════════════════════════════════════════════════════════════
render_profile()
render_context()
render_conflicts()
left_panel = widgets.VBox(
    [chat_log, widgets.HBox([user_input, send_btn])],
    layout=widgets.Layout(width="55%", padding="0 12px 0 0")
)
right_panel = widgets.VBox(
    [profile_html, context_html, conflict_html,
     widgets.HTML(value=(
         '<div style="font-family:\'IBM Plex Mono\',monospace;color:#585b70;'
         'font-size:0.70em;text-transform:uppercase;letter-spacing:0.1em;'
         'padding:6px 0 4px 0;border-bottom:1px solid #313244;margin-bottom:6px">'
         'Behind the Scenes -- Live Log'
         '<span style="color:#45475a;font-weight:normal"> (updates each turn)</span>'
         '</div>'
     )),
     log_html],
    layout=widgets.Layout(width="45%", max_height="740px", overflow_y="auto")
)
header = widgets.HTML(value="""
<div style="background:#1e1e2e;border-radius:10px;padding:14px 18px;margin-bottom:14px">
  <div style="display:flex;align-items:center;gap:10px">
    <span style="font-size:1.3em">🎓</span>
    <div>
      <div style="color:#cdd6f4;font-weight:bold;font-size:1em">
        Prof. Eric's Data 8 Study Assistant
      </div>
      <div style="color:#94b8d4;font-size:0.78em">
        Built by Zoe · All 5 parts running live · Live log shows exactly what the AI receives
      </div>
    </div>
  </div>
  <div style="margin-top:10px;background:#313244;border-radius:6px;padding:8px 12px;
              color:#94b8d4;font-size:0.78em;line-height:2">
    You are continuing <strong style="color:#cdd6f4">Zoe's session</strong>
    -- her profile and history are already loaded from ZOE_HISTORY.<br>
    <span style="color:#89b4fa">"I actually have 3 years of Python experience"</span>
    -- watch profile update instantly + tone shift next reply<br>
    <span style="color:#fab387">"What courses should I take after Data 100?"</span>
    -- watch RAG badge + injected docs appear in live log<br>
    <span style="color:#a6e3a1">Keep chatting</span>
    -- until you see "-- history compressed --" (Part 4 running live)
  </div>
</div>""")
display(header, widgets.HBox([left_panel, right_panel]))

🗺️ Part 7: “How Would This Actually Work in Production?”¶

The assistant works. Prof. Eric is ready to deploy it to all 300 students. But Zoe realises: everything in this notebook runs in a single Python process, in memory, for one user at a time.

She needs to answer Prof. Eric’s final question:

“What does this look like as a real system — one that can handle hundreds of students simultaneously, persist their profiles across sessions, and stay fast?”

Every piece of this notebook maps directly to a real backend component. The table and diagram below show how.

# Production Architecture: how every notebook concept maps to a real backend system

part7_html = """
<div style="font-family:'IBM Plex Mono','Fira Code',monospace;
            background:#080c12;border-radius:12px;padding:20px;color:#e2e8f0;margin-bottom:12px">

  <div style="font-size:0.63em;color:#7ea8c9;text-transform:uppercase;
              letter-spacing:0.15em;margin-bottom:4px">Part 7</div>
  <h3 style="color:#f0f6ff;margin:0 0 6px;font-size:1.1em">
    🗺️ How Would This Actually Work in Production?
  </h3>
  <p style="color:#94b8d4;font-size:0.82em;margin:0 0 8px;line-height:1.7">
    The assistant works. Prof. Eric is ready to deploy it to all 300 students.
    But Zoe realises: everything in this notebook runs in a single Python process,
    in memory, for one user at a time.
  </p>
  <div style="background:#020408;border:1px solid #38bdf844;border-left:3px solid #38bdf8;
              border-radius:0 8px 8px 0;padding:10px 14px;font-size:0.82em;
              color:#94b8d4;font-style:italic;margin-bottom:20px;line-height:1.7">
    "What does this look like as a real system — one that can handle hundreds of students
    simultaneously, persist their profiles across sessions, and stay fast?"
  </div>

  <!-- ASCII diagram -->
  <div style="font-size:0.65em;color:#7ea8c9;text-transform:uppercase;
              letter-spacing:0.1em;margin-bottom:8px">📐 System Architecture</div>
  <pre style="background:#020408;border:1px solid #1e293b;border-radius:8px;
              padding:16px;color:#34d399;font-size:0.78em;line-height:1.9;
              overflow-x:auto">
  🧑‍💻 Student's Browser
         |
         v
  +--------------------+
  |   Classroom Door   |  ← Checks who you are and if you're allowed in (Auth)
  +--------+-----------+
           |
           v
  +--------------------+
  |  Safety Filter     |  ← Blocks prompt injection & PII (Llama Guard)
  +--------+-----------+
           |
           v
  +--------------------+
  |   TA / API Server  |  ← Receives the question, decides how to respond
  +--+-----------+-----+
     |           |
     v           v
  📒 Sticky note   📚 Notebook
     (Redis)       (PostgreSQL)
  Last 10 turns    Full learning history
  Fast, temporary  Slower, permanent
     |           |
     v           v
  +--------------------+
  |     AI Model       |  ← Receives full context, generates a reply
  +--------------------+
         |
         v
  📬 Answer streams back to the student
     (one token at a time)
  </pre>

  <!-- Component table -->
  <div style="font-size:0.65em;color:#7ea8c9;text-transform:uppercase;
              letter-spacing:0.1em;margin:18px 0 8px">📋 Notebook vs Real System</div>
  <div style="background:#0d1420;border:1px solid #1e293b;border-radius:8px;overflow:hidden">

    <div style="display:grid;grid-template-columns:1fr 1fr 1fr;
                background:#0a1628;border-bottom:2px solid #1e293b;padding:10px 14px">
      <span style="color:#7ea8c9;font-size:0.72em;text-transform:uppercase;letter-spacing:0.08em">In this notebook</span>
      <span style="color:#7ea8c9;font-size:0.72em;text-transform:uppercase;letter-spacing:0.08em">In a real system</span>
      <span style="color:#7ea8c9;font-size:0.72em;text-transform:uppercase;letter-spacing:0.08em">School analogy</span>
    </div>

    <div style="display:grid;grid-template-columns:1fr 1fr 1fr;
                padding:10px 14px;border-bottom:1px solid #1e293b;background:#0d1420">
      <span style="color:#38bdf8;font-family:monospace;font-size:0.85em">recent_history = []</span>
      <span style="color:#34d399;font-size:0.8em">Short-term cache (Redis)</span>
      <span style="color:#94b8d4;font-size:0.8em">Scratch paper during class — thrown away after</span>
    </div>

    <div style="display:grid;grid-template-columns:1fr 1fr 1fr;
                padding:10px 14px;border-bottom:1px solid #1e293b;background:#111827">
      <span style="color:#38bdf8;font-family:monospace;font-size:0.85em">summary = compress(...)</span>
      <span style="color:#34d399;font-size:0.8em">Long-term memory DB (PostgreSQL)</span>
      <span style="color:#94b8d4;font-size:0.8em">Your study notes before finals — kept forever</span>
    </div>

    <div style="display:grid;grid-template-columns:1fr 1fr 1fr;
                padding:10px 14px;border-bottom:1px solid #1e293b;background:#0d1420">
      <span style="color:#38bdf8;font-family:monospace;font-size:0.85em">retrieve() in RAG</span>
      <span style="color:#34d399;font-size:0.8em">Knowledge search (Vector DB)</span>
      <span style="color:#94b8d4;font-size:0.8em">Going to the library to look something up</span>
    </div>

    <div style="display:grid;grid-template-columns:1fr 1fr 1fr;
                padding:10px 14px;border-bottom:1px solid #1e293b;background:#111827">
      <span style="color:#38bdf8;font-family:monospace;font-size:0.85em">model.create_chat_completion()</span>
      <span style="color:#34d399;font-size:0.8em">AI model server (vLLM)</span>
      <span style="color:#94b8d4;font-size:0.8em">The professor who actually answers the question</span>
    </div>

    <div style="display:grid;grid-template-columns:1fr 1fr 1fr;
                padding:10px 14px;border-bottom:1px solid #1e293b;background:#0d1420">
      <span style="color:#38bdf8;font-family:monospace;font-size:0.85em">user_profile = {}</span>
      <span style="color:#34d399;font-size:0.8em">Student profile database</span>
      <span style="color:#94b8d4;font-size:0.8em">The school's student records system</span>
    </div>

    <div style="display:grid;grid-template-columns:1fr 1fr 1fr;
                padding:10px 14px;background:#111827">
      <span style="color:#38bdf8;font-family:monospace;font-size:0.85em">INJECTION_PATTERNS guard</span>
      <span style="color:#34d399;font-size:0.8em">Safety filter (Llama Guard)</span>
      <span style="color:#94b8d4;font-size:0.8em">The security guard at the school entrance</span>
    </div>

  </div>

  <!-- Key insight -->
  <div style="margin-top:14px;background:#0d1420;border:1px solid #fbbf2444;
              border-left:3px solid #fbbf24;border-radius:0 8px 8px 0;
              padding:12px 16px;font-size:0.82em;line-height:1.8">
    <strong style="color:#fbbf24">💡 Zoe's Takeaway:</strong><br>
    <span style="color:#cbd5e1">
      Every <code style="color:#38bdf8;background:#020408;padding:1px 5px;border-radius:3px">dict</code>
      and <code style="color:#38bdf8;background:#020408;padding:1px 5px;border-radius:3px">list</code>
      in this notebook becomes its own service in a real system.<br>
      <strong style="color:#f0f6ff">The concepts are identical</strong> —
      you just swap the simple tools for more powerful ones
      that can handle 300 students at once.
    </span>
  </div>

</div>
"""

flow_html = """
<div style="font-family:'IBM Plex Mono','Fira Code',monospace;
            background:#080c12;border-radius:12px;padding:20px;
            color:#e2e8f0;margin-top:10px">

  <div style="font-size:0.65em;color:#7ea8c9;text-transform:uppercase;
              letter-spacing:0.1em;margin-bottom:12px">⚡ Request Lifecycle — One Student Message</div>

  <!-- step 1 -->
  <div style="display:flex;align-items:stretch;gap:12px;margin-bottom:4px">
    <div style="display:flex;flex-direction:column;align-items:center;width:32px;flex-shrink:0">
      <div style="width:28px;height:28px;border-radius:50%;background:#f8717122;border:1px solid #f87171;
                  display:flex;align-items:center;justify-content:center;
                  font-size:0.7em;font-weight:700;color:#f87171">1</div>
      <div style="width:1px;background:#1e293b;flex:1;margin-top:4px"></div>
    </div>
    <div style="background:#0d1420;border:1px solid #f8717133;border-radius:8px;
                padding:10px 14px;flex:1;margin-bottom:4px">
      <div style="color:#f87171;font-size:0.78em;font-weight:700">🛡️ Safety Filter</div>
      <div style="color:#64748b;font-size:0.72em;margin-top:2px">blocks prompt injection & PII (Llama Guard)</div>
      <div style="color:#475569;font-size:0.68em;margin-top:4px">← covered in Part 2</div>
    </div>
  </div>

  <!-- step 2 -->
  <div style="display:flex;align-items:stretch;gap:12px;margin-bottom:4px">
    <div style="display:flex;flex-direction:column;align-items:center;width:32px;flex-shrink:0">
      <div style="width:28px;height:28px;border-radius:50%;background:#38bdf822;border:1px solid #38bdf8;
                  display:flex;align-items:center;justify-content:center;
                  font-size:0.7em;font-weight:700;color:#38bdf8">2</div>
      <div style="width:1px;background:#1e293b;flex:1;margin-top:4px"></div>
    </div>
    <div style="background:#0d1420;border:1px solid #38bdf833;border-radius:8px;
                padding:10px 14px;flex:1;margin-bottom:4px">
      <div style="color:#38bdf8;font-size:0.78em;font-weight:700">👤 Load Student Profile</div>
      <div style="color:#64748b;font-size:0.72em;margin-top:2px">name, expertise, style preferences</div>
      <div style="color:#475569;font-size:0.68em;margin-top:4px">← covered in Part 2</div>
    </div>
  </div>

  <!-- step 3 -->
  <div style="display:flex;align-items:stretch;gap:12px;margin-bottom:4px">
    <div style="display:flex;flex-direction:column;align-items:center;width:32px;flex-shrink:0">
      <div style="width:28px;height:28px;border-radius:50%;background:#a78bfa22;border:1px solid #a78bfa;
                  display:flex;align-items:center;justify-content:center;
                  font-size:0.7em;font-weight:700;color:#a78bfa">3</div>
      <div style="width:1px;background:#1e293b;flex:1;margin-top:4px"></div>
    </div>
    <div style="background:#0d1420;border:1px solid #a78bfa33;border-radius:8px;
                padding:10px 14px;flex:1;margin-bottom:4px">
      <div style="color:#a78bfa;font-size:0.78em;font-weight:700">📚 Retrieve RAG Docs</div>
      <div style="color:#64748b;font-size:0.72em;margin-top:2px">fetch relevant course materials</div>
      <div style="color:#475569;font-size:0.68em;margin-top:4px">← covered in Part 5</div>
    </div>
  </div>

  <!-- step 4 -->
  <div style="display:flex;align-items:stretch;gap:12px;margin-bottom:4px">
    <div style="display:flex;flex-direction:column;align-items:center;width:32px;flex-shrink:0">
      <div style="width:28px;height:28px;border-radius:50%;background:#34d39922;border:1px solid #34d399;
                  display:flex;align-items:center;justify-content:center;
                  font-size:0.7em;font-weight:700;color:#34d399">4</div>
      <div style="width:1px;background:#1e293b;flex:1;margin-top:4px"></div>
    </div>
    <div style="background:#0d1420;border:1px solid #34d39933;border-radius:8px;
                padding:10px 14px;flex:1;margin-bottom:4px">
      <div style="color:#34d399;font-size:0.78em;font-weight:700">🧩 Assemble Context Window</div>
      <div style="color:#64748b;font-size:0.72em;margin-top:2px">system prompt + history + RAG + new message</div>
      <div style="color:#475569;font-size:0.68em;margin-top:4px">← covered in Part 5 visualizer</div>
    </div>
  </div>

  <!-- step 5 -->
  <div style="display:flex;align-items:stretch;gap:12px;margin-bottom:4px">
    <div style="display:flex;flex-direction:column;align-items:center;width:32px;flex-shrink:0">
      <div style="width:28px;height:28px;border-radius:50%;background:#f9e2af22;border:1px solid #f9e2af;
                  display:flex;align-items:center;justify-content:center;
                  font-size:0.7em;font-weight:700;color:#f9e2af">5</div>
      <div style="width:1px;background:#1e293b;flex:1;margin-top:4px"></div>
    </div>
    <div style="background:#0d1420;border:1px solid #f9e2af33;border-radius:8px;
                padding:10px 14px;flex:1;margin-bottom:4px">
      <div style="color:#f9e2af;font-size:0.78em;font-weight:700">🤖 LLM Inference → Stream Reply</div>
      <div style="color:#64748b;font-size:0.72em;margin-top:2px">generate response token by token</div>
      <div style="color:#475569;font-size:0.68em;margin-top:4px">← production: vLLM + SSE</div>
    </div>
  </div>

  <!-- step 6 -->
  <div style="display:flex;align-items:stretch;gap:12px;margin-bottom:4px">
    <div style="display:flex;flex-direction:column;align-items:center;width:32px;flex-shrink:0">
      <div style="width:28px;height:28px;border-radius:50%;background:#fbbf2422;border:1px solid #fbbf24;
                  display:flex;align-items:center;justify-content:center;
                  font-size:0.7em;font-weight:700;color:#fbbf24">6</div>
      <div style="width:1px;background:#1e293b;flex:1;margin-top:4px"></div>
    </div>
    <div style="background:#0d1420;border:1px solid #fbbf2433;border-radius:8px;
                padding:10px 14px;flex:1;margin-bottom:4px">
      <div style="color:#fbbf24;font-size:0.78em;font-weight:700">🔄 Detect Profile Changes</div>
      <div style="color:#64748b;font-size:0.72em;margin-top:2px">update expertise if user contradicts profile</div>
      <div style="color:#475569;font-size:0.68em;margin-top:4px">← covered in Part 4</div>
    </div>
  </div>

  <!-- step 7 -->
  <div style="display:flex;align-items:stretch;gap:12px">
    <div style="display:flex;flex-direction:column;align-items:center;width:32px;flex-shrink:0">
      <div style="width:28px;height:28px;border-radius:50%;background:#94b8d422;border:1px solid #94b8d4;
                  display:flex;align-items:center;justify-content:center;
                  font-size:0.7em;font-weight:700;color:#94b8d4">7</div>
    </div>
    <div style="background:#0d1420;border:1px solid #94b8d433;border-radius:8px;
                padding:10px 14px;flex:1">
      <div style="color:#94b8d4;font-size:0.78em;font-weight:700">💾 Persist Turn + Update Profile</div>
      <div style="color:#64748b;font-size:0.72em;margin-top:2px">save to Redis (recent) + PostgreSQL (long-term)</div>
      <div style="color:#475569;font-size:0.68em;margin-top:4px">← production: Redis + PostgreSQL</div>
    </div>
  </div>

</div>
"""

display(HTML(part7_html))
display(HTML(flow_html))

📝 Summary: Zoe’s Complete LLM Context Engineering Lab¶

🗺️ The Journey, Part by Part¶

Part	Prof. Eric’s request	Zoe’s solution
1	“Can it remember our conversation?”	`messages` list — pass full history every call
2	“Can it adapt to 300 different students?”	Profile dict → auto-generated system prompt
3	“Can the assistant learn who a student is?”	Infer profile from conversation — no form needed
4a	“It’s getting slower every day”	Three compression strategies: truncation, semantic, extraction
4b	(Zoe’s own discovery)	Anti-patterns: what Zoe almost got wrong
5	“Can it know about our course materials?”	RAG — retrieve and inject external knowledge
6	“A student contradicted their earlier profile”	Conflict detection — auto-update profile mid-chat
Final	All parts running together	Live chat UI: profile + history + RAG + compression
7	“How would this work in production?”	Architecture map: Redis, PostgreSQL, vLLM, Vector DB

🔁 The Mental Model¶

Every student message flows through the same pipeline:

Student message
    ↓
[ Safety guard ]              ← blocks injection / PII
    ↓
Load student profile          ← Part 2
    ↓
Retrieve RAG docs             ← Part 5 (course materials)
    ↓
Assemble context window       ← Part 4a (token budget)
    ↓
LLM inference → stream reply  ← Production: vLLM + SSE
    ↓
Detect profile changes        ← Part 6 (conflict detection)
    ↓
Persist turn + update profile ← Production: Redis + PostgreSQL

You built every one of those steps in this notebook.

Context management is the invisible backbone of every LLM product you’ve ever used. You now understand what most LLM tutorials skip entirely. That’s a real head start. 🎉

# Quiz: LLM Context Engineering — Hangman Edition
# ══════════════════════════════════════════════════════════════
#  5 random questions. Wrong answer = one hangman stroke.
#  6 wrong answers = game over. Pure HTML/JS.
# ══════════════════════════════════════════════════════════════
import json
import random
from IPython.display import display, HTML

QUESTION_POOL = [
    {
        "q": "Why does the LLM forget what was said in previous turns?",
        "options": ["The model's memory fills up after each reply", "The model is stateless — it only sees what you pass in right now", "The API deletes history to save bandwidth", "The system prompt overwrites previous messages"],
        "answer": 1, "explanation": "LLMs are stateless. Every call starts fresh — the only 'memory' is the messages list you pass in.", "part": "Part 1",
    },
    {
        "q": "What is the correct way to give the model memory of previous turns?",
        "options": ["Maintain a messages list and pass the full history with every API call", "Set a session cookie with the conversation ID", "Call model.remember() before each new message", "Use a larger model with more parameters"],
        "answer": 0, "explanation": "You maintain a messages list and pass the full history with every API call. The model has no built-in memory.", "part": "Part 1",
    },
    {
        "q": "Zoe needs to handle 300 students with different skill levels. What is her solution?",
        "options": ["Write 300 hardcoded system prompts, one per student", "Ask each student to manually set their level in a settings menu", "Store student info in a profile dict and auto-generate the system prompt", "Train a separate model for each skill level"],
        "answer": 2, "explanation": "A profile dict is injected into a template to auto-generate a personalised system prompt dynamically.", "part": "Part 2",
    },
    {
        "q": "What happens to token usage as a conversation grows longer?",
        "options": ["Token usage stays constant — only the latest message is sent", "Token usage grows linearly — every turn re-sends the full history", "Token usage decreases as the model gets more efficient", "Token usage only grows when the system prompt changes"],
        "answer": 1, "explanation": "Because you pass the full messages list every call, token usage grows with every turn.", "part": "Part 2",
    },
    {
        "q": "What does the system prompt control that a user message cannot?",
        "options": ["The model's response length", "The persona, tone, and instructions that frame every reply", "Which GPU the model runs on", "The temperature and sampling parameters"],
        "answer": 1, "explanation": "The system prompt sets the AI's persona, tone, and standing instructions — it shapes every reply.", "part": "Part 2",
    },
    {
        "q": "The assistant is getting slower every day. What is the root cause?",
        "options": ["The model server is overloaded with concurrent users", "The system prompt has grown too long", "The full conversation history is re-sent every call, filling the context window", "RAG retrieval is adding too many documents"],
        "answer": 2, "explanation": "After 40+ turns, the context window is nearly full. Every call processes the entire history from scratch.", "part": "Part 4",
    },
    {
        "q": "Which compression strategy preserves the most semantic meaning from old turns?",
        "options": ["Sliding window — keep only the last N messages", "Hard truncation — drop the oldest messages one by one", "Keyword extraction — keep sentences with important words", "Semantic compression — ask the model to summarise old turns"],
        "answer": 3, "explanation": "Semantic compression asks the model to summarise older turns, preserving meaning while drastically cutting token count.", "part": "Part 4",
    },
    {
        "q": "What is the main risk of sliding window truncation?",
        "options": ["It uses too many tokens", "It loses early context — like the student's name and background", "It makes the model reply more slowly", "It breaks the messages list format"],
        "answer": 1, "explanation": "Sliding window discards the oldest messages — often the ones containing the student's profile and background.", "part": "Part 4",
    },
    {
        "q": "Which of these is an anti-pattern Zoe discovered when managing context?",
        "options": ["Passing the system prompt as the first message", "Dumping the full user profile inline into every user message", "Using RAG to retrieve course documents before each call", "Compressing history older than 10 turns"],
        "answer": 1, "explanation": "Dumping all user data inline into every user message bloats the context. Profile injection belongs in the system prompt.", "part": "Part 4c",
    },
    {
        "q": "What problem does RAG (Retrieval-Augmented Generation) solve?",
        "options": ["It makes the model faster by caching common answers", "It lets the model answer questions about info it was never trained on", "It replaces the system prompt with retrieved documents", "It automatically compresses conversation history"],
        "answer": 1, "explanation": "RAG retrieves relevant documents and injects them into the context before calling the model.", "part": "Part 5",
    },
    {
        "q": "In the RAG demo, how does retrieve() find relevant documents?",
        "options": ["It uses vector embeddings and cosine similarity", "It asks the model which documents are relevant", "It counts keyword overlap between the query and each document", "It fetches documents randomly and filters by date"],
        "answer": 2, "explanation": "The demo uses simple keyword overlap. In production you would use a vector DB for semantic similarity.", "part": "Part 5",
    },
    {
        "q": "What would you use instead of keyword overlap in a production RAG system?",
        "options": ["A SQL LIKE query over a PostgreSQL table", "Vector similarity search using embeddings (FAISS, Pinecone, pgvector)", "A BM25 inverted index with TF-IDF scoring", "A simple Python list comprehension with string matching"],
        "answer": 1, "explanation": "Production RAG uses vector embeddings — nearest-neighbour search finds semantically similar documents.", "part": "Part 5",
    },
    {
        "q": "A student says they are a beginner, then later claims 5 years of experience. What should happen?",
        "options": ["Ignore the new claim and keep the original profile", "Ask the student to confirm which statement is correct", "Detect the contradiction and auto-update the profile", "Start a brand new session with a blank profile"],
        "answer": 2, "explanation": "Conflict detection compares the new message against the stored profile and updates automatically.", "part": "Part 6",
    },
    {
        "q": "In the conflict detection demo, what triggers a profile update without a model call?",
        "options": ["Any message longer than 20 words", "Keywords like 'years of experience' or 'senior developer'", "A change in the student's writing style", "The model's confidence score dropping below 0.5"],
        "answer": 1, "explanation": "Keyword matching runs first — fast and cheap. The model call is only a fallback.", "part": "Part 6",
    },
    {
        "q": "In a production system, what replaces the in-memory messages list for short-term history?",
        "options": ["A PostgreSQL table with one row per message", "A Redis cache — fast, temporary storage for the last 10 turns", "A vector database like FAISS or Pinecone", "A local JSON file written to disk after each turn"],
        "answer": 1, "explanation": "Redis handles short-term, fast-access history. PostgreSQL stores full long-term history.", "part": "Part 7",
    },
    {
        "q": "In the production architecture, what is the role of PostgreSQL?",
        "options": ["Fast cache for the last 10 turns of conversation", "Vector similarity search for RAG retrieval", "Long-term persistent storage for full conversation history and profiles", "Load balancer for distributing requests across model servers"],
        "answer": 2, "explanation": "PostgreSQL is the permanent store — it keeps the full learning history across sessions.", "part": "Part 7",
    },
    {
        "q": "What does vLLM provide in a production deployment?",
        "options": ["A vector database for RAG retrieval", "An efficient model server that handles many concurrent requests", "A compression algorithm for conversation history", "A safety filter that blocks prompt injection"],
        "answer": 1, "explanation": "vLLM is a high-throughput model serving framework that handles batching and concurrent requests efficiently.", "part": "Part 7",
    },
    {
        "q": "Which notebook concept maps to a Safety Filter (Llama Guard) in production?",
        "options": ["The STOPWORDS set in the RAG retriever", "The INJECTION_PATTERNS guard in the system prompt", "The MAX_HISTORY limit in the compression logic", "The conflict detection keyword list"],
        "answer": 1, "explanation": "The INJECTION_PATTERNS guard blocks prompt injection attempts. In production, this becomes a dedicated safety model like Llama Guard.", "part": "Part 7",
    },
    {
        "q": "Why does the Final Demo use threading.Thread for model calls?",
        "options": ["To run multiple model calls simultaneously for faster responses", "To keep the UI live and responsive while the model is generating", "To avoid hitting API rate limits", "To save memory by running the model in a separate process"],
        "answer": 1, "explanation": "Without threading, the UI freezes for 60-90s while the model runs.", "part": "Final Demo",
    },
    {
        "q": "What is the key advantage of widgets.HTML over widgets.Output for the chat log?",
        "options": ["widgets.HTML supports more HTML tags than widgets.Output", "widgets.HTML renders faster because it uses less memory", "widgets.HTML can be updated from any thread and renders immediately", "widgets.HTML automatically scrolls to the bottom of the chat"],
        "answer": 2, "explanation": "widgets.Output blocks all UI updates until the Python function finishes. widgets.HTML.value can be set from a background thread and renders immediately.", "part": "Final Demo",
    },
]

NUM_QUESTIONS = 5
MAX_WRONG     = 6
pool_json     = json.dumps(QUESTION_POOL)

html = f"""
<style>
  .hm-wrap {{
    font-family: 'IBM Plex Mono', 'Fira Code', monospace;
    background: #080c12;
    border-radius: 14px;
    padding: 20px;
    max-width: 860px;
  }}
  .hm-header {{
    background: linear-gradient(135deg, #0a1628, #0d1f3c);
    border: 1px solid #1e3a5f;
    border-radius: 12px;
    padding: 16px 20px;
    margin-bottom: 16px;
    display: flex;
    justify-content: space-between;
    align-items: center;
  }}
  .hm-title {{ color: #f0f6ff; font-weight: 700; font-size: 1.05em; }}
  .hm-sub   {{ color: #475569; font-size: 0.72em; margin-top: 4px; }}

  .hm-body {{
    display: flex;
    gap: 20px;
    align-items: flex-start;
    margin-bottom: 16px;
  }}

  /* Hangman SVG panel */
  .hm-scaffold {{
    flex-shrink: 0;
    background: #0d1420;
    border: 1px solid #1e293b;
    border-radius: 12px;
    padding: 12px;
    display: flex;
    flex-direction: column;
    align-items: center;
    gap: 8px;
    width: 160px;
  }}
  .hm-lives {{
    font-size: 0.72em;
    color: #475569;
    text-align: center;
  }}
  .hm-lives span {{ color: #f87171; font-weight: 700; }}

  /* Question panel */
  .hm-qpanel {{
    flex: 1;
  }}
  .hm-progress {{
    font-size: 0.72em;
    color: #475569;
    margin-bottom: 10px;
    display: flex;
    justify-content: space-between;
  }}
  .hm-progress .correct-count {{ color: #34d399; font-weight: 700; }}
  .hm-progress .wrong-count   {{ color: #f87171; font-weight: 700; }}

  .q-block {{
    background: #0d1420;
    border: 1px solid #1e293b;
    border-radius: 12px;
    padding: 16px 18px;
  }}
  .q-meta {{ display: flex; align-items: center; gap: 8px; margin-bottom: 10px; }}
  .q-part {{
    background: #1e3a5f; color: #38bdf8; font-size: 0.62em; font-weight: 700;
    padding: 2px 9px; border-radius: 20px; text-transform: uppercase; letter-spacing: 0.05em;
  }}
  .q-num  {{ color: #475569; font-size: 0.68em; }}
  .q-text {{ color: #e2e8f0; font-size: 0.88em; font-weight: 600; line-height: 1.6; margin-bottom: 12px; }}

  .opt {{
    background: #111827; border: 1px solid #1e293b; border-radius: 8px;
    padding: 11px 14px; margin-bottom: 7px; cursor: pointer;
    display: flex; align-items: flex-start; gap: 10px;
    transition: border-color 0.15s, background 0.15s;
  }}
  .opt:hover {{ border-color: #38bdf8; background: #0f1f35; }}
  .opt-letter {{
    background: #1e293b; color: #64748b; font-size: 0.70em; font-weight: 700;
    width: 20px; height: 20px; border-radius: 50%;
    display: flex; align-items: center; justify-content: center; flex-shrink: 0; margin-top: 1px;
  }}
  .opt-text {{ color: #94a3b8; font-size: 0.82em; line-height: 1.5; }}

  .opt.correct {{ background: #052e16; border-color: #34d399; cursor: default; }}
  .opt.correct .opt-letter {{ background: #34d399; color: #052e16; }}
  .opt.correct .opt-text   {{ color: #a7f3d0; }}
  .opt.wrong   {{ background: #1c0a0a; border-color: #f87171; cursor: default; }}
  .opt.wrong .opt-letter   {{ background: #f87171; color: #1c0a0a; }}
  .opt.wrong .opt-text     {{ color: #fca5a5; }}
  .opt.locked  {{ cursor: default; }}
  .opt.locked:hover        {{ border-color: #1e293b; background: #111827; }}
  .opt.locked.correct:hover {{ border-color: #34d399; background: #052e16; }}
  .opt.dim     {{ opacity: 0.35; cursor: default; pointer-events: none; }}

  .feedback {{
    margin-top: 8px; padding: 10px 14px; border-radius: 0 8px 8px 0;
    font-size: 0.78em; line-height: 1.6; color: #cbd5e1; display: none;
  }}
  .feedback.show       {{ display: block; }}
  .feedback.correct-fb {{ background: #34d39910; border-left: 3px solid #34d399; }}
  .feedback.wrong-fb   {{ background: #f8717110; border-left: 3px solid #f87171; }}

  .next-btn {{
    margin-top: 12px; padding: 8px 20px; border-radius: 8px;
    background: #1e3a5f; border: 1px solid #38bdf8; color: #38bdf8;
    font-family: inherit; font-size: 0.82em; font-weight: 700;
    cursor: pointer; display: none; transition: background 0.15s;
  }}
  .next-btn:hover {{ background: #0f1f35; }}
  .next-btn.visible {{ display: inline-block; }}

  .final-screen {{
    background: #0d1420; border-radius: 12px; padding: 28px 24px;
    text-align: center; display: none;
  }}
  .final-screen.show {{ display: block; }}
  .final-big {{ font-size: 3.0em; font-weight: 700; line-height: 1; margin-bottom: 8px; }}
  .final-msg {{ font-size: 0.88em; margin-bottom: 16px; }}
  .retry-btn {{
    display: inline-block; padding: 10px 24px; border-radius: 8px;
    font-family: inherit; font-size: 0.85em; font-weight: 700;
    cursor: pointer; border: 1px solid; transition: opacity 0.15s;
  }}
  .retry-btn:hover {{ opacity: 0.75; }}

  /* Hangman SVG strokes */
  .hm-svg line, .hm-svg circle, .hm-svg path {{
    stroke-linecap: round;
    transition: opacity 0.3s;
  }}
  .hm-stroke {{ opacity: 0; }}
  .hm-stroke.drawn {{ opacity: 1; }}
</style>

<div class="hm-wrap">

  <div class="hm-header">
    <div>
      <div style="font-size:0.58em;color:#38bdf8;text-transform:uppercase;letter-spacing:0.15em;margin-bottom:3px">LLM Context Engineering</div>
      <div class="hm-title">🪢 Hangman Quiz</div>
      <div class="hm-sub">Wrong answer = one stroke. 6 strokes = game over.</div>
    </div>
    <div style="text-align:right;color:#475569;font-size:0.70em">
      <div>{NUM_QUESTIONS} questions</div>
      <div style="color:#38bdf8;font-weight:700">{MAX_WRONG} lives</div>
    </div>
  </div>

  <div class="hm-body">

    <!-- Hangman figure -->
    <div class="hm-scaffold">
      <svg class="hm-svg" width="120" height="140" viewBox="0 0 120 140">
        <!-- Gallows (always visible) -->
        <line x1="10" y1="135" x2="110" y2="135" stroke="#1e3a5f" stroke-width="3"/>
        <line x1="30"  y1="135" x2="30"  y2="10"  stroke="#1e3a5f" stroke-width="3"/>
        <line x1="30"  y1="10"  x2="75"  y2="10"  stroke="#1e3a5f" stroke-width="3"/>
        <line x1="75"  y1="10"  x2="75"  y2="28"  stroke="#1e3a5f" stroke-width="3"/>
        <!-- Stroke 1: head -->
        <circle class="hm-stroke" id="hm-s1" cx="75" cy="38" r="10" stroke="#f87171" stroke-width="2.5" fill="none"/>
        <!-- Stroke 2: body -->
        <line class="hm-stroke" id="hm-s2" x1="75" y1="48" x2="75" y2="90" stroke="#f87171" stroke-width="2.5"/>
        <!-- Stroke 3: left arm -->
        <line class="hm-stroke" id="hm-s3" x1="75" y1="60" x2="52" y2="78" stroke="#f87171" stroke-width="2.5"/>
        <!-- Stroke 4: right arm -->
        <line class="hm-stroke" id="hm-s4" x1="75" y1="60" x2="98" y2="78" stroke="#f87171" stroke-width="2.5"/>
        <!-- Stroke 5: left leg -->
        <line class="hm-stroke" id="hm-s5" x1="75" y1="90" x2="52" y2="115" stroke="#f87171" stroke-width="2.5"/>
        <!-- Stroke 6: right leg -->
        <line class="hm-stroke" id="hm-s6" x1="75" y1="90" x2="98" y2="115" stroke="#f87171" stroke-width="2.5"/>
      </svg>
      <div class="hm-lives">Lives left: <span id="lives-left">{MAX_WRONG}</span></div>
      <div id="wrong-pills" style="display:flex;flex-wrap:wrap;gap:3px;justify-content:center;margin-top:4px"></div>
    </div>

    <!-- Question panel -->
    <div class="hm-qpanel">
      <div class="hm-progress">
        <span>Question <span id="q-current">1</span> / {NUM_QUESTIONS}</span>
        <span><span class="correct-count" id="correct-count">0</span> correct &nbsp; <span class="wrong-count" id="wrong-count">0</span> wrong</span>
      </div>
      <div id="q-container"></div>
    </div>

  </div>

  <!-- Final screen (hidden until end) -->
  <div class="final-screen" id="final-screen"></div>

</div>

<script>
(function() {{

  var POOL      = {pool_json};
  var N         = {NUM_QUESTIONS};
  var MAX_WRONG = {MAX_WRONG};

  var wrongCount   = 0;
  var correctCount = 0;
  var qIndex       = 0;
  var questions    = [];
  var gameOver     = false;

  // ── DOM refs ────────────────────────────────────────────────
  var qContainer   = document.getElementById('q-container');
  var livesLeft    = document.getElementById('lives-left');
  var wrongPills   = document.getElementById('wrong-pills');
  var qCurrent     = document.getElementById('q-current');
  var correctEl    = document.getElementById('correct-count');
  var wrongEl      = document.getElementById('wrong-count');
  var finalScreen  = document.getElementById('final-screen');

  // ── Shuffle ─────────────────────────────────────────────────
  function shuffle(arr) {{
    var a = arr.slice();
    for (var i = a.length - 1; i > 0; i--) {{
      var j = Math.floor(Math.random() * (i + 1));
      var t = a[i]; a[i] = a[j]; a[j] = t;
    }}
    return a;
  }}

  // ── Draw hangman stroke ─────────────────────────────────────
  function drawStroke(n) {{
    var el = document.getElementById('hm-s' + n);
    if (el) el.classList.add('drawn');
  }}

  // ── Render one question ─────────────────────────────────────
  function renderQuestion(qi) {{
    var q = questions[qi];
    qCurrent.textContent = qi + 1;

    var opts = '';
    q.options.forEach(function(opt, oi) {{
      opts +=
        '<div class="opt" id="opt-' + oi + '" onclick="hmAnswer(' + oi + ')">' +
          '<div class="opt-letter">' + String.fromCharCode(65 + oi) + '</div>' +
          '<div class="opt-text">' + opt + '</div>' +
        '</div>';
    }});

    qContainer.innerHTML =
      '<div class="q-block">' +
        '<div class="q-meta">' +
          '<span class="q-part">' + q.part + '</span>' +
          '<span class="q-num">Q' + (qi + 1) + '</span>' +
        '</div>' +
        '<div class="q-text">' + q.q + '</div>' +
        opts +
        '<div class="feedback" id="fb"></div>' +
        '<button class="next-btn" id="next-btn" onclick="hmNext()">Next question →</button>' +
      '</div>';
  }}

  // ── Answer handler ───────────────────────────────────────────
  window.hmAnswer = function(oi) {{
    if (gameOver) return;
    var q       = questions[qIndex];
    var correct = q.answer;
    var isRight = (oi === correct);
    var fb      = document.getElementById('fb');
    var nextBtn = document.getElementById('next-btn');

    // Lock all options
    for (var j = 0; j < q.options.length; j++) {{
      var el = document.getElementById('opt-' + j);
      if (!el) continue;
      el.classList.add('locked');
      el.removeAttribute('onclick');
      if (j === correct)       el.classList.add('correct');
      else if (j === oi)       el.classList.add('wrong');
      else                     el.classList.add('dim');
    }}

    if (isRight) {{
      correctCount++;
      correctEl.textContent = correctCount;
      fb.className  = 'feedback show correct-fb';
      fb.innerHTML  = '✅ <strong style="color:#34d399">Correct!</strong> ' + q.explanation;
    }} else {{
      wrongCount++;
      wrongEl.textContent = wrongCount;
      drawStroke(wrongCount);
      livesLeft.textContent = MAX_WRONG - wrongCount;

      // Add wrong pill
      var pill = document.createElement('div');
      pill.style.cssText = 'background:#f8717120;border:1px solid #f87171;border-radius:4px;padding:1px 6px;font-size:0.65em;color:#f87171';
      pill.textContent = '✗';
      wrongPills.appendChild(pill);

      fb.className = 'feedback show wrong-fb';
      fb.innerHTML = '❌ <strong style="color:#f87171">Wrong.</strong> ' + q.explanation;

      if (wrongCount >= MAX_WRONG) {{
        gameOver = true;
        setTimeout(function() {{ showFinal(false); }}, 1200);
        return;
      }}
    }}

    // Show next button or finish
    if (qIndex >= N - 1) {{
      nextBtn.textContent = '🏁 See results';
      nextBtn.classList.add('visible');
    }} else {{
      nextBtn.classList.add('visible');
    }}
  }};

  // ── Next question ────────────────────────────────────────────
  window.hmNext = function() {{
    qIndex++;
    if (qIndex >= N) {{
      showFinal(true);
    }} else {{
      renderQuestion(qIndex);
    }}
  }};

  // ── Final screen ─────────────────────────────────────────────
  function showFinal(completed) {{
    document.querySelector('.hm-body').style.display = 'none';
    finalScreen.classList.add('show');

    var pct   = Math.round(correctCount / N * 100);
    var color, emoji, msg;

    if (!completed) {{
      color = '#f87171'; emoji = '💀'; msg = 'Game Over! The man has been hanged...';
    }} else if (correctCount === N) {{
      color = '#34d399'; emoji = '🎉'; msg = 'Perfect! Not a scratch on the man!';
    }} else if (correctCount >= Math.ceil(N * 0.7)) {{
      color = '#f9e2af'; emoji = '😅'; msg = 'Close call — but you survived!';
    }} else {{
      color = '#f87171'; emoji = '😬'; msg = 'Lucky escape. Review the notebook!';
    }}

    finalScreen.innerHTML =
      '<div class="final-big" style="color:' + color + '">' + emoji + '</div>' +
      '<div style="color:' + color + ';font-weight:700;font-size:1.1em;margin-bottom:6px">' + msg + '</div>' +
      '<div class="final-msg" style="color:#94b8d4">' +
        'Score: <strong style="color:' + color + '">' + correctCount + '/' + N + ' (' + pct + '%)</strong>' +
        ' &nbsp;·&nbsp; Wrong answers: <strong style="color:#f87171">' + wrongCount + '</strong>' +
      '</div>' +
      '<button class="retry-btn" style="background:' + color + '18;border-color:' + color + ';color:' + color + '" ' +
              'onclick="hmRestart()">🔄 Play Again</button>';
  }}

  // ── Restart ──────────────────────────────────────────────────
  window.hmRestart = function() {{
    // Reset state
    wrongCount = correctCount = qIndex = 0;
    gameOver   = false;

    // Reset hangman strokes
    for (var i = 1; i <= MAX_WRONG; i++) {{
      var el = document.getElementById('hm-s' + i);
      if (el) el.classList.remove('drawn');
    }}

    // Reset counters
    livesLeft.textContent = MAX_WRONG;
    wrongPills.innerHTML  = '';
    correctEl.textContent = '0';
    wrongEl.textContent   = '0';

    // Show game, hide final
    document.querySelector('.hm-body').style.display = 'flex';
    finalScreen.classList.remove('show');

    // New questions
    questions = shuffle(POOL).slice(0, N);
    renderQuestion(0);
  }};

  // ── Kick off ─────────────────────────────────────────────────
  questions = shuffle(POOL).slice(0, N);
  renderQuestion(0);

}})();
</script>
"""

display(HTML(html))