Data 100 LLM Benchmarking

This notebook measures how well large language models perform on Data 100 exam questions. It runs each model through a set of multiple-choice and short-answer questions, scores them automatically, and produces a comparison table and HTML report.

What you need before starting¶

Requirement	Details
OpenRouter API key	Free to create at openrouter.ai/keys. You get $1 of free credits on signup.
Exam file	Place your `fa25_final.json` (or any exam JSON) in an `exams/` folder next to this notebook.
API key stored	Run `echo 'OPENROUTER_API_KEY=sk-or-v1-...' >> ~/.env && chmod 600 ~/.env` in a terminal once.

How it works¶

Your exam JSON
      │
      ▼
Each question is sent to MODEL A, MODEL B, ...  (via OpenRouter API)
      │
      ▼
MCQ answers  ──► exact-match scoring (no AI involved)
Short answers ──► sent to a separate JUDGE MODEL for scoring
      │
      ▼
Results saved to results/scores_<timestamp>.csv + HTML report

Estimated run time¶

A 29-question exam (22 MCQ + 7 short answer) across 2 models takes roughly 3–5 minutes at typical OpenRouter speeds. MCQ questions take ~0.3–0.5s each. Short answers take ~1–2s each (the model answers) plus ~0.5–1s for the judge to score.

Table of Contents¶

Install Dependencies — run once, then restart kernel
API Setup — load your key and verify the connection
Configure Your Eval — choose models, exam, and grading strategy
Load Exam — parse the exam file and preview questions
Run Benchmark — execute the full eval
- 5b. Judge Strategy Comparison — optional: compare grading prompts
Results — leaderboard, topic breakdown, per-question detail, HTML export

1. Install Dependencies¶

Run this cell once. It installs three packages:

openai — the Python client used to talk to OpenRouter (OpenRouter is OpenAI API-compatible)
pandas — used for result tables and CSV export
python-dotenv — reads your API key from ~/.env so it never appears in the notebook

After the cell finishes, restart the kernel (Kernel → Restart) if this is your first time running it. You only need to do this once per environment — the packages persist.

Note: You may see a dependency conflict warning mentioning langchain-openai. This is harmless — it means another package in your environment wants an older version of openai, but your notebook does not use LangChain so the warning does not affect anything here.

import subprocess, sys

pkgs = ['openai', 'pandas', 'python-dotenv']
for pkg in pkgs:
    print(f'Installing {pkg}...')
    subprocess.run([sys.executable, '-m', 'pip', 'install', pkg, '-q', '--upgrade'],
                   check=True)

print('\nAll dependencies installed.')
print('Restart the kernel now if this is your first time, then continue.')

Installing openai...
Installing pandas...
Installing python-dotenv...

All dependencies installed.
Restart the kernel now if this is your first time, then continue.

2. API Setup¶

Step 1 — Get an API key¶

Create a free account at openrouter.ai and generate a key at openrouter.ai/keys. New accounts receive $1 of free credits, which is enough to run this benchmark several times with small models.

Step 2 — Store the key securely (one-time setup)¶

Open a terminal and run:

echo 'OPENROUTER_API_KEY=sk-or-v1-your-key-here' >> ~/.env
chmod 600 ~/.env

~/.env is a hidden file in your home directory. The chmod 600 makes it readable only by your user account. Never paste the key directly into a notebook cell — it will be saved in the file and visible to anyone who opens the notebook or checks it into git.

If your notebook is in a git repository, also run:

echo '.env' >> .gitignore

Step 3 — Run this cell¶

The cell reads the key from ~/.env, creates the OpenRouter client, and sends a one-token test request to confirm the connection is working. If the test passes you will see OpenRouter connection verified.

Common errors:

OPENROUTER_API_KEY not found — the key is not in ~/.env, or you need to restart the kernel after creating the file
Connection failed: 401 — the key is invalid or has been revoked; generate a new one
Connection failed: 402 — your account has run out of credits

import os
from pathlib import Path
from openai import OpenAI
from dotenv import load_dotenv

# ── Load API key from ~/.env ──────────────────────────────────────────────
# Expected format in ~/.env:
#   OPENROUTER_API_KEY=sk-or-v1-...
#
# To create it the first time (run once in a terminal):
#   echo 'OPENROUTER_API_KEY=your-key-here' >> ~/.env
#   chmod 600 ~/.env
load_dotenv(Path.home() / '.env')
OPENROUTER_API_KEY = os.environ.get('OPENROUTER_API_KEY', '')

if not OPENROUTER_API_KEY:
    raise EnvironmentError(
        'OPENROUTER_API_KEY not found.\n'
        'Add it to ~/.env:\n'
        '  echo \'OPENROUTER_API_KEY=your-key-here\' >> ~/.env\n'
        'Then restart the kernel.'
    )

client = OpenAI(
    base_url='https://openrouter.ai/api/v1',
    api_key=OPENROUTER_API_KEY,
)

Path('exams').mkdir(exist_ok=True)
Path('results').mkdir(exist_ok=True)

# ── Verify connectivity ───────────────────────────────────────────────────
try:
    test = client.chat.completions.create(
        model='meta-llama/llama-3.2-1b-instruct',
        messages=[{'role': 'user', 'content': 'Reply with the single word OK.'}],
        max_tokens=5,
        temperature=0,
    )
    print('OpenRouter connection verified.')
    print(f'  Response: {test.choices[0].message.content.strip()}')
except Exception as e:
    print(f'Connection failed: {e}')

OpenRouter connection verified.
  Response: OK

3. Configure Your Eval¶

This is the only cell you need to edit between runs. All variables below flow through to every subsequent step — you do not need to touch cells 4 or 5 unless you are modifying the prompts themselves.

`EVAL_MODELS`¶

A list of models to benchmark. Each entry must be an exact OpenRouter model ID. The ID is the URL slug from the model’s OpenRouter page — for example, the model at openrouter.ai/meta-llama/llama-3.2-3b-instruct has the ID meta-llama/llama-3.2-3b-instruct.

To find IDs: go to openrouter.ai/models, click any model, and copy the slug from the URL.

Confirmed working IDs (as of Spring 2026):

Model	ID	Cost
Llama 3.2 3B	`meta-llama/llama-3.2-3b-instruct`	~$0.00/run
Qwen 2.5 7B	`qwen/qwen-2.5-7b-instruct`	~$0.01/run
Mistral 7B	`mistralai/mistral-7b-instruct`	~$0.01/run

`JUDGE_MODEL`¶

The model that grades short-answer questions. This should be different from the models you are evaluating. If a model grades its own answers, it tends to over-score itself because it recognizes its own phrasing. A small but capable model (Claude Haiku, GPT-4o-mini) works well here.

Set to None to fall back to self-grading (not recommended for comparisons).

`JUDGE_STRATEGY`¶

Controls the structure of the grading prompt. Three options:

Strategy	Speed	Reliability	Best for
`baseline`	Fast	Lower	Quick sanity checks
`chain_of_thought`	Slow	Higher	Auditing judge reasoning
`rubric_anchored`	Medium	Highest	Final benchmark runs

Run Section 5b to compare strategies on a sample question before committing.

`EXAM_SOURCE`¶

Path to the exam file. Options:

'sample' — uses the built-in 11-question sample exam
'exams/fa25_final.json' — your exam JSON (recommended)
'exams/myexam.pdf' — attempts to parse MCQ questions from a PDF (best-effort; answers must be added manually)

# ── Models to evaluate — use OpenRouter model IDs ────────────────────────
# Full list: https://openrouter.ai/models
EVAL_MODELS = [
    'meta-llama/llama-3.2-3b-instruct',
    'qwen/qwen-2.5-7b-instruct',
]

MODEL_DISPLAY = {
    'meta-llama/llama-3.2-3b-instruct': 'Llama-3.2-3B',
    'qwen/qwen-2.5-7b-instruct':        'Qwen-2.5-7B',
    'anthropic/claude-haiku-4-5':        'Claude-Haiku-4-5',
}

# ── Exam source ───────────────────────────────────────────────────────────

# 'exams/my.json'   -> your own exam JSON
EXAM_SOURCE = 'exams/fa25_final.json'

# ── Question types ────────────────────────────────────────────────────────
RUN_MCQ          = True
RUN_SHORT_ANSWER = True

# ── Judge model ───────────────────────────────────────────────────────────
# A separate model grades short-answer questions. This avoids self-grading
# bias and lets you use a stronger model for scoring independent of cost.
# Set to None to fall back to the same model being evaluated.
JUDGE_MODEL = 'anthropic/claude-haiku-4-5'

# ── Judge prompt strategy ─────────────────────────────────────────────────
# Controls how the judge is prompted to score short answers.
# Run Section 5b to compare strategies before choosing one.
# Options: 'baseline' | 'chain_of_thought' | 'rubric_anchored'
JUDGE_STRATEGY = 'rubric_anchored'

# ── Inference settings ────────────────────────────────────────────────────
TEMPERATURE = 0.0  # 0 = deterministic; best for reproducible evals

print('Config ready.')

Config ready.

4. Load Exam¶

This cell does three things:

Defines data structures — Question and Exam dataclasses that hold the parsed exam content
Defines prompt templates — what each model sees when answering MCQ and short-answer questions, and the three judge grading prompts
Loads and previews the exam — parses your JSON (or PDF) and displays a summary table

Expected output¶

After running, you should see something like:

Exam: Data C100/200 Final (Fall 2025)
MCQ: 22  |  Short answer: 7  |  Total: 40 pts

Followed by a table showing every question’s ID, type, topic, point value, and a truncated preview of the question text. Check this table to confirm the exam loaded correctly before running the benchmark.

Exam JSON format¶

If you are creating your own exam file, each question needs these fields:

{
  "id": "q1",
  "type": "mcq",          // or "short_answer"
  "topic": "pandas",
  "points": 2,
  "question": "Which method removes rows with missing values?",
  "choices": {"A": "dropna()", "B": "fillna()", ...},  // MCQ only
  "answer": "A",
  "rubric": ["criterion 1", "criterion 2"]             // short_answer only
}

Short-answer questions without a rubric field will fall back to baseline scoring.

import json, re, csv, time, gc
from pathlib import Path
from datetime import datetime
from dataclasses import dataclass, field
from typing import Optional
from collections import defaultdict
import pandas as pd

# ── Data classes ──────────────────────────────────────────────────────────
@dataclass
class Question:
    id: str
    type: str           # 'mcq' or 'short_answer'
    topic: str
    points: int
    question: str
    answer: str
    choices: Optional[dict] = None
    rubric: Optional[list] = None

@dataclass
class Exam:
    name: str
    semester: str
    questions: list = field(default_factory=list)

    @property
    def mcq(self): return [q for q in self.questions if q.type == 'mcq']
    @property
    def short_answer(self): return [q for q in self.questions if q.type == 'short_answer']
    @property
    def total_points(self): return sum(q.points for q in self.questions)

# ── Student-facing prompt templates ──────────────────────────────────────
def mcq_prompt(q):
    choices = '\n'.join(f'  {k}) {v}' for k, v in q.choices.items())
    return [
        {'role': 'system', 'content': 'You are a student taking a Data 100 exam at UC Berkeley. '
                                      'Answer every question directly and concisely.'},
        {'role': 'user',   'content': (
            f'Question [{q.topic}, {q.points} pts]:\n{q.question}\n\n'
            f'Choices:\n{choices}\n\n'
            f'Reply with ONLY the single letter of the correct answer.'
        )},
    ]

def short_answer_prompt(q):
    return [
        {'role': 'system', 'content': 'You are a student taking a Data 100 exam at UC Berkeley. '
                                      'Answer every question directly and concisely.'},
        {'role': 'user',   'content': (
            f'Question [{q.topic}, {q.points} pts]:\n{q.question}\n\n'
            f'Give a clear, concise answer (3-6 sentences).'
        )},
    ]

# ── Judge prompt strategies ───────────────────────────────────────────────
# Three strategies with increasing structure. JUDGE_STRATEGY (set in cell 3)
# selects which one is used in the main benchmark run.

def judge_prompt_baseline(q, student_answer):
    """Minimal prompt. Asks for a score and one sentence of feedback.
    Fast, but prone to inconsistent formatting and score inflation."""
    rubric = '\n'.join(f'  - {r}' for r in (q.rubric or []))
    return [
        {'role': 'system', 'content': 'You are a strict, impartial exam grader.'},
        {'role': 'user', 'content': (
            f'Grade this student response.\n\n'
            f'Question ({q.points} pts): {q.question}\n\n'
            f'Reference answer: {q.answer}\n\n'
            f'Rubric:\n{rubric}\n\n'
            f'Student answer: {student_answer}\n\n'
            f'Respond EXACTLY as:\nSCORE: X/{q.points}\nFEEDBACK: one sentence'
        )},
    ]

def judge_prompt_chain_of_thought(q, student_answer):
    """Ask the judge to reason through each rubric item before scoring.
    More reliable on ambiguous answers at the cost of more output tokens."""
    rubric = '\n'.join(f'  {i+1}. {r}' for i, r in enumerate(q.rubric or []))
    return [
        {'role': 'system', 'content': (
            'You are a strict, impartial exam grader. '
            'Think through your reasoning carefully before assigning a score.'
        )},
        {'role': 'user', 'content': (
            f'Grade this student response step by step.\n\n'
            f'Question ({q.points} pts): {q.question}\n\n'
            f'Reference answer: {q.answer}\n\n'
            f'Rubric criteria:\n{rubric}\n\n'
            f'Student answer: {student_answer}\n\n'
            f'First, evaluate whether the student addressed each rubric criterion. '
            f'Then assign a score.\n\n'
            f'Respond EXACTLY as:\n'
            f'REASONING: <your analysis>\n'
            f'SCORE: X/{q.points}\n'
            f'FEEDBACK: one sentence summary'
        )},
    ]

def judge_prompt_rubric_anchored(q, student_answer):
    """Score each rubric criterion independently (0 or 1), then sum.
    Most structured and reproducible; produces granular feedback."""
    criteria = q.rubric or []
    criterion_lines = '\n'.join(
        f'  Criterion {i+1}: {r}' for i, r in enumerate(criteria)
    )
    score_lines = '\n'.join(
        f'  CRITERION_{i+1}: 0 or 1' for i in range(len(criteria))
    )
    pts_per = round(q.points / len(criteria), 2) if criteria else q.points
    return [
        {'role': 'system', 'content': (
            'You are a strict, impartial exam grader. '
            'Score each criterion independently based only on whether the student addressed it.'
        )},
        {'role': 'user', 'content': (
            f'Grade this student response criterion by criterion.\n\n'
            f'Question: {q.question}\n\n'
            f'Reference answer: {q.answer}\n\n'
            f'Rubric ({len(criteria)} criteria, {pts_per} pts each):\n{criterion_lines}\n\n'
            f'Student answer: {student_answer}\n\n'
            f'Score each criterion 0 (not addressed) or 1 (addressed). Be strict.\n\n'
            f'Respond EXACTLY as:\n'
            f'{score_lines}\n'
            f'FEEDBACK: one sentence'
        )},
    ]

# Map strategy name to function
JUDGE_PROMPTS = {
    'baseline':        judge_prompt_baseline,
    'chain_of_thought': judge_prompt_chain_of_thought,
    'rubric_anchored': judge_prompt_rubric_anchored,
}

def judge_prompt(q, student_answer, strategy=None):
    """Dispatch to the selected judge prompt strategy."""
    s = strategy or JUDGE_STRATEGY
    return JUDGE_PROMPTS[s](q, student_answer)

# ── Score parsers ─────────────────────────────────────────────────────────
def parse_judge_baseline(text, max_points):
    m = re.search(r'SCORE:\s*(\d+(?:\.\d+)?)\s*/\s*(\d+)', text, re.I)
    if m:
        earned, out_of = float(m.group(1)), float(m.group(2))
        if out_of > 0:
            return min(round((earned / out_of) * max_points, 2), max_points)
    return 0.0

def parse_judge_rubric_anchored(text, q):
    """Sum CRITERION_N scores and scale to q.points."""
    criteria = q.rubric or []
    if not criteria:
        return parse_judge_baseline(text, q.points)
    total = 0
    for i in range(len(criteria)):
        m = re.search(rf'CRITERION_{i+1}:\s*([01])', text, re.I)
        if m:
            total += int(m.group(1))
    pts_per = q.points / len(criteria)
    return min(round(total * pts_per, 2), q.points)

def parse_score(judge_output, q, strategy=None):
    """Parse score from judge output using the appropriate parser."""
    s = strategy or JUDGE_STRATEGY
    if s == 'rubric_anchored':
        return parse_judge_rubric_anchored(judge_output, q)
    return parse_judge_baseline(judge_output, q.points)

def parse_feedback(text):
    m = re.search(r'FEEDBACK:\s*(.+)', text, re.I | re.DOTALL)
    return m.group(1).strip() if m else text.strip()

def extract_letter(text):
    t = text.strip().upper()
    if t and t[0] in 'ABCD': return t[0]
    m = re.search(r'\b([ABCD])\b', t)
    return m.group(1) if m else 'UNKNOWN'

def load_json_exam(data):
    qs = [Question(id=q['id'], type=q['type'], topic=q.get('topic', 'unknown'),
                   points=q['points'], question=q['question'], answer=q['answer'],
                   choices=q.get('choices'), rubric=q.get('rubric'))
          for q in data['questions']]
    return Exam(name=data['exam_name'], semester=data.get('semester', ''), questions=qs)

def parse_pdf_exam(pdf_path):
    """
    Best-effort MCQ parser for Data 100 PDFs.
    Finds questions starting with a number and answer choices A-D.
    You will likely need to manually clean results for complex questions.
    Answers are set to '?' and must be filled in from a solutions PDF.
    """
    try:
        import pdfplumber
    except ImportError:
        import subprocess, sys
        subprocess.run([sys.executable, '-m', 'pip', 'install', 'pdfplumber', '-q'])
        import pdfplumber

    questions = []
    with pdfplumber.open(pdf_path) as pdf:
        text = '\n'.join(page.extract_text() or '' for page in pdf.pages)

    blocks = re.split(r'\n(?=\d+\.\s)', text)
    qnum = 0
    for block in blocks:
        lines = block.strip().split('\n')
        if not lines: continue
        header = lines[0]
        if not re.match(r'^\d+\.', header): continue
        qtext = header + ' ' + ' '.join(
            l for l in lines[1:]
            if not re.match(r'^[A-D][.)\s]', l.strip())
        )
        choices = {}
        for line in lines:
            m = re.match(r'^([A-D])[.)\s]+(.+)', line.strip())
            if m:
                choices[m.group(1)] = m.group(2).strip()
        if len(choices) >= 2:
            qnum += 1
            questions.append(Question(
                id=f'q{qnum}', type='mcq', topic='unknown', points=2,
                question=qtext.strip(), answer='?', choices=choices,
            ))

    name = Path(pdf_path).stem
    print(f'Parsed {len(questions)} MCQ questions from {name}')
    print('NOTE: Answers are set to "?" — fill them in from the solutions PDF.')
    return Exam(name=name, semester='', questions=questions)

# ── Load exam based on EXAM_SOURCE ────────────────────────────────────────
if EXAM_SOURCE == 'sample':
    exam = load_json_exam(SAMPLE_EXAM_DATA)
elif EXAM_SOURCE.endswith('.json'):
    with open(EXAM_SOURCE) as f:
        exam = load_json_exam(json.load(f))
elif EXAM_SOURCE.endswith('.pdf'):
    exam = parse_pdf_exam(EXAM_SOURCE)
else:
    raise ValueError(f'Unknown EXAM_SOURCE: {EXAM_SOURCE}')

print(f'Exam: {exam.name} ({exam.semester})')
print(f'MCQ: {len(exam.mcq)}  |  Short answer: {len(exam.short_answer)}  |  Total: {exam.total_points} pts')
print()
pd.DataFrame([{'id': q.id, 'type': q.type, 'topic': q.topic, 'pts': q.points,
               'question': q.question[:65] + '...' if len(q.question) > 65 else q.question}
              for q in exam.questions])

Exam: Data C100/200 Final (Fall 2025)
MCQ: 22  |  Short answer: 7  |  Total: 47.0 pts

5. Run Benchmark¶

Run this cell to execute the full evaluation. It loops through every model in EVAL_MODELS and every question in the exam.

What happens for each question¶

MCQ: The question and choices are sent to the model. The model is asked to reply with a single letter. That letter is compared to the answer key — no AI judge involved.

Short answer: The question is sent to the model, which writes a 3–6 sentence response. That response is then sent to JUDGE_MODEL along with the reference answer and rubric. The judge returns a score and one-sentence feedback.

Reading the live output¶

=======================================================
Evaluating: Llama-3.2-3B
=======================================================
  Judge: Claude-Haiku-4-5  (strategy: rubric_anchored)

  MCQ (22 questions):
    [PASS] q1a_i   got=B  expected=B  (0.6s)   ← correct answer, time taken
    [FAIL] q1b_iv  got=A  expected=B  (0.5s)   ← wrong answer

  Short answer (7 questions):
    q2a_i  score=2.0/2  (1.2s)
           The student correctly applied linearity of expectation...

Each [PASS]/[FAIL] line shows what the model answered and what was expected. Short-answer lines show the score out of maximum points and the judge’s one-sentence feedback.

After the run¶

Results are saved to results/scores_<timestamp>.csv. If you run the benchmark multiple times without restarting the kernel, all_scores accumulates across runs — restart the kernel and re-run cells 1–5 for a clean slate.

If a model fails mid-run (bad model ID, connection error, etc.), the run stops for that model but all_scores keeps whatever was collected. Fix the error and re-run just the benchmark cell — already-collected scores are preserved in memory.


def generate(model_id, messages, max_tokens=256):
    """Send a chat completion request to OpenRouter and return (text, latency_seconds)."""
    t0 = time.perf_counter()
    resp = client.chat.completions.create(
        model=model_id,
        messages=messages,
        max_tokens=max_tokens,
        temperature=TEMPERATURE,
    )
    text = resp.choices[0].message.content or ''
    return text.strip(), time.perf_counter() - t0

def display_name(model_id):
    return MODEL_DISPLAY.get(model_id, model_id.split('/')[-1])

# ── Main eval loop ────────────────────────────────────────────────────────
all_scores = []
judge_id   = JUDGE_MODEL

for model_id in EVAL_MODELS:
    disp = display_name(model_id)
    print(f"\n{'='*55}\nEvaluating: {disp}\n{'='*55}")

    judge_disp = display_name(judge_id) if judge_id else disp
    if JUDGE_MODEL and JUDGE_MODEL != model_id:
        print(f'  Judge: {judge_disp}  (strategy: {JUDGE_STRATEGY})')
    else:
        print(f'  Judge: {disp} (self-grading — consider setting a separate JUDGE_MODEL)')

    # ── MCQ ──────────────────────────────────────────────────────────────
    if RUN_MCQ and exam.mcq:
        print(f'\n  MCQ ({len(exam.mcq)} questions):')
        for q in exam.mcq:
            msgs = mcq_prompt(q)
            raw, latency = generate(model_id, msgs, max_tokens=16)
            got     = extract_letter(raw)
            correct = got == q.answer.upper()
            pts     = q.points if correct else 0
            mark    = 'PASS' if correct else 'FAIL'
            print(f'    [{mark}] {q.id:4s}  got={got}  expected={q.answer}  ({latency:.1f}s)')
            all_scores.append({
                'model': disp, 'question_id': q.id, 'type': 'mcq', 'topic': q.topic,
                'points_earned': pts, 'max_points': q.points, 'answer': got,
                'correct_answer': q.answer, 'correct': correct,
                'feedback': '', 'latency': round(latency, 2),
                'judge_strategy': '',
            })

    # ── Short answer ──────────────────────────────────────────────────────
    has_sa = RUN_SHORT_ANSWER and len(exam.short_answer) > 0
    effective_judge = judge_id or model_id

    if has_sa:
        print(f'\n  Short answer ({len(exam.short_answer)} questions):')
        for q in exam.short_answer:
            # Student response
            sa_msgs = short_answer_prompt(q)
            raw, latency = generate(model_id, sa_msgs, max_tokens=300)

            # Judge grading
            j_msgs      = judge_prompt(q, raw, strategy=JUDGE_STRATEGY)
            judge_out, _ = generate(effective_judge, j_msgs, max_tokens=350)
            pts          = parse_score(judge_out, q, strategy=JUDGE_STRATEGY)
            feedback     = parse_feedback(judge_out)

            print(f'    {q.id:4s}  score={pts}/{q.points}  ({latency:.1f}s)')
            print(f'         {feedback[:90]}')
            all_scores.append({
                'model': disp, 'question_id': q.id, 'type': 'short_answer', 'topic': q.topic,
                'points_earned': pts, 'max_points': q.points, 'answer': raw[:200],
                'correct_answer': q.answer[:100], 'correct': None,
                'feedback': feedback[:200], 'latency': round(latency, 2),
                'judge_strategy': JUDGE_STRATEGY,
            })

# ── Save to CSV ───────────────────────────────────────────────────────────
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
csv_path  = f'results/scores_{timestamp}.csv'
pd.DataFrame(all_scores).to_csv(csv_path, index=False)
print(f'\nScores saved to {csv_path}')


=======================================================
Evaluating: Llama-3.2-3B
=======================================================
  Judge: Claude-Haiku-4-5  (strategy: rubric_anchored)

  MCQ (22 questions):
    [PASS] q1a_i  got=B  expected=B  (0.3s)
    [PASS] q1a_ii  got=B  expected=B  (0.4s)
    [PASS] q1a_iii  got=C  expected=C  (0.4s)
    [PASS] q1b_ii  got=C  expected=C  (0.3s)
    [PASS] q1b_iii  got=A  expected=A  (0.3s)
    [FAIL] q1b_iv_A  got=A  expected=B  (0.5s)
    [FAIL] q1b_iv_B  got=A  expected=B  (0.3s)
    [PASS] q1b_iv_C  got=A  expected=A  (0.3s)
    [PASS] q1b_viii_semester  got=A  expected=A  (0.3s)
    [FAIL] q1b_viii_hist  got=A  expected=B  (0.3s)
    [PASS] q1b_ix_hist  got=A  expected=A  (0.4s)
    [PASS] q1b_ix_kde  got=A  expected=A  (0.3s)
    [FAIL] q2b_iv  got=A  expected=B  (0.4s)
    [FAIL] q2b_vi  got=A  expected=B  (0.3s)
    [PASS] q3a   got=A  expected=A  (0.4s)
    [FAIL] q3a_nonlinear  got=A  expected=B  (0.4s)
    [PASS] q3c_i_rank  got=A  expected=A  (0.4s)
    [PASS] q3d_regularization  got=A  expected=A  (0.3s)
    [PASS] q4a_bias_model2  got=A  expected=A  (0.4s)
    [FAIL] q4b_irreducible  got=B  expected=A  (0.3s)
    [FAIL] q6b_pca_variance  got=A  expected=C  (0.5s)
    [FAIL] q6c_pca_total_variance  got=B  expected=D  (0.3s)

  Short answer (7 questions):
    q2a_i  score=2.0/2  (1.5s)
         The student correctly applied linearity of expectation, computed E[alpha-hat] = alpha, and
    q2a_ii  score=1.33/2  (1.2s)
         The student correctly applied the variance scaling rule and reached the correct final answ
    q3e_cv  score=0.0/1  (1.6s)
         The student fundamentally misunderstands the cross-validation process by confusing which d
    q5a_logistic_max_accuracy  score=0.0/2  (0.7s)
         The student claims 100% accuracy is possible and does not identify that the data is not li
    q5b_logistic_prob  score=2.0/2  (1.8s)
         The student correctly formed the feature vector, computed z = 9, and applied the sigmoid f
    q5c_ii_precision  score=0.0/1  (1.2s)
         The student incorrectly identified TP=2 and FP=3, leading to an incorrect precision of 2/5
    q7a_clustering  score=0.5/2  (2.1s)
         The student correctly identifies the first merge of {A} and {B}, but fails to apply comple

=======================================================
Evaluating: Qwen-2.5-7B
=======================================================
  Judge: Claude-Haiku-4-5  (strategy: rubric_anchored)

  MCQ (22 questions):
    [PASS] q1a_i  got=B  expected=B  (0.3s)
    [PASS] q1a_ii  got=B  expected=B  (0.3s)
    [PASS] q1a_iii  got=C  expected=C  (1.0s)
    [PASS] q1b_ii  got=C  expected=C  (0.9s)
    [PASS] q1b_iii  got=A  expected=A  (0.9s)
    [PASS] q1b_iv_A  got=B  expected=B  (0.9s)
    [FAIL] q1b_iv_B  got=A  expected=B  (1.0s)
    [PASS] q1b_iv_C  got=A  expected=A  (0.9s)
    [PASS] q1b_viii_semester  got=A  expected=A  (0.9s)
    [PASS] q1b_viii_hist  got=B  expected=B  (0.3s)
    [PASS] q1b_ix_hist  got=A  expected=A  (0.2s)
    [PASS] q1b_ix_kde  got=A  expected=A  (0.8s)
    [FAIL] q2b_iv  got=A  expected=B  (0.9s)
    [PASS] q2b_vi  got=B  expected=B  (0.7s)
    [FAIL] q3a   got=B  expected=A  (0.9s)
    [PASS] q3a_nonlinear  got=B  expected=B  (0.9s)
    [PASS] q3c_i_rank  got=A  expected=A  (1.1s)
    [PASS] q3d_regularization  got=A  expected=A  (0.3s)
    [FAIL] q4a_bias_model2  got=B  expected=A  (0.3s)
    [FAIL] q4b_irreducible  got=B  expected=A  (0.3s)
    [FAIL] q6b_pca_variance  got=D  expected=C  (0.9s)
    [PASS] q6c_pca_total_variance  got=D  expected=D  (0.9s)

  Short answer (7 questions):
    q2a_i  score=2.0/2  (5.0s)
         The student correctly applied linearity of expectation, computed E[alpha-hat] = alpha, and
    q2a_ii  score=2.0/2  (10.6s)
         The student correctly applied both the variance scaling rule and independence property, ar
    q3e_cv  score=1.0/1  (1.8s)
         The student correctly identified that 4 MSE calculations occur per lambda value (one per f
    q5a_logistic_max_accuracy  score=0.0/2  (3.3s)
         The student provides a generic discussion of logistic regression accuracy without identify
    q5b_logistic_prob  score=2.0/2  (2.5s)
         The student correctly formed the feature vector, computed z = 9, and applied the sigmoid f
    q5c_ii_precision  score=0.33/1  (1.6s)
         The student correctly identified TP=1 but missed that row 3 (P(Y=1|x)=0.60, Y=0) is also a
    q7a_clustering  score=1.0/2  (3.3s)
         The student correctly identifies the first two merges but fails to follow complete linkage

Scores saved to results/scores_20260427_195142.csv

5b. Judge Prompt Strategy Comparison¶

Optional — run this before the main benchmark if you are unsure which JUDGE_STRATEGY to use.

This cell takes the first short-answer question in the exam, generates one student response from EVAL_MODELS[0], and then scores that same response with all three grading strategies side by side. It does not write to all_scores and does not affect the main benchmark.

How to interpret the output¶

--- baseline ---
  Score:    1.5/2
  Feedback: The student addressed the main concept but missed the example.
  Latency:  0.8s

--- chain_of_thought ---
  Score:    1.0/2
  Feedback: Step 1 was correct, but the example used iloc incorrectly.
  Latency:  2.1s

--- rubric_anchored ---
  Score:    1.33/2
  Feedback: Criterion 1 met, criterion 2 partially met, criterion 3 not addressed.
  Latency:  1.3s

What to look for:

Scores agree (within ~0.5 pts): All three strategies are producing consistent results. Use baseline for speed or rubric_anchored for cleaner feedback.
Scores disagree widely: The question is ambiguous or the student answer is borderline. Use rubric_anchored — it is the most structured and least sensitive to prompt phrasing.
You want to audit the judge’s reasoning: Use chain_of_thought and read the REASONING: section in the raw output.

After reviewing, set JUDGE_STRATEGY in cell 3 and run the main benchmark.

# ── Judge strategy comparison ─────────────────────────────────────────────
# Scores a single student answer with all three strategies side by side.
# Run this cell independently; it does not affect all_scores.

# Pick a short-answer question to test
test_q = exam.short_answer[0] if exam.short_answer else None

if test_q is None:
    print('No short-answer questions in this exam — skipping judge comparison.')
else:
    # Generate one student response from the first eval model
    test_model = EVAL_MODELS[0]
    sa_msgs    = short_answer_prompt(test_q)
    student_ans, _ = generate(test_model, sa_msgs, max_tokens=300)

    print(f'Question:  {test_q.question}')
    print(f'Reference: {test_q.answer}')
    print(f'Student ({display_name(test_model)}): {student_ans}')
    print()

    judge_to_use = JUDGE_MODEL or EVAL_MODELS[0]
    rows = []

    for strategy in ['baseline', 'chain_of_thought', 'rubric_anchored']:
        j_msgs    = judge_prompt(test_q, student_ans, strategy=strategy)
        j_out, latency = generate(judge_to_use, j_msgs, max_tokens=400)
        score     = parse_score(j_out, test_q, strategy=strategy)
        feedback  = parse_feedback(j_out)
        rows.append({
            'strategy':  strategy,
            'score':     f'{score}/{test_q.points}',
            'latency_s': round(latency, 2),
            'feedback':  feedback,
        })
        print(f'--- {strategy} ---')
        print(f'  Score:    {score}/{test_q.points}')
        print(f'  Feedback: {feedback}')
        print(f'  Latency:  {latency:.1f}s')
        print()

    print('Summary:')
    pd.DataFrame(rows)

Question:  Define estimator alpha-hat as the sample mean: alpha-hat = (1/100) * sum(Xi) for i=1 to 100. Let alpha = E[Xi] be the true average. Calculate Bias(alpha-hat). Show your work. Recall: Bias(alpha-hat) = E[alpha-hat] - alpha.
Reference: E[alpha-hat] = E[(1/100)*sum(Xi)] = (1/100)*sum(E[Xi]) = (1/100)*(100*alpha) = alpha. Therefore Bias(alpha-hat) = E[alpha-hat] - alpha = alpha - alpha = 0. The estimator is unbiased.
Student (Llama-3.2-3B): To calculate Bias(alpha-hat), we need to find E[alpha-hat] and alpha.

E[alpha-hat] = E[(1/100) * sum(Xi)] = (1/100) * E[sum(Xi)] = (1/100) * 100 * E[Xi] = E[Xi] = alpha.

Since E[alpha-hat] = alpha, Bias(alpha-hat) = E[alpha-hat] - alpha = alpha - alpha = 0.

Therefore, the estimator alpha-hat is unbiased.

--- baseline ---
  Score:    2.0/2
  Feedback: Excellent response that correctly applies linearity of expectation, properly computes E[alpha-hat] = alpha, and correctly concludes the estimator is unbiased with Bias = 0.
  Latency:  1.2s

--- chain_of_thought ---
  Score:    2.0/2
  Feedback: REASONING: 

Let me evaluate the student response against each rubric criterion:

**Criterion 1: Applies linearity of expectation correctly**
The student writes: E[(1/100) * sum(Xi)] = (1/100) * E[sum(Xi)] = (1/100) * 100 * E[Xi]

This shows correct application of linearity of expectation. The student correctly factors out the constant (1/100) and then applies linearity to the sum. However, there is a slight notational issue in the transition from E[sum(Xi)] to 100 * E[Xi]. The student should have written E[sum(Xi)] = sum(E[Xi]) = 100*E[Xi] to be fully explicit about applying linearity to each term in the sum. Despite this minor presentation gap, the mathematical reasoning is sound.

**Criterion 2: Computes E[alpha-hat] = alpha**
The student correctly arrives at E[alpha-hat] = alpha through the calculation shown above. This is correct.

**Criterion 3: States Bias = 0 (unbiased)**
The student explicitly states "Bias(alpha-hat) = E[alpha-hat] - alpha = alpha - alpha = 0" and concludes "the estimator alpha-hat is unbiased." This criterion is fully satisfied.

**Overall Assessment:**
The student demonstrates understanding of all three criteria. The work is clear and reaches the correct conclusion. The only minor weakness is that the transition from E[sum(Xi)] to 100*E[Xi] could have been more explicitly shown as a separate step applying linearity to each individual term, but this is a very minor presentation issue that does not affect the correctness of the solution.

SCORE: 2/2

FEEDBACK
  Latency:  4.0s

--- rubric_anchored ---
  Score:    2.0/2
  Feedback: The student correctly applied linearity of expectation, computed E[alpha-hat] = alpha, and concluded the estimator is unbiased with Bias = 0.
  Latency:  1.1s

Summary:

6. Results¶

The cells below produce four views of the benchmark results. Run them in order after the benchmark completes.

These cells read from all_scores, which is created in cell 5. If you restarted the kernel after the benchmark, reload the CSV instead:
import pandas as pd
all_scores = pd.read_csv('results/scores_<your_timestamp>.csv').to_dict('records')

6a. Leaderboard¶

Overall score for each model. Columns:

Column	Meaning
Score	Raw points earned / total possible
Overall %	Combined score across MCQ and short answer
MCQ Accuracy	Percentage of multiple-choice questions answered correctly
Short Answer	Points earned / possible on short-answer questions only

Models are sorted best to worst by overall percentage.

df = pd.DataFrame(all_scores)

def leaderboard(df):
    rows = []
    for model, g in df.groupby('model'):
        earned   = g['points_earned'].sum()
        possible = g['max_points'].sum()
        pct      = 100 * earned / possible if possible else 0
        mcq      = g[g['type'] == 'mcq']
        sa       = g[g['type'] == 'short_answer']
        mcq_acc  = f"{int(100*mcq['correct'].sum()/len(mcq))}%" if len(mcq) else 'N/A'
        sa_score = f"{sa['points_earned'].sum():.1f}/{sa['max_points'].sum()}" if len(sa) else 'N/A'
        rows.append({'Model': model, 'Score': f'{earned:.0f}/{possible}',
                     'Overall %': f'{pct:.0f}%', 'MCQ Accuracy': mcq_acc,
                     'Short Answer': sa_score})
    return pd.DataFrame(rows).sort_values('Overall %', ascending=False).reset_index(drop=True)

leaderboard(df)

6b. Score by Topic¶

Breaks down each model’s performance by subject area (e.g. pandas, probability, SQL, ML). Useful for identifying which topics are systematically hard or easy for a given model.

A model that scores well overall but poorly on one topic (e.g. 40% on PCA questions) likely lacks training coverage in that area.

topic_df = (df.groupby(['model', 'topic'])
              .apply(lambda g: pd.Series({
                  'earned':   g['points_earned'].sum(),
                  'possible': g['max_points'].sum(),
                  '%':        round(100 * g['points_earned'].sum() / g['max_points'].sum())
              }))
              .reset_index())
topic_df['score'] = topic_df['earned'].astype(str) + '/' + topic_df['possible'].astype(str)
topic_df[['model', 'topic', 'score', '%']].sort_values(['model', '%'], ascending=[True, False])

6c. Per-Question Breakdown¶

One row per question per model. The result column shows ✓/✗ for MCQ and points/max for short answer. The feedback column shows the judge’s one-sentence comment for short-answer questions.

This view is most useful for spotting specific questions where models diverge — if one model gets a question right and another gets it wrong, that is a meaningful signal about capability differences.

detail = df[['model','question_id','type','topic','points_earned','max_points',
             'answer','correct','feedback','latency']].copy()
detail['result'] = detail.apply(
    lambda r: ('✓' if r['correct'] else '✗') if r['type'] == 'mcq' else f"{r['points_earned']}/{r['max_points']}", axis=1)
detail['answer'] = detail['answer'].str[:60]
detail['feedback'] = detail['feedback'].str[:80]
pd.set_option('display.max_colwidth', 80)
detail[['model','question_id','type','topic','result','answer','feedback','latency']]

6d. Export HTML Report¶

Generates a self-contained HTML file with the leaderboard and per-question table, styled for easy reading. The file is saved to results/ alongside the CSV.

The report is color-coded by score: green (≥75%), yellow (50–74%), red (<50%). It can be opened in any browser and shared without requiring Python or Jupyter.

from IPython.display import HTML

lb = leaderboard(df)

def color(pct_str):
    p = int(pct_str.replace('%',''))
    return '#4ade80' if p >= 75 else ('#fbbf24' if p >= 50 else '#f87171')

lb_rows = ''.join(
    f"<tr><td>{r['Model']}</td><td>{r['Score']}</td>"
    f"<td style='color:{color(r['Overall %'])};font-weight:600'>{r['Overall %']}</td>"
    f"<td>{r['MCQ Accuracy']}</td><td>{r['Short Answer']}</td></tr>"
    for _, r in lb.iterrows()
)

q_rows = ''.join(
    f"<tr><td>{r['model']}</td><td>{r['question_id']}</td><td>{r['type']}</td>"
    f"<td>{r['topic']}</td>"
    f"<td style='color:{color(str(int(100*r['points_earned']/r['max_points'] if r['max_points'] else 0))+'%')}'>"
    f"{r['points_earned']}/{r['max_points']}</td>"
    f"<td>{'✓' if r['correct'] else ('✗' if r['correct'] is False else '')}</td>"
    f"<td style='color:#94a3b8;font-size:.8em'>{str(r['feedback'])[:80]}</td></tr>"
    for _, r in df.iterrows()
)

html = f"""<!DOCTYPE html><html><head><meta charset='UTF-8'>
<title>Data 100 Benchmark</title>
<style>
  body{{font-family:'Courier New',monospace;background:#0f1117;color:#e2e8f0;padding:2rem}}
  h1{{color:#7dd3fc;margin-bottom:.2rem}} h2{{color:#94a3b8;margin:2rem 0 .6rem;border-bottom:1px solid #1e293b;padding-bottom:.3rem}}
  p{{color:#64748b;font-size:.85rem;margin-bottom:1.5rem}}
  table{{width:100%;border-collapse:collapse;font-size:.85rem}}
  th{{background:#1e293b;color:#7dd3fc;padding:.5rem 1rem;text-align:left}}
  td{{padding:.45rem 1rem;border-bottom:1px solid #1e293b}}
  tr:hover td{{background:#1a2236}}
</style></head><body>
<h1>Data 100 LLM Benchmark</h1>
<p>{exam.name} &mdash; {exam.semester}</p>
<h2>Leaderboard</h2>
<table><tr><th>Model</th><th>Score</th><th>%</th><th>MCQ Acc</th><th>Short Answer</th></tr>
{lb_rows}</table>
<h2>Per-Question Results</h2>
<table><tr><th>Model</th><th>Q</th><th>Type</th><th>Topic</th><th>Score</th><th>MCQ</th><th>Feedback</th></tr>
{q_rows}</table>
</body></html>"""

report_path = csv_path.replace('.csv', '.html')
with open(report_path, 'w') as f:
    f.write(html)
print(f'Report saved → {report_path}')
HTML(f'<a href="{report_path}" target="_blank" style="color:#7dd3fc">Open report ↗</a>')

Report saved → results/scores_20260427_195142.html