Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

SAT-style Question Answering with GGUF Models

SAT-style Question Answering with GGUF Models

Learning objective: Load a small model from a shared directory and use it to answer SAT-style questions with a small language model.

What this notebook teaches

  • how to locate a shared folder of GGUF weights on your laptop or hub

  • how to load a model with llama-cpp-python and walk through a tutor-style prompt

  • how to inspect the model’s answer and reasoning with reflection checkpoints

  • how to compare the model’s reasoning across random, filtered, and batched SAT items

Where the questions come from

  • We download the PineSAT Questionbank API at https://pinesat.com/api/questions.

  • PineSAT hosts community-built SAT-style questions so you can practice with authentic formats without licensing hurdles.

  • The endpoint replies in JSON (JavaScript Object Notation, a plain text key-value format) so we can inspect passages, choices, and answer keys with simple loops.

How to navigate the lesson

  1. Every markdown cell previews the next code cell so you always know why a command matters.

  2. Reflection checkpoints append your thoughts to answers.txt so you can track how your understanding changes.

  3. Later cells batch four easy questions into a table so you can see accuracy trends without extra plotting.

Tip: Keep an eye on the model context size (the amount of text it can read at once) and thread count so you do not overload a shared machine.


from llama_cpp import Llama  # loads llama-cpp-python so we can run GGUF models
import os  # lets us work with file paths
import random  # lets us pick random items from a list
import requests  # lets us call web APIs over HTTP
import pandas as pd  # helps with working with tables and spreadsheets
import re  # lets us search text with patterns
import json  # lets us parse structured JSON output from the model
from IPython.display import display  # lets us show interactive widgets inside the notebook
import ipywidgets as widgets  # adds dropdown menus and buttons for simple UI

Check which shared directory is available on this machine. If you are on a hub, the shared directory is usually under /home/jovyan. On a laptop it might live in your Documents folder.


possible_directories = [
    "/home/jovyan/shared/",
    "/home/jovyan/shared_readwrite/",
    "/Users/ericvandusen/SmallLM/Models"
]

existing_directories = []
for directory_path in possible_directories:
    if os.path.exists(directory_path):
        print("Found possible directory:", directory_path)
        existing_directories.append(directory_path)
    else:
        print("Did not find:", directory_path)
Did not find: /home/jovyan/shared/
Did not find: /home/jovyan/shared_readwrite/
Found possible directory: /Users/ericvandusen/SmallLM/Models

Pick a directory to use. We default to the first path that exists, and you can type a different one if you want.


if len(existing_directories) > 0:
    model_directory = existing_directories[0]
    print("Using this directory by default:", model_directory)
else:
    model_directory = input("Type a directory path that contains your .gguf files: ")

print("Current model directory:", model_directory)
Using this directory by default: /Users/ericvandusen/SmallLM/Models
Current model directory: /Users/ericvandusen/SmallLM/Models

List every .gguf model file in the chosen directory so we can pick any model that is available. Use the dropdown to make your choice before running the loader cell.


available_models = []
for filename in os.listdir(model_directory):
    if filename.endswith(".gguf"):
        available_models.append(filename)

if len(available_models) == 0:
    print("No .gguf files found in", model_directory)
else:
    print("Models found in", model_directory)
    dropdown_default = available_models[0]
    for candidate_name in available_models:
        lowercase_name = candidate_name.lower()
        if "qwen" in lowercase_name:
            dropdown_default = candidate_name
            break
    model_dropdown = widgets.Dropdown(
        options=available_models,
        description="Model:",
        value=dropdown_default
    )
    display(model_dropdown)
    print("Use the dropdown to pick a model, then run the next cell to load it.")
Models found in /Users/ericvandusen/SmallLM/Models
Loading...
Use the dropdown to pick a model, then run the next cell to load it.

Checkpoint #1: Look up the model card for the model you picked - what was that mode trained on? What is its context size? How many threads does it use by default?

Load the selected model with llama-cpp-python so it is ready to answer questions.

selected_model_name = model_dropdown.value
model_path = os.path.join(model_directory, selected_model_name)

model = Llama(
    model_path=model_path,
    n_ctx=2048,
    n_threads=4,
    chat_format="chatml",
    verbose=False
)

llama_context: n_ctx_per_seq (2048) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
ggml_metal_init: skipping kernel_get_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_set_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_c4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row              (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16                  (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f16                (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h112          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h128          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h192          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk192_hv128   (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h256          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk576_hv512   (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h64       (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h96       (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h128      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h192      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk192_hv128 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h256      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk576_hv512 (not supported)
ggml_metal_init: skipping kernel_cpy_f32_bf16                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_f32                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_bf16                     (not supported)

Warm-up: ask a simple SAT-style algebra question to confirm that the model responds.

Here is a simple question to get us started. We will now ask the model to return a structured JSON answer instead of free-form text. This helps us grade answers more reliably.

  • an algebraic equation to solve

  • four answer choices to pick from

  • a strict output rule that forces one JSON field named choice



# Build the question from parts
warmup_question = "If 3x + 5 = 14, what is the value of x?"
warmup_choice_map = {
    "A": "2",
    "B": "3",
    "C": "4",
    "D": "5"
}

warmup_choices_text = "\n".join(
    f"{label}: {value}" for label, value in warmup_choice_map.items()
)

warmup_prompt = f"""Here is an SAT-style multiple-choice question:

Question: {warmup_question}

Choices:
{warmup_choices_text}

Respond ONLY with valid JSON in this exact format: {{"choice": "B"}}
Choose one letter from A, B, C, or D."""

messages = [
    {
        "role": "system",
        "content": (
            "You are an SAT math solver. "
            "Always respond with valid JSON containing a single field 'choice' "
            "set to one of: A, B, C, or D. No explanation, no extra text."
        )
    },
    {"role": "user", "content": warmup_prompt}
]

warmup_response = model.create_chat_completion(
    messages=messages,
    response_format={
        "type": "json_object",
        "schema": {
            "type": "object",
            "properties": {
                "choice": {
                    "type": "string",
                    "enum": ["A", "B", "C", "D"]
                }
            },
            "required": ["choice"]
        }
    },
    temperature=0.0
)

warmup_raw_text = warmup_response["choices"][0]["message"]["content"].strip()
print("Raw model output:", warmup_raw_text)

# Parse with fallback
warmup_choice = ""
try:
    warmup_json = json.loads(warmup_raw_text)
    warmup_choice = str(warmup_json.get("choice", "")).strip().upper()
except json.JSONDecodeError as parse_error:
    print("JSON parse failed:", parse_error)
    # Fallback: scan for a lone A/B/C/D
    fallback_match = re.search(r'\b([A-D])\b', warmup_raw_text.upper())
    if fallback_match:
        warmup_choice = fallback_match.group(1)

if warmup_choice not in ["A", "B", "C", "D"]:
    print("WARNING: Could not extract a valid choice from model output.")
    warmup_choice = "UNKNOWN"

correct = "B"
print(f"Parsed choice:  {warmup_choice}")
print(f"Correct choice: {correct}")
print(f"Result: {'✓ Correct' if warmup_choice == correct else '✗ Wrong'}")
Raw model output: {"choice": "B"}
Parsed choice:  B
Correct choice: B
Result: ✓ Correct

Source an open source set of SAT-style questions

Download the SAT Questionbank from PineSAT so we can pull authentic practice questions.] We will use the API endpoint at https://pinesat.com/api/questions, which returns a JSON object with passages, questions, choices, and answer keys. We can loop through this data to inspect the format and content of the questions.

base_url = "https://pinesat.com/api/questions"

english_questions = requests.get(base_url, params={"section": "english"}).json()
math_questions = requests.get(base_url, params={"section": "math"}).json()

english_nested = pd.DataFrame(english_questions)
math_nested = pd.DataFrame(math_questions)

print("English questions:", len(english_nested))
print("Math questions:", len(math_nested))
English questions: 1443
Math questions: 1031
english_questions[0]
english_questions[0]["question"]
english_questions[0]["question"]['question']
english_questions[0]["question"]['choices']
english_questions[0]["question"]['paragraph']
english_questions[0]["question"]['correct_answer']

Pick a random question from the bank without filtering so you can see the full structure.

# Pick a math question from the question bank that we can test on
math_questions[0]

{'id': '281a4f3b', 'domain': 'Advanced Math', 'visuals': {'type': 'null', 'svg_content': 'null'}, 'question': {'choices': {'A': 'f(x) = 3,000(0.02)^x', 'B': 'f(x) = 0.98(3,000)^x', 'C': 'f(x) = 3,000(0.002)^x', 'D': 'f(x) = 3,000(0.98)^x'}, 'question': 'A certain college had 3,000 students enrolled in 2015. The college predicts that after 2015, the number of students enrolled each year will be 2% less than the number of students enrolled the year before. Which of the following functions models the relationship between the number of students enrolled, *f(x)*, and the number of years after 2015, *x*?', 'paragraph': 'null', 'explanation': 'Because the change in the number of students decreases by the same percentage each year, the relationship between the number of students and the number of years can be modeled with a decreasing, exponential function in the form *f(x) = a(1 - r)^x*, where *f(x)* is the number of students, *a* is the number of students in 2015, *r* is the rate of decrease each year, and *x* is the number of years since 2015. It’s given that 3,000 students were enrolled in 2015 and that the rate of decrease is predicted to be 2%, or 0.02. Substituting these values into the decreasing exponential function yields f(x) = 3,000(1 - 0.02)^x, which is equivalent to f(x) = 3,000(0.98)^x.', 'correct_answer': 'D'}, 'difficulty': 'Medium'}

Ask the model - English

We will now pass the random question to the model with the same tutor-style prompt we used for the warm-up question. This allows us to compare how the model handles a random question versus a simple one.

Ask the model to answer the random question. The same friendly tutor prompt is reused so you can compare responses.


# Pick a random question from the English question bank
random_index = random.randint(0, len(english_questions) - 1)
random_entry = english_questions[random_index]

# Pull out the question text, passage, and choices
random_question_text = random_entry["question"]["question"]
random_paragraph_text = random_entry["question"].get("paragraph", "")
random_choices_raw = random_entry["question"]["choices"]

choice_labels = ["A", "B", "C", "D", "E", "F"]
random_choice_map = {}

if isinstance(random_choices_raw, dict):
    for raw_label in random_choices_raw:
        clean_label = str(raw_label).strip().upper()
        random_choice_map[clean_label] = str(random_choices_raw[raw_label])
elif isinstance(random_choices_raw, list):
    for item_index in range(len(random_choices_raw)):
        if item_index < len(choice_labels):
            mapped_label = choice_labels[item_index]
            random_choice_map[mapped_label] = str(random_choices_raw[item_index])
else:
    random_choice_map["A"] = str(random_choices_raw)

valid_labels = []
for label in choice_labels:
    if label in random_choice_map:
        valid_labels.append(label)

random_option_lines = []
for label in valid_labels:
    random_option_lines.append(label + ": " + random_choice_map[label])
random_options_text = "\n".join(random_option_lines)

if len(random_paragraph_text) > 0:
    english_prompt_question_text = "Passage: " + random_paragraph_text + "\n\nQuestion: " + random_question_text
else:
    english_prompt_question_text = "Question: " + random_question_text

english_structured_prompt = """
{question_text}

Options:
{options_text}

Return your answer as JSON with this exact shape:
{{"choice": "A"}}
Use only one label from the provided options.
"""

print("Question:", random_question_text)
if len(random_paragraph_text) > 0:
    print("Paragraph:", random_paragraph_text)
print("Options:")
for option_line in random_option_lines:
    print(option_line)

random_messages = []
random_messages.append({"role": "system", "content": "You are a helpful tutor. Return only valid JSON with one field named choice."})
random_messages.append({"role": "user", "content": english_structured_prompt.format(question_text=english_prompt_question_text, options_text=random_options_text)})

random_response = model.create_chat_completion(
    messages=random_messages,
    response_format={
        "type": "json_object",
        "schema": {
            "type": "object",
            "properties": {
                "choice": {
                    "type": "string",
                    "enum": valid_labels
                }
            },
            "required": ["choice"]
        }
    },
    temperature=0.0
)

random_raw_text = random_response["choices"][0]["message"]["content"]
print("Raw model output:", random_raw_text)

random_choice = ""
try:
    random_json = json.loads(random_raw_text)
    random_choice = str(random_json.get("choice", "")).strip().upper()
except Exception as parse_error:
    print("Could not parse JSON directly:", parse_error)

if random_choice not in valid_labels:
    fallback_pattern = r"\b(" + "|".join(valid_labels) + r")\b"
    fallback_match = re.search(fallback_pattern, random_raw_text.upper())
    if fallback_match is not None:
        random_choice = fallback_match.group(1)

print("Parsed choice:", random_choice)
Question: What is the most likely reason why the author states that the Earth’s magnetic field has reversed “hundreds of times” in the past?
Paragraph: The main purpose of the passage is to argue that it’s difficult to know when the Earth’s magnetic field will reverse. The author emphasizes this point by noting that the field has reversed “hundreds of times” in the past but that “no one knows for sure” when the next reversal will occur. The author also acknowledges that the field is changing right now, suggesting that a reversal could be occurring soon.  In the following passage, the author’s main purpose is to
Options:
A: To provide evidence that the Earth’s magnetic field is unpredictable.
B: To contrast the Earth’s magnetic field with other natural phenomena.
C: To discuss the history of the Earth’s magnetic field.
D: To explain how the Earth’s magnetic field is changing.
Raw model output: {"choice": "A"}
Parsed choice: A

Ask the model - Math

Now let’s repeat the process with a random math question from the bank. This allows us to compare how the model handles different subjects and question formats.

# Pick a random question from the Math question bank
random_index = random.randint(0, len(math_questions) - 1)
random_entry = math_questions[random_index]

# Pull out the question text and choices
random_question_text = random_entry["question"]["question"]
random_choices_raw = random_entry["question"]["choices"]

choice_labels = ["A", "B", "C", "D", "E", "F"]
random_choice_map = {}

if isinstance(random_choices_raw, dict):
    for raw_label in random_choices_raw:
        clean_label = str(raw_label).strip().upper()
        random_choice_map[clean_label] = str(random_choices_raw[raw_label])
elif isinstance(random_choices_raw, list):
    for item_index in range(len(random_choices_raw)):
        if item_index < len(choice_labels):
            mapped_label = choice_labels[item_index]
            random_choice_map[mapped_label] = str(random_choices_raw[item_index])
else:
    random_choice_map["A"] = str(random_choices_raw)

valid_labels = []
for label in choice_labels:
    if label in random_choice_map:
        valid_labels.append(label)

random_option_lines = []
for label in valid_labels:
    random_option_lines.append(label + ": " + random_choice_map[label])
random_options_text = "\n".join(random_option_lines)

math_structured_prompt = """
Question: {question_text}

Options:
{options_text}

Return your answer as JSON with this exact shape:
{{"choice": "A"}}
Use only one label from the provided options.
"""

print("Question:", random_question_text)
print("Options:")
for option_line in random_option_lines:
    print(option_line)

random_messages = []
random_messages.append({"role": "system", "content": "You are a helpful tutor. Return only valid JSON with one field named choice."})
random_messages.append({"role": "user", "content": math_structured_prompt.format(question_text=random_question_text, options_text=random_options_text)})

random_response = model.create_chat_completion(
    messages=random_messages,
    response_format={
        "type": "json_object",
        "schema": {
            "type": "object",
            "properties": {
                "choice": {
                    "type": "string",
                    "enum": valid_labels
                }
            },
            "required": ["choice"]
        }
    },
    temperature=0.0
)

random_raw_text = random_response["choices"][0]["message"]["content"]
print("Raw model output:", random_raw_text)

random_choice = ""
try:
    random_json = json.loads(random_raw_text)
    random_choice = str(random_json.get("choice", "")).strip().upper()
except Exception as parse_error:
    print("Could not parse JSON directly:", parse_error)

if random_choice not in valid_labels:
    fallback_pattern = r"\b(" + "|".join(valid_labels) + r")\b"
    fallback_match = re.search(fallback_pattern, random_raw_text.upper())
    if fallback_match is not None:
        random_choice = fallback_match.group(1)

print("Parsed choice:", random_choice)
Question: If $x + 2y = 6$ and $x - 2y = 4$, what is the value of $x$?
Options:
A: 1
B: 2
C: 5
D: 10
Raw model output: {"choice": "A"}
Parsed choice: A

Checkpoint #2: Why are we testing the model with the domains provided in the bank? Write a short explanation.

Build a mini SAT practice set

Lets build a set of test question that we can then pass to the model in a batch. This allows us to see how the model performs across multiple questions and identify any patterns in its strengths or weaknesses.

Filter the bank by difficulty and subject so you can target specific skills.


difficulty_widget = widgets.Dropdown(
    options=["Easy", "Medium", "Hard"],
    value="Easy",
    description="Difficulty:"
)

section_widget = widgets.Dropdown(
    options=["English", "Math"],
    value="English",
    description="Section:"
)

num_questions_widget = widgets.Dropdown(
    options=[ 2, 4, 8, 10, 12],
    value=4,
    description="# Questions:"
)

display(difficulty_widget)
display(section_widget)
display(num_questions_widget)
print("Choose your settings above, then run the next cell to test the model.")
Loading...
Loading...
Loading...
Choose your settings above, then run the next cell to test the model.

Now send each question to the model and collect the results in a table.

# Read the widget values
chosen_difficulty = difficulty_widget.value
chosen_section = section_widget.value
chosen_count = num_questions_widget.value

# Pick the right question list based on section
if chosen_section == "English":
    question_pool = english_questions
else:
    question_pool = math_questions

# Filter by difficulty and shuffle so we get a fresh sample each run
filtered_questions = []
for question_entry in question_pool:
    if question_entry.get("difficulty", "").lower() == chosen_difficulty.lower():
        filtered_questions.append(question_entry)

random.shuffle(filtered_questions)
practice_set = filtered_questions[:chosen_count]

# Pull out question text, choices, paragraph, and correct answer for each item
practice_questions = []
practice_choices = []
practice_paragraphs = []
practice_answers = []
for question_entry in practice_set:
    practice_questions.append(question_entry["question"]["question"])
    practice_choices.append(question_entry["question"]["choices"])
    practice_paragraphs.append(question_entry["question"].get("paragraph", ""))
    practice_answers.append(question_entry["question"].get("correct_answer", ""))

print("Built a practice set of", len(practice_questions), chosen_difficulty, chosen_section, "questions.")
for item_index in range(len(practice_questions)):
    print("\nQ" + str(item_index + 1) + ":", practice_questions[item_index])
    if len(practice_paragraphs[item_index]) > 0:
        print("Passage:", practice_paragraphs[item_index])
    print("Choices:", practice_choices[item_index])
    print("Correct answer:", practice_answers[item_index])
Built a practice set of 4 Medium Math questions.

Q1: A survey of 100 people found that 60 people like apples, 40 people like oranges, and 20 people like both apples and oranges. How many people like only apples?
Passage: null
Choices: {'A': '20', 'B': '40', 'C': '60', 'D': '80'}
Correct answer: A

Q2: If $x$ is a positive integer, what is the smallest possible value of $x$ such that $\frac{x^2 - 4}{x - 2}$ is an integer?
Passage: null
Choices: {'A': '1', 'B': '2', 'C': '3', 'D': '4'}
Correct answer: D

Q3: A circle has a radius of 5 units.  What is the area, in square units, of the circle?
Passage: null
Choices: {'A': '5\\pi', 'B': '10\\pi', 'C': '25\\pi', 'D': '50\\pi'}
Correct answer: C

Q4: A survey of 200 people found that 120 people like apples, 100 people like oranges, and 60 people like both apples and oranges. How many of the people surveyed like neither apples nor oranges?
Passage: null
Choices: {'A': '20', 'B': '40', 'C': '60', 'D': '80'}
Correct answer: A

Now send each question to the model and collect the results in a table.

In this step, each model call asks for JSON output with a single field named choice. We then parse that JSON and compare it to the answer key.

batch_results = []
choice_labels = ["A", "B", "C", "D", "E", "F"]

for item_index in range(len(practice_questions)):
    passage_text = practice_paragraphs[item_index]
    if len(passage_text) > 0:
        full_question_text = "Passage: " + passage_text + "\n\nQuestion: " + practice_questions[item_index]
    else:
        full_question_text = "Question: " + practice_questions[item_index]

    question_choices_raw = practice_choices[item_index]
    question_choice_map = {}

    if isinstance(question_choices_raw, dict):
        for raw_label in question_choices_raw:
            clean_label = str(raw_label).strip().upper()
            question_choice_map[clean_label] = str(question_choices_raw[raw_label])
    elif isinstance(question_choices_raw, list):
        for choice_index in range(len(question_choices_raw)):
            if choice_index < len(choice_labels):
                mapped_label = choice_labels[choice_index]
                question_choice_map[mapped_label] = str(question_choices_raw[choice_index])
    else:
        question_choice_map["A"] = str(question_choices_raw)

    valid_labels = []
    for label in choice_labels:
        if label in question_choice_map:
            valid_labels.append(label)

    option_lines = []
    for label in valid_labels:
        option_lines.append(label + ": " + question_choice_map[label])
    options_text = "\n".join(option_lines)

    batch_structured_prompt = """
{question_text}

Options:
{options_text}

Return your answer as JSON with this exact shape:
{{"choice": "A"}}
Use only one label from the provided options.
"""

    batch_messages = []
    batch_messages.append({"role": "system", "content": "You are a helpful tutor. Return only valid JSON with one field named choice."})
    batch_messages.append({"role": "user", "content": batch_structured_prompt.format(question_text=full_question_text, options_text=options_text)})

    batch_response = model.create_chat_completion(
        messages=batch_messages,
        response_format={
            "type": "json_object",
            "schema": {
                "type": "object",
                "properties": {
                    "choice": {
                        "type": "string",
                        "enum": valid_labels
                    }
                },
                "required": ["choice"]
            }
        },
        temperature=0.0
    )

    batch_raw_text = batch_response["choices"][0]["message"]["content"]

    extracted_answer = ""
    try:
        batch_json = json.loads(batch_raw_text)
        extracted_answer = str(batch_json.get("choice", "")).strip().upper()
    except Exception:
        extracted_answer = ""

    if extracted_answer not in valid_labels:
        fallback_pattern = r"\b(" + "|".join(valid_labels) + r")\b"
        fallback_match = re.search(fallback_pattern, batch_raw_text.upper())
        if fallback_match is not None:
            extracted_answer = fallback_match.group(1)

    is_correct = False
    correct_answer_text = str(practice_answers[item_index]).strip().upper()
    if len(correct_answer_text) > 0 and len(extracted_answer) > 0:
        if extracted_answer == correct_answer_text[0]:
            is_correct = True

    batch_results.append({
        "Section": chosen_section,
        "Difficulty": chosen_difficulty,
        "Question": practice_questions[item_index],
        "Correct Answer": practice_answers[item_index],
        "Model Guess": extracted_answer,
        "Correct?": is_correct
    })

batch_results_table = pd.DataFrame(batch_results)
batch_results_table
Loading...
# Let's build a way to retrieve the question and model answer pairs from the batch results table, so we can review them more easily.
batch_results_table[1:3][["Question", "Model Guess", "Correct Answer"]]
Loading...

Checkpoint #3: Try an easy question from a subject you choose. Did the model get it right? Explain how you checked.


student_reply_three = input("Describe what you tried and whether the model's answer matched the key.\n")
with open('answers.txt', 'a') as answer_file:
    answer_file.write(student_reply_three)
    answer_file.write('\n')
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
Cell In[19], line 1
----> 1 student_reply_three = input("Describe what you tried and whether the model's answer matched the key.\n")
      2 with open('answers.txt', 'a') as answer_file:
      3     answer_file.write(student_reply_three)

File /opt/homebrew/Caskroom/miniforge/base/envs/data-science-env/lib/python3.10/site-packages/ipykernel/kernelbase.py:1473, in Kernel.raw_input(self, prompt)
   1471     msg = "raw_input was called, but this frontend does not support input requests."
   1472     raise StdinNotImplementedError(msg)
-> 1473 return self._input_request(
   1474     str(prompt),
   1475     self._shell_parent_ident.get(),
   1476     self.get_parent("shell"),
   1477     password=False,
   1478 )

File /opt/homebrew/Caskroom/miniforge/base/envs/data-science-env/lib/python3.10/site-packages/ipykernel/kernelbase.py:1518, in Kernel._input_request(self, prompt, ident, parent, password)
   1515 except KeyboardInterrupt:
   1516     # re-raise KeyboardInterrupt, to truncate traceback
   1517     msg = "Interrupted by user"
-> 1518     raise KeyboardInterrupt(msg) from None
   1519 except Exception:
   1520     self.log.warning("Invalid Message:", exc_info=True)

KeyboardInterrupt: Interrupted by user

Print a summary showing how many questions the model got right overall.


correct_count = 0
for result_row in batch_results:
    if result_row["Correct?"] == True:
        correct_count = correct_count + 1

print("Score:", correct_count, "out of", len(batch_results))
print(batch_results_table)
Score: 3 out of 4
  Section Difficulty                                           Question  \
0    Math     Medium  A survey of 100 people found that 60 people li...   
1    Math     Medium  If $x$ is a positive integer, what is the smal...   
2    Math     Medium  A circle has a radius of 5 units.  What is the...   
3    Math     Medium  A survey of 200 people found that 120 people l...   

  Correct Answer Model Guess  Correct?  
0              A           A      True  
1              D           A     False  
2              C           C      True  
3              A           A      True  

Summary: You located a shared model directory, picked any GGUF file, loaded it with llama-cpp-python, and exercised it on SAT-style questions with and without filters. You also logged your reflections in answers.txt.