SAT-style Question Answering with GGUF Models¶
Learning objective: Load a small model from a shared directory and use it to answer SAT-style questions with a small language model.
What this notebook teaches¶
how to locate a shared folder of GGUF weights on your laptop or hub
how to load a model with llama-cpp-python and walk through a tutor-style prompt
how to inspect the model’s answer and reasoning with reflection checkpoints
how to compare the model’s reasoning across random, filtered, and batched SAT items
Where the questions come from¶
We download the PineSAT Questionbank API at https://
pinesat .com /api /questions. PineSAT hosts community-built SAT-style questions so you can practice with authentic formats without licensing hurdles.
The endpoint replies in JSON (JavaScript Object Notation, a plain text key-value format) so we can inspect passages, choices, and answer keys with simple loops.
How to navigate the lesson¶
Every markdown cell previews the next code cell so you always know why a command matters.
Reflection checkpoints append your thoughts to answers.txt so you can track how your understanding changes.
Later cells batch four easy questions into a table so you can see accuracy trends without extra plotting.
Tip: Keep an eye on the model context size (the amount of text it can read at once) and thread count so you do not overload a shared machine.
from llama_cpp import Llama # loads llama-cpp-python so we can run GGUF models
import os # lets us work with file paths
import random # lets us pick random items from a list
import requests # lets us call web APIs over HTTP
import pandas as pd # helps with working with tables and spreadsheets
import re # lets us search text with patterns
import json # lets us parse structured JSON output from the model
from IPython.display import display # lets us show interactive widgets inside the notebook
import ipywidgets as widgets # adds dropdown menus and buttons for simple UI
Check which shared directory is available on this machine. If you are on a hub, the shared directory is usually under /home/jovyan. On a laptop it might live in your Documents folder.
possible_directories = [
"/home/jovyan/shared/",
"/home/jovyan/shared_readwrite/",
"/Users/ericvandusen/SmallLM/Models"
]
existing_directories = []
for directory_path in possible_directories:
if os.path.exists(directory_path):
print("Found possible directory:", directory_path)
existing_directories.append(directory_path)
else:
print("Did not find:", directory_path)
Found possible directory: /home/jovyan/shared/
Did not find: /home/jovyan/shared_readwrite/
Did not find: /Users/ericvandusen/SmallLM/Models
Pick a directory to use. We default to the first path that exists, and you can type a different one if you want.
if len(existing_directories) > 0:
model_directory = existing_directories[0]
print("Using this directory by default:", model_directory)
else:
model_directory = input("Type a directory path that contains your .gguf files: ")
print("Current model directory:", model_directory)
Using this directory by default: /home/jovyan/shared/
Current model directory: /home/jovyan/shared/
List every .gguf model file in the chosen directory so we can pick any model that is available. Use the dropdown to make your choice before running the loader cell.
available_models = []
for filename in os.listdir(model_directory):
if filename.endswith(".gguf"):
available_models.append(filename)
if len(available_models) == 0:
print("No .gguf files found in", model_directory)
else:
print("Models found in", model_directory)
dropdown_default = available_models[0]
for candidate_name in available_models:
lowercase_name = candidate_name.lower()
if "qwen" in lowercase_name:
dropdown_default = candidate_name
break
model_dropdown = widgets.Dropdown(
options=available_models,
description="Model:",
value=dropdown_default
)
display(model_dropdown)
print("Use the dropdown to pick a model, then run the next cell to load it.")
Models found in /home/jovyan/shared/
Use the dropdown to pick a model, then run the next cell to load it.
Checkpoint #1: Look up the model card for the model you picked - what was that mode trained on? What is its context size? How many threads does it use by default?
Load the selected model with llama-cpp-python so it is ready to answer questions.
selected_model_name = model_dropdown.value
model_path = os.path.join(model_directory, selected_model_name)model = Llama(
model_path=model_path,
n_ctx=2048,
n_threads=4,
chat_format="chatml",
verbose=False
)
llama_context: n_ctx_per_seq (2048) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
Warm-up: ask a simple SAT-style algebra question to confirm that the model responds.
Here is a simple question to get us started. We will now ask the model to return a structured JSON answer instead of free-form text. This helps us grade answers more reliably.
an algebraic equation to solve
four answer choices to pick from
a strict output rule that forces one JSON field named choice
# Build the question from parts
warmup_question = "If 3x + 5 = 14, what is the value of x?"
warmup_choice_map = {
"A": "2",
"B": "3",
"C": "4",
"D": "5"
}
warmup_choices_text = "\n".join(
f"{label}: {value}" for label, value in warmup_choice_map.items()
)
warmup_prompt = f"""Here is an SAT-style multiple-choice question:
Question: {warmup_question}
Choices:
{warmup_choices_text}
Respond in this exact format: The correct answer is *{{insert correct answer}}*
Choose one letter from A, B, C, or D."""
messages = [
{
"role": "system",
"content": (
"You are an SAT math solver. ",
"Always respond with the following format: The correct answer is *{{insert correct answer}}*"
)
},
{"role": "user", "content": warmup_prompt}
]
warmup_response = model.create_chat_completion(
messages=messages,
max_tokens=512,
temperature=0.6,
top_p=0.95,
)
#warmup_raw_text = warmup_response["choices"][0]["message"]["content"].strip()
# Parse with fallback
warmup_response["choices"][0]["message"]["content"]'The correct answer is A: 2'Source an open source set of SAT-style questions¶
Download the SAT Questionbank from PineSAT so we can pull authentic practice questions.] We will use the API endpoint at https://
base_url = "https://pinesat.com/api/questions"
english_questions = requests.get(base_url, params={"section": "english"}).json()
math_questions = requests.get(base_url, params={"section": "math"}).json()
english_nested = pd.DataFrame(english_questions)
math_nested = pd.DataFrame(math_questions)
print("English questions:", len(english_nested))
print("Math questions:", len(math_nested))English questions: 1443
Math questions: 1031
english_questions[0]{'id': 'random_id_a1',
'domain': 'Information and Ideas',
'visuals': {'type': 'null', 'svg_content': 'null'},
'question': {'choices': {'A': 'Suppressing opinions robs future generations of the chance to hear them, even if they disagree with them.',
'B': 'It is harmful to silence opinions that are held by a majority of people.',
'C': 'People who dissent from an opinion are more likely to be harmed by its suppression than those who hold it.',
'D': 'It is important to respect all opinions, even if they are wrong.'},
'question': 'What is Mill\'s main point in this passage from "On Liberty"?',
'paragraph': 'In the essay "On Liberty," John Stuart Mill argues that "the peculiar evil of silencing the expression of an opinion is, that it is robbing the human race; posterity as well as the existing generation; those who dissent from the opinion, still more than those who hold it." What is Mill\'s main point in this passage?',
'explanation': 'Mill argues that suppressing opinions is a harm to everyone, including future generations, because it prevents them from hearing and potentially engaging with these ideas. He also emphasizes that those who disagree with the silenced opinion are more likely to be harmed because it prevents them from developing their own understanding and potentially challenging it.',
'correct_answer': 'A'},
'difficulty': 'Medium'}english_questions[0]["question"]{'choices': {'A': 'Suppressing opinions robs future generations of the chance to hear them, even if they disagree with them.',
'B': 'It is harmful to silence opinions that are held by a majority of people.',
'C': 'People who dissent from an opinion are more likely to be harmed by its suppression than those who hold it.',
'D': 'It is important to respect all opinions, even if they are wrong.'},
'question': 'What is Mill\'s main point in this passage from "On Liberty"?',
'paragraph': 'In the essay "On Liberty," John Stuart Mill argues that "the peculiar evil of silencing the expression of an opinion is, that it is robbing the human race; posterity as well as the existing generation; those who dissent from the opinion, still more than those who hold it." What is Mill\'s main point in this passage?',
'explanation': 'Mill argues that suppressing opinions is a harm to everyone, including future generations, because it prevents them from hearing and potentially engaging with these ideas. He also emphasizes that those who disagree with the silenced opinion are more likely to be harmed because it prevents them from developing their own understanding and potentially challenging it.',
'correct_answer': 'A'}english_questions[0]["question"]['question']'What is Mill\'s main point in this passage from "On Liberty"?'english_questions[0]["question"]['choices']{'A': 'Suppressing opinions robs future generations of the chance to hear them, even if they disagree with them.',
'B': 'It is harmful to silence opinions that are held by a majority of people.',
'C': 'People who dissent from an opinion are more likely to be harmed by its suppression than those who hold it.',
'D': 'It is important to respect all opinions, even if they are wrong.'}english_questions[0]["question"]['paragraph']'In the essay "On Liberty," John Stuart Mill argues that "the peculiar evil of silencing the expression of an opinion is, that it is robbing the human race; posterity as well as the existing generation; those who dissent from the opinion, still more than those who hold it." What is Mill\'s main point in this passage?'english_questions[0]["question"]['correct_answer']'A'Pick a random question from the bank without filtering so you can see the full structure.
# Pick a math question from the question bank that we can test on
math_questions[0]
{'id': '281a4f3b',
'domain': 'Advanced Math',
'visuals': {'type': 'null', 'svg_content': 'null'},
'question': {'choices': {'A': 'f(x) = 3,000(0.02)^x',
'B': 'f(x) = 0.98(3,000)^x',
'C': 'f(x) = 3,000(0.002)^x',
'D': 'f(x) = 3,000(0.98)^x'},
'question': 'A certain college had 3,000 students enrolled in 2015. The college predicts that after 2015, the number of students enrolled each year will be 2% less than the number of students enrolled the year before. Which of the following functions models the relationship between the number of students enrolled, *f(x)*, and the number of years after 2015, *x*?',
'paragraph': 'null',
'explanation': 'Because the change in the number of students decreases by the same percentage each year, the relationship between the number of students and the number of years can be modeled with a decreasing, exponential function in the form *f(x) = a(1 - r)^x*, where *f(x)* is the number of students, *a* is the number of students in 2015, *r* is the rate of decrease each year, and *x* is the number of years since 2015. It’s given that 3,000 students were enrolled in 2015 and that the rate of decrease is predicted to be 2%, or 0.02. Substituting these values into the decreasing exponential function yields f(x) = 3,000(1 - 0.02)^x, which is equivalent to f(x) = 3,000(0.98)^x.',
'correct_answer': 'D'},
'difficulty': 'Medium'}Ask the model - English¶
We will now pass the random question to the model with the same tutor-style prompt we used for the warm-up question. This allows us to compare how the model handles a random question versus a simple one.
Ask the model to answer the random question. The same friendly tutor prompt is reused so you can compare responses.
# Pick a random question from the English question bank
random_index = random.randint(0, len(english_questions) - 1)
random_entry = english_questions[random_index]
# Pull out the question text, passage, and choices
random_question_text = random_entry["question"]["question"]
random_paragraph_text = random_entry["question"].get("paragraph", "")
random_choices_raw = random_entry["question"]["choices"]
# If the question contains a passage, input that as the question instead
if random_paragraph_text and random_paragraph_text != "null":
random_question_text = random_paragraph_text
choice_labels = ["A", "B", "C", "D", "E", "F"]
random_choice_map = {}
english_structured_prompt = f"""Here is an SAT-style multiple-choice question:
Question: {random_question_text}
Choices:
{random_choices_raw}
Respond in this exact format: The correct answer is *{{insert correct answer}}*
Choose one letter from A, B, C, or D."""
print("Question:", random_question_text)
if len(random_paragraph_text) > 0:
print("Paragraph:", random_paragraph_text)
print("Options:")
for item, amount in random_choices_raw.items():
print(f"* {item}: {amount}")
random_messages = [
{
"role": "system",
"content": (
"You are an SAT math solver. ",
"Always respond with the following format: The correct answer is *{{insert correct answer}}*"
)
},
{"role": "user", "content": english_structured_prompt}
]
random_response = model.create_chat_completion(
messages= random_messages,
max_tokens=512,
temperature=0.6,
top_p=0.95,
)
random_raw_text = random_response["choices"][0]["message"]["content"]
random_raw_textQuestion: In the passage, the author suggests that there are a variety of ways in which people can be influenced by others. The author argues that people who do not support certain ideas may be influenced by the opinions of the people who do support the ideas. In addition, the author suggests that people may be influenced by the way that things are presented. For example, the author notes that people may be more likely to agree with an idea if it is presented in a way that makes it seem more appealing or logical. What is the author's main point?
Paragraph: In the passage, the author suggests that there are a variety of ways in which people can be influenced by others. The author argues that people who do not support certain ideas may be influenced by the opinions of the people who do support the ideas. In addition, the author suggests that people may be influenced by the way that things are presented. For example, the author notes that people may be more likely to agree with an idea if it is presented in a way that makes it seem more appealing or logical. What is the author's main point?
Options:
* A: people are always influenced by the opinions of others.
* B: people are more likely to be influenced by the opinions of those who are close to them.
* C: people are influenced by a variety of factors, including the opinions of others and the way in which ideas are presented.
* D: people are more likely to be influenced by the opinions of those who are experts in a particular field.
'The correct answer is *{C}'Ask the model - Math¶
Now let’s repeat the process with a random math question from the bank. This allows us to compare how the model handles different subjects and question formats.
# Pick a random question from the Math question bank
random_index = random.randint(0, len(math_questions) - 1)
random_entry = math_questions[random_index]
# Pull out the question text and choices
random_question_text = random_entry["question"]["question"]
random_choices_raw = random_entry["question"]["choices"]
choice_labels = ["A", "B", "C", "D", "E", "F"]
random_choice_map = {}
math_structured_prompt = f"""Here is an SAT-style multiple-choice question:
Question: {random_question_text}
Choices:
{random_choices_raw}
Respond in this exact format: The correct answer is *{{insert correct answer}}*
Choose one letter from A, B, C, or D."""
print("Question:", random_question_text)
print("Options:")
for item, amount in random_choices_raw.items():
print(f"* {item}: {amount}")
random_messages = [
{
"role": "system",
"content": (
"You are an SAT math solver. ",
"Always respond with the following format: The correct answer is *{{insert correct answer}}*"
)
},
{"role": "user", "content": math_structured_prompt}
]
random_response = model.create_chat_completion(
messages= random_messages,
max_tokens=512,
temperature=0.6,
top_p=0.95,
)
random_raw_text = random_response["choices"][0]["message"]["content"]
random_raw_textQuestion: The area of a rectangle is 24 square centimeters. If the length of the rectangle is 6 centimeters, what is the width, in centimeters, of the rectangle?
Options:
* A: 4
* B: 6
* C: 8
* D: 12
'The correct answer is B: 6'Checkpoint #2: Why are we testing the model with the domains provided in the bank? Write a short explanation.
## YOUR ANSWER HEREBuild a mini SAT practice set¶
Lets build a set of test question that we can then pass to the model in a batch. This allows us to see how the model performs across multiple questions and identify any patterns in its strengths or weaknesses.
Filter the bank by difficulty and subject so you can target specific skills.
difficulty_widget = widgets.Dropdown(
options=["Easy", "Medium", "Hard"],
value="Easy",
description="Difficulty:"
)
section_widget = widgets.Dropdown(
options=["English", "Math"],
value="English",
description="Section:"
)
num_questions_widget = widgets.Dropdown(
options=[ 2, 4, 8, 10, 12],
value=4,
description="# Questions:"
)
display(difficulty_widget)
display(section_widget)
display(num_questions_widget)
print("Choose your settings above, then run the next cell to test the model.")
Choose your settings above, then run the next cell to test the model.
Now send each question to the model and collect the results in a table.
# Read the widget values
chosen_difficulty = difficulty_widget.value
chosen_section = section_widget.value
chosen_count = num_questions_widget.value
# Pick the right question list based on section
if chosen_section == "English":
question_pool = english_questions
else:
question_pool = math_questions
# Filter by difficulty and shuffle so we get a fresh sample each run
filtered_questions = []
for question_entry in question_pool:
if question_entry.get("difficulty", "").lower() == chosen_difficulty.lower():
filtered_questions.append(question_entry)
random.shuffle(filtered_questions)
practice_set = filtered_questions[:chosen_count]# Create an empty table to fill with responses
batch_results_table = pd.DataFrame(columns = ['Question', 'Choices', 'Model Response', 'Correct Answer', 'Correct?'])
batch_results_table# This cell will run the model for each randomly selected questions
for question in practice_set:
# Pull out the question text and choices
random_question_text = question["question"]["question"]
random_choices_raw = question["question"]["choices"]
structured_prompt = f"""Here is an SAT-style multiple-choice question:
Question: {random_question_text}
Choices:
{random_choices_raw}
Respond in this exact format: The correct answer is *{{insert correct answer}}*
Choose one letter from A, B, C, or D."""
random_messages = [
{
"role": "system",
"content": (
"You are an SAT math solver. ",
"Respond with the following format, but replace *{{insert correct answer}}* with the right answer: The correct answer is *{{insert correct answer}}*."
)
},
{"role": "user", "content": structured_prompt}
]
random_response = model.create_chat_completion(
messages= random_messages,
max_tokens=512,
temperature=0.6,
top_p=0.95,
)
random_raw_text = random_response["choices"][0]["message"]["content"]
# correct the answer to see if the model got it correct!
correct_answer = question["question"]["correct_answer"]
if correct_answer in random_raw_text:
grade = 'CORRECT'
else:
grade = 'INCORRECT'
new_entry = [random_question_text, random_choices_raw, random_raw_text, correct_answer, grade]
batch_results_table.loc[len(batch_results_table)] = new_entrybatch_results_tableNow, we can examine the questions more closely if we’d like.
# Let's build a way to retrieve the question and model answer pairs from the batch results table, so we can review them more easily.
batch_results_table[1:3]We can also inspect closely about the questions the model got wrong..
batch_results_table[batch_results_table['Correct?'] == 'INCORRECT']Checkpoint #3: Try an easy question from a subject you choose. Did the model get it right? Explain how you checked.
student_reply_three = input("Describe what you tried and whether the model's answer matched the key.\n")
with open('answers.txt', 'a') as answer_file:
answer_file.write(student_reply_three)
answer_file.write('\n')
Describe what you tried and whether the model's answer matched the key.
hi
Print a summary showing how many questions the model got right overall.
len(batch_results_table[batch_results_table['Correct?'] == 'CORRECT']) / len(batch_results_table) * 10025.0Summary: You located a shared model directory, picked any GGUF file, loaded it with llama-cpp-python, and exercised it on SAT-style questions with and without filters. You also logged your reflections in answers.txt.