Talking to OLMo 3: Open-Source AI via the OpenRouter API

Most of the AI models you hear about — like ChatGPT or Claude — are closed-source. That means the company that built them does not share the training code, the training data, or the exact model weights. You can use the model, but you cannot inspect how it was built.

OLMo 3 is different. It was built by AllenAI (the Allen Institute for AI), a non-profit research lab. Everything about OLMo is public: the training code, the training data, and the model weights. Anyone can download it, study it, and build on top of it.

In this notebook, we will call OLMo 3 through OpenRouter — a service that gives you access to hundreds of AI models through a single API. Along the way, we will learn how system messages shape what a model says, and we will run benchmark-style tests to see how the model performs on standardized questions.

Learning Objective: By the end of this notebook, you will be able to call the OpenRouter API to access OLMo 3, explain what system messages are and how they change model behavior, and run benchmark-style tests to evaluate a language model.

Imports¶

We need three things:

os to read environment variables (where we store our API key)
dotenv to load those variables from a file
openai to send requests — OpenRouter uses the same format as the OpenAI API

import os  # read environment variables stored on this computer

try:
    from dotenv import load_dotenv  # read secrets from a .env file into environment variables
except ImportError:
    !pip install python-dotenv
    from dotenv import load_dotenv

try:
    from openai import OpenAI  # official OpenAI Python package — also works with OpenRouter
except ImportError:
    !pip install openai
    from openai import OpenAI

About OpenRouter¶

OpenRouter is a service that acts as a single front door to hundreds of AI models. Instead of signing up separately for Anthropic, OpenAI, and AllenAI, you create one OpenRouter account, load funds once, and use a single API key to access all of them.

OpenRouter uses the exact same API format as OpenAI. The only difference is the base URL — we point our requests at https://openrouter.ai/api/v1 instead of OpenAI’s servers.

You can browse all available models at: https://openrouter.ai/models

About OLMo 3 and AllenAI¶

AllenAI (Allen Institute for AI) is a non-profit research lab founded in Seattle. Its mission is to make AI research open and accessible.

OLMo stands for Open Language Model. Unlike GPT or Claude, OLMo is fully open:

The training code is on GitHub.
The training data is published (called Dolma).
The model weights can be downloaded for free.

This openness matters for science: researchers can study exactly why the model behaves the way it does, which is impossible with closed models.

OLMo 3 is the third generation. It comes in different sizes — a 7-billion-parameter version and a 32-billion-parameter version. A parameter is a number inside the model that was learned during training. More parameters generally means the model can store more knowledge and handle harder tasks, but also costs more to run.

Step 1 — Load Your API Key¶

An API key is a secret string that proves to OpenRouter’s servers that you are allowed to use their service. It is tied to a billing account, so treat it like a credit card number — never paste it directly into a notebook you will share.

We store the key in a file called .env that lives outside this notebook. That file looks like this:

OPENROUTER_API_KEY="sk-or-v1-..."

The load_dotenv function reads that file and puts the key into memory. We then read it with os.getenv.

# Your file path may differ based on your setup. Adjust as needed.
load_dotenv('/home/jovyan/shared/.env')

openrouter_api_key = os.getenv('OPENROUTER_API_KEY')

print("API Key loaded:", "✅ Ready" if openrouter_api_key else "❌ Not found — check your .env file")

If the key was not found, you can paste it directly here instead (never commit this line to GitHub):

# Uncomment the next line and paste your key if load_dotenv did not find it
# openrouter_api_key = "sk-or-v1-..."  # Replace with your actual OpenRouter API key if needed

Step 2 — Create a Client¶

A client is a Python object that holds your credentials and handles the technical work of sending HTTP requests. Think of it as picking up a phone and dialing a number — the client makes the connection so you can have a conversation.

Because OpenRouter uses the same format as the OpenAI API, we can use the OpenAI Python class. We just need to tell it two things:

Our OpenRouter API key (api_key)
The URL of OpenRouter’s servers (base_url)

The base_url is the only thing that changes compared to a regular OpenAI setup.

openrouter_client = OpenAI(
    api_key=openrouter_api_key,
    base_url="https://openrouter.ai/api/v1"
)

print("Client created. Ready to talk to OLMo 3 via OpenRouter.")

The OLMo 3 Model IDs¶

To use a model on OpenRouter, we need its model ID — a string that uniquely identifies it. The format for AllenAI models on OpenRouter is: allenai/model-name.

OLMo 3 comes in two sizes:

Model ID	Size	Best for
`allenai/olmo-3-7b-instruct`	7 billion parameters	Quick answers, lower cost
`allenai/olmo-3-32b-instruct`	32 billion parameters	Complex reasoning, more thorough answers

The word instruct in the name means the model was fine-tuned to follow instructions and answer questions. A base model (without “instruct”) just predicts the next word and is harder to use in conversation.

Note: Model IDs can change as AllenAI releases new versions. Always check https://openrouter.ai/models and search for “olmo” to confirm the latest IDs.

# Store the model IDs in variables so we can easily change them later
olmo_small = "allenai/olmo-3-7b-instruct"   # 7 billion parameters — faster
olmo_large = "allenai/olmo-3-32b-instruct"  # 32 billion parameters — more capable

print("Small model:", olmo_small)
print("Large model:", olmo_large)

Concept 1 — What is a System Message?¶

Every conversation you have with a language model can start with a system message. A system message is a special instruction that you give to the model before the conversation begins. The user never sees it — it runs quietly in the background.

Think of it like a stage director giving an actor their role before a performance. The actor (the model) will stay in character for the whole show.

A conversation sent to the API is a list of messages. Each message has two parts:

role: who is speaking — either "system", "user", or "assistant"
content: what they said

Here is the structure:

messages = [
    {"role": "system",    "content": "You are a helpful science tutor."},
    {"role": "user",      "content": "What is photosynthesis?"},
]

The system message sets the tone, personality, and constraints. Let us see this in action.

# A simple first call: ask OLMo a question with a system message
first_response = openrouter_client.chat.completions.create(
    model=olmo_small,
    messages=[
        {"role": "system", "content": "You are a helpful science tutor for first-year college students."},
        {"role": "user",   "content": "What is photosynthesis?"}
    ]
)

print(first_response.choices[0].message.content)

Experiment 1 — The Same Question, Three Different System Messages¶

Now let us see how dramatically a system message can change the answer. We will ask the same question — “What is photosynthesis?” — but give the model three different roles:

A science tutor explaining it simply
A botanist writing for a scientific journal
A chef trying to relate everything back to cooking

Notice how the vocabulary, tone, and detail level all shift.

# Three different system messages for the same question
system_message_tutor = "You are a patient science tutor. Explain things simply using everyday words. Keep your answer to three sentences."
system_message_scientist = "You are a research botanist writing for a peer-reviewed journal. Use precise scientific language. Keep your answer to three sentences."
system_message_chef = "You are a professional chef who explains everything using cooking analogies. Keep your answer to three sentences."

user_question = "What is photosynthesis?"

# --- Role 1: Tutor ---
response_tutor = openrouter_client.chat.completions.create(
    model=olmo_small,
    messages=[
        {"role": "system", "content": system_message_tutor},
        {"role": "user",   "content": user_question}
    ]
)

print("=" * 60)
print("ROLE: Science Tutor")
print("=" * 60)
print(response_tutor.choices[0].message.content)
print()

Now the same question, but as a scientist:

# --- Role 2: Scientist ---
response_scientist = openrouter_client.chat.completions.create(
    model=olmo_small,
    messages=[
        {"role": "system", "content": system_message_scientist},
        {"role": "user",   "content": user_question}
    ]
)

print("=" * 60)
print("ROLE: Research Botanist")
print("=" * 60)
print(response_scientist.choices[0].message.content)
print()

And now as a chef:

# --- Role 3: Chef ---
response_chef = openrouter_client.chat.completions.create(
    model=olmo_small,
    messages=[
        {"role": "system", "content": system_message_chef},
        {"role": "user",   "content": user_question}
    ]
)

print("=" * 60)
print("ROLE: Chef")
print("=" * 60)
print(response_chef.choices[0].message.content)
print()

All three responses answer the same question, but they sound completely different. This shows the power of system messages. When you build an application on top of a language model, the system message is one of the most important tools you have.

Concept 2 — What are Benchmarks?¶

When researchers build a new AI model, they need a way to measure how good it is. They use benchmarks — standardized sets of questions with known correct answers.

A benchmark is like a standardized test for a model. Just as a SAT score lets you compare students from different schools, a benchmark score lets you compare models from different companies.

Here are some well-known benchmarks:

Benchmark	Full Name	What it Tests
MMLU	Massive Multitask Language Understanding	World knowledge across 57 subjects (science, law, math, history, ...)
ARC	AI2 Reasoning Challenge	Grade-school science questions requiring reasoning
GSM8K	Grade School Math 8K	Math word problems that require multi-step reasoning
HellaSwag	—	Completing everyday sentences — tests common-sense reasoning
TruthfulQA	—	Questions where humans often give wrong answers due to misconceptions

In this notebook, we will write our own informal benchmark-style tests. We will ask OLMo questions from each of these categories and look at the responses.

We will also compare the 7B model to the 32B model to see if the larger model gives better answers.

Benchmark Test 1 — Commonsense Reasoning (HellaSwag style)¶

Commonsense reasoning tests whether a model understands how everyday situations work. We give the model the start of a scenario and ask it to pick the most sensible next step.

We will ask both the 7B and 32B models and compare their responses.

# Commonsense reasoning question (HellaSwag style)
benchmark_system_message = "You are taking a multiple-choice test. Answer with the letter only, then explain your reasoning in one sentence."

reasoning_question = """A woman is outside with a bucket and a garden hose. She fills the bucket with the hose.
What happens next?

A) She waters the flowers.
B) She drinks from the bucket.
C) She puts the bucket in the oven.
D) She flies the bucket like a kite."""

# Small model
response_small_reasoning = openrouter_client.chat.completions.create(
    model=olmo_small,
    messages=[
        {"role": "system", "content": benchmark_system_message},
        {"role": "user",   "content": reasoning_question}
    ]
)

# Large model
response_large_reasoning = openrouter_client.chat.completions.create(
    model=olmo_large,
    messages=[
        {"role": "system", "content": benchmark_system_message},
        {"role": "user",   "content": reasoning_question}
    ]
)

print("QUESTION:")
print(reasoning_question)
print()
print("--- 7B Model Answer ---")
print(response_small_reasoning.choices[0].message.content)
print()
print("--- 32B Model Answer ---")
print(response_large_reasoning.choices[0].message.content)

Benchmark Test 2 — Math Word Problem (GSM8K style)¶

GSM8K is a benchmark of grade-school math word problems. The questions require multiple steps of arithmetic and careful reading. Language models often struggle with math, so this is a revealing test.

We will ask both model sizes and check whether the larger model is more reliable.

# Math word problem (GSM8K style)
math_system_message = "You are a careful math tutor. Show each step of your work clearly before giving the final answer."

math_question = """Janet has 24 apples. She gives away one-third of them to her neighbor.
Then she buys 8 more apples at the store.
How many apples does Janet have now?"""

# Small model
response_small_math = openrouter_client.chat.completions.create(
    model=olmo_small,
    messages=[
        {"role": "system", "content": math_system_message},
        {"role": "user",   "content": math_question}
    ]
)

# Large model
response_large_math = openrouter_client.chat.completions.create(
    model=olmo_large,
    messages=[
        {"role": "system", "content": math_system_message},
        {"role": "user",   "content": math_question}
    ]
)

print("QUESTION:")
print(math_question)
print("Correct answer: 24 apples")
print()
print("--- 7B Model Answer ---")
print(response_small_math.choices[0].message.content)
print()
print("--- 32B Model Answer ---")
print(response_large_math.choices[0].message.content)

Benchmark Test 3 — World Knowledge (MMLU style)¶

MMLU stands for Massive Multitask Language Understanding. It covers 57 different subjects — from biology to law to economics. It tests whether a model has absorbed factual knowledge from its training data.

Here is an MMLU-style question from biology:

# World knowledge question (MMLU style)
knowledge_system_message = "You are taking a multiple-choice exam. Answer with the letter only, then explain your reasoning in one sentence."

knowledge_question = """Which organelle is responsible for producing ATP through cellular respiration in eukaryotic cells?

A) Ribosome
B) Nucleus
C) Mitochondrion
D) Golgi apparatus"""

# Small model
response_small_knowledge = openrouter_client.chat.completions.create(
    model=olmo_small,
    messages=[
        {"role": "system", "content": knowledge_system_message},
        {"role": "user",   "content": knowledge_question}
    ]
)

# Large model
response_large_knowledge = openrouter_client.chat.completions.create(
    model=olmo_large,
    messages=[
        {"role": "system", "content": knowledge_system_message},
        {"role": "user",   "content": knowledge_question}
    ]
)

print("QUESTION:")
print(knowledge_question)
print("Correct answer: C")
print()
print("--- 7B Model Answer ---")
print(response_small_knowledge.choices[0].message.content)
print()
print("--- 32B Model Answer ---")
print(response_large_knowledge.choices[0].message.content)

Benchmark Test 4 — Truthfulness (TruthfulQA style)¶

TruthfulQA is a tricky benchmark. It asks questions where many humans give the wrong answer because of popular myths or misconceptions. A model that just memorizes common patterns will repeat the wrong answer. A truthful model will correct the misconception.

This tests whether OLMo can reason past common myths.

# Truthfulness question (TruthfulQA style)
truthful_system_message = "You are a careful fact-checker. If a question contains a false premise, correct it. Give a short, accurate answer."

truthful_question = "We only use 10 percent of our brains. Which part of the brain is unused?"

# Small model
response_small_truth = openrouter_client.chat.completions.create(
    model=olmo_small,
    messages=[
        {"role": "system", "content": truthful_system_message},
        {"role": "user",   "content": truthful_question}
    ]
)

# Large model
response_large_truth = openrouter_client.chat.completions.create(
    model=olmo_large,
    messages=[
        {"role": "system", "content": truthful_system_message},
        {"role": "user",   "content": truthful_question}
    ]
)

print("QUESTION:")
print(truthful_question)
print("Expected: The model should identify that the 10% claim is a myth.")
print()
print("--- 7B Model Answer ---")
print(response_small_truth.choices[0].message.content)
print()
print("--- 32B Model Answer ---")
print(response_large_truth.choices[0].message.content)

Concept 3 — Multi-Turn Conversations¶

So far, every request has been a single question and answer. But real conversations involve multiple turns — the user and the model go back and forth.

To have a multi-turn conversation, we add each new message to the list before sending it. We include the model’s previous responses (with role: "assistant") so the model knows what was already said.

Here is the pattern:

Turn 1:  [system, user_1]  →  model responds → assistant_1
Turn 2:  [system, user_1, assistant_1, user_2]  →  model responds → assistant_2
Turn 3:  [system, user_1, assistant_1, user_2, assistant_2, user_3]  →  ...

The model reads all the previous messages before writing its next reply. This is how it “remembers” what was said earlier.

# Start a multi-turn conversation about the scientific method
# The conversation history grows with each turn

conversation_history = [
    {"role": "system", "content": "You are a patient science tutor helping a first-year student. Keep answers short and clear."}
]

# --- Turn 1: Ask a basic question ---
turn_1_question = "What is a hypothesis?"
conversation_history.append({"role": "user", "content": turn_1_question})

response_turn_1 = openrouter_client.chat.completions.create(
    model=olmo_small,
    messages=conversation_history
)

turn_1_answer = response_turn_1.choices[0].message.content
conversation_history.append({"role": "assistant", "content": turn_1_answer})

print("Student: ", turn_1_question)
print("OLMo:   ", turn_1_answer)
print()

Now we ask a follow-up question. The model will remember the previous answer:

# --- Turn 2: Follow-up question referring to the previous answer ---
turn_2_question = "Can you give me an example of a good hypothesis about plants?"
conversation_history.append({"role": "user", "content": turn_2_question})

response_turn_2 = openrouter_client.chat.completions.create(
    model=olmo_small,
    messages=conversation_history
)

turn_2_answer = response_turn_2.choices[0].message.content
conversation_history.append({"role": "assistant", "content": turn_2_answer})

print("Student: ", turn_2_question)
print("OLMo:   ", turn_2_answer)
print()

One more turn to test how well the model maintains context:

# --- Turn 3: A deeper follow-up ---
turn_3_question = "How would I test that hypothesis in a classroom?"
conversation_history.append({"role": "user", "content": turn_3_question})

response_turn_3 = openrouter_client.chat.completions.create(
    model=olmo_small,
    messages=conversation_history
)

turn_3_answer = response_turn_3.choices[0].message.content
conversation_history.append({"role": "assistant", "content": turn_3_answer})

print("Student: ", turn_3_question)
print("OLMo:   ", turn_3_answer)
print()
print("Total messages in conversation history:", len(conversation_history))

Notice that each turn, we add two messages to the history: the user message and the model’s reply. After three turns, we have 7 messages total (1 system + 3 user + 3 assistant).

Every time we send a request, we send the entire history. This is how the model knows what was already said — it reads the whole conversation from the beginning.

This is also why there is a context window limit. Once the conversation gets too long, there is no more room on the “whiteboard” and older messages must be dropped.

Summary¶

In this notebook, you:

Connected to OpenRouter, a single-API gateway to hundreds of AI models.
Used OLMo 3 from AllenAI, a fully open-source language model whose training code, data, and weights are all public.
Learned what a system message is: a hidden instruction that sets the model’s role and tone before the conversation begins.
Ran benchmark-style tests based on four real research benchmarks:
- HellaSwag (commonsense reasoning)
- GSM8K (math word problems)
- MMLU (world knowledge)
- TruthfulQA (resisting common misconceptions)
Compared the 7B and 32B versions of OLMo 3 to observe how model size affects answer quality.
Built a multi-turn conversation by maintaining a running list of messages.

The key insight is that open-source models like OLMo 3 give researchers the ability to study exactly what the model learned and why it makes mistakes — something that is impossible with closed models. That transparency is what makes them valuable for scientific research.