Imagine you are building an AI study buddy for your classmates. Your study buddy needs to answer questions, explain concepts, and remember what was already said in the conversation.
But there is a problem: different AI models cost very different amounts of money. Do you really need the most expensive model, or will the cheapest one do the job?
In this notebook, we will run real experiments to find out. Along the way, you will learn how these models actually work under the hood.
Learning Objective: By the end of this notebook, you will be able to call the Anthropic Claude API, explain what tokens and context windows are, compare model quality against cost, and tune parameters like temperature, top-p, and top-k to control how an AI responds.
import os
try:
from dotenv import load_dotenv
except ImportError:
!pip install python-dotenv
from dotenv import load_dotenv
try:
import anthropic
except ImportError:
!pip install anthropic
import anthropicStep 1 — Load Your API Key¶
An API key is a secret password that proves to Anthropic’s servers that you are allowed to use their service. It is tied to a billing account, so treat it like a credit card number.
We store the key in a file called .env (outside of this notebook) so it never gets accidentally shared on GitHub.
The file looks like this:
ANTHROPIC_API_KEY="sk-ant-..."The load_dotenv function reads that file and puts the key into memory as an environment variable.
We then read it with os.getenv.
# Your file path may differ based on your setup. Adjust as needed.
load_dotenv('/home/jovyan/shared/.env')
anthropic_api_key = os.getenv('ANTHROPIC_API_KEY')
print("API Key loaded:", "✅ Ready" if anthropic_api_key else "❌ Not found — check your .env file")API Key loaded: ❌ Not found — check your .env file
If the key was not found, you can paste it directly here instead (never commit this line to GitHub):
# Uncomment the next line and paste your key if load_dotenv did not find it
#anthropic_api_key = "sk-..." # Replace with your actual API key if neededStep 2 — Create a Client¶
A client is a Python object that holds your API key and handles the technical details of sending requests to Anthropic’s servers. Think of it like dialing a phone number — the client makes the connection so you can have a conversation.
claude_client = anthropic.Anthropic(api_key=anthropic_api_key)
print("Client created. Ready to talk to Claude.")# what models are available?
models = claude_client.models.list()
for model in models.data:
print(f"{model.display_name} — {model.id}")Concept 1 — What is a Token?¶
Before we send a single message, we need to understand tokens.
A token is the basic unit of text that a language model reads and writes. It is not a word and it is not a character. It is something in between.
Here are some examples:
The word
"hello"is 1 token.The word
"unbelievable"is 3 tokens:un,believ,able.A space before a word is usually part of the token:
" hello"is still 1 token.A short sentence of 10 words is roughly 13–15 tokens.
Why does this matter? Because Anthropic charges you per token, not per word or per message. The longer your prompt and the longer the response, the more tokens are used, and the more you pay.
As a rough rule of thumb: 1 token ≈ 4 English characters ≈ 0.75 words. So 1,000 words is about 1,333 tokens.
# Let's estimate tokens for a few example strings using the rough rule of thumb
example_texts = [
"Hello",
"What is the Pythagorean theorem?",
"Explain the difference between supervised and unsupervised machine learning, with examples.",
"Write a 500-word essay on the history of the Roman Empire."
]
print(f"{'Text':<60} {'Chars':>6} {'~Tokens':>8}")
print("-" * 76)
for example_text in example_texts:
character_count = len(example_text)
estimated_token_count = character_count / 4
print(f"{example_text:<60} {character_count:>6} {estimated_token_count:>8.0f}")The estimates above are rough. The real token count depends on the tokenizer — a piece of software that converts text into tokens. Different models can have different tokenizers.
Luckily, when we make an actual API call, Claude tells us the exact token count in the response. We will see this in action shortly.
Concept 2 — The Context Window¶
Every language model has a context window — a limit on how much text it can read and write in one conversation.
Think of it like a whiteboard. The model can only see what is written on the whiteboard right now. Once the whiteboard is full, it cannot hold any more text.
The context window is measured in tokens. If the context window is 200,000 tokens, and there are roughly 750 words per 1,000 tokens, that is about 150,000 words — or the length of an entire novel.
Why does this matter for our study buddy? If your conversation gets too long, the model will start forgetting earlier parts of it. You need to be aware of how much history you are passing in.
All current Claude models share the same context window size:
# Context window information for current Claude models
# Source: https://docs.anthropic.com/en/docs/about-claude/models
claude_model_info = {
"claude-haiku-4-20250514": {"nickname": "Haiku", "context_window": 200000},
"claude-sonnet-4-20250514": {"nickname": "Sonnet", "context_window": 200000},
"claude-opus-4-5": {"nickname": "Opus", "context_window": 200000},
}
print(f"{'Model':<30} {'Nickname':<10} {'Context Window':>15} {'Approx. words':>15}")
print("-" * 72)
for model_id in claude_model_info:
model_details = claude_model_info[model_id]
context_tokens = model_details["context_window"]
approx_words = int(context_tokens * 0.75)
print(f"{model_id:<30} {model_details['nickname']:<10} {context_tokens:>15,} {approx_words:>15,}")Concept 3 — The Claude Model Family¶
Anthropic offers three levels of Claude models. Each level trades off quality for cost and speed:
| Model | Nickname | Best For | Cost (per 1M tokens) |
|---|---|---|---|
| claude-haiku-4 | Haiku | Simple Q&A, summaries, quick lookups | Input 4.00 |
| claude-sonnet-4 | Sonnet | Analysis, writing, coding, most tasks | Input 15.00 |
| claude-opus-4 | Opus | Complex research, hard reasoning, nuanced writing | Input 75.00 |
Haiku is named after the short Japanese poem form — it is fast and lightweight. Sonnet is a longer poem form — more capable. Opus is the grandest form — the most powerful but also the most expensive.
For our study buddy, Haiku might be good enough for simple questions. Let’s find out.
# Pricing per million tokens (as of early 2026 — check https://claude.com/pricing#api for updates)
claude_pricing = {
"claude-haiku-4-5-20251001": {"input_per_million": 0.80, "output_per_million": 4.00},
"claude-sonnet-4-5-20250929": {"input_per_million": 3.00, "output_per_million": 15.00},
"claude-opus-4-5": {"input_per_million": 15.00, "output_per_million": 75.00},
}Experiment 1 — Ask Haiku a Simple Question¶
Let’s give Haiku its first test. We’ll ask it to explain a basic math concept — the kind of thing a student might ask their study buddy.
The API call has these key parts:
model— which Claude model to usemax_tokens— the maximum number of tokens the response can be (like a word count limit)system— a background instruction that sets the AI’s role or personalitymessages— the list of messages in the conversation so far
haiku_model_id = "claude-haiku-4-5-20251001"
haiku_response = claude_client.messages.create(
model=haiku_model_id,
max_tokens=300,
system="You are a friendly study buddy helping a college student understand math and science.",
messages=[
{"role": "user", "content": "What is the Pythagorean theorem? Give a simple example."}
]
)
haiku_answer = haiku_response.content[0].text
print("=== Haiku's Answer ===")
print(haiku_answer)
print()
print(f"Input tokens used: {haiku_response.usage.input_tokens}")
print(f"Output tokens used: {haiku_response.usage.output_tokens}")
print(f"Total tokens used: {haiku_response.usage.input_tokens + haiku_response.usage.output_tokens}")That looked pretty solid for a simple question!
Now let’s ask the same question to Sonnet — the more expensive, more capable model. Can you spot the differences in quality?
sonnet_model_id = "claude-sonnet-4-5-20250929"
sonnet_response = claude_client.messages.create(
model=sonnet_model_id,
max_tokens=300,
system="You are a friendly study buddy helping a college student understand math and science.",
messages=[
{"role": "user", "content": "What is the Pythagorean theorem? Give a simple example."}
]
)
sonnet_answer = sonnet_response.content[0].text
print("=== Sonnet's Answer ===")
print(sonnet_answer)
print()
print(f"Input tokens used: {sonnet_response.usage.input_tokens}")
print(f"Output tokens used: {sonnet_response.usage.output_tokens}")
print(f"Total tokens used: {sonnet_response.usage.input_tokens + sonnet_response.usage.output_tokens}")Comparing Cost: What Did Those Two Calls Actually Cost?¶
Both models answered the same question. But how much did each one cost?
Remember: pricing is per million tokens. So the formula for one call is:
cost = (input_tokens / 1,000,000) * input_price_per_million
+ (output_tokens / 1,000,000) * output_price_per_million# Calculate cost for the Haiku call
haiku_prices = claude_pricing[haiku_model_id]
haiku_input_tokens = haiku_response.usage.input_tokens
haiku_output_tokens = haiku_response.usage.output_tokens
haiku_cost = (haiku_input_tokens / 1_000_000) * haiku_prices["input_per_million"]
haiku_cost = haiku_cost + (haiku_output_tokens / 1_000_000) * haiku_prices["output_per_million"]
# Calculate cost for the Sonnet call
sonnet_prices = claude_pricing[sonnet_model_id]
sonnet_input_tokens = sonnet_response.usage.input_tokens
sonnet_output_tokens = sonnet_response.usage.output_tokens
sonnet_cost = (sonnet_input_tokens / 1_000_000) * sonnet_prices["input_per_million"]
sonnet_cost = sonnet_cost + (sonnet_output_tokens / 1_000_000) * sonnet_prices["output_per_million"]
print(f"Haiku — {haiku_input_tokens} input + {haiku_output_tokens} output tokens — Cost: ${haiku_cost:.6f}")
print(f"Sonnet — {sonnet_input_tokens} input + {sonnet_output_tokens} output tokens — Cost: ${sonnet_cost:.6f}")
print()
cost_ratio = sonnet_cost / haiku_cost
print(f"Sonnet cost {cost_ratio:.1f}x more than Haiku for this question.")Experiment 2 — A Harder Question¶
The Pythagorean theorem question was simple. Both models probably did well.
Now let’s try something that requires more reasoning — a question about algorithm tradeoffs. This is the kind of question where you might actually need the smarter model.
We will use the same setup, just with a harder question.
hard_question = """Compare bubble sort, merge sort, and quicksort.
For each one, explain: the time complexity, the space complexity, and a real situation
where you would choose that algorithm over the others."""
haiku_hard_response = claude_client.messages.create(
model=haiku_model_id,
max_tokens=500,
system="You are a friendly study buddy helping a college student understand computer science.",
messages=[
{"role": "user", "content": hard_question}
]
)
sonnet_hard_response = claude_client.messages.create(
model=sonnet_model_id,
max_tokens=500,
system="You are a friendly study buddy helping a college student understand computer science.",
messages=[
{"role": "user", "content": hard_question}
]
)
print("=== Haiku on the hard question ===")
print(haiku_hard_response.content[0].text)
print()
print("=" * 60)
print()
print("=== Sonnet on the hard question ===")
print(sonnet_hard_response.content[0].text)Take a moment to read both answers carefully.
Did both models cover all three algorithms?
Were the real-world examples useful?
Was one response noticeably clearer or more complete?
This is what “good enough” looks like in practice. For many questions, Haiku is surprisingly capable. The extra cost of Sonnet is only worth it when you notice a real difference in quality.
Concept 4 — Temperature: How Random Should the AI Be?¶
When a language model generates the next word (or token), it does not pick just one answer. It assigns a probability to every possible next token. Then it samples from those probabilities.
Temperature controls how spread out those probabilities are:
Temperature = 0.0 — The model always picks the single most likely token. The output is deterministic (the same every time you ask). Use this for factual tasks where consistency matters.
Temperature = 1.0 — The model samples according to raw probabilities. The output is more creative and varied. Use this for brainstorming or creative writing.
Temperature > 1.0 — The model gets even more random. Rarely useful in practice.
Think of temperature as a confidence dial:
Low temperature = the model commits to its best guess.
High temperature = the model is willing to explore unusual ideas.
Let’s see this in action by asking for a one-sentence summary twice — once with temperature 0 and once with temperature 1.
temperature_prompt = "In one sentence, explain what machine learning is."
# Call the API three times with temperature = 0 (should produce near-identical answers)
print("--- Temperature = 0 (deterministic) ---")
print("Running the same prompt three times...")
print()
for run_number in range(1, 4):
low_temp_response = claude_client.messages.create(
model=haiku_model_id,
max_tokens=80,
temperature=0.0,
messages=[
{"role": "user", "content": temperature_prompt}
]
)
print(f"Run {run_number}: {low_temp_response.content[0].text.strip()}")
print()With temperature = 0, the answers should be identical or nearly identical every single run. This is great for tasks like: factual lookups, structured data extraction, classification.
Now watch what happens when we crank the temperature up to 1.0.
# Call the API three times with temperature = 1.0 (should produce varied answers)
print("--- Temperature = 1.0 (creative) ---")
print("Running the same prompt three times...")
print()
for run_number in range(1, 4):
high_temp_response = claude_client.messages.create(
model=haiku_model_id,
max_tokens=80,
temperature=1.0,
messages=[
{"role": "user", "content": temperature_prompt}
]
)
print(f"Run {run_number}: {high_temp_response.content[0].text.strip()}")
print()Notice how the three high-temperature answers differ in wording, structure, and sometimes emphasis. None of them are wrong — they are just different ways of expressing the same idea.
For our study buddy app, we might use:
Temperature = 0 when we need a consistent, factual answer (e.g., “What year did World War I start?”)
Temperature = 0.7 when we want a natural, conversational tone but still mostly accurate
Concept 5 — Top-P: Nucleus Sampling¶
Top-P (also called nucleus sampling) is another way to control randomness. Instead of adjusting the temperature of all probabilities, top-p filters out the low-probability tokens entirely.
Here is how it works:
Rank all possible next tokens from most likely to least likely.
Keep only the top tokens whose combined probability adds up to P.
Sample only from those tokens.
For example, with top_p = 0.9, the model looks at the most probable tokens until their total probability reaches 90%, then ignores everything else.
Top-P = 1.0 — Use all tokens (no filtering). This is the default.
Top-P = 0.9 — Filters out rare, unlikely tokens. Responses are more coherent.
Top-P = 0.1 — Only the most probable tokens are kept. Responses are very conservative and predictable.
Temperature vs Top-P:
These two parameters both control randomness, but they do it differently.
You can use them together or separately.
A common setting for creative writing is temperature=0.9, top_p=0.95.
A common setting for factual answers is temperature=0.0 (top-p doesn’t matter much when temperature is 0).
brainstorm_prompt = "Give me one creative analogy that explains what a neural network is."
# Very conservative — only the most common tokens
conservative_response = claude_client.messages.create(
model=haiku_model_id,
max_tokens=100,
top_p=0.1,
messages=[
{"role": "user", "content": brainstorm_prompt}
]
)
# Very open — all tokens are in play
open_response = claude_client.messages.create(
model=haiku_model_id,
max_tokens=100,
top_p=1.0,
messages=[
{"role": "user", "content": brainstorm_prompt}
]
)
print("--- top_p = 0.1 (conservative nucleus) ---")
print(conservative_response.content[0].text.strip())
print()
print("--- top_p = 1.0 (full nucleus) ---")
print(open_response.content[0].text.strip())Concept 6 — Top-K: Limiting the Vocabulary¶
Top-K is the simplest way to limit randomness. Instead of working with probabilities at all, it simply says: “Only ever consider the K most likely next tokens.”
Top-K = 1 — Always pick the single most likely token. Same as temperature = 0.
Top-K = 5 — Only consider the 5 most likely tokens. Very constrained.
Top-K = 50 — Consider the 50 most likely tokens. More variety.
Top-K = 250 — A lot of vocabulary to choose from. Can get unusual.
When to use Top-K: Top-K is useful when you want to limit the model to reasonable vocabulary without adjusting temperature. For example, if you are generating product names, you might want Top-K = 40 to stay creative but not generate nonsense words.
Note: You can combine Top-K and Top-P. In practice, whichever filter is more restrictive wins.
topk_prompt = "In one sentence, what color is the sky?"
# Very low top_k — only considers the most likely 3 tokens at each step
low_topk_response = claude_client.messages.create(
model=haiku_model_id,
max_tokens=50,
temperature=1.0,
top_k=3,
messages=[
{"role": "user", "content": topk_prompt}
]
)
# High top_k — considers the top 250 tokens at each step
high_topk_response = claude_client.messages.create(
model=haiku_model_id,
max_tokens=50,
temperature=1.0,
top_k=250,
messages=[
{"role": "user", "content": topk_prompt}
]
)
print("--- top_k = 3 (very limited vocabulary at each step) ---")
print(low_topk_response.content[0].text.strip())
print()
print("--- top_k = 250 (much wider vocabulary at each step) ---")
print(high_topk_response.content[0].text.strip())Concept 7 — Conversation History¶
This is one of the most important things to understand about language model APIs:
The model has no memory.
Every API call is completely independent.
The model does not remember the previous call.
It only knows what is in the messages list you send right now.
To give the model memory, you have to manually build up the conversation history. Each time the user says something and the model responds, you add both messages to a list. Then you send that entire list on the next call.
The messages list looks like this:
[
{"role": "user", "content": "What is photosynthesis?"},
{"role": "assistant", "content": "Photosynthesis is the process..."},
{"role": "user", "content": "How does that relate to climate change?"},
]The third message only makes sense because we included the first two. Without history, the model would not know what “that” refers to.
Let’s build a short multi-turn conversation step by step.
# Start with an empty conversation history
conversation_history = []
study_system_prompt = "You are a friendly study buddy. Keep your answers short and clear."
# --- Turn 1: User asks about photosynthesis ---
first_question = "What is photosynthesis? One sentence only."
conversation_history.append({"role": "user", "content": first_question})
turn1_response = claude_client.messages.create(
model=haiku_model_id,
max_tokens=100,
system=study_system_prompt,
messages=conversation_history
)
turn1_answer = turn1_response.content[0].text
conversation_history.append({"role": "assistant", "content": turn1_answer})
print(f"User: {first_question}")
print(f"Claude: {turn1_answer.strip()}")
print(f"(History now has {len(conversation_history)} messages)")
print()We added the model’s answer to the conversation history. Now we ask a follow-up question that refers back to the first answer. If history is working correctly, Claude will know what we are referring to.
# --- Turn 2: Follow-up using "it" — requires history to make sense ---
second_question = "Where in the plant does it happen?"
conversation_history.append({"role": "user", "content": second_question})
turn2_response = claude_client.messages.create(
model=haiku_model_id,
max_tokens=100,
system=study_system_prompt,
messages=conversation_history
)
turn2_answer = turn2_response.content[0].text
conversation_history.append({"role": "assistant", "content": turn2_answer})
print(f"User: {second_question}")
print(f"Claude: {turn2_answer.strip()}")
print(f"(History now has {len(conversation_history)} messages)")
print()
# --- Turn 3: Another follow-up ---
third_question = "Give me one quiz question to test myself on this topic."
conversation_history.append({"role": "user", "content": third_question})
turn3_response = claude_client.messages.create(
model=haiku_model_id,
max_tokens=150,
system=study_system_prompt,
messages=conversation_history
)
turn3_answer = turn3_response.content[0].text
conversation_history.append({"role": "assistant", "content": turn3_answer})
print(f"User: {third_question}")
print(f"Claude: {turn3_answer.strip()}")
print(f"(History now has {len(conversation_history)} messages)")The conversation flowed naturally across three turns because we passed the full history each time.
Now let’s see what happens when we forget to include history and send only the follow-up question alone. Claude will not know what “it” refers to.
# What happens without history?
# We send only the follow-up question, with no prior context.
no_history_response = claude_client.messages.create(
model=haiku_model_id,
max_tokens=100,
system=study_system_prompt,
messages=[
{"role": "user", "content": "Where in the plant does it happen?"}
]
)
print("Without history, Claude sees only: 'Where in the plant does it happen?'")
print()
print("Claude's answer:")
print(no_history_response.content[0].text.strip())Without history, Claude guesses what “it” might mean — or asks for clarification. This shows why conversation history is so important in any multi-turn application.
One thing to watch out for: Every message in the history costs tokens. A long conversation can eat up your token budget quickly. As conversations get long, you may need to summarize and trim the history.
Putting It All Together: The Full Parameter Set¶
Now that we understand each parameter, let’s see them all used together in a single call. This is what a well-configured API call looks like for our study buddy:
| Parameter | What it does | Our choice |
|---|---|---|
model | Which Claude version to use | haiku (cheap), sonnet (capable) |
max_tokens | Maximum length of the response | 300 (a few paragraphs) |
temperature | How creative/random | 0.3 (mostly factual, slightly natural) |
top_p | Which tokens are in the sampling pool | 0.95 (nearly all tokens) |
top_k | Max number of tokens to sample from | 50 (limits to most reasonable words) |
system | Background role for the AI | Study buddy persona |
messages | The full conversation history | All prior turns |
stop_sequences | Text that ends generation early | Not used here |
# A fully configured call with all parameters set intentionally
full_config_response = claude_client.messages.create(
model=haiku_model_id,
max_tokens=300,
temperature=0.3, # mostly factual, a little natural-sounding
#top_p=0.95, # nearly all tokens are eligible, but rare ones excluded
#top_k=50, # at each step, only consider the 50 most probable tokens
system="You are a friendly, encouraging study buddy helping a college student pass their exams.",
messages=[
{"role": "user", "content": "I'm struggling with recursion. Can you explain it simply?"}
]
)
print("=== Study Buddy Response ===")
print(full_config_response.content[0].text)
print()
print(f"Tokens used — Input: {full_config_response.usage.input_tokens}, Output: {full_config_response.usage.output_tokens}")
# Calculate and show the cost
haiku_price_info = claude_pricing[haiku_model_id]
call_input_cost = (full_config_response.usage.input_tokens / 1_000_000) * haiku_price_info["input_per_million"]
call_output_cost = (full_config_response.usage.output_tokens / 1_000_000) * haiku_price_info["output_per_million"]
total_call_cost = call_input_cost + call_output_cost
print(f"Cost for this single call: ${total_call_cost:.6f}")Cost Comparison: How Much Does This Actually Cost?¶
Let’s calculate the monthly cost for our study buddy under three realistic scenarios.
We will use a for loop to go through each Claude model and print the cost.
The formula is:
monthly cost = (sessions/day × 30) × (input_tokens × input_price + output_tokens × output_price)
─────────────────────────────────────────────────────────
1,000,000# Three scenarios to compare
scenario_name = "100 sessions/day, 500 input tokens, 400 output tokens each"
daily_sessions = 100
avg_input_tokens_per_session = 500
avg_output_tokens_per_session = 400
days_per_month = 30
monthly_total_sessions = daily_sessions * days_per_month
monthly_total_input_tokens = monthly_total_sessions * avg_input_tokens_per_session
monthly_total_output_tokens = monthly_total_sessions * avg_output_tokens_per_session
print(f"Scenario: {scenario_name}")
print(f"Total sessions per month: {monthly_total_sessions:,}")
print(f"Total input tokens: {monthly_total_input_tokens:,}")
print(f"Total output tokens: {monthly_total_output_tokens:,}")
print()
print(f"{'Model':<30} {'Monthly Cost':>15}")
print("-" * 47)
for model_id in claude_pricing:
model_price_info = claude_pricing[model_id]
monthly_input_cost = (monthly_total_input_tokens / 1_000_000) * model_price_info["input_per_million"]
monthly_output_cost = (monthly_total_output_tokens / 1_000_000) * model_price_info["output_per_million"]
monthly_total_cost = monthly_input_cost + monthly_output_cost
print(f"{model_id:<30} ${monthly_total_cost:>14.2f}")Now try changing the numbers above and re-running the cell.
Some scenarios to explore:
1,000 sessions/day — how does the cost change for each model?
2,000 output tokens per session instead of 400 — how much more expensive is that?
For a classroom with 30 students asking 5 questions each day, what would the monthly Haiku cost be?
Notice how Haiku is dramatically cheaper than Sonnet or Opus. For a study buddy that answers factual questions, Haiku is almost certainly good enough — and much more affordable.
Quick Reference: When to Use Which Model and Settings¶
Here is a decision guide based on what you have learned:
Choose Haiku when:
The task is simple: facts, summaries, yes/no, basic Q&A
Cost is a priority
Speed matters (Haiku is also faster)
Choose Sonnet when:
The task requires nuance: analysis, explanation, writing, code
Quality matters more than cost
You tried Haiku and the answers were not good enough
Choose Opus when:
The task is very complex: multi-step reasoning, advanced research, subtle writing
You have a specific benchmark or quality bar that only Opus can reach
Cost is not a major concern
Temperature guide:
0.0 — factual, deterministic, consistent
0.3–0.5 — natural-sounding but mostly accurate
0.7–0.9 — creative, varied
1.0+ — very random, experimental
Top-P and Top-K:
For most tasks, leave both at their defaults (top_p=1.0, top_k not set)
Lower top_p (e.g., 0.9) when you want to reduce rare/strange word choices
Lower top_k (e.g., 20–50) for very constrained, conservative outputs
Summary: What You Just Learned¶
You have covered a lot of ground in this notebook. Here is what you now know:
Tokens are the basic unit of text for language models — roughly 4 characters or 0.75 words each. Every API call is billed by the token.
Context window is the total amount of text (in tokens) that fits in one conversation. All current Claude models support up to 200,000 tokens, but more history means higher cost.
Model families — Haiku (fast and cheap), Sonnet (balanced), Opus (most capable but expensive). You should always start with the cheapest model and upgrade only when you see a real quality difference.
Temperature controls randomness. Use 0 for consistent factual answers. Use 0.7–1.0 for creative or varied outputs.
Top-P filters the sampling pool by cumulative probability. Values near 1.0 allow all tokens. Lower values (like 0.9) exclude rare, unusual tokens.
Top-K limits sampling to only the K most probable tokens at each step. It is a simpler filter than top-p.
Conversation history is how you give the model memory. The model is stateless — you must send the full history on every call. Each message in the history costs tokens.
You now have all the tools to build a real AI application — and to make smart decisions about cost, quality, and behavior.