Talking to an Inference Server: vLLM and the NRP API

Learning Objective: By the end of this notebook, you will be able to explain what an inference server is, describe what vLLM does, connect to the NRP managed LLM endpoint, list available models, send chat requests, and explain the concepts of a harness and fine-tuning.

What You Will Do¶

Learn what an inference server is and why it matters.
Learn what vLLM is and how it powers fast model serving.
Connect to the NRP (National Research Platform) managed LLM endpoint.
See which models are available on NRP.
Send chat requests to an NRP model.
Explore the tools that surround a language model: system prompts, temperature, and the context window.
Understand what a harness is.
Get a first look at what fine-tuning means.

# os — reads environment variables like API keys from the operating system
import os

# openai — the OpenAI Python client; we use it here to talk to the NRP endpoint, which speaks the same language
try:
    from openai import OpenAI
except ImportError:
    !pip install openai
    from openai import OpenAI

# dotenv — loads secrets from a .env file so we never paste API keys into notebooks
try:
    from dotenv import load_dotenv
except ImportError:
    !pip install python-dotenv
    from dotenv import load_dotenv

Part 1 — What Is an Inference Server?¶

When people talk about “running” a language model, they mean two very different things.

Training is when a model learns from data. Training a large model can take weeks on thousands of GPUs and costs millions of dollars. You do not train a model every time you use it.

Inference is when a trained model reads your prompt and generates a reply. This is what happens every time you send a message to ChatGPT or Claude. Inference is much faster than training, but it still requires a GPU.

An inference server is a program that:

Loads a trained model into GPU memory.
Listens for requests from users (or from your Python code).
Runs the model on each request and sends back the response.
Handles many requests at the same time efficiently.

Think of it like a restaurant kitchen. The chef (the model) is always ready. When an order (a prompt) comes in, the kitchen processes it and sends back a dish (a response). The inference server is the whole kitchen system — it keeps the chef busy, manages the queue of orders, and makes sure nothing is wasted.

Without an inference server, you would need to load the model yourself every time you wanted to use it. That could take minutes just to start.

Part 2 — What Is vLLM?¶

vLLM is one of the most popular open-source inference servers for large language models. The “v” stands for virtual, referring to a memory management technique called PagedAttention that vLLM invented.

Here is the problem vLLM solves.

When a model generates a response, it stores a large amount of temporary data in GPU memory. This data is called the KV cache (Key-Value cache). If you have 50 users talking to the model at the same time, 50 KV caches need to fit in memory. Traditional systems waste space by reserving a fixed block of memory for each user, even if their conversation is short.

vLLM’s PagedAttention borrows an idea from operating systems. Instead of one big reserved block, it divides memory into small pages and allocates them on demand — just like how your computer manages RAM for running programs. The result is that vLLM can serve far more users simultaneously on the same hardware.

Why does this matter for you?

Your class can all hit the same endpoint at once without crashing it.
The API you use looks identical to the OpenAI API, so code you write here works on OpenAI too (and vice versa).
vLLM supports essentially every popular open-source model: Llama, Qwen, Mistral, Phi, and more.

The architecture looks like this:

Your Python Code
      │
      │  HTTP request  (same format as OpenAI)
      ▼
  vLLM Server  ──►  GPU  ──►  Model generates tokens
      │
      │  HTTP response  (text + token counts)
      ▼
Your Python Code

Because vLLM speaks the OpenAI protocol, you can use the openai Python library to talk to it — just by pointing it at a different URL.

Part 3 — The NRP: A Free Research Computing Platform¶

The NRP (National Research Platform) is a network of computers funded by the U.S. National Science Foundation. It is free to use for education and research.

The NRP runs a managed LLM service at: https://llm.nrp-nautilus.io

This service:

Is powered by vLLM.
Hosts several open-source models.
Uses the OpenAI-compatible API format.
Requires an NRP API key (free for academic users).

Documentation: https://nrp.ai/documentation/userdocs/ai/llm-managed/

Because the NRP uses the same API format as OpenAI, connecting to it is almost identical to connecting to OpenAI — you just change the base_url and api_key.

Part 4 — Load Your NRP API Key¶

Your instructor has stored the NRP API key in a shared .env file.

A .env file is a plain text file with lines like:

NRP_API_KEY="your-key-here"

The load_dotenv function reads that file and puts the key into memory. We never paste the key directly into the notebook, because notebooks often get shared or pushed to GitHub by accident.

# Load the API key from the shared .env file
env_file_path = "/home/jovyan/shared/.env"
load_dotenv(env_file_path)

nrp_api_key = os.getenv('NRP_API_KEY')
print("NRP API Key loaded:", "✅ Ready" if nrp_api_key else "❌ Not found — check your .env file")

If the key was not found, you can paste it here instead. Never commit this to GitHub.

# Uncomment the line below and paste your key if load_dotenv did not find it
# nrp_api_key = "your-nrp-key-here"

Part 5 — Connect to the NRP Endpoint¶

We use the standard openai Python library, but we point it at the NRP server instead of OpenAI’s servers.

Two things change compared to a normal OpenAI connection:

base_url — the address of the NRP vLLM server
api_key — your NRP key instead of an OpenAI key

Everything else — how you send messages, how you read responses — is identical.

# The NRP base URL — this is the address of the vLLM inference server
nrp_base_url = "https://llm.nrp-nautilus.io/v1"

# Create a client that talks to the NRP endpoint
nrp_client = OpenAI(
    base_url=nrp_base_url,
    api_key=nrp_api_key
)

print("NRP client created. Endpoint:", nrp_base_url)

Part 6 — What Models Are Available?¶

One of the first things to do when connecting to any LLM API is ask: “What models can I use?”

The client.models.list() method returns all model identifiers the server offers. Each model has a different size, capability, and speed.

The NRP hosts several open-source models. Here is a short guide to the ones you are likely to see:

Model name (short)	Full family	Strengths
Llama (Meta)	Meta-Llama-3.x	Strong general reasoning, widely used in research
Qwen (Alibaba)	Qwen2.5 / Qwen3	Excellent on math and code
Mistral	Mistral-7B	Efficient, good for quick tasks
gpt-oss	Various	Open-source models served under the NRP

All of these are open-source — their weights are publicly available, anyone can download and study them.

# Ask the NRP server which models are available
available_models = nrp_client.models.list()

print("Models available on NRP:")
print()

for model_entry in available_models.data:
    print(" •", model_entry.id)

The models listed above are the ones you can use right now. The list changes as the NRP team adds or retires models.

For the rest of this notebook, we will use one of these models. Update the cell below to use whichever model name appeared in the list above.

# Pick a model from the list above — replace this string with one of the names printed above
# A Llama or Qwen 8B model is a good starting point: fast and capable
chosen_model = available_models.data[0].id

print("Using model:", chosen_model)

Part 7 — Your First Chat Request¶

Let’s send a simple message to the model and read the response.

A chat request has two required parts:

model — which model to use
messages — a list of conversation turns

Each message in the list has a role and content:

"system" — background instructions that tell the model how to behave
"user" — the message from the person talking to the model
"assistant" — a previous reply from the model (used to continue a conversation)

# Send a simple chat message to the NRP model
first_response = nrp_client.chat.completions.create(
    model=chosen_model,
    messages=[
        {"role": "system", "content": "You are a helpful assistant for a university data science course."},
        {"role": "user",   "content": "In two sentences, what is a large language model?"}
    ]
)

# Pull the text out of the response object
first_answer = first_response.choices[0].message.content

print(first_answer)
print()
print("Tokens used — prompt:", first_response.usage.prompt_tokens,
      "  completion:", first_response.usage.completion_tokens)

Notice the token counts printed at the bottom. Every request consumes tokens. Prompt tokens are the tokens you sent. Completion tokens are the tokens the model generated. Together they determine cost or quota usage.

Now let’s try the same question with a different system message and see how the response changes.

# Same question, different system prompt — the model's tone changes
second_response = nrp_client.chat.completions.create(
    model=chosen_model,
    messages=[
        {"role": "system", "content": "You explain things as if speaking to a curious 10-year-old."},
        {"role": "user",   "content": "In two sentences, what is a large language model?"}
    ]
)

second_answer = second_response.choices[0].message.content

print("=== Friendly explanation ===")
print(second_answer)

The model gave a different answer — not because the question changed, but because the system prompt changed.

The system prompt is one of the most powerful tools around a language model. It lets you shape the model’s persona, tone, language level, and even its values, all without changing any model weights.

Part 8 — Tools Around a Language Model¶

A language model by itself is just a function: you give it text, it gives you text back. The interesting work happens in the tools that surround it.

Here are the main tools:

1. The System Prompt¶

A hidden instruction at the start of every conversation. It sets the model’s role, restrictions, and style. You saw this above. The user never sees it directly.

2. Temperature¶

A number (usually 0 to 2) that controls how predictable the model’s word choices are.

Low temperature (0.0) → the model picks the most likely next word every time. Same answer every run.
High temperature (1.0+) → the model sometimes picks less likely words. More creative, more varied.

3. The Context Window¶

The model can only see a limited amount of text at once. This limit is called the context window. It is measured in tokens. Most modern models have windows of 8,000 to 128,000 tokens. Once the window fills up, old messages fall off the start.

4. Max Tokens¶

The max_tokens parameter caps how long the model’s reply can be. Setting it low forces short answers. Setting it high allows long essays.

5. Stop Sequences¶

A list of strings that tell the model to stop generating as soon as it writes one of them. Useful for structured outputs.

Let’s experiment with temperature to see the effect.

# Ask the same creative question three times with low temperature
creative_prompt = "Give me a one-sentence metaphor for how a neural network learns."

print("=== Low temperature (0.0) — deterministic ===")
for attempt_number in range(1, 4):
    low_temp_response = nrp_client.chat.completions.create(
        model=chosen_model,
        max_tokens=60,
        temperature=0.0,
        messages=[{"role": "user", "content": creative_prompt}]
    )
    print(f"Run {attempt_number}:", low_temp_response.choices[0].message.content)

With temperature = 0, the three runs should produce identical or nearly identical answers.

Now let’s raise the temperature and watch the answers diverge.

# Same question three times with high temperature
print("=== High temperature (1.0) — creative ===")
for attempt_number in range(1, 4):
    high_temp_response = nrp_client.chat.completions.create(
        model=chosen_model,
        max_tokens=60,
        temperature=1.0,
        messages=[{"role": "user", "content": creative_prompt}]
    )
    print(f"Run {attempt_number}:", high_temp_response.choices[0].message.content)

At temperature = 1.0 you should see three different metaphors. All are valid — the model is just sampling from a wider range of possibilities.

When to use low temperature: factual questions, JSON extraction, code generation — any time you need consistency.

When to use high temperature: creative writing, brainstorming, generating diverse options.

Part 9 — What Is a Harness?¶

A harness is a wrapper — a piece of code that connects a language model to a specific task.

Think of a horse harness. On its own, a horse is powerful but unfocused. The harness connects the horse to a plow, a carriage, or a wagon — giving that power a specific direction.

A model harness does the same thing. It takes the general-purpose language model and wraps it in logic that:

Formats the input — builds the prompt from structured data (e.g., a question from a test bank).
Calls the model — sends the formatted prompt to the API.
Parses the output — extracts a structured answer from the model’s text (e.g., finds the letter A, B, C, or D).
Scores or logs — compares the model’s answer to the correct answer and records the result.

The word “harness” is most common in evaluation (testing how well a model performs on a benchmark). For example, the lm-evaluation-harness library from EleutherAI is the standard tool researchers use to benchmark open-source models.

Let’s build a tiny harness by hand so you can see what one looks like.

# A tiny evaluation harness — three multiple-choice questions about AI

eval_questions = [
    {
        "question": "What does the temperature parameter control in a language model?",
        "choices": ["A) The speed of inference", "B) The randomness of the output", "C) The number of layers", "D) The size of the vocabulary"],
        "correct": "B"
    },
    {
        "question": "What is a token?",
        "choices": ["A) A word", "B) A character", "C) A chunk of text that may be a word, part of a word, or punctuation", "D) A sentence"],
        "correct": "C"
    },
    {
        "question": "What does vLLM do?",
        "choices": ["A) Trains language models from scratch", "B) Serves language models efficiently using PagedAttention", "C) Translates text between languages", "D) Generates images from text"],
        "correct": "B"
    }
]

Now we write the harness loop. For each question, the harness:

Builds a prompt that presents the question and the choices.
Calls the model with a low temperature (we want consistent answers for evaluation).
Checks whether the model’s response contains the correct letter.
Records the result.

# Track results
correct_count = 0
total_count = len(eval_questions)

# System prompt tells the model exactly how to respond
eval_system_prompt = (
    "You are an exam-taking assistant. "
    "Read the question and the answer choices. "
    "Reply with only the letter of the correct answer (A, B, C, or D). "
    "Do not write anything else."
)

# Loop over each question
for question_index in range(total_count):
    question_data = eval_questions[question_index]

    # Step 1: Build the prompt
    choices_text = "\n".join(question_data["choices"])
    user_message = question_data["question"] + "\n" + choices_text

    # Step 2: Call the model
    eval_response = nrp_client.chat.completions.create(
        model=chosen_model,
        max_tokens=5,
        temperature=0.0,
        messages=[
            {"role": "system", "content": eval_system_prompt},
            {"role": "user",   "content": user_message}
        ]
    )

    # Step 3: Parse the output
    model_answer = eval_response.choices[0].message.content.strip().upper()

    # Step 4: Score
    is_correct = question_data["correct"] in model_answer
    if is_correct:
        correct_count = correct_count + 1
        result_label = "✅ Correct"
    else:
        result_label = "❌ Wrong (expected " + question_data["correct"] + ")"

    print(f"Q{question_index + 1}: Model answered '{model_answer}' — {result_label}")

print()
print(f"Score: {correct_count} / {total_count}")

That loop — format input, call model, parse output, score — is the skeleton of every evaluation harness.

The lm-evaluation-harness library used by researchers does the same thing, just for thousands of questions and dozens of standardized benchmarks. It is how the community produces the leaderboards you see on sites like the Open LLM Leaderboard.

Part 10 — A Multi-Turn Conversation¶

So far, every request has been a single exchange. But real conversations go back and forth.

Language models are stateless — they have no memory between API calls. You give them memory by passing the entire conversation history with every request.

The messages list grows with each turn:

Turn 1 →  [system, user_1]
Turn 2 →  [system, user_1, assistant_1, user_2]
Turn 3 →  [system, user_1, assistant_1, user_2, assistant_2, user_3]

Let’s build a three-turn conversation by hand.

# Start with just the system message
conversation_history = [
    {"role": "system", "content": "You are a concise AI tutor. Keep every answer to two sentences."}
]

# Turn 1 — user asks about inference servers
conversation_history.append({"role": "user", "content": "What is an inference server?"})

turn1_response = nrp_client.chat.completions.create(
    model=chosen_model,
    max_tokens=80,
    messages=conversation_history
)
turn1_answer = turn1_response.choices[0].message.content
conversation_history.append({"role": "assistant", "content": turn1_answer})

print("User: What is an inference server?")
print("Model:", turn1_answer)
print()

The model answered. Now we add a follow-up question. Notice we pass the same conversation_history list — it already contains the previous exchange.

# Turn 2 — follow-up question (model needs the previous context to understand "it")
conversation_history.append({"role": "user", "content": "How does vLLM make it more efficient?"})

turn2_response = nrp_client.chat.completions.create(
    model=chosen_model,
    max_tokens=80,
    messages=conversation_history
)
turn2_answer = turn2_response.choices[0].message.content
conversation_history.append({"role": "assistant", "content": turn2_answer})

print("User: How does vLLM make it more efficient?")
print("Model:", turn2_answer)
print()
print("Total messages in history:", len(conversation_history))

Part 11 — What Is Fine-Tuning?¶

So far we have used models exactly as they were released by their creators. But real-world applications often need a model to do something more specific — answer only about your company’s products, write in a particular style, or follow a strict output format.

Fine-tuning is the process of taking a pre-trained model and continuing its training on a smaller, task-specific dataset.

Here is the big picture:

Pre-training (done once, very expensive)
  Trillions of tokens from the internet → Base model weights

Fine-tuning (done by you, much cheaper)
  Thousands of curated examples → Fine-tuned model weights

Why fine-tune?¶

Situation	Better approach
You need the model to answer questions about a specific document	RAG — Retrieval-Augmented Generation (no training needed)
You need the model to always respond in a specific JSON format	Fine-tuning on JSON-formatted examples
You need the model to match a writing style exactly	Fine-tuning on examples of that style
You need the model to follow safe, polite guidelines	RLHF — Reinforcement Learning from Human Feedback (a special kind of fine-tuning)

Three types of fine-tuning¶

1. Supervised Fine-Tuning (SFT): You provide input–output pairs. The model learns to match your outputs. Example: pairs of customer service questions and ideal answers.

2. RLHF (Reinforcement Learning from Human Feedback): Human raters compare two model outputs and say which one is better. The model learns to produce outputs like the preferred ones. This is how ChatGPT was made helpful and safe.

3. LoRA (Low-Rank Adaptation): Instead of updating all model weights (which requires a lot of GPU memory), LoRA adds a tiny number of new parameters and only updates those. The original weights stay frozen. This makes fine-tuning possible even on modest hardware.

What fine-tuning is NOT¶

Fine-tuning does not give the model new knowledge (new facts it did not see during pre-training). For that, use RAG. Fine-tuning changes how the model responds, not what it knows.

Seeing Fine-Tuning’s Effect Without Actually Fine-Tuning¶

We can simulate the difference between a base model and a fine-tuned model using the system prompt and temperature — not actual weight changes, but enough to understand the concept.

Imagine we “fine-tuned” a model on formal academic writing. We would expect it to always cite sources, use passive voice, and avoid contractions.

Below we compare two prompts to the same model: one that mimics a base model, one that mimics a fine-tuned model.

# Simulating a "base" model — no special instructions
base_response = nrp_client.chat.completions.create(
    model=chosen_model,
    max_tokens=100,
    messages=[
        {"role": "user", "content": "Explain why the sky is blue."}
    ]
)

print("=== Base model (no system prompt) ===")
print(base_response.choices[0].message.content)

Now the same question, but with a system prompt that mimics what a fine-tuned “academic writing” model might do.

# Simulating a "fine-tuned" model — system prompt enforces a specific style
finetuned_response = nrp_client.chat.completions.create(
    model=chosen_model,
    max_tokens=100,
    messages=[
        {
            "role": "system",
            "content": (
                "You are a scientific writing assistant trained on physics textbooks. "
                "Always use formal academic language. "
                "Refer to physical phenomena by their technical names. "
                "Do not use contractions."
            )
        },
        {"role": "user", "content": "Explain why the sky is blue."}
    ]
)

print("=== Fine-tuned style (academic system prompt) ===")
print(finetuned_response.choices[0].message.content)

The two responses use different vocabulary, structure, and tone — all from the same base model.

A real fine-tuned model would apply this difference automatically without needing a system prompt at all. The style would be baked directly into the weights. That is the key practical difference.

When should you use a system prompt vs. fine-tuning?

System prompt — fast, free, easy to change, works well for most cases.
Fine-tuning — when you need 100% consistency, want to reduce prompt length, or have thousands of labeled examples ready.

Part 12 — Comparing Two NRP Models Side by Side¶

One advantage of the NRP hosting multiple models is that you can compare them on the same task with minimal code changes.

Let’s pick the first two models from the available list and ask them the same reasoning question.

# Gather the model IDs from the list we fetched earlier
all_model_ids = []
for model_entry in available_models.data:
    all_model_ids.append(model_entry.id)

print("All available model IDs:")
for model_id in all_model_ids:
    print(" •", model_id)

Now send the same reasoning problem to the first two models and compare their answers.

# Use the first two models for the comparison
# If there is only one model available, both comparisons will use the same one
model_a_id = all_model_ids[0]
model_b_id = all_model_ids[1] if len(all_model_ids) > 1 else all_model_ids[0]

comparison_question = (
    "A train leaves City A at 9:00 AM traveling at 60 mph. "
    "A second train leaves City B at 10:00 AM traveling toward City A at 80 mph. "
    "The cities are 280 miles apart. At what time do the trains meet?"
)

# Ask Model A
response_a = nrp_client.chat.completions.create(
    model=model_a_id,
    max_tokens=200,
    temperature=0.0,
    messages=[{"role": "user", "content": comparison_question}]
)

# Ask Model B
response_b = nrp_client.chat.completions.create(
    model=model_b_id,
    max_tokens=200,
    temperature=0.0,
    messages=[{"role": "user", "content": comparison_question}]
)

print(f"=== {model_a_id} ===")
print(response_a.choices[0].message.content)
print()
print(f"=== {model_b_id} ===")
print(response_b.choices[0].message.content)

Read the two answers carefully.

Did both models show their work?
Did both arrive at the same answer? (The correct answer is 12:00 PM noon.)
Was one explanation clearer than the other?

Different open-source models have different strengths on math and reasoning tasks. This kind of side-by-side test — called an A/B evaluation — is the simplest way to pick the right model for a job.

Summary¶

In this notebook you:

Learned what an inference server is — a program that loads a model and handles requests from users, like a kitchen that processes orders.
Learned what vLLM is — an open-source inference server that uses PagedAttention to serve many users simultaneously on the same GPU hardware.
Connected to the NRP — the National Research Platform’s managed LLM endpoint, which is powered by vLLM and speaks the same protocol as OpenAI.
Listed and compared NRP models — open-source models including Llama, Qwen, Mistral, and others.
Explored tools around a language model — the system prompt, temperature, context window, and max tokens, and how each one shapes model behavior.
Built a tiny evaluation harness — a loop that formats a question, calls the model, parses the letter answer, and checks it against the correct answer. This is the skeleton of every LLM benchmark.
Simulated a multi-turn conversation — by passing the growing messages list with every call.
Understood fine-tuning — continuing training on a small task-specific dataset to change how a model responds, without giving it new factual knowledge. Three approaches: SFT, RLHF, and LoRA.