Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Intro to AI workflows using Small LM from Huggingface

Using llama-cpp-python Package

This notebook is adapted from the Intro to AI workflows using Small LM from Huggingface notebook and demonstrates how to work with small language models using the llama-cpp-python package instead of GPT4All.

What is llama-cpp-python?

llama-cpp-python provides Python bindings for the llama.cpp library, which is a high-performance C++ implementation for running Large Language Models (LLMs) locally. It’s one of the most popular libraries for running GGUF models on consumer hardware.

Key Features of llama-cpp-python:

  1. Efficient Local Inference: Runs models entirely on CPU (with optional GPU acceleration)

  2. GGUF Model Support: Works with quantized models in the GGUF format from Hugging Face

  3. Low Memory Footprint: Supports 4-bit, 8-bit, and other quantization levels

  4. OpenAI-Compatible API: Includes a built-in server that mimics OpenAI’s API

  5. Active Development: Regular updates to support new model architectures

  6. Cross-Platform: Works on Windows, macOS, and Linux

Why use llama-cpp-python over GPT4All?

Featurellama-cpp-pythonGPT4All
Model SupportAny GGUF model from Hugging FaceCurated list of ~30 models
UpdatesVery frequent (follows llama.cpp)Less frequent
API StyleDirect Python + OpenAI-compatible serverCustom Python API
FlexibilityMore control over inference parametersSimpler, more abstracted
CommunityLarge, active communitySmaller, focused community

Fun Fact: GPT4All actually uses llama.cpp as its backend! So this notebook gives you more direct access to the underlying inference engine.

Resources

Attribution

Notebook originally developed based on work by Greg Merritt <gmerritt@berkeley.edu> and adapted by Eric Van Dusen. This llama-cpp-python version was created as an alternative implementation.

1. Environment setup

Installing llama-cpp-python

The installation of llama-cpp-python is straightforward but has some important considerations:

  1. CPU-only installation (what we’ll use): pip install llama-cpp-python

  2. GPU acceleration (CUDA): Requires building from source with CUDA support

  3. Metal acceleration (macOS): Automatic on Apple Silicon

Note: The first time you run inference, the model needs to be loaded into memory. This may take a moment depending on the model size and your hardware.

Steps:

  1. Ensure that your Python environment has llama-cpp-python installed

  2. Define the model path where your .gguf model files are stored

  3. Load a model using the Llama class

This notebook assumes that at least one ‘Small model’ file ending in .gguf has already been downloaded into a directory (see 1-2-HuggingFace_Hub_Download_gguf.ipynb for more)._

# Ensure that your python environment has llama-cpp-python capability
# Note: This uses the installation pattern specified for this notebook
try: 
    from llama_cpp import Llama
except: 
    %pip install llama-cpp-python
    from llama_cpp import Llama

Understanding the Llama Class

The Llama class is the main interface for loading and interacting with GGUF models. Key parameters include:

ParameterDescriptionDefault
model_pathFull path to the .gguf model fileRequired
n_ctxContext window size (max tokens model can see)512
n_threadsNumber of CPU threads to useAuto
n_gpu_layersLayers to offload to GPU (0 = CPU only)0
verbosePrint loading informationTrue
chat_formatChat template format (e.g., “chatml”, “llama-2”)Auto-detect

Let’s check out our local filesystem path and whether we have files downloaded

We need to locate where our .gguf model files are stored. Below are examples for different environments.

Approach 1 - if a Shared Hub is being used

# This only worked for FA 25 workshop on Cal ICOR Hub
#!ls /home/jovyan/shared_readwrite
# On Cal-ICOR workshop hub (JupyterCon Nov 2025)
!ls /home/jovyan/shared/ 

Approach 2 - if a local machine is being used

#This is my local path to a directory called shared-rw
!ls shared-rw
# or the full path ( this is on my laptop) 
!ls /Users//Users/ericvandusen/Documents/GitHub/shared/

1.1 Pick your environment - Local vs Hub - and set the Path

# set the model path parameter depending on where you are computing
model_directory = "/home/jovyan/shared/"
# set the model path parameter depending on where you are computing
#model_directory = "/Users/ericvandusen/Documents/GitHub/shared/"

1.2 Loading the Downloaded Model with llama-cpp-python

In this step, we create a local instance of the model using the Llama class.

Key differences from GPT4All:

  • We provide the full path to the model file (not just the filename)

  • We can specify n_ctx (context window size) directly

  • We have fine-grained control over threading with n_threads

  • The chat_format parameter helps the model understand conversation structure

About the model: qwen2-1_5b-instruct-q4_0.gguf is a 1.5 billion-parameter Qwen2 model that has been quantized to reduce its size and memory usage. The .gguf extension indicates that the model is stored in the GGUF format, which is the standard format for llama.cpp inference.

Note:
Loading the model may take a few seconds. You’ll see verbose output showing the model configuration being loaded.

import os

# Define the model filename
model_name = "qwen2-1_5b-instruct-q4_0.gguf"

# Create the full path to the model
model_path = os.path.join(model_directory, model_name)

# Load the model using llama-cpp-python
# n_ctx: context window size (how many tokens the model can "see" at once)
# n_threads: number of CPU threads (None = auto-detect)
# verbose: whether to print loading information
# chat_format: the chat template format for this model family
model = Llama(
    model_path=model_path,
    n_ctx=2048,          # Context window size
    n_threads=1,      # Auto-detect optimal thread count
    n_gpu_layers=999,     # -1 means send all layers to GPU
    verbose=True,        # Print model loading info
    chat_format="chatml" # Qwen uses ChatML format
)

print(f"\n✓ Model loaded successfully: {model_name}")

2. Call the model with a simple user message

Using create_chat_completion()

In llama-cpp-python, we use the create_chat_completion() method to generate responses. This method follows the OpenAI Chat Completions API format, making it easy to switch between local models and cloud APIs.

Message Structure:

messages = [
    {"role": "system", "content": "System instructions here"},
    {"role": "user", "content": "User message here"},
    {"role": "assistant", "content": "Previous assistant response (optional)"}
]

This may take a few moments to process.

You may run this multiple times, and will likely get different results. Feel free to change the user_message!

user_message = "Who pays for tariffs on foreign manufactured goods? Consumer or Producer?"  # You can change this prompt

# Create the messages list (OpenAI-compatible format)
messages = [
    {"role": "user", "content": user_message}
]

# Generate a response using create_chat_completion
response = model.create_chat_completion(
    messages=messages
)

# Extract and print the response
print("Response:")
print(response["choices"][0]["message"]["content"])

Understanding the Response Object

The create_chat_completion() method returns a dictionary that follows the OpenAI API response format:

{
    "id": "chatcmpl-...",
    "object": "chat.completion",
    "created": 1234567890,
    "model": "model_name",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "The actual response text..."
            },
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 10,
        "completion_tokens": 50,
        "total_tokens": 60
    }
}

To get the text content, we access: response["choices"][0]["message"]["content"]

3. Passing additional arguments to control generation

The create_chat_completion() method accepts many parameters to control the generation:

ParameterDescriptionDefault
messagesList of message dictionariesRequired
max_tokensMaximum number of tokens to generate16
temperatureControls randomness (0 = deterministic, higher = more random)0.8
top_pNucleus sampling (consider tokens with top_p cumulative probability)0.95
top_kOnly consider the top_k most likely tokens40
repeat_penaltyPenalize repeated tokens (1.0 = no penalty)1.1
streamIf True, returns a generator for streaming responsesFalse

3a. Using the max_tokens argument to cap the length of the response

Generation will stop abruptly once it reaches the maximum number of tokens, even if the response is mid-sentence.

response_size_limit_in_tokens = 60  # You can change this parameter

user_message = "What is the economic outcome of tariffs on foreign manufactured goods?"

messages = [
    {"role": "user", "content": user_message}
]

response = model.create_chat_completion(
    messages=messages,
    max_tokens=response_size_limit_in_tokens
)

print("Response:")
print(response["choices"][0]["message"]["content"])

3b. The temperature argument

LLMs generate one token (“word”) at a time as they complete the text. At each step, there’s a probability distribution over possible next tokens. The temperature parameter controls how this distribution is sampled:

  • temperature = 0 (“cold”): Always picks the most likely token → deterministic output

  • temperature = 0.5-0.7 (“warm”): Balanced creativity and coherence

  • temperature = 1.0 (“hot”): High variety but may be less coherent

  • temperature > 1.0 (“very hot”): Very random, often incoherent

Let’s run the same prompt three times with temperature = 0; we expect identical outputs:

response_size_limit_in_tokens = 30
number_of_responses = 3
temperature = 0.0  # You can change this parameter

user_message = "How will tariffs affect the prices of foreign manufactured goods"

for i in range(number_of_responses):
    print(f"Response {i + 1}:")
    
    messages = [
        {"role": "user", "content": user_message}
    ]
    
    response = model.create_chat_completion(
        messages=messages,
        max_tokens=response_size_limit_in_tokens,
        temperature=temperature
    )
    
    print(f"{response['choices'][0]['message']['content']}\n")

Let’s repeat with a slightly “hotter” temperature of temperature = 0.25; we expect outputs to begin diverging:

response_size_limit_in_tokens = 30
number_of_responses = 3
temperature = 0.25

user_message = "How will tariffs affect elections"

for i in range(number_of_responses):
    print(f"Response {i + 1}:")
    
    messages = [
        {"role": "user", "content": user_message}
    ]
    
    response = model.create_chat_completion(
        messages=messages,
        max_tokens=response_size_limit_in_tokens,
        temperature=temperature
    )
    
    print(f"{response['choices'][0]['message']['content']}\n")

A “very hot” temperature of temperature = 1 will result in high variety but may lead to less satisfactory responses:

response_size_limit_in_tokens = 30
number_of_responses = 5
temperature = 1

user_message = "How will tariffs affect elections"

for i in range(number_of_responses):
    print(f"Response {i + 1}:")
    
    messages = [
        {"role": "user", "content": user_message}
    ]
    
    response = model.create_chat_completion(
        messages=messages,
        max_tokens=response_size_limit_in_tokens,
        temperature=temperature
    )
    
    print(f"{response['choices'][0]['message']['content']}\n")

4. Include a hidden “system message” at the start of the conversation

A system message sets the context and personality for the assistant. It’s placed at the beginning of the messages list and influences how the model responds to all subsequent user messages.

In llama-cpp-python, we simply include a message with "role": "system" as the first item in the messages list:

messages = [
    {"role": "system", "content": "Your system instructions here..."},
    {"role": "user", "content": "User's question"}
]

Note: System messages are never guaranteed to remain secret; models can sometimes be prompted to reveal their instructions.

response_size_limit_in_tokens = 100

system_message = """
You are a hard working economics student at UC Berkeley. 
You think that there may be some truth to the things you learn in economics classes.
You wish that the people in the government understood economics.
You think that memes and poems and pop songs are a good way to communicate
Answer in rap lyrics always
"""

user_message = "How will tariffs affect inflation"

messages = [
    {"role": "system", "content": system_message},
    {"role": "user", "content": user_message}
]

response = model.create_chat_completion(
    messages=messages,
    max_tokens=response_size_limit_in_tokens
)

print("Response:")
print(response["choices"][0]["message"]["content"])

5. “Few-shot” learning: include conversation history to set the tone

Few-shot learning involves providing example prompt/response pairs to establish a pattern. The model will statistically tend to follow this pattern when generating new responses.

In llama-cpp-python, this is elegantly handled by simply including multiple user/assistant message pairs before the actual user query:

messages = [
    {"role": "system", "content": "System prompt"},
    {"role": "user", "content": "Example question 1"},
    {"role": "assistant", "content": "Example answer 1"},
    {"role": "user", "content": "Example question 2"},
    {"role": "assistant", "content": "Example answer 2"},
    {"role": "user", "content": "Actual user question"}  # Real question
]

The model learns the response style from the examples and applies it to the new question.

5a. A “Few-shot” example

In this example, we establish a pattern of concise, informative responses about economics. The model should follow this established conversational style.

Note: We’re using the native message format which llama-cpp-python automatically converts to the appropriate chat template (ChatML for Qwen).

response_size_limit_in_tokens = 200

# System message sets the overall behavior
system_message = """
You are an economics tutor with a focus on international trade.
Answer concisely and clearly, using accessible language.
"""

# Few-shot examples establish the response pattern
messages = [
    {"role": "system", "content": system_message},
    # Example 1
    {"role": "user", "content": "What is a tariff?"},
    {"role": "assistant", "content": "A tariff is a tax imposed by a government on imported goods, often used to protect domestic industries."},
    # Example 2
    {"role": "user", "content": "How do tariffs affect consumer prices?"},
    {"role": "assistant", "content": "Tariffs typically raise the price of imported goods, making them more expensive for consumers."},
    # Example 3
    {"role": "user", "content": "Can tariffs backfire?"},
    {"role": "assistant", "content": "Yes, they can lead to trade wars, hurt exporters, and reduce overall economic efficiency."},
    # Example 4
    {"role": "user", "content": "How do other countries respond to tariffs?"},
    {"role": "assistant", "content": "They often retaliate with their own tariffs, targeting key export sectors."},
    # The actual user question
    {"role": "user", "content": "What is an example of a real-world tariff dispute?"}
]

# Generate response
response = model.create_chat_completion(
    messages=messages,
    max_tokens=response_size_limit_in_tokens,
    temperature=0.8
)

print("Response:")
print(response["choices"][0]["message"]["content"])

5b. The importance of proper chat formatting

One advantage of llama-cpp-python is that it automatically handles chat templating when you set the chat_format parameter (we set it to "chatml" for Qwen models).

Behind the scenes, your messages are converted to special tokens like:

<|im_start|>system
You are an economics tutor...
<|im_end|>
<|im_start|>user
What is a tariff?
<|im_end|>
<|im_start|>assistant

If you used the wrong format (or no format), the model might:

  • Continue writing the script rather than responding as an assistant

  • Produce incoherent output

  • Not follow the conversation structure

The chat_format parameter ensures proper formatting automatically!

5c. A note about “hallucinations”

It’s popular to use the word “hallucinations” to talk about model output that is very different from what we wanted, or when the output does not seem to make sense.

However, an LLM does not perceive; it merely predicts the most likely next token based on patterns in its training data. The term “hallucination” can be misleading because:

  1. The model isn’t failing — it’s doing exactly what it’s designed to do

  2. It’s a statistical process — sometimes low-probability completions happen

  3. Training data limitations — the model can only draw from what it learned

Understanding this helps us have realistic expectations and design better prompts.

5d. Building a chatbot application

If you wanted to build an extended conversation experience, you would:

  1. Maintain a message history list that grows with each exchange

  2. Append new user messages to the history

  3. Append assistant responses to the history after generation

  4. Pass the entire history to each new call

# Example chatbot loop structure
messages = [{"role": "system", "content": "You are a helpful assistant."}]

while True:
    user_input = input("You: ")
    messages.append({"role": "user", "content": user_input})
    
    response = model.create_chat_completion(messages=messages)
    assistant_message = response["choices"][0]["message"]["content"]
    
    messages.append({"role": "assistant", "content": assistant_message})
    print(f"Assistant: {assistant_message}")

Important: The LLM itself has no “memory” — it’s your application that stores and manages the conversation history. Each call processes the entire conversation from the beginning.

6. Bonus: Streaming responses

llama-cpp-python supports streaming, which lets you see tokens as they’re generated (like ChatGPT). This is useful for:

  • Better user experience (immediate feedback)

  • Long responses (no waiting for complete generation)

  • Real-time applications

Set stream=True to enable streaming:

user_message = "Explain the concept of comparative advantage in international trade."

messages = [
    {"role": "user", "content": user_message}
]

print("Response (streaming):")

# With streaming, we get a generator that yields chunks
stream = model.create_chat_completion(
    messages=messages,
    max_tokens=150,
    stream=True  # Enable streaming
)

# Print each chunk as it arrives
for chunk in stream:
    delta = chunk["choices"][0]["delta"]
    if "content" in delta:
        print(delta["content"], end="", flush=True)

print()  # New line at the end

Summary

In this notebook, you learned how to:

  1. Install and import llama-cpp-python

  2. Load a GGUF model using the Llama class

  3. Generate responses using create_chat_completion()

  4. Control generation with parameters like max_tokens, temperature, top_p

  5. Use system messages to set assistant behavior

  6. Implement few-shot learning with example conversations

  7. Stream responses for real-time output

Key Advantages of llama-cpp-python:

  • OpenAI-compatible API — easy to switch between local and cloud models

  • Any GGUF model — not limited to a curated list

  • Active development — regular updates for new model architectures

  • Fine-grained control — access to low-level inference parameters

Next Steps:

  • Try different GGUF models from Hugging Face

  • Experiment with different temperature and sampling settings

  • Build a simple chatbot application

  • Explore GPU acceleration for faster inference