Intro to AI workflows using Small LM from Huggingface

Using llama-cpp-python Package¶

This notebook is adapted from the Intro to AI workflows using Small LM from Huggingface notebook and demonstrates how to work with small language models using the llama-cpp-python package instead of GPT4All.

What is llama-cpp-python?¶

llama-cpp-python provides Python bindings for the llama.cpp library, which is a high-performance C++ implementation for running Large Language Models (LLMs) locally. It’s one of the most popular libraries for running GGUF models on consumer hardware.

Key Features of llama-cpp-python:¶

Efficient Local Inference: Runs models entirely on CPU (with optional GPU acceleration)
GGUF Model Support: Works with quantized models in the GGUF format from Hugging Face
Low Memory Footprint: Supports 4-bit, 8-bit, and other quantization levels
OpenAI-Compatible API: Includes a built-in server that mimics OpenAI’s API
Active Development: Regular updates to support new model architectures
Cross-Platform: Works on Windows, macOS, and Linux

Why use llama-cpp-python over GPT4All?¶

Feature	llama-cpp-python	GPT4All
Model Support	Any GGUF model from Hugging Face	Curated list of ~30 models
Updates	Very frequent (follows llama.cpp)	Less frequent
API Style	Direct Python + OpenAI-compatible server	Custom Python API
Flexibility	More control over inference parameters	Simpler, more abstracted
Community	Large, active community	Smaller, focused community

Fun Fact: GPT4All actually uses llama.cpp as its backend! So this notebook gives you more direct access to the underlying inference engine.

Resources¶

llama-cpp-python GitHub Repository
llama-cpp-python Documentation
llama.cpp GitHub Repository
Hugging Face GGUF Models

Attribution¶

Notebook originally developed based on work by Greg Merritt <gmerritt@berkeley.edu> and adapted by Eric Van Dusen. This llama-cpp-python version was created as an alternative implementation.

1. Environment setup¶

Installing llama-cpp-python¶

The installation of llama-cpp-python is straightforward but has some important considerations:

CPU-only installation (what we’ll use): pip install llama-cpp-python
GPU acceleration (CUDA): Requires building from source with CUDA support
Metal acceleration (macOS): Automatic on Apple Silicon

Note: The first time you run inference, the model needs to be loaded into memory. This may take a moment depending on the model size and your hardware.

Steps:¶

Ensure that your Python environment has llama-cpp-python installed
Define the model path where your .gguf model files are stored
Load a model using the Llama class

This notebook assumes that at least one ‘Small model’ file ending in .gguf has already been downloaded into a directory (see 1-2-HuggingFace_Hub_Download_gguf.ipynb for more)._

# Ensure that your python environment has llama-cpp-python capability
# Note: This uses the installation pattern specified for this notebook
try: 
    from llama_cpp import Llama
except: 
    %pip install llama-cpp-python
    from llama_cpp import Llama

Understanding the Llama Class¶

The Llama class is the main interface for loading and interacting with GGUF models. Key parameters include:

Parameter	Description	Default
`model_path`	Full path to the .gguf model file	Required
`n_ctx`	Context window size (max tokens model can see)	512
`n_threads`	Number of CPU threads to use	Auto
`n_gpu_layers`	Layers to offload to GPU (0 = CPU only)	0
`verbose`	Print loading information	True
`chat_format`	Chat template format (e.g., “chatml”, “llama-2”)	Auto-detect

Let’s check out our local filesystem path and whether we have files downloaded¶

We need to locate where our .gguf model files are stored. Below are examples for different environments.

Approach 1 - if a Shared Hub is being used¶

# This only worked for FA 25 workshop on Cal ICOR Hub
#!ls /home/jovyan/shared_readwrite

# On Cal-ICOR workshop hub (JupyterCon Nov 2025)
!ls /home/jovyan/shared/

Approach 2 - if a local machine is being used¶

#This is my local path to a directory called shared-rw
!ls shared-rw

# or the full path ( this is on my laptop) 
!ls /Users//Users/ericvandusen/Documents/GitHub/shared/

1.1 Pick your environment - Local vs Hub - and set the Path¶

# set the model path parameter depending on where you are computing
model_directory = "/home/jovyan/shared/"

# set the model path parameter depending on where you are computing
#model_directory = "/Users/ericvandusen/Documents/GitHub/shared/"

1.2 Loading the Downloaded Model with llama-cpp-python¶

In this step, we create a local instance of the model using the Llama class.

Key differences from GPT4All:

We provide the full path to the model file (not just the filename)
We can specify n_ctx (context window size) directly
We have fine-grained control over threading with n_threads
The chat_format parameter helps the model understand conversation structure

About the model: qwen2-1_5b-instruct-q4_0.gguf is a 1.5 billion-parameter Qwen2 model that has been quantized to reduce its size and memory usage. The .gguf extension indicates that the model is stored in the GGUF format, which is the standard format for llama.cpp inference.

Note:
Loading the model may take a few seconds. You’ll see verbose output showing the model configuration being loaded.

import os

# Define the model filename
model_name = "qwen2-1_5b-instruct-q4_0.gguf"

# Create the full path to the model
model_path = os.path.join(model_directory, model_name)

# Load the model using llama-cpp-python
# n_ctx: context window size (how many tokens the model can "see" at once)
# n_threads: number of CPU threads (None = auto-detect)
# verbose: whether to print loading information
# chat_format: the chat template format for this model family
model = Llama(
    model_path=model_path,
    n_ctx=2048,          # Context window size
    n_threads=1,      # Auto-detect optimal thread count
    n_gpu_layers=999,     # -1 means send all layers to GPU
    verbose=True,        # Print model loading info
    chat_format="chatml" # Qwen uses ChatML format
)

print(f"\n✓ Model loaded successfully: {model_name}")

2. Call the model with a simple user message¶

Using create_chat_completion()¶

In llama-cpp-python, we use the create_chat_completion() method to generate responses. This method follows the OpenAI Chat Completions API format, making it easy to switch between local models and cloud APIs.

Message Structure:

messages = [
    {"role": "system", "content": "System instructions here"},
    {"role": "user", "content": "User message here"},
    {"role": "assistant", "content": "Previous assistant response (optional)"}
]

This may take a few moments to process.

You may run this multiple times, and will likely get different results. Feel free to change the user_message!

user_message = "Who pays for tariffs on foreign manufactured goods? Consumer or Producer?"  # You can change this prompt

# Create the messages list (OpenAI-compatible format)
messages = [
    {"role": "user", "content": user_message}
]

# Generate a response using create_chat_completion
response = model.create_chat_completion(
    messages=messages
)

# Extract and print the response
print("Response:")
print(response["choices"][0]["message"]["content"])

Understanding the Response Object¶

The create_chat_completion() method returns a dictionary that follows the OpenAI API response format:

{
    "id": "chatcmpl-...",
    "object": "chat.completion",
    "created": 1234567890,
    "model": "model_name",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "The actual response text..."
            },
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 10,
        "completion_tokens": 50,
        "total_tokens": 60
    }
}

To get the text content, we access: response["choices"][0]["message"]["content"]

3. Passing additional arguments to control generation¶

The create_chat_completion() method accepts many parameters to control the generation:

Parameter	Description	Default
`messages`	List of message dictionaries	Required
`max_tokens`	Maximum number of tokens to generate	16
`temperature`	Controls randomness (0 = deterministic, higher = more random)	0.8
`top_p`	Nucleus sampling (consider tokens with top_p cumulative probability)	0.95
`top_k`	Only consider the top_k most likely tokens	40
`repeat_penalty`	Penalize repeated tokens (1.0 = no penalty)	1.1
`stream`	If True, returns a generator for streaming responses	False

3a. Using the `max_tokens` argument to cap the length of the response¶

Generation will stop abruptly once it reaches the maximum number of tokens, even if the response is mid-sentence.

response_size_limit_in_tokens = 60  # You can change this parameter

user_message = "What is the economic outcome of tariffs on foreign manufactured goods?"

messages = [
    {"role": "user", "content": user_message}
]

response = model.create_chat_completion(
    messages=messages,
    max_tokens=response_size_limit_in_tokens
)

print("Response:")
print(response["choices"][0]["message"]["content"])

3b. The `temperature` argument¶

LLMs generate one token (“word”) at a time as they complete the text. At each step, there’s a probability distribution over possible next tokens. The temperature parameter controls how this distribution is sampled:

temperature = 0 (“cold”): Always picks the most likely token → deterministic output
temperature = 0.5-0.7 (“warm”): Balanced creativity and coherence
temperature = 1.0 (“hot”): High variety but may be less coherent
temperature > 1.0 (“very hot”): Very random, often incoherent

Let’s run the same prompt three times with temperature = 0; we expect identical outputs:

response_size_limit_in_tokens = 30
number_of_responses = 3
temperature = 0.0  # You can change this parameter

user_message = "How will tariffs affect the prices of foreign manufactured goods"

for i in range(number_of_responses):
    print(f"Response {i + 1}:")
    
    messages = [
        {"role": "user", "content": user_message}
    ]
    
    response = model.create_chat_completion(
        messages=messages,
        max_tokens=response_size_limit_in_tokens,
        temperature=temperature
    )
    
    print(f"{response['choices'][0]['message']['content']}\n")

Let’s repeat with a slightly “hotter” temperature of temperature = 0.25; we expect outputs to begin diverging:

response_size_limit_in_tokens = 30
number_of_responses = 3
temperature = 0.25

user_message = "How will tariffs affect elections"

for i in range(number_of_responses):
    print(f"Response {i + 1}:")
    
    messages = [
        {"role": "user", "content": user_message}
    ]
    
    response = model.create_chat_completion(
        messages=messages,
        max_tokens=response_size_limit_in_tokens,
        temperature=temperature
    )
    
    print(f"{response['choices'][0]['message']['content']}\n")

A “very hot” temperature of temperature = 1 will result in high variety but may lead to less satisfactory responses:

response_size_limit_in_tokens = 30
number_of_responses = 5
temperature = 1

user_message = "How will tariffs affect elections"

for i in range(number_of_responses):
    print(f"Response {i + 1}:")
    
    messages = [
        {"role": "user", "content": user_message}
    ]
    
    response = model.create_chat_completion(
        messages=messages,
        max_tokens=response_size_limit_in_tokens,
        temperature=temperature
    )
    
    print(f"{response['choices'][0]['message']['content']}\n")

4. Include a hidden “system message” at the start of the conversation¶

A system message sets the context and personality for the assistant. It’s placed at the beginning of the messages list and influences how the model responds to all subsequent user messages.

In llama-cpp-python, we simply include a message with "role": "system" as the first item in the messages list:

messages = [
    {"role": "system", "content": "Your system instructions here..."},
    {"role": "user", "content": "User's question"}
]

Note: System messages are never guaranteed to remain secret; models can sometimes be prompted to reveal their instructions.

response_size_limit_in_tokens = 100

system_message = """
You are a hard working economics student at UC Berkeley. 
You think that there may be some truth to the things you learn in economics classes.
You wish that the people in the government understood economics.
You think that memes and poems and pop songs are a good way to communicate
Answer in rap lyrics always
"""

user_message = "How will tariffs affect inflation"

messages = [
    {"role": "system", "content": system_message},
    {"role": "user", "content": user_message}
]

response = model.create_chat_completion(
    messages=messages,
    max_tokens=response_size_limit_in_tokens
)

print("Response:")
print(response["choices"][0]["message"]["content"])

5. “Few-shot” learning: include conversation history to set the tone¶

Few-shot learning involves providing example prompt/response pairs to establish a pattern. The model will statistically tend to follow this pattern when generating new responses.

In llama-cpp-python, this is elegantly handled by simply including multiple user/assistant message pairs before the actual user query:

messages = [
    {"role": "system", "content": "System prompt"},
    {"role": "user", "content": "Example question 1"},
    {"role": "assistant", "content": "Example answer 1"},
    {"role": "user", "content": "Example question 2"},
    {"role": "assistant", "content": "Example answer 2"},
    {"role": "user", "content": "Actual user question"}  # Real question
]

The model learns the response style from the examples and applies it to the new question.

5a. A “Few-shot” example¶

In this example, we establish a pattern of concise, informative responses about economics. The model should follow this established conversational style.

Note: We’re using the native message format which llama-cpp-python automatically converts to the appropriate chat template (ChatML for Qwen).

response_size_limit_in_tokens = 200

# System message sets the overall behavior
system_message = """
You are an economics tutor with a focus on international trade.
Answer concisely and clearly, using accessible language.
"""

# Few-shot examples establish the response pattern
messages = [
    {"role": "system", "content": system_message},
    # Example 1
    {"role": "user", "content": "What is a tariff?"},
    {"role": "assistant", "content": "A tariff is a tax imposed by a government on imported goods, often used to protect domestic industries."},
    # Example 2
    {"role": "user", "content": "How do tariffs affect consumer prices?"},
    {"role": "assistant", "content": "Tariffs typically raise the price of imported goods, making them more expensive for consumers."},
    # Example 3
    {"role": "user", "content": "Can tariffs backfire?"},
    {"role": "assistant", "content": "Yes, they can lead to trade wars, hurt exporters, and reduce overall economic efficiency."},
    # Example 4
    {"role": "user", "content": "How do other countries respond to tariffs?"},
    {"role": "assistant", "content": "They often retaliate with their own tariffs, targeting key export sectors."},
    # The actual user question
    {"role": "user", "content": "What is an example of a real-world tariff dispute?"}
]

# Generate response
response = model.create_chat_completion(
    messages=messages,
    max_tokens=response_size_limit_in_tokens,
    temperature=0.8
)

print("Response:")
print(response["choices"][0]["message"]["content"])

5b. The importance of proper chat formatting¶

One advantage of llama-cpp-python is that it automatically handles chat templating when you set the chat_format parameter (we set it to "chatml" for Qwen models).

Behind the scenes, your messages are converted to special tokens like:

<|im_start|>system
You are an economics tutor...
<|im_end|>
<|im_start|>user
What is a tariff?
<|im_end|>
<|im_start|>assistant

If you used the wrong format (or no format), the model might:

Continue writing the script rather than responding as an assistant
Produce incoherent output
Not follow the conversation structure

The chat_format parameter ensures proper formatting automatically!

5c. A note about “hallucinations”¶

It’s popular to use the word “hallucinations” to talk about model output that is very different from what we wanted, or when the output does not seem to make sense.

However, an LLM does not perceive; it merely predicts the most likely next token based on patterns in its training data. The term “hallucination” can be misleading because:

The model isn’t failing — it’s doing exactly what it’s designed to do
It’s a statistical process — sometimes low-probability completions happen
Training data limitations — the model can only draw from what it learned

Understanding this helps us have realistic expectations and design better prompts.

5d. Building a chatbot application¶

If you wanted to build an extended conversation experience, you would:

Maintain a message history list that grows with each exchange
Append new user messages to the history
Append assistant responses to the history after generation
Pass the entire history to each new call

# Example chatbot loop structure
messages = [{"role": "system", "content": "You are a helpful assistant."}]

while True:
    user_input = input("You: ")
    messages.append({"role": "user", "content": user_input})
    
    response = model.create_chat_completion(messages=messages)
    assistant_message = response["choices"][0]["message"]["content"]
    
    messages.append({"role": "assistant", "content": assistant_message})
    print(f"Assistant: {assistant_message}")

Important: The LLM itself has no “memory” — it’s your application that stores and manages the conversation history. Each call processes the entire conversation from the beginning.

6. Bonus: Streaming responses¶

llama-cpp-python supports streaming, which lets you see tokens as they’re generated (like ChatGPT). This is useful for:

Better user experience (immediate feedback)
Long responses (no waiting for complete generation)
Real-time applications

Set stream=True to enable streaming:

user_message = "Explain the concept of comparative advantage in international trade."

messages = [
    {"role": "user", "content": user_message}
]

print("Response (streaming):")

# With streaming, we get a generator that yields chunks
stream = model.create_chat_completion(
    messages=messages,
    max_tokens=150,
    stream=True  # Enable streaming
)

# Print each chunk as it arrives
for chunk in stream:
    delta = chunk["choices"][0]["delta"]
    if "content" in delta:
        print(delta["content"], end="", flush=True)

print()  # New line at the end

Summary¶

In this notebook, you learned how to:

Install and import llama-cpp-python
Load a GGUF model using the Llama class
Generate responses using create_chat_completion()
Control generation with parameters like max_tokens, temperature, top_p
Use system messages to set assistant behavior
Implement few-shot learning with example conversations
Stream responses for real-time output

Key Advantages of llama-cpp-python:¶

OpenAI-compatible API — easy to switch between local and cloud models
Any GGUF model — not limited to a curated list
Active development — regular updates for new model architectures
Fine-grained control — access to low-level inference parameters

Next Steps:¶

Try different GGUF models from Hugging Face
Experiment with different temperature and sampling settings
Build a simple chatbot application
Explore GPU acceleration for faster inference