Intro to AI workflows using Small LM from Huggingface

Using GPT4All Python package¶

This notebook uses the GPT4All Python and was taught in Spring 2025 in Econ 148 run on the UCBerkeley DataHub. Here it is adapted for a more general case - local or hub-type deployment

There are many ways to access Small Models in a pythonic way . GPT4all happens to be one package that has been optimized for lightweight deployments and seems to work well on Jupyterhub deployments

Resources¶

GPT4All Github Repo
GPT4All Documentation

Behind the scenes GPT4All is working with the llama.cpp backend

Attribution¶

Notebook originally developed by Greg Merritt <gmerritt@berkeley.edu> and inspired by ds-modules/ollama-demo. Adapted by Eric Van Dusen

1. Environment setup¶

Ensure that your python environment has gpt4all capability
Define the “model” object to which this notebook’s code will send conversations & prompts
This notebook assumes that at least one ‘Small model’ file ending in .gguf has already been downloaded into a directory (see GPT4All_Download_gguf.ipynb for more).

Do not worry about “Failed to load libllamamodel-mainline-cuda...” errors; this happens when the environment, like ours, does not have GPU support.

# Ensure that your python environment has gpt4all capability
try:
    from gpt4all import GPT4All
except ImportError:
    %pip install gpt4all
    from gpt4all import GPT4All

Let’s check out our local filesystem path and whether we have files downloaded¶

To do this I have two sets of code - one with the code commented out with the #

Approach 1 - if a Shared Hub is being used¶

# This only worked for FA 25 workshop on Cal ICOR Hub
#!ls /home/jovyan/shared_readwrite

# On Cal-ICOR workshop hub (JupyterCon Nov 2025)
!ls /home/jovyan/shared/

Approach 2 - if a local machine is being used¶

#This is my local path to a directory called shared-rw
!ls shared-rw

# or the full path ( this is on my laptop) 
!ls /Users/ericvandusen/Documents/GitHub/SmallLM-SP25/shared-rw

1.1 Pick your environment - Local vs Hub - and set the Path¶

# set the model path parameter depending on where you are computing
path="/home/jovyan/shared/"

# set the model path parameter depending on where you are computing
#path="/Users/ericvandusen/Documents/GitHub/SmallLM-SP25/shared-rw"

1.2 Loading the Downloaded GPT4All Model¶

In this step, we create a local instance of the GPT4All model that we’ve already downloaded.
The GPT4All class loads the quantized .gguf model file into memory and prepares it for inference.
This allows us to run the model entirely offline, using only CPU resources.

We specify:

model_name – the filename of the model (e.g., qwen2-1_5b-instruct-q4_0.gguf).
Below, qwen2-1_5b-instruct-q4_0.gguf is a 1.5 billion-parameter Qwen2 model that has been quantized to reduce its size and memory usage. The .gguf extension indicates that the model is stored in the GGUF format, compatible with local inference frameworks like llama.cpp and GPT4All.
model_path – the directory path where that model file is stored
verbose=True – to print detailed information during loading

Note:
If you see a pink or red error box after running this cell, don’t worry — it’s not a failure.
It simply indicates that your system does not have a GPU or CUDA configured.
GPT4All will automatically switch to CPU inference, which will work fine for our purposes.

⚠️ Note: If you see a pink error box, don't worry—it's just a CUDA warning.

# This calls in the model that we have downloaded already 
model = GPT4All(
    model_name="qwen2-1_5b-instruct-q4_0.gguf",
    model_path=path,
    verbose=True
)
# If you see a pink error box - do not worry - that is an error because we dont have a GPU and Cuda set up

2. Call the model with a GPT4All chat session containing a simple user message¶

This code pretends that a person submitted a message (prompt) to your application; your application then takes this user_message and passes it to the LLM model for response generation. The response is printed.

This may take a few moments to process.

You may run this multiple times, and will likely get different results. You may also feel free to do replace user_message with a prompt of your own!

user_message = "Who pays for tariffs on foreign manufactured goods? Consumer or Producer?" # You can change this prompt 

with model.chat_session():
    print(f"Response:")
    response = model.generate(
        prompt = user_message
    )
    print(f"{response}")

3. Passing additional arguments to the chat session model call¶

We can pass more than just a prompt to the GPT4All chat-session model model call. The complete list is shown here:

prompt: The prompt for the model to complete.
max_tokens: The maximum number of tokens to generate.
temp: The model temperature. Larger values increase creativity but decrease factuality.
top_k: Randomly sample from the top_k most likely tokens at each generation step. Set this to 1 for greedy decoding.
top_p: Randomly sample at each generation step from the top most likely tokens whose probabilities add up to top_p.
min_p: Randomly sample at each generation step from the top most likely tokens whose probabilities are at least min_p.
repeat_penalty: Penalize the model for repetition. Higher values result in less repetition.
repeat_last_n: How far in the models generation history to apply the repeat penalty.
n_batch: Number of prompt tokens processed in parallel. Larger values decrease latency but increase resource requirements.
n_predict: Equivalent to max_tokens, exists for backwards compatibility.
streaming: If True, this method will instead return a generator that yields tokens as the model generates them.
callback: A function with arguments token_id:int and response:str, which receives the tokens from the model as they are generated and stops the generation by returning False.

3a. Using the `max_tokens` argument to cap the length of the response¶

A GPT4All chat completion generation will stop generating words (tokens) abruptly once it’s generated (at most) the specified maximum number of tokens assigned to the optional max_tokens parameter. The response may cut off mid-sentence, even if the response

response_size_limit_in_tokens = 60  # You can change this parameter 

user_message = "What is the ecomomic outcome of tariffs on foreign manufactured goods?"

with model.chat_session():
    print(f"Response:")
    response = model.generate(
        prompt = user_message,
        max_tokens = response_size_limit_in_tokens
    )
    print(f"{response}")

3b. The `temp`erature argument.¶

LLMs generate one token (“word”) at a time as they complete the chat you give them. As the LLM completes the chat, there is a single statistically most-likely token to “come next” at each step. However, a model will generally also have additional -- but less-likely -- tokens as candidate alternatives at each step. Which should it choose?

The value of the temperature argument will affect the likelihood that the model may randomly generate a less-probable token at each chat completion step.

A temperature of 0 -- “cold,” if you like -- will constrain the model to always pick the most-likely token (“word”) at each chat completion step.

Let’s run the same chat completion three times, but with temp = 0; we expect that each of the three runs will give precisely the same output, choosing the model’s most-statistically-likely next token at each step of the generation:

response_size_limit_in_tokens = 30 
number_of_responses = 3 
temperature = 0.0  # You can change this parameter 

user_message = "How will tariffs affect the prices of foreign manufactured goods"

for i in range(number_of_responses):
    print(f"Response {i + 1}:")
    with model.chat_session():
        response = model.generate(
            prompt = user_message,
            max_tokens = response_size_limit_in_tokens,
            temp = temperature
        )
    print(f"{response}\n")

Let’s repeat that, but with a slightly “hotter” temperature of temp = 0.25; we expect the outputs to begin to diverge from one another:

response_size_limit_in_tokens = 30
number_of_responses = 3
temperature = .25

user_message = "How will tariffs affect elections"


for i in range(number_of_responses):
    print(f"Response {i + 1}:")
    with model.chat_session():
        response = model.generate(
            prompt = user_message,
            max_tokens = response_size_limit_in_tokens,
            temp = temperature
        )
    print(f"{response}\n")

A “very hot” temperature of temp = 1 will result in a high variety of responses, but may lead to “very unlikley” responses that may be less satisfactory:

response_size_limit_in_tokens = 30
number_of_responses = 5
temperature = 1

user_message = "How will tariffs affect elections"

for i in range(number_of_responses):
    print(f"Response {i + 1}:")
    with model.chat_session():
        response = model.generate(
            prompt = user_message,
            max_tokens = response_size_limit_in_tokens,
            temp = temperature
        )
    print(f"{response}\n")

4. Include a hidden “system message” at the start of the conversation, before the user prompt¶

If chatbots were thinking entities, we developers might like to give them “instructions” regarding what we want them to do for users. However, chatbots just call LLMs to advance a conversation.

A “sytem message” is often thought of as instructions given to a chatbot. Functionally, it serves as a “conversation starter” to which the LLM does not respond directly; it is effectively “prepended” to the first user prompt in the conversation.

So, when you set a system message in your application, every conversation that your chatbot app gives to the LLM for advancing a conversation always has this “sytem message” quietly inserted at the very beginning of the conversation -- whether the user likes it or not!

Note that these “system messages” are never guaranteed to remain secret, no matter how cleverly you may try to craft them; models can be prompted to reveal the contents of their system message.

response_size_limit_in_tokens = 100

system_message = """
You are a hard working economics student at UC Berkeley. 
You think that there may be some truth to the things you learn in economics classes.
You wish that the people in the government understood economics.
You think that memes and poems and pop songs are a good way to communicate
Answer in rap lyrics always
"""

user_message = "How will tariffs affect inflation "


with model.chat_session(system_prompt=system_message):
    print(f"Response:")
    response = model.generate(
        prompt = user_message,
        max_tokens = response_size_limit_in_tokens
    )
    print(f"{response}")

5. “Few-shot” learning: include a pre-made conversation history to set the tone of subsequent response generations¶

Another way to guide a language model is to provide a “few shots,” a sequence of sample prompt/response (or user/assistant) dialogue pairs that establish a pattern to the conversation; our model will statistically tend to follow the presented established converation pattern when it responds to a new prompt from a user.

The “Few shot” label is commonly used for this technique, but, in truth, this is simply a “pre-loaded” initial conversation in which both sample prompts and sample responses were written beforehand by the developer; when the real user engages in a new conversation via your application, they do not know that their first prompt is appended by your application to this this hidden, pre-written conversation.

5a. A “Few-shot” example¶

In this example, we include such a fake conversation history, intended to help set the tone of responses. This conversation history consists of pairs of prompts/responses (user:/assistant:), but the user: lines were not written by a user, and the assistant: lines were not generated by the LLM! These were drafted by the developer, and are included to establish a baseline conversational style.

Here the developer made some choices about how the cat should respond to questions. The sample responses are brief, and each contains a word or two at the end that describes some kind of ~expression~ of the imaginary cat. Hopefully the next response generated will fit this pattern -- although this is never guaranteed!

Note 1: response_size_limit_in_tokens has been set to 200, but we’ll hope that the model follows the conversational history example and keeps responses brief.
Note 2: We use a template appropriate to the model being used (qwen2.5) to give symantic structure to the conversation; more on this in the example to follow.

# qwen2.5 template
prompt_template = """
<|im_start|>user
{0}
<|im_end|>
<|im_start|>assistant
{1}
<|im_end|>
"""

# Define the system message and chat history
system_message = """
You are an economics tutor with a focus on international trade.
Answer concisely and clearly, using accessible language.
"""

chat_history = [
    {"role": "user", "content": "What is a tariff?"},
    {"role": "assistant", "content": "A tariff is a tax imposed by a government on imported goods, often used to protect domestic industries."},
    {"role": "user", "content": "How do tariffs affect consumer prices?"},
    {"role": "assistant", "content": "Tariffs typically raise the price of imported goods, making them more expensive for consumers."},
    {"role": "user", "content": "Can tariffs backfire?"},
    {"role": "assistant", "content": "Yes, they can lead to trade wars, hurt exporters, and reduce overall economic efficiency."},
    {"role": "user", "content": "How do other countries respond to tariffs?"},
    {"role": "assistant", "content": "They often retaliate with their own tariffs, targeting key export sectors."}
]

new_user_message = "What is an example of a real-world tariff dispute?"

# Append the new user message
chat_history.append({"role": "user", "content": new_user_message})

# Format the conversation history
formatted_prompt = ""
for message in chat_history:
    formatted_prompt += f"<|im_start|>{message['role']}\n{message['content']}\n<|im_end|>\n"

print(f"Formatted prompt:\n{formatted_prompt}")

# Combine with model session
with model.chat_session(system_prompt=system_message, prompt_template=prompt_template):
    print("Response:")
    response = model.generate(
        prompt=formatted_prompt,
        max_tokens=response_size_limit_in_tokens,
        temp=0.8
    )

# Output the assistant's reply
print(response)

5b. Why we need to conform to the model’s conversation template: a counter-example¶

Above, we wrapped the conversation history elements in tags according to a the template syntax published with this model. Different models will use different template syntax. (Some model-running frameworks & supporting SDKs help abstract this away so you may not have to worry about it too much in some applications.)

What if we make a bogus, over-simplified template that just packages the full user: and assistant: conversation history into one big lump? It’s as if the user’s initial prompt was one single blob of text, a scripted dialogue, without any special distinctions of the elements to indicate to the model that they are conversation history prompt/response pairs.

When we give an LLM this blob of a script, it may try to simply continue the script, as a playwrite writing a continuing dialogue between two actors, rather than take the role of the “assistant” and “speak the next line” of the dialogue! (Run several times to get varied results.)

Note: The way we lump the history into one blob is to give a bogus template ({0}) that serves to lump the full conversation history into one element that appears to be one single user prompt. The prompt value is exactly the same as the proper templated example above, but we give the model different parsing instructions via this reductive template!

# qwen2.5 template
prompt_template = "{0}"

# Define the system message and chat history
system_message = """
You are an economics tutor who specializes in international trade.
Keep answers concise and informative. Provide real-world context when possible.
"""

chat_history = [
    {"role": "user", "content": "What is a tariff?"},
    {"role": "assistant", "content": "A tariff is a tax on imported goods, usually used to protect domestic industries or raise government revenue."},
    {"role": "user", "content": "How do tariffs impact consumers?"},
    {"role": "assistant", "content": "They usually raise prices on imported goods, which can lead to higher costs for consumers."},
    {"role": "user", "content": "Why do countries use tariffs?"},
    {"role": "assistant", "content": "To shield domestic producers from foreign competition, or as leverage in trade negotiations."},
    {"role": "user", "content": "Do tariffs always work?"},
    {"role": "assistant", "content": "Not always. They can provoke retaliation, distort markets, and reduce overall trade efficiency."}
]

new_user_message = "Can you give an example of a recent tariff conflict?"

# Append the new user message to the chat history
chat_history.append({"role": "user", "content": new_user_message})

# Format the conversation history for the model
formatted_prompt = ""
for message in chat_history:
    formatted_prompt += f"<|im_start|>{message['role']}\n{message['content']}\n<|im_end|>\n"

print(f"Formatted prompt:\n{formatted_prompt}")

# Combine the system prompt and history
with model.chat_session(system_prompt=system_message, prompt_template=prompt_template):
    
    # Generate the assistant's response
    print("Response:")
    response = model.generate(
        prompt=formatted_prompt,
        max_tokens=200,
        temp=0.8
    )

# Print the final response
print(response)

5c. A note about “hallucinations”¶

It’s popular to use the word “hallucinations” to talk about model output that is very different from what we wanted, or when the output does not seem to make sense.

However, an LLM does not perceive; it merely continues a conversation. Can it literally hallucinate?

In such situations, the model is not crashing or failing or broken or sending errors; it is working exactly as it’s designed to work.

What’s certain about such situations is that there is a disconnect between a model’s output and our hopes / expectations for its output. The more we can understand about models’ behaviors, the less we may be surprised by their output, even if that output is not what we were hoping the model would generate.

Model responses to 5b. are likely something that nobody would ever want. However, the model is working as designed.

5d. Can you imagine how you might code a chatbot application?¶

If you wanted to develop an application that provided the user with an extended conversation experience, your application would capture the history of user prompts and model responses; for every new user prompt, your application would bundle the (growing) conversation history in precisely the way done above for the “few-shot” example. The pieces and the syntax are the same, but the history of prompts & responses would be dynamically generated by your app’s user and the LLM, and the conversation history would be managed by your application.

This is important: the LLM itself has no “memory” and can never store a conversation. It takes an application to store and manage conversations. In many contemporary examples, each new user input to an extended-conversation chatbot app results in a wholesale from-the-beginning processing of the historical conversation. There are frameworks that let your app cache the “tokenized” version of your conversation history, so that the LLM does not have to freshly encode the history with each subsequent prompt, but these are not ubiquitous.