!! This notebook is not meant to be run, just to explain the setup !!¶
Teaching LLM workflow by using open source models and Hugging Face Hub¶
The Hugging Face Hub is a platform that hosts thousands of pre-trained models, datasets, and demos. It’s the go-to source for downloading quantized GGUF models that can run efficiently on CPU.
This info is as of the writing of this notebook in December 2025 and this info is changing rapidly.
Why Hugging Face Hub?¶
Vast model selection: Access to thousands of GGUF models, not limited to a curated subset
Standard format: All models use the GGUF format compatible with llama.cpp and llama-cpp-python
Configurable caching: Control where models are downloaded and stored
Simple Python API: Easy-to-use
hf_hub_downloadfunction
Shared Filesystem¶
In the setup where I was teaching, I used this notebook to download models from Hugging Face and I put them in a shared-readwrite folder where the students could access them on JupyterHub. This was possible because I was using a JupyterHub for teaching that had a shared folder system.
Your use case may vary. It could look like...
Shared read-write directory on JupyterHub
Each student downloads their own models
Download models to local machine
# Ensure that your python environment has huggingface_hub package installed.
try:
from huggingface_hub import hf_hub_download
except ImportError:
%pip install huggingface_hub
from huggingface_hub import hf_hub_download
Which model to download¶
In the use case for teaching on a JupyterHub with a CPU, I was looking for small models:
~1-2 billion parameters
Quantized (Weights are normalized in blocks to 4 bit integers, a scale factor, and zero point)
You can explore the world of models at: Hugging Face Model List
You can explore rankings based on test metrics llm-explorer
When searching for GGUF models, look for:
Q4_K_M or Q4_0: 4-bit quantization, good balance of size and quality
Q5_K_M: Slightly larger but better quality
Q8_0: 8-bit quantization, best quality but larger file size
Recommended small models for teaching: ( Fall 2025 )¶
| Model | Repo ID | Filename | Size |
|---|---|---|---|
| TinyLlama 1.1B | TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF | tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf | ~670 MB |
| Qwen2 1.5B | Qwen/Qwen2-1.5B-Instruct-GGUF | qwen2-1_5b-instruct-q4_0.gguf | ~900 MB |
| Llama 3.2 1B | bartowski/Llama-3.2-1B-Instruct-GGUF | Llama-3.2-1B-Instruct-Q4_K_M.gguf | ~700 MB |
| Phi-3 Mini | bartowski/Phi-3-mini-4k-instruct-GGUF | Phi-3-mini-4k-instruct-Q4_K_M.gguf | ~2.3 GB |
| DeepSeek R1 1.5B | bartowski/DeepSeek-R1-Distill-Qwen-1.5B-GGUF | DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M.gguf | ~1 GB |
Some newer models to check out ( Spring 2026 )¶
| Model | Repo ID | Filename | Size |
|---|---|---|---|
| Gemma 3 1B | bartowski/google_gemma-3-1b-it-GGUF | google_gemma-3-1b-it-Q4_K_M.gguf | ~700 MB |
| Gemma 3 4B | ggml-org/gemma-3-4b-it-GGUF | gemma-3-4b-it-Q4_K_M.gguf | ~2.5 GB |
| Qwen2.5 1.5B | bartowski/Qwen2.5-1.5B-Instruct-GGUF | Qwen2.5-1.5B-Instruct-Q4_K_M.gguf | ~1 GB |
| Qwen2.5 3B | bartowski/Qwen2.5-3B-Instruct-GGUF | Qwen2.5-3B-Instruct-Q4_K_M.gguf | ~1.9 GB |
| SmolLM2 360M | bartowski/SmolLM2-360M-Instruct-GGUF | SmolLM2-360M-Instruct-Q4_K_M.gguf | ~230 MB |
| SmolLM2 1.7B | bartowski/SmolLM2-1.7B-Instruct-GGUF | SmolLM2-1.7B-Instruct-Q4_K_M.gguf | ~1 GB |
| Llama 3.2 3B | bartowski/Llama-3.2-3B-Instruct-GGUF | Llama-3.2-3B-Instruct-Q4_K_M.gguf | ~2 GB |
Embeddings Model Download
We also need to download an embeddings model for vector search. An embeddings model converts text into a list of numbers (a vector) so that similar passages end up close together in that number space. This is the core of RAG (Retrieval-Augmented Generation): you embed your documents once, then embed each question and find the closest matching passages.
| Model | Repo ID | Use via | Size | Notes |
|---|---|---|---|---|
| all-MiniLM-L6-v2 | sentence-transformers/all-MiniLM-L6-v2 | sentence-transformers | ~90 MB | Fast, battle-tested, widely used in RAG tutorials |
| Gemma Embedding | google/embedding-gemma collection | sentence-transformers | ~300 MB | Google’s dedicated embedding models, newer and higher quality |
all-MiniLM-L6-v2 is downloaded automatically by the sentence-transformers library the first time you use it — no manual hf_hub_download call needed. The Gemma Embedding models are also loaded via sentence-transformers using their Hugging Face repo ID directly.
Let’s check out our local filesystem path and where we will download the files¶
Approach 1 - If a Shared Hub is being used¶
# Cloudbank workshop Hub specific path
!ls /home/jovyan/shared# This is my local path to a directory called shared-readwrite
!ls /home/jovyan/shared-readwriteApproach 2 - If a local machine is being used¶
# or the full path (this is on my laptop)
!ls /Users/ericvandusen/SmallLM/Models/Set the path where the models will download¶
# Path for Shared Hub - change this to match your JupyterHub's shared directory
# Examples: /home/jovyan/shared, /home/jovyan/shared_readwrite, /home/jovyan/_shared/course-name
shared_model_path = "/home/jovyan/shared-readwrite" # Update this path as needed# Path for Local
#shared_model_path = "/Users/ericvandusen/SmallLM/Models/"
Downloading Models with Hugging Face Hub¶
The hf_hub_download function downloads a specific file from a Hugging Face repository.
Key Parameters:¶
| Parameter | Description |
|---|---|
repo_id | The repository identifier (e.g., "TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF") |
filename | The specific file to download (e.g., "tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf") |
local_dir | Directory where the file will be stored (for shared access) |
local_dir_use_symlinks | Set to False to copy files instead of creating symlinks |
Default Behavior vs Shared Repository¶
By default, hf_hub_download stores files in ~/.cache/huggingface/hub/, which is user-specific. To enable shared access for students, we use the local_dir parameter to specify a shared directory.
Download TinyLlama 1.1B (Recommended for teaching)¶
# Download TinyLlama model to shared directory
model_path = hf_hub_download(
repo_id="TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF",
filename="tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf",
local_dir=shared_model_path,
local_dir_use_symlinks=False
)
print(f"Model downloaded to: {model_path}")Download Qwen2 1.5B Instruct¶
# Download Qwen2 model to shared directory
model_path = hf_hub_download(
repo_id="Qwen/Qwen2-1.5B-Instruct-GGUF",
filename="qwen2-1_5b-instruct-q4_0.gguf",
local_dir=shared_model_path,
local_dir_use_symlinks=False
)
print(f"Model downloaded to: {model_path}")Download Llama 3.2 1B Instruct¶
# Download Llama 3.2 model to shared directory
model_path = hf_hub_download(
repo_id="bartowski/Llama-3.2-1B-Instruct-GGUF",
filename="Llama-3.2-1B-Instruct-Q4_K_M.gguf",
local_dir=shared_model_path,
local_dir_use_symlinks=False
)
print(f"Model downloaded to: {model_path}")Download DeepSeek R1 Distill 1.5B¶
# Download DeepSeek model to shared directory
model_path = hf_hub_download(
repo_id="bartowski/DeepSeek-R1-Distill-Qwen-1.5B-GGUF",
filename="DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M.gguf",
local_dir=shared_model_path,
local_dir_use_symlinks=False
)
print(f"Model downloaded to: {model_path}")Let’s make a general function to download any model by specifying the repo ID and filename¶
This function will make it easier to download additional models in the future by just calling this function with different parameters.
This function takes in the repo_id and filename as parameters and uses the same local_dir for all downloads.
1- specify the models to download in a dictionary with repo_id as key and filename as value
2- call the download_models function with the dictionary and shared model path
def download_models(models_dict, shared_model_path):
for repo_id, filename in models_dict.items():
model_path = hf_hub_download(
repo_id=repo_id,
filename=filename,
local_dir=shared_model_path
)
print(f"Downloaded: {model_path}")MODELS_Mar26 = {
"ggml-org/gemma-3-4b-it-GGUF": "gemma-3-4b-it-Q4_K_M.gguf",
"bartowski/SmolLM2-1.7B-Instruct-GGUF": "SmolLM2-1.7B-Instruct-Q4_K_M.gguf",
"bartowski/Qwen2.5-3B-Instruct-GGUF": "Qwen2.5-3B-Instruct-Q4_K_M.gguf",
}#DONT RUN THIS AGAIN UNLESS YOU WANT TO RE-DOWNLOAD THE MODELS
download_models(MODELS_Mar26, shared_model_path)Let’s now check which models we have¶
!ls /home/jovyan/shared
#!ls /Users/ericvandusen/SmallLM/Models/Testing the Downloaded Model with llama-cpp-python¶
Let’s verify that our downloaded model works correctly by loading it with llama-cpp-python and generating a simple response.
# Ensure llama-cpp-python is installed
try:
from llama_cpp import Llama
except ImportError:
%pip install llama-cpp-python
from llama_cpp import Llamaimport os
# Path to our downloaded model
model_file = os.path.join(shared_model_path, "tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf")
# Load the model
print(f"Loading model from: {model_file}")
llm = Llama(
model_path=model_file,
n_ctx=2048,
verbose=True,
# n_threads=1, #CPU settings
# n_gpu_layers=-1 #GPU Settings
)
print("\n✓ Model loaded successfully!")# Test generation
response = llm.create_chat_completion(
messages=[
{"role": "user", "content": "Hello! Can you tell me a fun fact about llamas?"}
],
max_tokens=100
)
print("Response:")
print(response["choices"][0]["message"]["content"])Bonus: Searching for Models on Hugging Face¶
You can also use the Hugging Face Hub API to search for models programmatically.
from huggingface_hub import HfApi, list_models# Search for GGUF models
api = HfApi()
# Find models with "gguf" in the name, sorted by downloads
models = list(api.list_models(
search="gguf",
sort="downloads",
limit=20
))
print("Top 20 GGUF models by downloads:")
print("-" * 60)
for model in models:
print(f"{model.id}")# List files in a specific repository to find available quantizations
from huggingface_hub import list_repo_files
repo_id = "TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF"
files = list_repo_files(repo_id)
print(f"Files in {repo_id}:")
print("-" * 60)
for f in files:
if f.endswith(".gguf"):
print(f)Summary¶
In this notebook, you learned how to:
Install the
huggingface_hubpackageDownload GGUF models using
hf_hub_downloadConfigure shared storage for classroom environments
Test downloaded models with llama-cpp-python
Search for models on Hugging Face programmatically
Key Advantages of Hugging Face Hub:¶
Huge model selection: Thousands of GGUF models available
Configurable caching: Easy to set up shared directories for classrooms
Automatic versioning: Models are versioned and can be pinned to specific commits
Simple API: Just two parameters needed:
repo_idandfilename
Next Steps:¶
See
LlamaCpp_SmallLM_Demo.ipynbfor detailed usage of downloaded modelsExplore different quantization levels (Q4, Q5, Q8) for your use case
Try models from different families (Llama, Qwen, Phi, etc.)