Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Hugging Face Hub - Setup by Downloading Models

!! This notebook is not meant to be run, just to explain the setup !!

Teaching LLM workflow by using open source models and Hugging Face Hub

The Hugging Face Hub is a platform that hosts thousands of pre-trained models, datasets, and demos. It’s the go-to source for downloading quantized GGUF models that can run efficiently on CPU.

This info is as of the writing of this notebook in December 2025 and this info is changing rapidly.

Why Hugging Face Hub?

  • Vast model selection: Access to thousands of GGUF models, not limited to a curated subset

  • Standard format: All models use the GGUF format compatible with llama.cpp and llama-cpp-python

  • Configurable caching: Control where models are downloaded and stored

  • Simple Python API: Easy-to-use hf_hub_download function

Shared Filesystem

In the setup where I was teaching, I used this notebook to download models from Hugging Face and I put them in a shared-readwrite folder where the students could access them on JupyterHub. This was possible because I was using a JupyterHub for teaching that had a shared folder system.

Your use case may vary. It could look like...

  • Shared read-write directory on JupyterHub

  • Each student downloads their own models

  • Download models to local machine

# Ensure that your python environment has huggingface_hub package installed.
try:
    from huggingface_hub import hf_hub_download
except ImportError:
    %pip install huggingface_hub
    from huggingface_hub import hf_hub_download

Which model to download

In the use case for teaching on a JupyterHub with a CPU, I was looking for small models:

  • ~1-2 billion parameters

  • Quantized (Weights are normalized in blocks to 4 bit integers, a scale factor, and zero point)

You can explore the world of models at: Hugging Face Model List

You can explore rankings based on test metrics llm-explorer

When searching for GGUF models, look for:

  • Q4_K_M or Q4_0: 4-bit quantization, good balance of size and quality

  • Q5_K_M: Slightly larger but better quality

  • Q8_0: 8-bit quantization, best quality but larger file size

ModelRepo IDFilenameSize
TinyLlama 1.1BTheBloke/TinyLlama-1.1B-Chat-v1.0-GGUFtinyllama-1.1b-chat-v1.0.Q4_K_M.gguf~670 MB
Qwen2 1.5BQwen/Qwen2-1.5B-Instruct-GGUFqwen2-1_5b-instruct-q4_0.gguf~900 MB
Llama 3.2 1Bbartowski/Llama-3.2-1B-Instruct-GGUFLlama-3.2-1B-Instruct-Q4_K_M.gguf~700 MB
Phi-3 Minibartowski/Phi-3-mini-4k-instruct-GGUFPhi-3-mini-4k-instruct-Q4_K_M.gguf~2.3 GB
DeepSeek R1 1.5Bbartowski/DeepSeek-R1-Distill-Qwen-1.5B-GGUFDeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M.gguf~1 GB

Some newer models to check out ( Spring 2026 )

ModelRepo IDFilenameSize
Gemma 3 1Bbartowski/google_gemma-3-1b-it-GGUFgoogle_gemma-3-1b-it-Q4_K_M.gguf~700 MB
Gemma 3 4Bggml-org/gemma-3-4b-it-GGUFgemma-3-4b-it-Q4_K_M.gguf~2.5 GB
Qwen2.5 1.5Bbartowski/Qwen2.5-1.5B-Instruct-GGUFQwen2.5-1.5B-Instruct-Q4_K_M.gguf~1 GB
Qwen2.5 3Bbartowski/Qwen2.5-3B-Instruct-GGUFQwen2.5-3B-Instruct-Q4_K_M.gguf~1.9 GB
SmolLM2 360Mbartowski/SmolLM2-360M-Instruct-GGUFSmolLM2-360M-Instruct-Q4_K_M.gguf~230 MB
SmolLM2 1.7Bbartowski/SmolLM2-1.7B-Instruct-GGUFSmolLM2-1.7B-Instruct-Q4_K_M.gguf~1 GB
Llama 3.2 3Bbartowski/Llama-3.2-3B-Instruct-GGUFLlama-3.2-3B-Instruct-Q4_K_M.gguf~2 GB

Embeddings Model Download

We also need to download an embeddings model for vector search. An embeddings model converts text into a list of numbers (a vector) so that similar passages end up close together in that number space. This is the core of RAG (Retrieval-Augmented Generation): you embed your documents once, then embed each question and find the closest matching passages.

ModelRepo IDUse viaSizeNotes
all-MiniLM-L6-v2sentence-transformers/all-MiniLM-L6-v2sentence-transformers~90 MBFast, battle-tested, widely used in RAG tutorials
Gemma Embeddinggoogle/embedding-gemma collectionsentence-transformers~300 MBGoogle’s dedicated embedding models, newer and higher quality

all-MiniLM-L6-v2 is downloaded automatically by the sentence-transformers library the first time you use it — no manual hf_hub_download call needed. The Gemma Embedding models are also loaded via sentence-transformers using their Hugging Face repo ID directly.

Let’s check out our local filesystem path and where we will download the files

Approach 1 - If a Shared Hub is being used

# Cloudbank workshop Hub specific path
!ls /home/jovyan/shared
# This is my local path to a directory called shared-readwrite
!ls /home/jovyan/shared-readwrite

Approach 2 - If a local machine is being used

# or the full path (this is on my laptop)
!ls /Users/ericvandusen/SmallLM/Models/

Set the path where the models will download

# Path for Shared Hub - change this to match your JupyterHub's shared directory
# Examples: /home/jovyan/shared, /home/jovyan/shared_readwrite, /home/jovyan/_shared/course-name
shared_model_path = "/home/jovyan/shared-readwrite"  # Update this path as needed
# Path for Local
#shared_model_path = "/Users/ericvandusen/SmallLM/Models/"

Downloading Models with Hugging Face Hub

The hf_hub_download function downloads a specific file from a Hugging Face repository.

Key Parameters:

ParameterDescription
repo_idThe repository identifier (e.g., "TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF")
filenameThe specific file to download (e.g., "tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf")
local_dirDirectory where the file will be stored (for shared access)
local_dir_use_symlinksSet to False to copy files instead of creating symlinks

Default Behavior vs Shared Repository

By default, hf_hub_download stores files in ~/.cache/huggingface/hub/, which is user-specific. To enable shared access for students, we use the local_dir parameter to specify a shared directory.

# Download TinyLlama model to shared directory
model_path = hf_hub_download(
    repo_id="TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF",
    filename="tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf",
    local_dir=shared_model_path,
    local_dir_use_symlinks=False
)

print(f"Model downloaded to: {model_path}")

Download Qwen2 1.5B Instruct

# Download Qwen2 model to shared directory
model_path = hf_hub_download(
    repo_id="Qwen/Qwen2-1.5B-Instruct-GGUF",
    filename="qwen2-1_5b-instruct-q4_0.gguf",
    local_dir=shared_model_path,
    local_dir_use_symlinks=False
)

print(f"Model downloaded to: {model_path}")

Download Llama 3.2 1B Instruct

# Download Llama 3.2 model to shared directory
model_path = hf_hub_download(
    repo_id="bartowski/Llama-3.2-1B-Instruct-GGUF",
    filename="Llama-3.2-1B-Instruct-Q4_K_M.gguf",
    local_dir=shared_model_path,
    local_dir_use_symlinks=False
)

print(f"Model downloaded to: {model_path}")

Download DeepSeek R1 Distill 1.5B

# Download DeepSeek model to shared directory
model_path = hf_hub_download(
    repo_id="bartowski/DeepSeek-R1-Distill-Qwen-1.5B-GGUF",
    filename="DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M.gguf",
    local_dir=shared_model_path,
    local_dir_use_symlinks=False
)

print(f"Model downloaded to: {model_path}")

Let’s make a general function to download any model by specifying the repo ID and filename

This function will make it easier to download additional models in the future by just calling this function with different parameters.

This function takes in the repo_id and filename as parameters and uses the same local_dir for all downloads.

1- specify the models to download in a dictionary with repo_id as key and filename as value 2- call the download_models function with the dictionary and shared model path

def download_models(models_dict, shared_model_path):
    for repo_id, filename in models_dict.items():
        model_path = hf_hub_download(
            repo_id=repo_id,
            filename=filename,
            local_dir=shared_model_path
        )
        print(f"Downloaded: {model_path}")
MODELS_Mar26 = {
    "ggml-org/gemma-3-4b-it-GGUF":               "gemma-3-4b-it-Q4_K_M.gguf",
    "bartowski/SmolLM2-1.7B-Instruct-GGUF":      "SmolLM2-1.7B-Instruct-Q4_K_M.gguf",
    "bartowski/Qwen2.5-3B-Instruct-GGUF":        "Qwen2.5-3B-Instruct-Q4_K_M.gguf",
}
#DONT RUN THIS AGAIN UNLESS YOU WANT TO RE-DOWNLOAD THE MODELS
download_models(MODELS_Mar26, shared_model_path)

Let’s now check which models we have

!ls  /home/jovyan/shared
#!ls /Users/ericvandusen/SmallLM/Models/

Testing the Downloaded Model with llama-cpp-python

Let’s verify that our downloaded model works correctly by loading it with llama-cpp-python and generating a simple response.

# Ensure llama-cpp-python is installed
try:
    from llama_cpp import Llama
except ImportError:
    %pip install llama-cpp-python
    from llama_cpp import Llama
import os

# Path to our downloaded model
model_file = os.path.join(shared_model_path, "tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf")

# Load the model
print(f"Loading model from: {model_file}")
llm = Llama(
    model_path=model_file,
    n_ctx=2048,
    verbose=True,
    # n_threads=1, #CPU settings
    # n_gpu_layers=-1 #GPU Settings
)

print("\n✓ Model loaded successfully!")
# Test generation
response = llm.create_chat_completion(
    messages=[
        {"role": "user", "content": "Hello! Can you tell me a fun fact about llamas?"}
    ],
    max_tokens=100
)

print("Response:")
print(response["choices"][0]["message"]["content"])

Bonus: Searching for Models on Hugging Face

You can also use the Hugging Face Hub API to search for models programmatically.

from huggingface_hub import HfApi, list_models
# Search for GGUF models
api = HfApi()

# Find models with "gguf" in the name, sorted by downloads
models = list(api.list_models(
    search="gguf",
    sort="downloads",
    limit=20
))

print("Top 20 GGUF models by downloads:")
print("-" * 60)
for model in models:
    print(f"{model.id}")
# List files in a specific repository to find available quantizations
from huggingface_hub import list_repo_files

repo_id = "TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF"
files = list_repo_files(repo_id)

print(f"Files in {repo_id}:")
print("-" * 60)
for f in files:
    if f.endswith(".gguf"):
        print(f)

Summary

In this notebook, you learned how to:

  1. Install the huggingface_hub package

  2. Download GGUF models using hf_hub_download

  3. Configure shared storage for classroom environments

  4. Test downloaded models with llama-cpp-python

  5. Search for models on Hugging Face programmatically

Key Advantages of Hugging Face Hub:

  • Huge model selection: Thousands of GGUF models available

  • Configurable caching: Easy to set up shared directories for classrooms

  • Automatic versioning: Models are versioned and can be pinned to specific commits

  • Simple API: Just two parameters needed: repo_id and filename

Next Steps:

  • See LlamaCpp_SmallLM_Demo.ipynb for detailed usage of downloaded models

  • Explore different quantization levels (Q4, Q5, Q8) for your use case

  • Try models from different families (Llama, Qwen, Phi, etc.)