Quantization Demo (Data 8–style, NumPy + scikit-learn)

This notebook demonstrates the idea of model quantization (FP32 → INT8) without any deep learning frameworks. We will:

Train a simple Logistic Regression classifier on the classic digits dataset (8×8 grayscale).
Measure accuracy, size, and inference latency for the FP32 model.
“Quantize” the learned weights to INT8 (per-tensor scale + zero-point).
Run inference by dequantizing on the fly (simple and portable) and compare metrics.
(Optional) Make a tiny plot for size/time comparisons.

Teaching goals: show the trade-offs (size vs. accuracy vs. speed) with simple, readable code—one small step at a time.

0. (Optional) Install packages¶

If your environment doesn’t already have scikit-learn and numpy, run the cell below. If they are already installed, it’s safe to skip or re-run.


# %pip install -q scikit-learn numpy matplotlib
# (Uncomment if needed. In many JupyterHub environments, these are preinstalled.)

1. Imports¶

Short, standard imports.


import time
import sys
import numpy as np

from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# Matplotlib is only used for a small optional plot at the end.
import matplotlib.pyplot as plt

2. Load & prepare the digits dataset¶

8×8 grayscale images (features = 64 pixels).
Scale pixel values to 0..1 for stable logistic regression.
Make a train/test split.


X, y = load_digits(return_X_y=True)  # X: [N, 64], y: digits 0..9
X = X.astype(np.float32) / 16.0      # original pixels are 0..16
Xtr, Xte, ytr, yte = train_test_split(
    X, y, test_size=0.25, random_state=0, stratify=y
)

Xtr.shape, Xte.shape

3. Train a baseline Logistic Regression (FP32)¶

We keep it simple: one-vs-rest logistic regression. Then we measure training time.


clf = LogisticRegression(
    max_iter=200,
    solver="lbfgs",
    multi_class="ovr",
    n_jobs=1  # keep deterministic/portable for class demos
)

t0 = time.perf_counter()
clf.fit(Xtr, ytr)
t1 = time.perf_counter()

train_time_s = t1 - t0
print("FP32 train time (s):", round(train_time_s, 4))

4. FP32 accuracy, inference time, and size¶

We measure accuracy, do a tiny warm-up, and time one prediction pass.
We also compute model size from the coefficient arrays.


# Accuracy
yhat_fp32 = clf.predict(Xte)
acc_fp32 = (yhat_fp32 == yte).mean()

# Inference timing (with a quick warm-up)
_ = clf.predict(Xte[:64])  # warm-up
t2 = time.perf_counter()
_ = clf.predict(Xte)
t3 = time.perf_counter()
inf_ms_fp32 = (t3 - t2) * 1000

# Size in bytes (weights + bias)
size_fp32_bytes = clf.coef_.nbytes + clf.intercept_.nbytes

print("FP32 acc:", round(acc_fp32, 4))
print("FP32 inference time (ms):", round(inf_ms_fp32, 3))
print("FP32 size (KB):", round(size_fp32_bytes/1024, 2))

5. INT8 quantization helpers (per-tensor)¶

We quantize each row of the weight matrix (one per class) to int8 using a simple scale and zero-point.


def quantize_per_tensor(w: np.ndarray):
    """
    Map float32 array w -> int8 values with per-tensor scale & zero-point.
    Returns (q, scale, zp).
    """
    w_min, w_max = float(w.min()), float(w.max())
    if w_max == w_min:
        # Edge case: constant tensor
        scale, zp = 1.0, 0
        q = np.zeros_like(w, dtype=np.int8)
    else:
        scale = (w_max - w_min) / 255.0
        # Shift so that -128..127 covers [w_min, w_max] approximately
        zp = int(round(-w_min / scale)) - 128
        q = np.clip(np.round(w / scale + zp), -128, 127).astype(np.int8)
    return q, scale, zp

def dequantize_per_tensor(q: np.ndarray, scale: float, zp: int):
    """
    Map int8 values back to float32 using scale & zero-point.
    """
    return (q.astype(np.float32) - zp) * scale

6. Quantize the trained weights to INT8¶

We keep biases as float32 (common practice) and quantize the weight vectors.


W = clf.coef_.astype(np.float32)       # shape: [10, 64]
b = clf.intercept_.astype(np.float32)  # shape: [10]

qW_list, scales, zps = [], [], []
for k in range(W.shape[0]):
    q, s, z = quantize_per_tensor(W[k])
    qW_list.append(q); scales.append(s); zps.append(z)

qW = np.stack(qW_list, axis=0)                    # [10, 64] int8
scales = np.array(scales, dtype=np.float32)       # [10]
zps    = np.array(zps, dtype=np.int32)            # [10]

size_int8_bytes = qW.nbytes + b.nbytes            # weights int8, biases fp32
print("INT8 size (KB):", round(size_int8_bytes/1024, 2))
print("Compression ×:", round(size_fp32_bytes / max(1, size_int8_bytes), 2))

7. INT8 inference (dequantize on the fly)¶

For simplicity (and portability), we dequantize back to float32 before the dot product. This keeps the code readable while still demonstrating size and accuracy effects.


def predict_int8(X):
    # Dequantize each class's weights (one-time per call, still fast on CPU)
    W_deq = np.vstack([dequantize_per_tensor(qW[k], scales[k], zps[k]) for k in range(qW.shape[0])])
    logits = X @ W_deq.T + b  # [N,64] @ [64,10] -> [N,10]
    return logits.argmax(axis=1)

# Warm-up
_ = predict_int8(Xte[:64])

t4 = time.perf_counter()
yhat_int8 = predict_int8(Xte)
t5 = time.perf_counter()

acc_int8 = (yhat_int8 == yte).mean()
inf_ms_int8 = (t5 - t4) * 1000

print("INT8 acc:", round(acc_int8, 4))
print("INT8 inference time (ms):", round(inf_ms_int8, 3))

8. Summary table¶

A tiny dictionary so students can see the comparison clearly.


summary = {
    "FP32_acc": round(acc_fp32, 4),
    "INT8_acc": round(acc_int8, 4),
    "FP32_KB": round(size_fp32_bytes/1024, 2),
    "INT8_KB": round(size_int8_bytes/1024, 2),
    "FP32_ms": round(inf_ms_fp32, 3),
    "INT8_ms": round(inf_ms_int8, 3),
}
summary

9. (Optional) Quick plot¶

One chart at a time; no custom colors. This plot compares model size (KB).


labels = ["FP32", "INT8"]
sizes  = [summary["FP32_KB"], summary["INT8_KB"]]

plt.figure()
plt.bar(labels, sizes)
plt.ylabel("Model size (KB)")
plt.title("FP32 vs INT8 size")
plt.show()

10. Discussion prompts (for class)¶

Accuracy usually changes very little on simple datasets like digits.
Model size drops roughly 4× when moving from float32 to int8 weights.
In this simple NumPy demo, speed may not improve dramatically because the math is tiny and Python overhead dominates.
In real systems, vectorized int8 kernels (e.g., BLAS, ONNX Runtime, or framework backends) can speed up large models.
Trade-offs: size vs. accuracy vs. speed — the right choice depends on your device and latency goals.