Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

1) Load the data

Estimated time: 45–75 minutes

Learning goals:

  • Understand the MRSA dataset and key variables (counts, county, year, population).

  • Use Python (pandas, numpy) to explore and visualize health data.

  • Apply simple mathematical models (exponential and logistic) to infection counts and interpret parameters in a public health context.

  • Use interactive widgets to explore how models and data change by county and year.

# Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
from scipy.optimize import curve_fit
import ipywidgets as widgets
from IPython.display import display

plt.style.use('seaborn')
print('Libraries loaded')

1) Load the data

Run the cell below to load the provided CSV files. If a file is missing, update the path to where the data files are stored on your machine.

# Try common relative paths used in the original notebook
import os
base = 'data'
files = {
    'mrsa_merged': 'mrsa_merged.csv',
    'infec_pop_merge': 'infec_pop_merge.csv',
    'mrsa_2013': 'mrsa_2013.csv'
}
datasets = {}
for name, fname in files.items():
    p = os.path.join(base, fname)
    if os.path.exists(p):
        try:
            datasets[name] = pd.read_csv(p)
            print(f'Loaded {p} — {datasets[name].shape[0]} rows, {datasets[name].shape[1]} cols')
        except Exception as e:
            print('Failed to read', p, e)
    else:
        print('Missing file:', p)

# Fall back to empty DataFrames for safety if files missing
mrsa_merged = datasets.get('mrsa_merged', pd.DataFrame())
infec_pop_merge = datasets.get('infec_pop_merge', pd.DataFrame())
mrsa_2013 = datasets.get('mrsa_2013', pd.DataFrame())

# Quick preview
preview_datasets = [
    ("mrsa_merged", mrsa_merged),
    ("infec_pop_merge", infec_pop_merge),
    ("mrsa_2013", mrsa_2013),
]

for name, df in preview_datasets:
    print(f"\n== {name} ==")
    display(df.head())

2) Quick data checks (helpful hints)

  • County, Year, and Infections (or a similar count column) are the key columns used below. If your dataset uses different column names, rename them accordingly before proceeding.

  • If population is provided, we will calculate rates per 100,000 people for fair comparisons between counties.

3) Mathematical modeling — why and what

In public health, simple models help summarize how infections grow and what controls them. Two common, easy‑to‑understand models are:

  • Exponential growth: early outbreaks often look exponential when unconstrained. Parameters: initial scale and growth rate.

  • Logistic growth: growth that saturates because of limited susceptible population or interventions. Parameters: carrying capacity, growth rate, and midpoint.

Below we define both models and provide code to fit them to county‑level yearly MRSA counts. Interpretation tips are included in the markdown cells that follow the fits.

# Define models
def exp_model(t, a, b):
    return a * np.exp(b * t)

def logistic_model(t, K, r, t0):
    return K / (1 + np.exp(-r * (t - t0)))

print('Model functions defined')
# Helper: prepare a county time series of total infections per year
def county_time_series(df, county_name, year_col='Year', county_col='County', count_col=None):
    # Try to guess the count column if not provided
    if count_col is None:
        candidates = [c for c in df.columns if 'infect' in c.lower() or 'count' in c.lower() or 'reports' in c.lower()]
        count_col = candidates[0] if candidates else None
    if count_col is None:
        raise ValueError('Could not infer count column; provide count_col explicitly')
    sub = df[df[county_col] == county_name].groupby(year_col)[count_col].sum().reset_index()
    sub = sub.sort_values(year_col)
    years = sub[year_col].astype(int).values
    counts = sub[count_col].astype(float).values
    return years, counts

# Fit function that attempts both models and plots results
def fit_and_plot_models(df, county, count_col=None):
    years, counts = county_time_series(df, county, count_col=count_col)
    if len(years) < 3:
        print('Not enough data points for reliable fitting for', county)
        return
    t = years - years.min()

    # Fit exponential (linearize with log for initial guess)
    try:
        # provide initial guesses to help convergence
        p0 = [max(1.0, counts[0]), 0.1]
        popt_exp, _ = curve_fit(exp_model, t, counts, p0=p0, maxfev=10000)
    except Exception as e:
        popt_exp = None
        print('Exponential fit failed:', e)

    # Fit logistic: need a good initial guess for K, r, t0
    try:
        K0 = max(counts) * 2
        r0 = 0.5
        t0_0 = t.mean()
        popt_log, _ = curve_fit(logistic_model, t, counts, p0=[K0, r0, t0_0], maxfev=10000)
    except Exception as e:
        popt_log = None
        print('Logistic fit failed:', e)

    # Prepare figure
    plt.figure(figsize=(8, 4))
    plt.plot(years, counts, 'o-', label='Observed')
    t_fine = np.linspace(t.min(), t.max(), 50)
    years_fine = years.min() + t_fine
    if popt_exp is not None:
        plt.plot(years_fine, exp_model(t_fine, *popt_exp), label=f'Exponential: a={popt_exp[0]:.2f}, b={popt_exp[1]:.3f}')
    if popt_log is not None:
        plt.plot(years_fine, logistic_model(t_fine, *popt_log), label=f'Logistic: K={popt_log[0]:.1f}, r={popt_log[1]:.3f}, t0={popt_log[2]:.2f}')
    plt.xlabel('Year')
    plt.ylabel('Infections (counts)')
    plt.title(f'County: {county}')
    plt.legend()
    plt.tight_layout()
    plt.show()

    # Return parameters for inspection
    return {'years': years, 'counts': counts, 'exp_params': popt_exp, 'log_params': popt_log}

3) Fit models interactively

Use the widget below to select a county and fit both exponential and logistic models. Click the “Run” button after choosing a county. Interpretations of parameters are provided after the output.

# Build a dropdown of available counties from the merged dataset if present
if not mrsa_merged.empty:
    counties = sorted(mrsa_merged['County'].dropna().unique().tolist())
elif not mrsa_2013.empty:
    counties = sorted(mrsa_2013['County'].dropna().unique().tolist())
else:
    counties = ['Los Angeles', 'San Francisco', 'Sonoma']

county_dd = widgets.Dropdown(options=counties, value=counties[0], description='County:')
run_btn = widgets.Button(description='Run fit')
out = widgets.Output()

def on_run(b):
    with out:
        out.clear_output()
        try:
            # try common column name 'Infections' first, otherwise infer inside helper
            if 'Infections' in (mrsa_merged.columns if not mrsa_merged.empty else []):
                res = fit_and_plot_models(mrsa_merged, county_dd.value, count_col='Infections')
            elif 'Infections' in (mrsa_2013.columns if not mrsa_2013.empty else []):
                res = fit_and_plot_models(mrsa_2013, county_dd.value, count_col='Infections')
            else:
                # try generic dataframe if available
                target_df = mrsa_merged if not mrsa_merged.empty else mrsa_2013
                res = fit_and_plot_models(target_df, county_dd.value)
            print('
Parameter summary:')
            print(res)
            print('
Interpretation tips:
- Exponential growth rate (b): positive values indicate growth; larger values mean faster increases.
- Logistic carrying capacity (K): approximate upper limit of counts if the logistic model fits well.
- Growth rate (r) and midpoint (t0): together they describe how quickly the saturation happens and when.
')
        except Exception as e:
            print('Error running fit:', e)

run_btn.on_click(on_run)
display(widgets.HBox([county_dd, run_btn]))
display(out)

4) Data analysis with interactive regression (population vs infections)

We often compare infection counts to population size to compute per‑capita rates. The widget below shows a scatter plot for a selected year with a linear best fit line. Use the slider to change the year and inspect how the relationship evolves.

def plot_pop_vs_inf(year):
    # Attempt to find county totals for the given year
    df = infec_pop_merge if not infec_pop_merge.empty else mrsa_merged
    if df.empty:
        print('No population+infection merged dataset available in `infec_pop_merge` or `mrsa_merged`.')
        return
    # Expect columns like 'Year', 'County', 'Population', 'Infections' or similar
    if 'Population' not in df.columns or 'Infections' not in df.columns:
        # try to infer names
        pop_col = [c for c in df.columns if 'pop' in c.lower() or 'population' in c.lower()]
        inf_col = [c for c in df.columns if 'infect' in c.lower() or 'count' in c.lower()]
        if pop_col:
            pop_col = pop_col[0]
        else:
            print('Could not find a population column')
            return
        if inf_col:
            inf_col = inf_col[0]
        else:
            print('Could not find an infection count column')
            return
    else:
        pop_col = 'Population'
        inf_col = 'Infections'

    sub = df[df['Year'] == year].groupby('County')[[pop_col, inf_col]].sum().reset_index()
    sub['rate_per_100k'] = 100000 * sub[inf_col] / sub[pop_col]
    fig = px.scatter(sub, x=pop_col, y=inf_col, hover_name='County', size='rate_per_100k',
                     title=f'Population vs Infections — {year}',
                     labels={pop_col: 'Population', inf_col: 'Infections'})
    # add linear fit line
    x = sub[pop_col].values
    y = sub[inf_col].values
    if len(x) > 1:
        m, c = np.polyfit(x, y, 1)
        x_line = np.linspace(x.min(), x.max(), 50)
        y_line = m * x_line + c
        fig.add_traces(px.line(x=x_line, y=y_line, labels={'x': pop_col, 'y': inf_col}).data)
        fig.update_layout(showlegend=False)
    fig.show()

# Interactive control: pick a year
years_available = sorted(infec_pop_merge['Year'].unique().tolist()) if not infec_pop_merge.empty else []
if years_available:
    year_slider = widgets.IntSlider(value=years_available[0], min=min(years_available), max=max(years_available), step=1, description='Year:')
    widgets.interact(plot_pop_vs_inf, year=year_slider)
else:
    print('No years found in `infec_pop_merge` to build the year slider')

5) Exercises (short) — tailored for pre‑nursing students

  • Exercise 1: Select an urban county and a rural county. Use the model widget to fit both counties. Which model (exponential or logistic) seems to describe each county better? What do the parameter values suggest about control of MRSA in each county?

  • Exercise 2: Using the population vs infection widget, pick two years and compare the slope of the population–infections regression. How did the per‑capita infection trend change?

  • Exercise 3 (optional, challenge): Modify the fit_and_plot_models function to fit only the years 2013–2016 and compare parameters to the full time range. What changed?

Write short answers in new markdown cells below each question.


If you’d like, I can also:

  • Add more step‑by‑step explanations for the code cells for students newer to programming, or

  • Export a PDF copy of this adjusted notebook for distribution.

Tell me which you’d prefer next.