Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Excel to Pandas Transition (Zillow Dataset) - Basic

In this introductory notebook, you’ll work with real Zillow housing data, including metro-level home values (ZHVI) and rent estimates (ZORI). We’ll focus on the core Excel-to-Pandas workflow so you can confidently load, inspect, reshape, and analyze the data step by step.

After completing this Basic notebook, continue to Excel_to_Pandas_Zillow_Advanced.ipynb for deeper analysis patterns, richer transformations, and more advanced visualizations.


Professor: Khai Lu
Developer: Rinrada Maneenop
Estimated Time: ~20 minutes


Contents:

  • Introduction: Why Transition from Excel to Pandas?

  • Setting Up: Loading Real Zillow Data

    • Understanding the Wide Format

    • Reshaping with pd.melt() — Excel’s “Unpivot”

  • Basic Data Operations

    • Reading Data

    • Viewing and Inspecting Data

  • Excel Functions Translated to Pandas

    • VLOOKUP/XLOOKUP -> merge() and map()

    • FILTER -> Boolean Indexing

    • SUMIF/COUNTIF -> Conditional Aggregations

    • IF Statements -> np.where() and apply()

Please run the cell below to import all the necessary libraries so your notebook can run without issues!
NOTE: DO NOT edit the cell below, please only run it.

# Import necessary libraries
import sys
from pathlib import Path

# Install ipywidgets into this kernel — do NOT use bare `pip install` here (NameError: pip is not defined)
import subprocess
subprocess.run([sys.executable, "-m", "pip", "install", "-q", "ipywidgets"], check=False)

# Local module interact.py lives next to this notebook — add it to path when cwd differs
# (e.g. JupyterHub/cloud kernels often start in /tmp or home, not this folder)
for _p in (
    Path.cwd(),
    Path.cwd() / "excel_to_pandas_zillow",
    Path.cwd() / "cis26" / "excel_to_pandas_zillow",
):
    if (_p / "interact.py").exists():
        sys.path.insert(0, str(_p.resolve()))
        break

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display, HTML, Markdown
from interact import (
    explore_data_widget,
    filter_markets_widget,
    pivot_table_widget,
    metro_trends_widget,
)
import warnings
warnings.filterwarnings('ignore')

# Set plot style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("✅ All libraries imported successfully!")
print("\n📌 Note: This notebook uses interactive widgets (ipywidgets)")
print("   You'll see dropdowns and sliders throughout — just run each cell and use the controls!")

Introduction: Why Transition from Excel to Pandas?

Excel is a powerful tool for data analysis, but it has limitations:

  • Scale: Excel struggles with datasets larger than ~1 million rows

  • Reproducibility: Manual operations are hard to document and reproduce

  • Automation: Repetitive tasks require VBA or manual effort

  • Version Control: Tracking changes is difficult

Pandas, a Python library, addresses these issues:

  • Handles millions of rows efficiently

  • All operations are code-based and reproducible

  • Easy to automate workflows

  • Integrates with version control systems (Git)

  • Free and open-source

In this notebook, you’ll learn how to translate your Excel skills into Pandas using real housing market data from Zillow Research!

Key Concept: DataFrames vs Spreadsheets

In Excel, you work with worksheets containing cells organized in rows and columns.

In Pandas, you work with DataFrames — similar structure but with powerful programmatic capabilities:

Excel ConceptPandas Equivalent
WorkbookMultiple DataFrames
WorksheetDataFrame
ColumnSeries (df['column_name'])
RowRow (df.loc[row_index])
CellSingle value (df.loc[row, col])

Setting Up: Loading Real Zillow Data

We’ll work with two real datasets published by Zillow Research:

  1. home_values (Zillow Home Value Index) — the typical home value by metro area, monthly, going back to 2000

  2. home_values (Zillow Observed Rent Index) — the typical observed rent by metro area, monthly

These are the same datasets analysts at real estate firms and banks use for housing market research.

Load Zillow Home Value Index (home_values)

  • Excel: File → Open → Browse → select the CSV

  • Pandas: use pd.read_csv() with a URL or file path

  • Also load the rent_values rent dataset for the same metros

# URLs from https://www.zillow.com/research/data/
zhvi_URL = (
    "https://files.zillowstatic.com/research/public_csvs/zhvi/"
    "Metro_zhvi_uc_sfrcondo_tier_0.33_0.67_sm_sa_month.csv"
)
zori_URL = (
    "https://files.zillowstatic.com/research/public_csvs/zori/"
    "Metro_zori_uc_sfrcondomfr_sm_month.csv"
)

print("⏳ Downloading home value data (ZHVI) from Zillow Research...")
home_values_wide = pd.read_csv(zhvi_URL)
print(f"✅ Home values loaded: {home_values_wide.shape[0]} metro areas × {home_values_wide.shape[1]} columns")

print("\n⏳ Downloading rent data (ZORI) from Zillow Research...")
rent_values_wide = pd.read_csv(zori_URL)
print(f"✅ Rent values loaded: {rent_values_wide.shape[0]} metro areas × {rent_values_wide.shape[1]} columns")
# Let's look at what the raw home value data looks like
print("Raw home value data (first 5 rows, first 10 columns):")
display(home_values_wide.iloc[:5, :10])

print("\n📌 Notice the structure:")
print("  • Each ROW = one metro area (e.g., 'Los Angeles, CA')")
print("  • Each COLUMN after StateName = a monthly date (e.g., '2000-01-31')")
print("  • Each VALUE = the typical home value in $ for that metro + month")
print(f"\n  Total date columns: {home_values_wide.shape[1] - 5}")

💡 Understanding the ‘Wide’ Format

The Zillow CSV is in wide format — each date is its own column. This is common in Excel:

RegionName2000-01-312000-02-292000-03-31...
Los Angeles, CA210,000212,000214,000...
New York, NY350,000351,000353,000...

For most pandas analysis, we need long format — each row is one observation (metro + date + value):

RegionNameDateHomeValue
Los Angeles, CA2000-01-31210,000
Los Angeles, CA2000-02-29212,000
New York, NY2000-01-31350,000

Excel Equivalent: Data → Get & Transform → Unpivot Columns
Pandas Equivalent: pd.melt()

Excel ‘Unpivot’ → pd.melt()

  • In Excel: Data → Get & Transform → Unpivot Columns

  • In pandas: use pd.melt to reshape wide date columns into rows

  • Keep metadata columns as identifiers and unpivot the date columns

# Identify the metadata columns (non-date columns)
meta_cols = ['RegionID', 'SizeRank', 'RegionName', 'RegionType', 'StateName']

# Identify the date columns (everything else)
date_cols = [c for c in home_values_wide.columns if c not in meta_cols]

print(f"Metadata columns: {meta_cols}")
print(f"Number of date columns: {len(date_cols)}")
print(f"Date range: {date_cols[0]} → {date_cols[-1]}")

home_values = pd.melt(
    home_values_wide,
    id_vars=meta_cols,        # columns to KEEP as-is
    value_vars=date_cols,     # columns to UNPIVOT into rows
    var_name='Date',          # name for the new 'column header' column
    value_name='HomeValue'    # name for the new 'value' column
)

# Convert Date string to an actual datetime object
home_values['Date'] = pd.to_datetime(home_values['Date'])

# Drop rows where HomeValue is missing (some early months have no data)
home_values = home_values.dropna(subset=['HomeValue'])

print(f"\n✅ Reshaped to long format: {home_values.shape[0]:,} rows × {home_values.shape[1]} columns")
print("\nFirst 5 rows of long-format data:")
display(home_values.head())

# Backwards-compatible alias used elsewhere in the notebook
home_values = home_values
# Do the same for rent_values (rent data)
rent_meta_cols = ['RegionID', 'SizeRank', 'RegionName', 'RegionType', 'StateName']
rent_date_cols = [c for c in rent_values_wide.columns if c not in rent_meta_cols]

rent_values = pd.melt(
    rent_values_wide,
    id_vars=rent_meta_cols,
    value_vars=rent_date_cols,
    var_name='Date',
    value_name='Rent'
)
rent_values['Date'] = pd.to_datetime(rent_values['Date'])
rent_values = rent_values.dropna(subset=['Rent'])

print(f"✅ rent_values long format (rent values): {rent_values.shape[0]:,} rows × {rent_values.shape[1]} columns")
display(rent_values.head())

Column definitions (long-format data)

home_values (Zillow Home Value Index, one row per metro per month):

  • RegionID — Zillow’s unique numeric ID for the region.

  • SizeRank — Rank by market size (0 = largest, e.g. United States or New York; higher = smaller market).

  • RegionName — Name of the metro or area (e.g. “New York, NY”, “Los Angeles, CA”).

  • RegionType — Geography type: country (e.g. United States), or msa = Metropolitan Statistical Area (a core city plus surrounding counties the U.S. government groups as one economic region; “metro” in the data means MSA).

  • StateName — State code (e.g. NY, CA); missing for U.S. or non-state regions.

  • Date — Month of the observation (month-end date).

  • HomeValue — Typical home value in dollars (ZHVI, smoothed, seasonally adjusted, mid-tier).

rent_values (Zillow Observed Rent Index, one row per metro per month):

  • RegionID — Zillow’s unique numeric ID for the region.

  • SizeRank — Rank by market size (0 = largest).

  • RegionName — Name of the metro or area.

  • RegionType — Geography type: country or msa (Metropolitan Statistical Area; see above).

  • StateName — State code; missing for U.S. or non-state regions.

  • Date — Month of the observation.

  • Rent — Typical observed monthly rent in dollars (ZORI, smoothed).


Basic Data Operations

Reading and Inspecting Data

In Excel, you open a file by clicking it. In Pandas, you use read_csv().

Excel: File → Open → Browse → Select file
Pandas: df = pd.read_csv('filename.csv') or df = pd.read_csv('https://...')

Essential Inspection Methods

Excel ActionPandas MethodPurpose
Scroll to see datadf.head() or df.tail()View first/last rows
Ctrl+Down to see sizedf.shapeGet (rows, columns)
Right-click column headerdf.info()See column types and non-null counts
Select column → look at status bardf.describe()Get summary statistics
# Interactive widget to explore different inspection methods
explore_data_widget(home_values)

Excel Functions Translated to Pandas

1. VLOOKUP / XLOOKUP → merge() and map()

VLOOKUP looks up a value in one table and pulls a column from another table.

Excel Formula: =VLOOKUP(RegionName, RentTable, RentColumn, FALSE)

Pandas Equivalent: df.merge() — joins two DataFrames like a SQL JOIN

We’ll use this to attach rent data (rent_values) to home value data (home_values) for the same metro areas.

Using map() as a simpler VLOOKUP

  • Goal: build a dictionary from RegionName to latest HomeValue

  • Use .map() to attach that lookup value to each row

  • Equivalent to an Excel VLOOKUP returning a single column

# Find the most recent date in the dataset
latest_date = home_values['Date'].max()
print(f"Most recent month in dataset: {latest_date.strftime('%B %Y')}")

# Create a dictionary mapping RegionName → latest HomeValue
# This is like building a lookup table in Excel
latest_values = (
    home_values[home_values['Date'] == latest_date]
    .set_index('RegionName')['HomeValue']
    .to_dict()
)

print(f"\nSample entries from our lookup dictionary:")
sample_keys = list(latest_values.keys())[:5]
for k in sample_keys:
    print(f"  {k}: ${latest_values[k]:,.0f}")

# Now apply it to the full dataset using map()
# Excel equivalent: =VLOOKUP(A2, LookupTable, 2, FALSE)
latest_home_values = home_values[home_values['Date'] == latest_date].copy()
latest_home_values['NationalRank'] = latest_home_values['SizeRank']

# Backwards-compatible name used elsewhere in the notebook
home_values_latest = latest_home_values

print("\nUsing map() to rank markets by size:")
display(latest_home_values[['RegionName', 'StateName', 'HomeValue', 'NationalRank']]
        .sort_values('NationalRank')
        .head(10))

Using merge() as a full VLOOKUP

  • Goal: combine latest home values and rents for each metro

  • Start from home_values (home values) and join rent_values (rents) on RegionName

  • Keep all metros from home_values and fill missing rents with NaN

latest_rent_values = rent_values[rent_values['Date'] == rent_values['Date'].max()][['RegionName', 'Rent']]
latest_home_snapshot = home_values[home_values['Date'] == latest_date][['RegionName', 'StateName', 'HomeValue', 'SizeRank']]

# This is like doing =VLOOKUP(RegionName, rent_valuesTable, RentColumn, FALSE)
# but getting ALL columns from the rent table at once
housing = latest_home_snapshot.merge(
    latest_rent_values,
    on='RegionName',
    how='left'  # Keep all metros from the home value table, fill NaN if no rent data
)

print(f"Merged dataset shape: {housing.shape}")
print("\nHousing data with home values AND rents:")
display(housing.dropna().sort_values('SizeRank').head(10))

💡 map() vs merge() — When to Use Which?

SituationUse
You need ONE value from a lookup tablemap()
You need MULTIPLE columns from another tablemerge()
You want SQL-style join behavior (inner, outer, left)merge()
Fastest option for a simple single-column lookupmap()

2. FILTER → Boolean Indexing

FILTER in Excel (or AutoFilter) selects rows that meet a condition.

Excel: Data → Filter → select criteria
Pandas: df[condition] — called Boolean Indexing

Simple filter: California metros only

  • Excel: AutoFilter on StateName = 'CA'

  • Pandas: subset rows where StateName equals 'CA'

  • Then sort metros by HomeValue descending

ca_metros = latest_home_values[latest_home_values['StateName'] == 'CA']

print(f"California metros: {len(ca_metros)}")
display(ca_metros[['RegionName', 'HomeValue']].sort_values('HomeValue', ascending=False))

Multi-condition filters

  • AND condition: CA metros with HomeValue > 800000

  • OR condition: metros in TX or FL

  • Use & and | with boolean expressions in pandas

expensive_ca = latest_home_values[
    (latest_home_values['StateName'] == 'CA') &
    (latest_home_values['HomeValue'] > 800_000)
]
print("CA metros with home values > $800K:")
display(expensive_ca[['RegionName', 'HomeValue']].sort_values('HomeValue', ascending=False))

tx_fl = latest_home_values[
    (latest_home_values['StateName'] == 'TX') |
    (latest_home_values['StateName'] == 'FL')
]
print(f"\nTX or FL metros: {len(tx_fl)}")
display(tx_fl[['RegionName', 'StateName', 'HomeValue']].sort_values('HomeValue', ascending=False).head(10))

Filtering for multiple values with isin()

  • Mimic Excel’s AutoFilter with multiple checked states

  • Define a list of target states like CA, NY, TX, FL, WA

  • Use df['StateName'].isin(target_states) to select matching rows

# Interactive Filter Widget
filter_markets_widget(latest_home_values)

3. SUMIF / COUNTIF → Conditional Aggregations

SUMIF and COUNTIF calculate totals or counts based on conditions.

Excel Formula: =AVERAGEIF(StateName, "CA", HomeValue)
Pandas Equivalent: Filter then aggregate, or use groupby()

SUMIF / COUNTIF equivalents

  • AVERAGEIF: compute average HomeValue for California metros

  • COUNTIF: count how many metros there are in each state

  • Use filtering plus .mean() and value_counts() in pandas

ca_avg = latest_home_values[latest_home_values['StateName'] == 'CA']['HomeValue'].mean()
print(f"Average home value in CA: ${ca_avg:,.0f}")

print("\nNumber of metro areas per state (like COUNTIF):")
metros_per_state = latest_home_values['StateName'].value_counts()
display(metros_per_state.head(10))

groupby() as multi-state SUMIF/AVERAGEIF

  • Aggregate HomeValue by StateName in one step

  • Compute median, mean, count, and max per state

  • Excel equivalent would require separate SUMIF/AVERAGEIF formulas for each state

state_summary = latest_home_values.groupby('StateName')['HomeValue'].agg(
    Median_Home_Value='median',
    Mean_Home_Value='mean',
    Num_Metros='count',
    Max_Home_Value='max'
).round(0).sort_values('Median_Home_Value', ascending=False)

print("Home value statistics by state (top 10 most expensive):")
display(state_summary.head(10))

Time-based aggregation: national home value by year

  • Add a Year column derived from the Date

  • Use groupby('Year') and an aggregation like median

  • Excel equivalent: AVERAGEIFS with a year criterion

home_values['Year'] = home_values['Date'].dt.year

annual_avg = home_values.groupby('Year')['HomeValue'].median().round(0)

print("Median US home value by year:")
display(annual_avg.tail(10))

print(f"\n📈 Home values from {annual_avg.index[0]} to {annual_avg.index[-1]}:")
print(f"   ${annual_avg.iloc[0]:,.0f} → ${annual_avg.iloc[-1]:,.0f}")
print(f"   Total increase: {((annual_avg.iloc[-1] / annual_avg.iloc[0]) - 1) * 100:.0f}%")

4. IF Statements → np.where() and apply()

Excel’s IF function creates conditional logic in new columns.

Excel Formula: =IF(HomeValue > 500000, "Expensive", "Affordable")
Pandas Equivalent: np.where() or apply()

Simple IF classification

  • Excel: =IF(HomeValue > 500000, "Expensive", "Affordable")

  • Pandas: use np.where to assign “Expensive” or “Affordable” labels

  • Creates an Affordability column based on HomeValue

latest_home_values_copy = home_values_latest.copy()
latest_home_values_copy['Affordability'] = np.where(
    latest_home_values_copy['HomeValue'] > 500_000,
    'Expensive',
    'Affordable'
)

print("Markets classified by affordability:")
display(latest_home_values_copy[['RegionName', 'StateName', 'HomeValue', 'Affordability']]
        .sort_values('HomeValue', ascending=False)
        .head(10))

print("\nCounts:")
print(latest_home_values_copy['Affordability'].value_counts())

Nested IF logic for market tiers

  • Excel example: IF(HomeValue>1000000, "Ultra-Luxury", IF(HomeValue>600000, "Luxury", ...))

  • In pandas, write a helper function and apply it to HomeValue

  • Produces categorical labels like Ultra-Luxury, Luxury, Mid-Range, Affordable

def classify_market(home_value):
    if home_value > 1_000_000:
        return 'Ultra-Luxury'
    elif home_value > 600_000:
        return 'Luxury'
    elif home_value > 400_000:
        return 'Mid-Range'
    else:
        return 'Affordable'

latest_home_values_copy['MarketTier'] = latest_home_values_copy['HomeValue'].apply(classify_market)

print("Market tier distribution:")
print(latest_home_values_copy['MarketTier'].value_counts())

print("\nSample of each tier:")
display(latest_home_values_copy.groupby('MarketTier')
        .apply(lambda x: x.nsmallest(2, 'SizeRank'))
        [['RegionName', 'StateName', 'HomeValue', 'MarketTier']]
        .reset_index(drop=True))

np.select() for multi-level conditions

  • Use multiple boolean conditions and matching labels

  • Cleaner than deeply nested np.where calls

  • Similar to Excel’s IFS() with several thresholds

conditions = [
    latest_home_values_copy['HomeValue'] > 1_000_000,
    latest_home_values_copy['HomeValue'] > 600_000,
    latest_home_values_copy['HomeValue'] > 400_000,
]
choices = ['Ultra-Luxury', 'Luxury', 'Mid-Range']

latest_home_values_copy['MarketTier_v2'] = np.select(conditions, choices, default='Affordable')

print("Using np.select() — same result, cleaner code for many conditions:")
print(latest_home_values_copy['MarketTier_v2'].value_counts())

Quick Reference Cheat Sheet

Excel FunctionPandas EquivalentZillow Example
File → Open CSVpd.read_csv(url)pd.read_csv(home_values_URL)
Data → Unpivotpd.melt()Wide home_values → long format
AutoFilterBoolean indexingdf[df['StateName'] == 'CA']
FILTER (multi-value).isin()df[df['StateName'].isin(['CA','NY'])]
VLOOKUP (1 col)series.map(dict)Map RegionName → latest home value
VLOOKUP / XLOOKUPdf.merge()Join home values + rents by RegionName
SUMIFgroupby().sum()Total value by state
AVERAGEIFgroupby().mean()Avg home value by state
COUNTIFvalue_counts() / groupby().size()Count metros per state
IFnp.where()Classify Affordable vs Expensive
Nested IFSnp.select()Market tier classification
IF (complex).apply(func)Custom category logic
YEAR(Date) + summary.dt.year + groupby()Annual median home value
AVERAGE.mean()National median/mean home value snapshot
MAX/MIN.max() / .min()Most/least expensive metro