Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

InvJNB: Are Emily and Greg More Employable Than Lakisha and Jamal?

InvJNB: Are Emily and Greg More Employable Than Lakisha and Jamal?

Standalone Investigation Notebook
Use with CourseKata intro chapters or as standalone
Kernel: R

Summary of Notebook

In 2004, researchers sent identical resumes with different names to job postings and recorded callback rates. This investigation explores whether names associated with different races affect callback rates, starting with gender comparisons and moving to race/name analysis. You will comprehensively explore the dataset, compute descriptive statistics, visualize patterns, and optionally use advanced techniques like shuffle tests and multiple variable analyses.

Includes

  • Comprehensive data exploration (all 63 variables)

  • Descriptive comparisons by gender and race

  • Name-race relationship exploration

  • Optional: Distribution triad and DGP concepts

  • Optional: Shuffle tests for evaluating chance vs. pattern

  • Optional: Multiple variable analysis and effect sizes

Approximate time to complete Notebook: 60-140 mins (depends on optional sections)

Core sections (1.0-5.0): 60-75 mins
With all optional advanced sections: 120-140 mins

Intro — Approximate Time: 3-5 mins

The 2004 Resume Study

In 2004, researchers Marianne Bertrand and Sendhil Mullainathan conducted a field experiment. They created resumes that were identical except for the names—some had names typically associated with White people (like Emily and Greg) and others had names typically associated with Black people (like Lakisha and Jamal). They sent these resumes to job postings and recorded which ones received callbacks.

The Hiring Process:

  • Researchers responded to help-wanted ads in Boston and Chicago newspapers

  • They sent nearly 5,000 resumes to over 1,300 employment ads

  • The ads were for sales, administrative support, and clerical positions

  • Resumes were randomly assigned names that were perceived as either distinctively White or distinctively Black

  • Qualifications were kept similar across both groups (education, experience, skills)

  • The key outcome: whether the applicant received a callback (a phone call or email requesting an interview)

Key Research Questions:

  • Do names signal race to employers?

  • Does this affect callback rates?

  • What does this mean for creating effective resumes?

Study Reference:
Bertrand, M., & Mullainathan, S. (2004). Are Emily and Greg More Employable Than Lakisha and Jamal? A Field Experiment on Labor Market Discrimination. American Economic Review, 94(4), 991-1013.

At the beginning of each notebook, load the packages you will use. Always run this first.

# Install coursekata if not already installed
if (!require("coursekata", quietly = TRUE)) {
    install.packages("coursekata", repos = "https://cloud.r-project.org", quiet = TRUE)
}

suppressPackageStartupMessages({
    library(coursekata)
    library(dplyr)
    library(tidyr)
})
Registered S3 method overwritten by 'mosaic':
  method                           from   
  fortify.SpatialPolygonsDataFrame ggplot2

Loading required package: dslabs

Loading required package: Metrics

Loading required package: lsr

Loading required package: mosaic


The 'mosaic' package masks several functions from core packages in order to add 
additional features.  The original behavior of these functions should not be affected by this.


Attaching package: 'mosaic'


The following objects are masked from 'package:dplyr':

    count, do, tally


The following object is masked from 'package:Matrix':

    mean


The following object is masked from 'package:ggplot2':

    stat


The following objects are masked from 'package:stats':

    IQR, binom.test, cor, cor.test, cov, fivenum, median, prop.test,
    quantile, sd, t.test, var


The following objects are masked from 'package:base':

    max, mean, min, prod, range, sample, sum


Loading required package: supernova

-- CourseKata packages ------------------------------------ coursekata 0.19.0 --
v dslabs              0.9.0       v Metrics             0.1.4
x Lock5withR                      v lsr                 0.5.2
x fivethirtyeightdata             v mosaic              1.9.2
x fivethirtyeight                 v supernova           3.0.0

Attaching package: 'coursekata'


The following object is masked from 'package:datasets':

    penguins


1.0 — Approximate Time: 12-15 mins

1.0 — Comprehensive Data Exploration

Before analyzing anything, we need to understand what data we have. This dataset contains information about resumes sent to job postings and whether they received callbacks.

1.1 — Load the data and check its dimensions. How many rows (resumes) and columns (variables) are in this dataset? What does each row represent?

# Load data from the same folder as this notebook

df <- read.csv("labor_market_discrimination.csv")

# How many rows and columns?
dim(df)

# What does each row represent?
head(df, 3)
Loading...
Loading...

Sample Responses

  • Each row represents one resume sent to a job posting

  • There are [X] resumes and 63 variables in the dataset

  • Each row contains information about the resume characteristics, the job it was sent to, and the outcome (callback)

1.2 — Look at all the variable names. Which ones seem most relevant for studying callback rates? Try to categorize them: resume characteristics, job/context variables, and outcome variables.

# View all column names
names(df)
length(names(df))

# Print them in a more readable format
cat("Total variables:", length(names(df)), "\n\n")
Loading...
Loading...
Total variables: 63 

Sample Responses

Outcome variables:

  • call - whether resume received callback (main outcome)

Resume characteristics:

  • first_name, sex, race - name and demographics

  • education, years_exp, n_jobs - qualifications

  • honors, volunteer, military, computer_skills, special_skills - additional resume features

  • emp_holes, occup_specific, occup_broad - employment history details

Job/context variables:

  • city, kind, ad_id - job posting identifiers

  • exp_min_req, school_req, eoe - job requirements

  • Many variables starting with frac_ or l_ - neighborhood characteristics

  • Variables like manager, supervisor, etc. - job type indicators

1.3 — Use glimpse() to see the structure of the data. Which variables look categorical? Which look quantitative? What does call contain (0s and 1s)?

# Get a better view of the data structure
glimpse(df)
Rows: 4,870
Columns: 63
$ education            <int> 4, 3, 4, 3, 3, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 1, 4~
$ n_jobs               <int> 2, 3, 1, 4, 3, 2, 2, 4, 3, 2, 4, 4, 4, 2, 2, 3, 3~
$ years_exp            <int> 6, 6, 6, 6, 22, 6, 5, 21, 3, 6, 8, 8, 4, 4, 5, 4,~
$ honors               <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ volunteer            <int> 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1~
$ military             <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ emp_holes            <int> 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0~
$ occup_specific       <int> 17, 316, 19, 313, 313, 266, 13, 313, 316, 263, 37~
$ occup_broad          <int> 1, 6, 1, 5, 5, 4, 1, 5, 6, 4, 6, 6, 1, 1, 4, 6, 4~
$ work_in_school       <int> 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1~
$ email                <int> 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1~
$ computer_skills      <int> 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0~
$ special_skills       <int> 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0~
$ first_name           <chr> "Allison", "Kristen", "Lakisha", "Latonya", "Carr~
$ sex                  <chr> "f", "f", "f", "f", "f", "m", "f", "f", "f", "m",~
$ race                 <chr> "w", "w", "b", "b", "w", "w", "w", "b", "b", "b",~
$ h                    <int> 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1~
$ l                    <int> 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0~
$ call                 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ city                 <chr> "c", "c", "c", "c", "c", "c", "c", "c", "c", "c",~
$ kind                 <chr> "a", "a", "a", "a", "a", "s", "s", "a", "s", "s",~
$ ad_id                <int> 384, 384, 384, 384, 385, 386, 386, 385, 386, 386,~
$ frac_black           <dbl> 0.989359558, 0.080735669, 0.104300544, 0.33616510~
$ frac_white           <dbl> 0.005500220, 0.888373852, 0.837370217, 0.63736963~
$ l_med_hh_inc         <dbl> 9.527484, 10.408828, 10.466754, 10.431908, 9.8762~
$ frac_dropout         <dbl> 0.27415097, 0.23368677, 0.10133465, 0.10884786, 0~
$ frac_colp            <dbl> 0.03766151, 0.08728535, 0.59169549, 0.40657592, 0~
$ l_inc                <dbl> 8.706325, 9.532859, 10.540329, 10.412141, 8.72826~
$ col                  <int> 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1~
$ exp_min_req          <chr> "5", "5", "5", "5", "some", "", "", "some", "", "~
$ school_req           <chr> "", "", "", "", "", "", "", "", "", "", "", "", "~
$ eoe                  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
$ parent_sales         <dbl> NA, NA, NA, NA, 9.4, 40.0, 40.0, 9.4, 40.0, 40.0,~
$ parent_emp           <dbl> NA, NA, NA, NA, 143, 135, 135, 143, 135, 135, 143~
$ branch_sales         <dbl> NA, NA, NA, NA, 9.4, 40.0, 40.0, 9.4, 40.0, 40.0,~
$ branch_emp           <dbl> NA, NA, NA, NA, 143, 135, 135, 143, 135, 135, 143~
$ fed                  <int> NA, NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, NA,~
$ frac_black_emp_zip   <dbl> NA, NA, NA, NA, 0.204763502, 0.008141458, 0.00814~
$ frac_white_emp_zip   <dbl> NA, NA, NA, NA, 0.7270460, 0.9734131, 0.9734131, ~
$ l_med_hh_inc_emp_zip <dbl> NA, NA, NA, NA, 10.619399, 11.137956, 11.137956, ~
$ frac_dropout_emp_zip <dbl> NA, NA, NA, NA, 0.07049314, 0.04795827, 0.0479582~
$ frac_colp_emp_zip    <dbl> NA, NA, NA, NA, 0.3699031, 0.4133062, 0.4133062, ~
$ l_inc_emp_zip        <dbl> NA, NA, NA, NA, 10.007352, 10.393723, 10.393723, ~
$ manager              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ supervisor           <int> 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ secretary            <int> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0~
$ off_support          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ sales_rep            <int> 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1~
$ retail_sales         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ req                  <int> 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0~
$ exp_req              <int> 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0~
$ com_req              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0~
$ educ_req             <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ comp_req             <int> 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0~
$ org_req              <int> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0~
$ manuf                <int> 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ trans_com            <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ bank_real            <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ trade                <int> 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1~
$ bus_service          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0~
$ oth_service          <int> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0~
$ miss_ind             <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ ownership            <chr> "", "", "", "", "Nonprofit", "Private", "Private"~

Sample Responses

  • call contains 0s and 1s: 0 = no callback, 1 = callback

  • Categorical variables: first_name, sex, race, city, kind, occup_broad, etc.

  • Quantitative variables: education, years_exp, n_jobs, variables with frac_ (fractions), l_ (logged values)

1.4 — Let’s focus on the key variables we’ll need for our analysis: first_name, sex, race, and call (the outcome). We’ll also keep education and years_exp for later sections. Create a focused dataset with just these columns. What patterns do you notice in the first few rows?

# Create a focused dataset with only the key variables
df <- df %>% 
  select(first_name, sex, race, call, education, years_exp)

# Look at first few rows
head(df, 10)

# Check the dimensions
cat("\nFocused dataset dimensions:", dim(df)[1], "rows x", dim(df)[2], "columns\n")

# Check unique values for key categorical variables
cat("\nUnique values for sex:", unique(df$sex), "\n")
cat("Unique values for race:", unique(df$race), "\n")
cat("\nCallback outcomes:\n")
table(df$call)
Loading...

Focused dataset dimensions: 4870 rows x 6 columns

Unique values for sex: f m 
Unique values for race: w b 

Callback outcomes:
0 1 4478 392

Sample Responses

  • Reduced the dataset from 63 variables to just 6 key variables

  • Now df contains only: first_name, sex, race, call, education, years_exp

  • sex has values “f” and “m” (female, male)

  • race has values “w” and “b” (white, black)

  • call is mostly 0s (no callback), some 1s (callback)

  • First names vary - some may sound more associated with certain races

  • Education and experience levels vary across resumes

  • All future analysis will use this streamlined dataset

2.0 — Approximate Time: 5-7 mins

2.0 — Overall Callback Rate

Before comparing groups, let’s establish a baseline: what’s the overall callback rate?

2.1 — Compute the overall callback rate. What does this number represent in context?

mean(df$call, na.rm = TRUE)
Loading...

Sample Responses

  • The overall callback rate is about [X] or [X%]

  • This means [X] out of every 100 resumes sent received a callback

  • This is our baseline for comparing groups

3.0 — Approximate Time: 10-12 mins

3.0 — Callback Rate by Gender

Let’s start with a simple comparison: do callback rates differ by gender?

3.1 — Prediction (no code): Which gender do you expect to have a higher callback rate? Why?

3.2 — Compute callback rates by sex. What is the observed difference (Female - Male)?

# Counts by gender and callback status
tally(~ call + sex, data = df)

# Proportions by group
df %>% group_by(sex) %>% summarize(callback_rate = mean(call, na.rm = TRUE))

# Calculate the difference
callback_by_sex <- df %>% group_by(sex) %>% summarize(callback_rate = mean(call, na.rm = TRUE))
callback_f <- callback_by_sex$callback_rate[callback_by_sex$sex == "f"]
callback_m <- callback_by_sex$callback_rate[callback_by_sex$sex == "m"]
cat("\nDifference (Female - Male):", callback_f - callback_m)
sex call f m 0 3437 1041 1 309 83
Loading...

Difference (Female - Male): 0.008644571
# Visualize the callback rates by gender
gf_props(~ sex, fill = ~ factor(call), data = df) %>%
  gf_labs(x = "Gender", fill = "Callback", y = "Proportion", 
          title = "Callback Rates by Gender")
plot without title

3.3 — Interpret the difference in context. Avoid causal language—describe what you see in the data.

3.4 — Is this difference practically meaningful? What would make it more or less meaningful?

Sample Responses

  • Prediction: Answers vary - no strong prior expectation, or expect small/no difference

  • Rates: Female resumes had callback rate of [X], Male resumes had [Y], difference of [Z]

  • Interpretation: In this dataset, resumes with female names had [X percentage points higher/lower] callback rate than resumes with male names

  • Practical: Depends on volume of applications, stakes for individuals, fairness standards. Small differences matter when scaled up to many applications.

4.0 — Approximate Time: 15-18 mins

4.0 — Names, Race, and Callbacks

This is the core research question: Do names signal race, and does this affect callbacks?

4.1 — Explore the first names in the dataset. Look at some names associated with White applicants and some associated with Black applicants. Which names seem clearly associated with each group?

# Look at names by race
names_by_race <- df %>% select(first_name, race) %>% distinct() %>% arrange(race, first_name)

# White-sounding names
cat("Sample of White-sounding names:\n")
head(names_by_race %>% filter(race == "w") %>% pull(first_name), 15)

cat("\n\nSample of Black-sounding names:\n")
head(names_by_race %>% filter(race == "b") %>% pull(first_name), 15)
Sample of White-sounding names:
Loading...


Sample of Black-sounding names:
Loading...

Sample Responses

  • White-sounding names: Emily, Greg, Allison, Kristen, Carrie, Geoffrey, Matthew, Todd, etc.

  • Black-sounding names: Lakisha, Jamal, Latonya, Kenya, Aisha, Tyrone, Leroy, Tamika, Keisha, etc.

  • Names do appear to be clearly associated with different racial groups

4.2 — Check the relationship between names and the race variable. Are names a good indicator of race in this dataset?

# Cross-tabulation of first_name and race
# Let's look at a few examples
df %>% 
  select(first_name, race) %>% 
  distinct() %>% 
  arrange(race, first_name) %>% 
  head(20)

# Check if names map perfectly to race
name_race_check <- df %>% 
  select(first_name, race) %>% 
  distinct() %>% 
  group_by(first_name) %>% 
  summarize(n_races = n_distinct(race), races = paste(unique(race), collapse = ", "))

# Are there any names that appear in both groups?
name_race_check %>% filter(n_races > 1)
Loading...
Loading...

4.3 — Research question: Are names indicative of race? Based on what you see, can employers infer race from names?

4.4 — Now compute callback rates by race. What is the difference?

# Counts by race and callback status
tally(~ call + race, data = df)

# Proportions by group
callback_by_race <- df %>% group_by(race) %>% summarize(callback_rate = mean(call, na.rm = TRUE))
callback_by_race

# Calculate the difference
callback_w <- callback_by_race$callback_rate[callback_by_race$race == "w"]
callback_b <- callback_by_race$callback_rate[callback_by_race$race == "b"]
cat("\nDifference (White - Black):", callback_w - callback_b)
race call b w 0 2278 2200 1 157 235
Loading...

Difference (White - Black): 0.03203285
# Visualize callback rates by race
gf_props(~ race, fill = ~ factor(call), data = df) %>%
  gf_labs(x = "Race (name-associated)", fill = "Callback", y = "Proportion",
          title = "Callback Rates by Race")
plot without title

4.5 — Interpret the difference descriptively (avoid causal language). What do you observe?

4.6 — What cautions should we keep in mind when interpreting these differences? What other factors might be at play?

Sample Responses

  • Names appear to be strong indicators of race in this dataset

  • White-sounding names received callback rate of [X], Black-sounding names received [Y], difference of [Z]

  • Interpretation: In this dataset, resumes with White-sounding names had [X percentage points higher] callback rate than resumes with Black-sounding names

  • Cautions:

    • Names proxy race imperfectly (not all people with these names are of that race)

    • Other resume characteristics might differ between groups (though study design tried to control this)

    • Different job types might have different patterns

    • This is observational - we cannot claim causation from this analysis alone

    • Ethics: These are real people’s experiences with potential discrimination

5.0 — Approximate Time: 8-10 mins

5.0 — What Makes a Resume Successful?

Since we’re interested in creating effective resumes, let’s explore what resume characteristics relate to callbacks.

5.1 — Explore how education level relates to callback rates. Does higher education predict more callbacks?

5.2 — What about years of experience? Do more experienced applicants get more callbacks?

# Education and callbacks
education_callbacks <- df %>% 
  group_by(education) %>% 
  summarize(
    n = n(),
    callback_rate = mean(call, na.rm = TRUE)
  ) %>%
  arrange(education)

education_callbacks

# Years of experience and callbacks
# Create experience groups for easier comparison
df <- df %>% 
  mutate(exp_group = case_when(
    years_exp <= 2 ~ "0-2 years",
    years_exp <= 5 ~ "3-5 years",
    years_exp <= 10 ~ "6-10 years",
    TRUE ~ "10+ years"
  ))

exp_callbacks <- df %>% 
  group_by(exp_group) %>% 
  summarize(
    n = n(),
    callback_rate = mean(call, na.rm = TRUE)
  )

exp_callbacks
Loading...
Loading...
# Visualize education and callbacks
gf_props(~ factor(education), fill = ~ factor(call), data = df) %>%
  gf_labs(x = "Education Level", fill = "Callback", y = "Proportion",
          title = "Callback Rates by Education Level")

# Visualize experience and callbacks
gf_props(~ exp_group, fill = ~ factor(call), data = df) %>%
  gf_labs(x = "Years of Experience", fill = "Callback", y = "Proportion",
          title = "Callback Rates by Experience")
plot without title
plot without title

5.3 — Based on what you’ve explored, what would you focus on if you were creating a resume? What seems most important?

5.4 — How do these resume characteristics compare to the differences we saw by race and gender?

Sample Responses

  • Education: [Describe patterns - higher education may/may not show clear pattern]

  • Experience: [Describe patterns - more experience may relate to callbacks]

  • Resume focus: Education, relevant experience, skills, clear formatting

  • Comparison: Race/gender differences might be as large or larger than education/experience differences, highlighting the importance of understanding discrimination patterns

5.5 — Creating Effective Resumes Today

Since this study was conducted in 2004, job applications have moved almost entirely online. Most companies now use Applicant Tracking Systems (ATS), software that scans and filters resumes based on keywords, formatting, and qualifications before a human ever sees them. Many also use AI to evaluate resumes beyond basic keyword matching. While this study showed hiring bias exists, building strong skills and networking significantly improve your chances of success.

Resume Basics

  • Use simple fonts and standard headings; avoid tables and images

  • Match keywords from the job description to your resume

  • Build skills through free resources: Coursera, LinkedIn Learning, YouTube, Khan Academy

Some Tools to Streamline Applications

  • Simplify: Auto-fills applications and tracks progress

  • Teal: Resume builder with ATS optimization

  • Huntr: Job search tracker

Beyond the Resume: Networking

Connections often matter as much as applications. Attend career fairs, reach out to alumni in your field, connect with professionals on LinkedIn, request informational interviews, and build strong references with professors and supervisors.

Wrap-Up — Approximate Time: 3-5 mins

Summary and Implications

  • What did we learn about callback rates by gender and race?

  • What are the limitations of this study design?

  • What are the implications for:

    • Creating effective resumes?

    • Understanding discrimination in hiring?

    • The broader question of fairness in employment?

Sample Responses

  • Callback rates differed by gender and race, with notable differences for race

  • Names are strong indicators of race in this dataset

  • Limitations: Field experiment, but still can’t establish causation definitively; only one time period/region

  • Resume implications: Focus on qualifications, but also be aware that factors beyond your control may affect outcomes

  • Discrimination: Patterns suggest potential discrimination in hiring processes

  • Fairness: Raises important questions about how to ensure fair employment practices


Optional Advanced Sections

The sections below use more advanced statistical concepts. You can skip these if you’re focusing on descriptive analysis, or include them if you want to explore inference and modeling.

Total additional time: ~30-45 mins

A1.0 — Optional Advanced: Hypotheses and Models — Approximate Time: 5-7 mins

Prerequisite: Chapter 01 or 08 concepts (DGP, distribution triad)

Models and the Data Generating Process (DGP)

We can think about two competing models for what might have generated our data:

  • Empty model: Data = overall mean callback rate + random error (gender/race unrelated to callbacks)

  • Explanatory model: Data = mean + effect of gender/race + error (gender/race related to callbacks)

We use the distribution triad to evaluate these models:

  • The DGP (the process that generated the data - empty or explanatory model?)

  • The Sample Distribution (our observed data - the resumes that were sent)

  • The Sampling Distribution (what we’d expect to see if we repeated the process many times)

A1.1 — In words, what does the empty model say about gender and callbacks?

A1.2 — In words, what does the explanatory model say, and how does it differ from the empty model?

Sample Responses

  • Empty model: Gender does not affect expected callback probability; any differences are just random variation

  • Explanatory model: Expected callback probability differs by gender; there’s a systematic effect

A2.0 — Optional Advanced: Shuffle Test for Gender — Approximate Time: 8-10 mins

Prerequisite: Chapter 04 or 08 (shuffle tests, null distributions)

A2.0 — Shuffle Test for Gender

A2.1 — If gender were truly unrelated to callbacks (empty model), what would the difference in callback rates look like across many random samples?

A2.2 — Use the shuffle test: randomly shuffle the call labels many times, calculate the difference each time, and see where our observed difference falls in this null distribution.

# Observed difference (Female - Male)
obs_diff_gender <- with(df, mean(call[sex == "f"], na.rm = TRUE) - mean(call[sex == "m"], na.rm = TRUE))
cat("Observed difference (Female - Male):", obs_diff_gender, "\n\n")

# Shuffle test: Generate null distribution
set.seed(123)
n_sims <- 2000
sim_diffs_gender <- replicate(n_sims, {
  perm <- sample(df$call)
  mean(perm[df$sex == "f"], na.rm = TRUE) - mean(perm[df$sex == "m"], na.rm = TRUE)
})

# Visualize the null distribution
gf_histogram(~ sim_diffs_gender, bins = 30) %>%
  gf_vline(xintercept = ~ obs_diff_gender, color = "red", size = 1.5) %>%
  gf_labs(x = "Difference in Callback Rate (Female - Male)",
          y = "Frequency",
          title = "Null Distribution: What if Gender Had No Effect?")

# Calculate how often we see a difference as extreme or more extreme
p_estimate <- mean(abs(sim_diffs_gender) >= abs(obs_diff_gender))
cat("\nProportion of simulations with |difference| >= |observed|:", p_estimate)
Observed difference (Female - Male): 0.008644571 

Warning message:
"Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
i Please use `linewidth` instead."

Proportion of simulations with |difference| >= |observed|: 0.3865
plot without title

A2.3 — Where does the observed difference fall relative to the null distribution? What does this suggest about which model (empty vs. explanatory) is more plausible?

A2.4 — Using the distribution triad, explain what we’ve learned: What does the null distribution tell us about the DGP, sample, and sampling distribution?

Sample Responses

  • Under empty model, differences should center around 0 with variation

  • Observed difference is [in the tail/center] of the null distribution

  • If in tail: empty model less plausible; if central: empty model plausible

  • Distribution triad: Null distribution shows what the sampling distribution would look like if empty model (DGP) were true. Our observed sample difference is [consistent/inconsistent] with this.

A3.0 — Optional Advanced: Shuffle Test for Race — Approximate Time: 8-10 mins

Prerequisite: Chapter 04 or 08 (shuffle tests)

A3.0 — Shuffle Test for Race

A3.1 — Now perform the same shuffle test for race. Generate a null distribution assuming race has no effect, and compare the observed race difference to this distribution.

# Observed difference (White - Black)
obs_diff_race <- with(df, mean(call[race == "w"], na.rm = TRUE) - mean(call[race == "b"], na.rm = TRUE))
cat("Observed difference (White - Black):", obs_diff_race, "\n\n")

# Shuffle test: Generate null distribution
set.seed(456)
n_sims <- 2000
sim_diffs_race <- replicate(n_sims, {
  perm <- sample(df$call)
  mean(perm[df$race == "w"], na.rm = TRUE) - mean(perm[df$race == "b"], na.rm = TRUE)
})

# Visualize the null distribution
gf_histogram(~ sim_diffs_race, bins = 30) %>%
  gf_vline(xintercept = ~ obs_diff_race, color = "red", size = 1.5) %>%
  gf_labs(x = "Difference in Callback Rate (White - Black)",
          y = "Frequency",
          title = "Null Distribution: What if Race Had No Effect?")

# Calculate how often we see a difference as extreme or more extreme
p_estimate_race <- mean(abs(sim_diffs_race) >= abs(obs_diff_race))
cat("\nProportion of simulations with |difference| >= |observed|:", p_estimate_race)
Observed difference (White - Black): 0.03203285 


Proportion of simulations with |difference| >= |observed|: 0
plot without title

A3.2 — Compare the results for gender and race. Which difference appears more extreme relative to chance? What does this suggest?

A3.3 — Multiple comparisons consideration: We’ve now tested both gender and race. Why might this matter when interpreting our results?

Sample Responses

  • Race difference typically more extreme than gender difference in shuffle tests

  • This suggests race effect is stronger/more consistent than gender effect

  • Multiple comparisons: When we test multiple hypotheses, we increase chance of finding “significant” results by chance alone. Need to be cautious about interpreting multiple tests.

A4.0 — Optional Advanced: Multiple Variable Analysis — Approximate Time: 12-15 mins

Prerequisite: Later chapters (group models, interactions, faceted plots)

A4.0 — Multiple Variable Analysis

A4.1 — Does the race effect vary by education level? For example, is the callback rate difference between White and Black applicants the same for college graduates as for those with less education?

A4.2 — Explore callback rates across combinations of race and education. Create a visualization that shows this relationship.

# Callback rates by race and education
race_edu_callbacks <- df %>% 
  group_by(race, education) %>% 
  summarize(
    n = n(),
    callback_rate = mean(call, na.rm = TRUE),
    .groups = 'drop'
  ) %>%
  arrange(race, education)

race_edu_callbacks

# Calculate differences within each education level
race_edu_diff <- race_edu_callbacks %>%
  pivot_wider(names_from = race, values_from = callback_rate) %>%
  mutate(difference = w - b)

race_edu_diff
Loading...
Loading...
# Faceted plot: callback rates by race, separated by education level
gf_props(~ race, fill = ~ factor(call), data = df) %>%
  gf_facet_grid(. ~ factor(education)) %>%
  gf_labs(x = "Race", fill = "Callback", y = "Proportion",
          title = "Callback Rates by Race, Across Education Levels")

# Alternative: Grouped bar chart showing rates directly
df %>%
  group_by(race, education) %>%
  summarize(callback_rate = mean(call, na.rm = TRUE), .groups = 'drop') %>%
  gf_col(callback_rate ~ factor(education), fill = ~ race, position = "dodge") %>%
  gf_labs(x = "Education Level", y = "Callback Rate", fill = "Race",
          title = "Callback Rates by Race and Education")
plot without title
plot without title

A4.3 — What do you notice? Does the race difference appear consistent across education levels, or does it vary? What might this mean?

A4.4 — Why might it be important to look at multiple variables together rather than just one at a time?

Sample Responses

  • Race difference may [be similar/vary] across education levels

  • If consistent: effect is robust across qualifications

  • If varies: might suggest discrimination is more pronounced at certain education levels, or that other factors interact

  • Multiple variables: Helps understand whether effects are consistent across contexts, or if some groups are affected differently

A5.0 — Optional Advanced: Effect Size and Practical Significance — Approximate Time: 5-7 mins

A5.0 — Effect Size and Practical Significance

A5.1 — Calculate the callback rate difference in percentage points (not just proportions). For race, what is the difference?

A5.2 — If 100 resumes were sent for White-sounding names and 100 for Black-sounding names, how many more callbacks would White-sounding names receive on average?

# Calculate difference in percentage points
callback_by_race <- df %>% group_by(race) %>% summarize(callback_rate = mean(call, na.rm = TRUE))
diff_prop <- callback_by_race$callback_rate[callback_by_race$race == "w"] - 
             callback_by_race$callback_rate[callback_by_race$race == "b"]
diff_percent <- diff_prop * 100

cat("Difference in callback rate:", round(diff_prop, 4), "or", round(diff_percent, 2), "percentage points\n\n")

# Calculate practical impact: if 100 resumes sent for each group
white_rate <- callback_by_race$callback_rate[callback_by_race$race == "w"]
black_rate <- callback_by_race$callback_rate[callback_by_race$race == "b"]

cat("If 100 resumes sent for White-sounding names:", round(white_rate * 100, 1), "callbacks\n")
cat("If 100 resumes sent for Black-sounding names:", round(black_rate * 100, 1), "callbacks\n")
cat("Difference:", round((white_rate - black_rate) * 100, 1), "more callbacks for White-sounding names\n")

# Scale it up: if 1000 resumes sent
cat("\nIf 1000 resumes sent for each group:\n")
cat("White-sounding names:", round(white_rate * 1000, 0), "callbacks\n")
cat("Black-sounding names:", round(black_rate * 1000, 0), "callbacks\n")
cat("Difference:", round((white_rate - black_rate) * 1000, 0), "more callbacks for White-sounding names")
Difference in callback rate: 0.032 or 3.2 percentage points

If 100 resumes sent for White-sounding names: 9.7 callbacks
If 100 resumes sent for Black-sounding names: 6.4 callbacks
Difference: 3.2 more callbacks for White-sounding names

If 1000 resumes sent for each group:
White-sounding names: 97 callbacks
Black-sounding names: 64 callbacks
Difference: 32 more callbacks for White-sounding names

A5.3 — Is this difference “practically significant”? At what scale (10 resumes? 100? 1000?) does the difference become more meaningful?

A5.4 — Why is it important to think about both statistical significance (from shuffle tests) and practical significance (effect size)?

Sample Responses

  • Difference is [X] percentage points

  • At scale of 100 resumes: [Y] more callbacks

  • Practical significance: Even “small” percentage point differences matter when scaled to many applicants. At individual level, each lost callback matters.

  • Both important: Statistical significance tells us pattern is unlikely due to chance; practical significance tells us if the effect size matters in real-world terms.