Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

InvJNB: Why Are Soccer Stars Born in January?

InvJNB: Why Are Soccer Stars Born in January?

Standalone Investigation Notebook
Use with CourseKata intro chapters or as standalone
Kernel: R

Summary of Notebook

Have you noticed that many professional soccer players seem to be born in January, February, or March? This investigation explores the “relative age effect”—a phenomenon where players born early in the year have advantages in youth sports due to age cutoffs. Students will examine birth month distributions of professional soccer players, compare observed patterns to expected uniform distributions, and explore how the effect varies across countries and positions.

Includes

  • Distribution analysis of birth months

  • Expected vs. observed frequency comparisons

  • Visualization of categorical distributions

  • Cross-group comparisons (countries, positions)

  • Optional: Quantifying the effect and historical trends

Approximate time to complete Notebook: 60-90 mins (depends on optional sections)

Core sections (1.0-5.0): 60-75 mins
With all optional advanced sections: 85-90 mins

Intro — Approximate Time: 3-5 mins

The Relative Age Effect in Soccer

If you look at professional soccer players, you might notice something surprising: many of them were born in January, February, or March. This isn’t a coincidence—it’s called the relative age effect.

The Problem:

  • Youth sports teams group players by birth year (e.g., all players born in 2005 play together)

  • A player born in January 2005 is nearly a full year older than a player born in December 2005

  • Older players in the same age group tend to be bigger, stronger, and more coordinated

  • They get more playing time, better coaching, and more opportunities

  • This advantage compounds over years, leading to overrepresentation of early-born players in professional leagues

Research Question: Are professional soccer players born disproportionately in the first few months of the year?

Study References:

  • Musch, J., & Grondin, S. (2001). Unequal competition as an impediment to personal development: A review of the relative age effect in sport. Developmental Review, 21(2), 147-167.

  • Helsen, W. F., et al. (2005). The relative age effect in youth soccer across Europe. Journal of Sports Sciences, 23(6), 629-636.

At the beginning of each notebook, load the packages you will use. Always run this first.

# Install coursekata if not already installed
if (!require("coursekata", quietly = TRUE)) {
    install.packages("coursekata", repos = "https://cloud.r-project.org")
}

suppressPackageStartupMessages({
    library(coursekata)
    library(dplyr)
    library(lubridate)
    library(tidyr)
})
Registered S3 method overwritten by 'mosaic':
  method                           from   
  fortify.SpatialPolygonsDataFrame ggplot2

Loading required package: dslabs

Loading required package: Metrics

Loading required package: lsr

Loading required package: mosaic


The 'mosaic' package masks several functions from core packages in order to add 
additional features.  The original behavior of these functions should not be affected by this.


Attaching package: ‘mosaic’


The following objects are masked from ‘package:dplyr’:

    count, do, tally


The following object is masked from ‘package:Matrix’:

    mean


The following object is masked from ‘package:ggplot2’:

    stat


The following objects are masked from ‘package:stats’:

    IQR, binom.test, cor, cor.test, cov, fivenum, median, prop.test,
    quantile, sd, t.test, var


The following objects are masked from ‘package:base’:

    max, mean, min, prod, range, sample, sum


Loading required package: supernova

── CourseKata packages ──────────────────────────────────── coursekata 0.19.0 ──
 dslabs              0.9.0        Metrics             0.1.4
x Lock5withR                       lsr                 0.5.2
x fivethirtyeightdata              mosaic              1.9.2
x fivethirtyeight                  supernova           3.0.0

Attaching package: ‘coursekata’


The following object is masked from ‘package:datasets’:

    penguins


1.0 — Approximate Time: 10-12 mins

1.0 — Data Exploration

Let’s start by loading and exploring the FIFA player data.

1.1 — Load the data and check its dimensions. How many players are in the dataset? What variables do we have?

# Load data from the same folder as this notebook
df <- read.csv("fifa_players.csv")

# Check dimensions
dim(df)

# Look at variable names
names(df)

# Look at first few rows
head(df[, c("name", "birth_date", "nationality", "positions")], 10)
Loading...
Loading...
Loading...

Sample Responses

  • There are [X] players in the dataset

  • Key variables include: name, birth_date, nationality, positions, and many skill ratings

  • The birth_date is in format like “6/24/1987” (month/day/year)

1.2 — We need to extract the birth month from the birth_date. Create a new variable called birth_month that contains just the month (1-12, where 1 = January, 12 = December).

# Convert birth_date to date format and extract month
df <- df %>%
  mutate(
    birth_date_parsed = as.Date(birth_date, format = "%m/%d/%Y"),
    birth_month = month(birth_date_parsed)
  )

# Check a few examples
head(df[, c("name", "birth_date", "birth_month")], 10)

# Verify we have all 12 months
table(df$birth_month)
Loading...
1 2 3 4 5 6 7 8 9 10 11 12 2002 2091 1815 1533 1526 1384 1396 1333 1403 1242 1151 1078

Sample Responses

  • Successfully extracted birth_month as numbers 1-12

  • All 12 months are represented in the data

  • January = 1, February = 2, ..., December = 12

2.0 — Approximate Time: 12-15 mins

2.0 — Birth Month Distribution

Now let’s examine the distribution of birth months among professional soccer players.

2.1 — Count how many players were born in each month using tally(). Which months have the most players? Which have the fewest?

# Count players by birth month
tally(~ birth_month, data = df)

# Create a summary table with month names
month_names <- c("January", "February", "March", "April", "May", "June",
                 "July", "August", "September", "October", "November", "December")

birth_month_counts <- df %>%
  group_by(birth_month) %>%
  summarize(count = n()) %>%
  mutate(month_name = month_names[birth_month]) %>%
  arrange(birth_month)

birth_month_counts
birth_month 1 2 3 4 5 6 7 8 9 10 11 12 2002 2091 1815 1533 1526 1384 1396 1333 1403 1242 1151 1078
Loading...

Sample Responses

  • January, February, and March typically have the highest counts

  • November and December typically have the lowest counts

  • There’s a clear pattern: more players born in early months, fewer in late months

2.2 — Visualize the birth month distribution using gf_bar(). What pattern do you see?

# Create month name variable for better labels
df <- df %>%
  mutate(month_name = factor(month_names[birth_month], levels = month_names))

# Bar chart showing counts
gf_bar(~ month_name, data = df) %>%
  gf_labs(x = "Birth Month", y = "Number of Players",
          title = "Distribution of Birth Months Among Professional Soccer Players")
plot without title

2.3 — Now create a bar chart using gf_props() to show proportions instead of counts. Does this change how you interpret the pattern?

# Bar chart showing proportions
gf_props(~ month_name, data = df) %>%
  gf_labs(x = "Birth Month", y = "Proportion of Players",
          title = "Proportion of Players by Birth Month")
plot without title

Sample Responses

  • The pattern is the same in both visualizations—early months have higher bars

  • Proportions make it easier to see the relative differences (e.g., if January has 10% and December has 6%, that’s a 4 percentage point difference)

  • The visual pattern clearly shows a decline from early months to late months

3.0 — Approximate Time: 15-18 mins

3.0 — Expected vs. Observed

If birth months were completely random (uniformly distributed), what would we expect to see?

3.1 — If birth months were uniformly distributed, how many players would we expect in each month? (Hint: total players divided by 12)

# Calculate expected count per month (uniform distribution)
total_players <- nrow(df)
expected_per_month <- total_players / 12

cat("Total players:", total_players, "\n")
cat("Expected players per month (if uniform):", round(expected_per_month, 2), "\n")

# Add expected values to our summary
birth_month_summary <- df %>%
  group_by(birth_month, month_name) %>%
  summarize(observed = n(), .groups = 'drop') %>%
  mutate(expected = expected_per_month,
         difference = observed - expected)

birth_month_summary
Total players: 17954 
Expected players per month (if uniform): 1496.17 
Loading...

Sample Responses

  • Expected count per month: [total/12]

  • If birth months were random, each month should have approximately the same number of players

  • The difference column shows how far each month deviates from expected

3.2 — Create a visualization that compares observed and expected counts. Which months are overrepresented (observed > expected)? Which are underrepresented (observed < expected)?

# Prepare data for comparison plot
comparison_data <- birth_month_summary %>%
  select(month_name, observed, expected) %>%
  pivot_longer(cols = c(observed, expected), 
               names_to = "type", 
               values_to = "count")

# Create comparison plot
gf_col(count ~ month_name, fill = ~ type, data = comparison_data, position = "dodge") %>%
  gf_labs(x = "Birth Month", y = "Number of Players", fill = "Type",
          title = "Observed vs. Expected Birth Month Distribution") %>%
  gf_hline(yintercept = ~ expected_per_month, linetype = "dashed", color = "gray")
plot without title

3.3 — Look at the differences (observed - expected). What’s the largest positive difference? The largest negative difference? What does this tell us?

# Show differences sorted
birth_month_summary %>%
  arrange(desc(difference)) %>%
  select(month_name, observed, expected, difference)
Loading...

Sample Responses

  • Largest positive difference: Typically January or February (most overrepresented)

  • Largest negative difference: Typically November or December (most underrepresented)

  • This tells us that early months have many more players than expected by chance, while late months have fewer

  • The pattern is consistent with the relative age effect hypothesis

4.0 — Approximate Time: 10-12 mins

4.0 — The Relative Age Effect Explained

Now that we’ve seen the pattern, let’s understand why it happens.

4.1 — In youth soccer, players are typically grouped by birth year. If the cutoff is January 1st, who would be in the same age group: a player born January 15, 2005 and a player born December 20, 2005, or a player born January 15, 2005 and a player born January 15, 2006?

4.2 — Why does being older within the same age group give an advantage? Think about physical and cognitive development.

Sample Responses

  • January 15, 2005 and December 20, 2005 would be in the same age group (both born in 2005)

  • The January 2005 player is nearly 11 months older than the December 2005 player

  • Advantages of being older:

    • Physical: Bigger, stronger, faster, more coordinated

    • Cognitive: Better decision-making, game understanding

    • Social: More confidence, leadership opportunities

  • These advantages lead to: more playing time, better coaching attention, selection for elite teams

  • The advantage compounds over years, making it harder for late-born players to catch up

4.3 — Why is the effect especially strong in Italy? (Hint: Think about how Italy structures their youth leagues differently from other countries.)

Sample Responses

  • Italy uses calendar year cutoffs (January 1st), which means the age gap within a group can be up to 12 months

  • Some other countries use different cutoffs (e.g., August 1st for school year), which can reduce the effect

  • Italy’s system creates the maximum possible advantage for early-born players

  • This is why the pattern is especially pronounced in Italian players

5.0 — Approximate Time: 12-15 mins

5.0 — Comparing Across Groups

Does the relative age effect vary by country or position? Let’s find out.

5.1 — Compare the birth month distribution for Italian players versus players from other countries. Is the effect stronger in Italy?

# Create Italy vs. Other comparison
df <- df %>%
  mutate(country_group = ifelse(nationality == "Italy", "Italy", "Other Countries"))

# Count by country group and month
italy_comparison <- df %>%
  group_by(country_group, month_name, birth_month) %>%
  summarize(count = n(), .groups = 'drop') %>%
  group_by(country_group) %>%
  mutate(proportion = count / sum(count)) %>%
  ungroup()

# Show summary
italy_comparison %>%
  group_by(country_group) %>%
  summarize(
    jan_mar = sum(proportion[birth_month %in% 1:3]),
    oct_dec = sum(proportion[birth_month %in% 10:12]),
    .groups = 'drop'
  )
Loading...
# Create Italy vs. Other comparison
df <- df %>%
  mutate(country_group = ifelse(nationality == "Italy", "Italy", "Other Countries"))

# Count by country group and month (include birth_month to use later)
italy_comparison <- df %>%
  group_by(country_group, month_name, birth_month) %>%
  summarize(count = n(), .groups = 'drop') %>%
  group_by(country_group) %>%
  mutate(proportion = count / sum(count)) %>%
  ungroup()

# Show summary
italy_comparison %>%
  group_by(country_group) %>%
  summarize(
    jan_mar = sum(proportion[birth_month %in% 1:3]),
    oct_dec = sum(proportion[birth_month %in% 10:12]),
    .groups = 'drop'
  )
Loading...
# Visualize with faceted plot
gf_props(~ month_name, data = df) %>%
  gf_facet_grid(. ~ country_group) %>%
  gf_labs(x = "Birth Month", y = "Proportion of Players",
          title = "Birth Month Distribution: Italy vs. Other Countries")
plot without title

5.2 — Does the effect vary by position? Compare goalkeepers, defenders, midfielders, and forwards. (Hint: You may need to extract the primary position from the positions variable.)

# Extract primary position (first position listed)
df <- df %>%
  mutate(
    primary_position = case_when(
      grepl("GK", positions) ~ "Goalkeeper",
      grepl("CB|LB|RB|LWB|RWB", positions) ~ "Defender",
      grepl("CM|CAM|CDM|LM|RM", positions) ~ "Midfielder",
      grepl("ST|CF|LW|RW", positions) ~ "Forward",
      TRUE ~ "Other"
    )
  )

# Filter to main positions and visualize
df_main_positions <- df %>%
  filter(primary_position != "Other")

gf_props(~ month_name, data = df_main_positions) %>%
  gf_facet_grid(. ~ primary_position) %>%
  gf_labs(x = "Birth Month", y = "Proportion of Players",
          title = "Birth Month Distribution by Position")
plot without title

5.3 — What do you notice about the patterns across countries and positions? Are there any interesting differences?

Sample Responses

  • Italy typically shows a stronger effect (more pronounced early-month overrepresentation)

  • The effect appears across all positions, though may vary slightly

  • Goalkeepers might show a slightly different pattern (physical size matters more)

  • The pattern is consistent but not identical across groups, suggesting the relative age effect is robust but may interact with other factors

Wrap-Up — Approximate Time: 3-5 mins

Summary and Implications

  • What did we learn about birth month distributions in professional soccer?

  • Why does the relative age effect occur, and what are its consequences?

  • What are the implications for fairness in youth sports?

  • What potential solutions could address this issue?

Sample Responses

  • Professional soccer players are disproportionately born in early months (Jan-Mar) and underrepresented in late months (Oct-Dec)

  • The relative age effect occurs because age cutoffs in youth sports create advantages for older players within the same age group

  • Consequences: Late-born players get fewer opportunities, less development, and are less likely to reach professional levels

  • Implications: The system is unfair to late-born players, potentially missing talented athletes

  • Potential solutions:

    • Use different age cutoffs (e.g., mid-year instead of calendar year)

    • Create narrower age bands

    • Weight teams by birth month to ensure balanced competition

    • Increase awareness among coaches and parents


Optional Advanced Sections

The sections below explore additional aspects of the relative age effect.

Total additional time: ~20-30 mins

A1.0 — Optional Advanced: Quantifying the Effect — Approximate Time: 8-10 mins

A1.0 — Quantifying the Effect

A1.1 — Calculate what percentage of players were born in Q1 (January-March) versus Q4 (October-December). How much larger is Q1?

A1.2 — Calculate the ratio: (Q1 players) / (Q4 players). What does this ratio tell us?

# Calculate Q1 vs Q4
quarter_summary <- df %>%
  mutate(
    quarter = case_when(
      birth_month %in% 1:3 ~ "Q1 (Jan-Mar)",
      birth_month %in% 4:6 ~ "Q2 (Apr-Jun)",
      birth_month %in% 7:9 ~ "Q3 (Jul-Sep)",
      birth_month %in% 10:12 ~ "Q4 (Oct-Dec)"
    )
  ) %>%
  group_by(quarter) %>%
  summarize(
    count = n(),
    proportion = n() / nrow(df),
    .groups = 'drop'
  )

quarter_summary

# Calculate Q1/Q4 ratio
q1_count <- sum(df$birth_month %in% 1:3)
q4_count <- sum(df$birth_month %in% 10:12)
q1_q4_ratio <- q1_count / q4_count

cat("\nQ1 (Jan-Mar) players:", q1_count, "\n")
cat("Q4 (Oct-Dec) players:", q4_count, "\n")
cat("Q1/Q4 ratio:", round(q1_q4_ratio, 2), "\n")
cat("This means there are", round(q1_q4_ratio, 2), "times more Q1 players than Q4 players\n")
Loading...

Q1 (Jan-Mar) players: 5908 
Q4 (Oct-Dec) players: 3471 
Q1/Q4 ratio: 1.7 
This means there are 1.7 times more Q1 players than Q4 players

Sample Responses

  • Q1 typically represents [X]% of players, Q4 represents [Y]%

  • Q1/Q4 ratio is typically around 1.3-1.5, meaning there are 30-50% more players born in early months

  • If birth months were uniform, we’d expect 25% in each quarter and a ratio of 1.0

  • The deviation from 1.0 quantifies the strength of the relative age effect

A2.1 — Extract the birth year from the data. Does the relative age effect vary by decade? Has awareness of the issue changed the pattern over time?

# Extract birth year and create decade groups
df <- df %>%
  mutate(
    birth_year = year(birth_date_parsed),
    birth_decade = (birth_year %/% 10) * 10
  )

# Calculate Q1 proportion by decade
decade_effect <- df %>%
  mutate(q1_born = birth_month %in% 1:3) %>%
  group_by(birth_decade) %>%
  summarize(
    total = n(),
    q1_count = sum(q1_born),
    q1_proportion = mean(q1_born),
    .groups = 'drop'
  ) %>%
  arrange(birth_decade)

decade_effect

# Visualize trend
gf_col(q1_proportion ~ factor(birth_decade), data = decade_effect) %>%
  gf_hline(yintercept = 0.25, linetype = "dashed", color = "red") %>%
  gf_labs(x = "Birth Decade", y = "Proportion Born in Q1 (Jan-Mar)",
          title = "Relative Age Effect Over Time",
          subtitle = "Red line shows expected 25% if uniform")
Loading...
plot without title

Sample Responses

  • The effect may be consistent across decades, suggesting it’s a persistent structural issue

  • Or it may show slight changes if awareness has led to policy adjustments

  • The pattern typically remains above the 25% expected line, indicating the effect persists

A3.0 — Optional Advanced: Other Sports Comparison — Approximate Time: 5-7 mins

A3.0 — Other Sports Comparison

A3.1 — The relative age effect has been documented in many sports. Research shows it’s particularly strong in sports where physical maturity matters. How might the effect differ between soccer (where skill and endurance matter) versus sports like ice hockey (where size and strength are more important)?

A3.2 — If you had access to NHL player data, what pattern would you expect to see? Would it be stronger or weaker than in soccer?

Sample Responses

  • The effect is found across many sports: soccer, hockey, baseball, tennis, etc.

  • It tends to be stronger in sports where physical maturity provides clear advantages

  • NHL might show a similar or even stronger pattern due to the importance of size and strength

  • The effect is a systemic issue in youth sports organization, not specific to one sport

  • Understanding this helps us see it’s not about talent distribution, but about how we structure competition