Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

The Data Science Lifecycle

This notebook walks through one pass of the lifecycle with a concrete example. We will:

  1. Start with a question (e.g., who are the students? what are their majors?)

  2. Acquire data (load tables we need)

  3. Explore the data (inspect structure, peek at rows)

  4. Answer a question (e.g., how many students?)

The example is adapted from Data 100 and DSC 80.

The lifecycle at a glance

The data science process moves from questiondataanalysis. The diagram below summarizes the main stages. In this notebook we do one pass: question, acquire data, explore, then answer one simple question.

Diagram of the data science lifecycle: question, data, analysis.

1. Starting with a question

Lifecycle stage: Question

Starting with a question: who are we studying, and what do we want to learn?

We start with a goal: learn something about the students in the course. Here are some simple questions we might ask:

  1. How many students do we have?

  2. What are their majors?

  3. What year are they?

  4. How did major enrollment change over time?

In this walkthrough we use a small dataset (names and majors) to illustrate the next steps.

2. Data acquisition and cleaning

Lifecycle stage: Data

Data acquisition and cleaning: getting and preparing the data.

To answer our questions we need data. Here we load two CSV files: one with majors and one with names. In later chapters you will learn more ways to collect and load data; for now we use pandas to read the files.

import pandas as pd

# Load data (paths relative to this notebook)
majors = pd.read_csv("data/majors-sp24.csv")
names = pd.read_csv("data/names-sp24.csv")

3. Exploratory data analysis

Lifecycle stage: Explore

Exploratory data analysis: understanding the structure and content of the data.

Before answering our questions we need to understand the data: its structure, columns, and any obvious issues. Let’s peek at the data.

Peeking at the data

# First 20 rows of the majors table
majors.head(20)
Loading...
# First 5 rows of the names table (default)
names.head()
Loading...

We can see the structure: majors has columns like major and terms in attendance; names has a single name column. Notice that some names are capitalized and some are not; that is the kind of inconsistency we often clean before deeper analysis.

4. Answering a question

Lifecycle stage: Analysis

We asked: How many students do we have? Now we can answer using the data we loaded and explored.

# How many students? (each row in majors is one student)
len(majors)
1276

That completes one pass through the lifecycle: question (who are the students?) → data (load majors and names) → explore (peek at structure) → answer (how many?).