# Initialize Otter
import otter
grader = otter.Notebook("introduction.ipynb")Introduction to Jupyter Notebooks¶
Welcome to a Jupyter Notebook! Notebooks are documents that support interactive computing in which code is interwoven with text, visualizations, and more.
The way notebooks are formatted encourages exploration, allowing users to iteratively update code and document the results. In use cases such as data exploration and communication, notebooks excel. Science (and computational work in general) has become quite sophisticated: models are built upon experiments that are conducted on large swaths of data, methods and results are abstracted away into symbols, and papers are full of technical jargon. A static document like a paper might not be sufficient to both effectively communicate a new discovery and allow someone else to discover it for themselves.
In this notebook, there are some more advanced topics that are "optional". This means you can just read over these sections, don't worry about fully understanding these parts unless you are really interested.
Learning Outcomes¶
Working through this notebook, you will learn about:
The history behind Jupyter notebooks and why they are used in computing
How Jupyter notebooks are structured and how to use them
Python fundamentals and working with tabular data
Note: This notebook contains introduces a number of Python concepts that will be new to some and review to others. Take a look at the entire notebook to guage your familiarity with the content before getting started. Start early, and don't be discouraged if each section requires a different time commitment!
A Brief History¶
The Jupyter Notebook is an interactive computational environment that supports over 40 different programming languages. Fernando Perez, a professor in the Statistics department here at UC Berkeley, co-founded Project Jupyter in 2014.
Though the Jupyter Notebook interface has been around only about a decade, the first notebook interface, Mathematica, was released over 30 years ago in 1988.
Fun Etymology Fact!
Project Jupyter’s name is a reference to the three core programming languages supported by Jupyter, which are Julia, Python and R ("ju" from "Julia", "pyt" from "Python", and "er" from "R"; all together you get "ju" + "pyt" + "er" = "jupyter"). The word notebook is an homage to Galileo’s notebooks in which he documents his discovery of the moons of Jupiter.
Why Use Notebooks?¶
Notebooks are used for literate programming, a programming paradigm introduced by Donald Knuth in 1984, in which a programming language is accompanied with plain, explanatory language.
This approach to programming treats software as works of literature (Knuth, “Literate Programming”), supporting users to have a strong conceptual map of what is happening in the code.
In addition to code and natural language, notebooks can include diagrams, visualizations, and rich media, making them useful in any discipline. They are also popular in education as a tool for engaging students at various skill levels with scaffolded and diverse lessons.
Notebook Structure¶
Cell Types¶
A notebook is composed of rectangular sections called cells. There are 2 kinds of cells: markdown and code.
A markdown cell, such as this one, contains text.
A code cell contains code. In this class, we’ll be using Python, but in other data science classes you might use languages such as Julia or R.
Running Cells¶
To “run” a code cell (i.e. tell the computer to perform the programmed instructions in the cell), select it and either:
Press
Shift+Enterto run the cell and move to (select) the following cell.Press
Command/Control+Enterto run the cell but stay on the same cell. This can be used to re-run the same cell repeatedly.Click the Run button in the toolbar at the top of the screen.
Results and Outputs of a Cell¶
When you run a code cell, a number of things can happen, depending on the type and contents of the cell:
Running a markdown cell renders the text inside of it.
Running a code cell returns the result of the code below the cell.
This output may be text, a number, a visualization, or nothing at all, depending on the code!
If a code cell is running, you will see an asterisk (*) appear in the square brackets to the left of the cell. Once the cell has finished running, a number in brackets will replace the asterisk and any output from the code will appear under the cell. This number goes up by one each time you run a code cell, telling you the order in which the code cells in your notebook have been run.
Let’s try it! Run the cell below to see the output. Feel free to play around with the code—try changing ‘World’ to your name, and re-run it multiple times to see how the number to the left increments.
print("Hello World!") # Run the cell by using one of the methods we mentioned above!Comments¶
Notice the blue text that starts with a # in the code cell above. This is a comment. The leading # tells the computer to ignore whatever text follows it. Comments help programmers organize their code and make it easier interpret. Writing helpful comments is an essential tool when collaborating on a notebook.
Editing the Notebook¶
You can change the text in a markdown cell by clicking it twice. Text in markdown cells is written in Markdown, a formatting language for plain text, so you may see some funky symbols should you try and edit a markdown cell we’ve already written. Once you’ve made changes to a markdown cell, you can exit editing mode by running the cell the same way you’d run a code cell. Try double-clicking this text to see what some markdown formatting looks like.
Manipulating Cells¶
Cells can be added or deleted anywhere in a notebook. You can add cells by pressing the plus sign icon in the menu bar, to the right of the save icon. This will add (by default) a code cell immediately below your current highlighted cell.
To convert a cell to markdown, you can press ‘Cell’ in the menu bar, select ‘Cell Type’, and finally pick the desired option. This works the other way around too!
To delete a cell, simply press the scissors icon in the menu bar. A common fear is deleting a cell that you needed -- but don’t worry! This can be undone using ‘Edit’ > ‘Undo Delete Cells’! If you accidentally delete content in a cell, you can use Ctrl + Z to undo.
Shortcuts¶
This section is optional.
Select a cell by clicking on the empty space to the left of the text (there will be a blue bar to the left of the cell at this point)
To add a cell below the selected one, press the
bkey (b for below)To add a cell above the selected one, press the
akey (a for above)To delete a cell, press the
dkey twice (d for delete, twice to ensure the action)To copy a cell, press the
ckey (c for copy)To cut a cell, press the
xkey (same as the general cut text command)To paste a cell, press the
vkey (same as the general paste text command)To convert a cell to a markdown cell, press the
mkey (m for markdown)To convert a cell to a code cell, press the
ykey
Saving and Loading¶
Your notebook will automatically save your text and code edits, as well as any results of your code cells. However, you can also manually save the notebook in its current state by using Ctrl + S, clicking the floppy disk icon in the toolbar at the top of the page, or by going to the ‘File’ menu and selecting ‘Save and Checkpoint’.
Next time you open your notebook, it will look the same as when you last saved it!
Python Basics¶
Python is a programming language—a way for us to communicate with the computer and give it instructions.
Just like any language, Python has a set vocabulary made up of words it can understand, and a syntax which provides the rules for how to structure our commands and give instructions.
Math¶
Python is a great language for math, as it is easy to understand, and looks very similar to what it would look like in a regular scientific calculator.
+Is the addition operator-Is the subtraction operator and can also act as a negative sign if next to a number (e.g.,-2vs- 2)*Is the multiplication operator**Is the exponentiation operator/Is the division operator()Is the grouping operator
There are two types of numbers in python: Integers, also known as int, (e.g., 4, 1000) and decimal numbers, which are referred to as a float (e.g., 12.0, 3.1415).
When using the / operator, even if the result is a whole number, the result will be a float. For example, 10 / 5 returns 2.0, not 2.
Let’s look at some examples of using these operators. As usual, feel free to play artound with these cells or even add new ones to explore how these operations work!
3 / 41 + 32 ** 34 ** .5(6 + 4) * 2 - 15Strings¶
Strings are what we call words or text in Python. A string is surrounded in either single (‘’) or double (“”) quotes. Here are some examples of strings
"This is a string"'This is too'Errors¶
Errors in programming are common and to be expected! Don’t be afraid when you see an error because more likely than not the solution lies in the error code itself. Let’s see what an error looks like. Run the cell below to see the output.
print('This line is missing something.' # We are missing a closing parenthesis here!The last line of the error output attempts to tell you what went wrong.
The syntax of a language is its structure, and this SyntaxError tells you that you have created an illegal structure. Specifically, the error message incomplete input lets you know that the computer expected a character in your code that wasn’t found. In this case, we forgot to add a closing parenthesis ) to the end of our print statement.
There’s a lot of terminology in programming languages, but you don’t need to know it all in order to program effectively. If the terms in an error message confuse you, copying the entire message and searching it online is a tried and true first step.
Variables¶
In this Jupyter Notebook you will be assigning data, figures, numbers, text, or other objects to variables. Variables are stored in a computer’s memory, and can be used over and over again in future calculations.
Sometimes, instead of trying to work with raw information all the time in a long calculation you will want to store it as a variable for easy access in future calculations. Check out how we can use variables to our advantage below!
The following are all valid variable names:
pants,pan_cakes,_,_no_fun,potato940,bowser_32,FOO,BaR,bAr
These are invalid names:
123,1_fun,f@ke,fun time,fun_times!!,00f00
Assignment Statements¶
We use assignment statements to create a variable:
x = 1 + 2 + 3 + 4The first part of the statment is the name of the variable, in this case, x.
After the variable name, we write an equal sign (=).
On the right side of the equal sign, we give the variable a value. In this case, we assign x to be the result of adding 1 + 2 + 3 + 4.
x #just run this cellYou can also use previously assigned variables when assigning new ones, such as:
y = x * 2
yVariable Scope¶
Variable scope is a relatively complex topic that we don’t need to be overly concerned with yet. That said, when a variable is used in the definition of another, we can encounter behavior we don’t expect if we aren’t careful. Run the following five cells in sequence.
x = 5 # Assigns `x` to the value 5y = x * 2 # Assigns `y` to the result of `x` * 2y # Returns the current value of `y`x = 10 # Re-assigns `x` to a new valuey # Returns the current value of `y`Why did the value of y stay the same after we changed the value of x?
If a variable that is used in the definition of another variable (x) changes, the cell containing the assignment of the outer variable (y) must be re-run to take the inner variable’s new value into account.
Try rerunning the second and third cells above where we assign y and return its output. Notice how the value of y only takes in the updated value of x after we re-run the cell where it is assigned!
Variable Examples¶
Let’s look at a couple examples of when using variables can help us immensely!
Example 1: Seconds in a Year¶
Let’s say we want to find out how many seconds are in a year. We could calcluate it raw as following:
However, someone reading this may not understand what we are calculating or why, and we have no way to use the components or result of this calculation in other cells. Let’s see how we can improve this process using variables:
days = 365 # The days in a year
hours = 24 # The hours in a day
minutes = 60 # The minutes in an hour
seconds = 60 # The seconds in a minute
seconds_per_year = days * hours * minutes * seconds # The seconds in a year
seconds_per_yearWhile lengthier, this method is far easier to understand, and we can use our new variable seconds_per_year to answer other questions!
Say we wanted to find the number of seconds in half a year, 7 years, 234 years, or even 3.1415 years. Without variables, we calculating the number of seconds in each time period would be tedious and repetitive. With variables, it’s much easier:
print("Seconds in half a year:", seconds_per_year / 2)
print("Seconds in seven years:", seconds_per_year * 7)
print("Seconds in two hundred and thirty-four years:", seconds_per_year * 243)
print("Seconds in 3.1415 years:", seconds_per_year * 3.1415)Example 2: Mitosis Mania¶
Mitosis is a process that copies and separates chromosomes in a cell to create two identical daughter cells. If mitosis happens once per hour, the equation to represent the number of cells after x hours is:
We can use variables to easily calculate the , and we can set (the number of hours) and (the number of cells we started with).
b = 5 # The intercept
x = 4 # Try changing this value to see how the output changes!
y = 2**x + b
print(f"On the line y = 2^{x} + {b}, at x = {x}, y is equal to {y}")Lists¶
The value of a variable can take on any number of types—it isn’t limited to being an integer, float, or string.
One such type is a list, which can be used to store multiple values of any type. Run the cells below.
list_of_integers = [4,9,16]
list_of_integersmixed_list = ["string", 2]
mixed_listLoops¶
This section is advanced/optional.
That code above is repetitive. Instead of typing that out, we can use a for loop. The cell below does the exact same thing as the cell above, but it is shorter and more straightforward. Check out this documentation on for loops if you want to learn how this code works.
running_totals = [1]
for _ in range(6):
running_totals = running_totals + [sum(running_totals)]
running_totalsFunctions¶
We’ve seen that we can use variables to store and name values, but operations can also be named. A named operation is called a function. Python has some functions built into it.
round # A built-in function.Functions get used in call expressions, where a function is named and given values to operate on inside a set of parentheses. The round function returns the number it was given, rounded to the nearest whole number.
round(1988.74699) # A call expression using the `round` functionThe values a function is called on are called arguments, and each function places limitations on the number and type of arguments it can be called on.
For instance, the minimum, min, function will take as many integers or floats as you’d like, separated by commas, or a single list, and returns the smallest value.
min(9, -34, 0, 99)min([9, -34, 0, 99])User-Defined Functions¶
This section is advanced/optional
One of the most useful features in python is the ability to define your own functions using a def statement. Here is an example of one such function based on our earier example of mitosis:
def mitosis(x, b): # Returns the number of cells in `x` hours with 'b' existing cells.
return 2**x + bNow we can use this function just like a built-in function!
mitosismitosis?print("Cells in 2 hours:", mitosis(2,50))
print("Cells in 57 hours:", mitosis(57,50))
print("Cells in 6.022 hours:", mitosis(6.022,50))Practice¶
The abs function takes one argument (just like round)
The max function takes one or more arguments (just like min)
Try calling abs and max in the cell below. What does each function do?
Also try calling each function incorrectly, such as with the wrong number of arguments. What kinds of error messages do you see?
... # replace the "..." with calls to abs and maxDot Notation¶
Python has a lot of built-in functions (that is, functions that are already named and defined in Python), but even more functions are stored in collections called modules. Earlier, we imported the math module so we could use it later. Like with the np.mean() example above, we can access a module’s functions by typing the name of the module, then the name of the function you want from it, separated with a ..
Note: If you type the name of a module, but can't remember the name of the function you're looking for, type a dot ., then press the Tab key to bring up an auto-complete menu to help you find the function you're looking for!
import math
math.factorial(5) # A call expression with the factorial function from the math moduleDouble click to edit this markdown cell with your answer*
Random numbers and sampling¶
Random sampling plays a key role in data science. The random module implements functions for random sampling and random number generation. For example, the cell below generates a random integer between 1 and 50.
Note that any whole number between 1 and 50 has an equal probability of being selected --- the sampling probabilities are uniform.
Try running this cell multiple times by holding Command (mac)/Control (windows) and pressing Enter repeatedly; notice how the output changes even though the code stays the same
import random
random.randint(1,50)Tables¶
In most data science contexts, when interacting with data you will be working with tables. In this section, we will cover how to examine and manipulate data using Python.
Tables are the fundamental way we organize and display data. Run the cell below to load a dataset. We’ll be working with this data in a future notebook. This data set provides access to counts of COVID-19 cases by CDCR institution and case status by day. It also provides testing volume data.
from datascience import *
prisons = Table.read_table("covid19dashboard.csv") # Here we see an assignment statement
prisonsThis table is organized into columns, one for each category of information collected, and rows, each containing all the information collected about a particular instance of data. In this case, each row contains information about a different state prison (no repeats).
Every table has attributes that give information about the table, such as the number of rows and the number of columns. Attributes you’ll use frequently include num_rows and num_columns, which give the number of rows and columns in the table, respectively. These are accessed using something called dot notation which means we won’t be using any parentheses like in our print statement (Hello World!) earlier.
prisons.num_columns # Get the number of columnsprisons.num_rows # Get the number of rowsType your answer here, replacing this text.
In other situations, we will want to sort, filter, or group our data. In order to manipulate our data stored in a table, we will be using various table functions. These will be explained as we go through them as to not overwhelm you!
Now that you have a basic grasp on Python and the kinds of information we’ll be working with, we can move on to where our data came from and how to interact with it.
Notebooks in Practice¶
With proprietary software like Mathematica, users are supposed to trust the results returned and are unable to check the code. In contrast, Jupyter is open-source, which fosters transparency and encourages programmers to understand and reproduce the work of others.
Theodore Gray, the co-founder of Wolfram Research who was also involved in creating the Mathematica interface, said about Jupyter,
“I think what they have is acceptance from the scientific community as a tool that is considered to be universal.”
In other words, Jupyter Notebooks support the computational work of researchers from different fields, enabling new ways for researchers in very different domains to share research tools, methods, and learn from one another.
The versatility of the Notebook has important consequences for data science and the workflows that are involved when working with data in settings other than research, such as for education and community science projects. The process of working with data can be messy and nonlinear, which a Jupyter notebook handles well because of its flexibility.
The power of the notebook lies in its ability to include a variety of media with the computation as a means to maintain accountability, integrity, and transparency for both the author of the notebook and the audiences that you share your work with.
Type your answer here, replacing this text.
Congratulations on finishing Notebook 1!
The cell below generates a link to download your notebook as a zip file, which you can then submit.
Before downloading, run all cells and save the notebook using command/control + s or clicking the save icon in the toolbar at the top of notebook. This is very important to ensure that all of your work shows up in the downloaded file!
Submission¶
Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. Please save before exporting!
# Save your notebook first, then run this cell to export your submission.
grader.export()