Legal Studies 190 - Data, Prediction, and Law
Welcome to our class! This introductory notebook will reviews concepts that you may already be familiar with from Data 8 or similar courses. The basic strategies and tools for data analysis covered in this notebook will be the foundations of this class. It will cover an overview of our software and some programming concepts.
Table of Contents
2 - Coding Concepts
1 - Python Basics
2 - Tables
Our Computing Environment, Jupyter notebooks
This webpage is called a Jupyter notebook. A notebook is a place to write programs and view their results.
Text cells
In a notebook, each rectangle containing text or code is called a cell.
Text cells (like this one) can be edited by double-clicking on them. They’re written in a simple format called Markdown to add formatting and section headings. You don’t need to learn Markdown, but you might want to.
After you edit a text cell, click the “run cell” button at the top that looks like ▶ | to confirm any changes. (Try not to delete the instructions of the lab.) |
Understanding Check 1 This paragraph is in its own text cell. Try editing it so that this sentence is the last sentence in the paragraph, and then click the “run cell” ▶ | button . This sentence, for example, should be deleted. So should this one. |
Code cells
Other cells contain code in the Python 3 language. Running a code cell will execute all of the code it contains.
To run the code in a code cell, first click on that cell to activate it. It’ll be highlighted with a little green or blue rectangle. Next, either press ▶ | or hold down the shift key and press return or enter . |
Try running this cell:
print("Hello, World!")
And this one:
print("\N{WAVING HAND SIGN}, \N{EARTH GLOBE ASIA-AUSTRALIA}!")
The fundamental building block of Python code is an expression. Cells can contain multiple lines with multiple expressions. When you run a cell, the lines of code are executed in the order in which they appear. Every print
expression prints a line. Run the next cell and notice the order of the output.
print("First this line is printed,")
print("and then this one.")
Don’t be scared if you see a “Kernel Restarting” message! Your data and work will still be saved. Once you see “Kernel Ready” in a light blue box on the top right of the notebook, you’ll be ready to work again. You should rerun any cells with imports, variables, and loaded data.
Writing Jupyter notebooks
You can use Jupyter notebooks for your own projects or documents. When you make your own notebook, you’ll need to create your own cells for text and code.
To add a cell, click the + button in the menu bar. It’ll start out as a text cell. You can change it to a code cell by clicking inside it so it’s highlighted, clicking the drop-down box next to the restart (⟳) button in the menu bar, and choosing “Code”.
Errors
Python is a language, and like natural human languages, it has rules. It differs from natural language in two important ways:
- The rules are simple. You can learn most of them in a few weeks and gain reasonable proficiency with the language in a semester.
- The rules are rigid. If you’re proficient in a natural language, you can understand a non-proficient speaker, glossing over small mistakes. A computer running Python code is not smart enough to do that.
Whenever you write code, you’ll make mistakes. When you run a code cell that has errors, Python will sometimes produce error messages to tell you what you did wrong.
Errors are okay; even experienced programmers make many errors. When you make an error, you just have to find the source of the problem, fix it, and move on.
We have made an error in the next cell. Run it and see what happens.
print("This line is missing something."
You should see something like this (minus our annotations):
The last line of the error output attempts to tell you what went wrong. The syntax of a language is its structure, and this SyntaxError
tells you that you have created an illegal structure. “EOF
” means “end of file,” so the message is saying Python expected you to write something more (in this case, a right parenthesis) before finishing the cell.
There’s a lot of terminology in programming languages, but you don’t need to know it all in order to program effectively. If you see a cryptic message like this, you can often get by without deciphering it. (Of course, if you’re frustrated, feel free to ask a friend or post on the class Piazza.)
Understanding Check 2 Try to fix the code above so that you can run the cell and see the intended message instead of an error.
Programming Concepts
Now that you are comfortable with our computing environment, we are going to be moving into more of the fundamentals of Python, but first, run the cell below to ensure all the libraries needed for this notebook are installed.
Part 1: Python basics
Before getting into the more advanced analysis techniques that will be required in this course, we need to cover a few of the foundational elements of programming in Python.
A. Expressions
The departure point for all programming is the concept of the expression. An expression is a combination of variables, operators, and other Python elements that the language interprets and acts upon. Expressions act as a set of instructions to be fed through the interpreter, with the goal of generating specific outcomes. See below for some examples of basic expressions.
# Examples of expressions:
#addition
print(2 + 2)
#string concatenation
print('me' + ' and I')
#you can print a number with a string if you cast it
print("me" + str(2))
#exponents
print(12 ** 2)
You will notice that only the last line in a cell gets printed out. If you want to see the values of previous expressions, you need to call print
on that expression. Try adding print
statements to some of the above expressions to get them to display.
Data Types
In Python, all things have a type. In the above example, you saw saw integers (positive and negative whole numbers) and strings (sequences of characters, often thought of as words or sentences). We denote strings by surrounding the desired value with quotes. For example, “Data Science” and “2017” are strings, while bears
and 2020
(both without quotes) are not strings (bears
without quotes would be interpreted as a variable). You’ll also be using decimal numbers in Python, which are called floats (positive and negative decimal numbers).
You’ll also often run into booleans. They can take on one of two values: True
or False
. Booleans are often used to check conditions; for example, we might have a list of dogs, and we want to sort them into small dogs and large dogs. One way we could accomplish this is to say either True
or False
for each dog after seeing if the dog weighs more than 15 pounds.
We’ll soon be going over additional data types. Below is a table that summarizes the information in this section:
Variable Type | Definition | Examples |
---|---|---|
Integer | Positive and negative whole numbers | 42 , -10 , 0 |
Float | Positive and negative decimal numbers | 73.9 , 2.4 , 0.0 |
String | Sequence of characters | "Go Bears!" , "variables" |
Boolean | True or false value | True , False |
B. Variables
In the example below, a
and b
are Python objects known as variables. We are giving an object (in this case, an integer
and a float
, two Python data types) a name that we can store for later use. To use that value, we can simply type the name that we stored the value as. Variables are stored within the notebook’s environment, meaning stored variable values carry over from cell to cell.
a = 4
b = 10/5
Notice that when you create a variable, unlike what you previously saw with the expressions, it does not print anything out.
# Notice that 'a' retains its value.
print(a)
a + b
Question 1: Variables
See if you can write a series of expressions that creates two new variables called x and y and assigns them values of 10.5 and 7.2. Then assign their product to the variable combo and print it.
# Fill in the missing lines to complete the expressions.
x = ...
...
...
print(...)
Check to see if the value you get for combo is what you expect it to be.
C. Lists
The next topic is particularly useful in the kind of data manipulation that you will see throughout this class. The following few cells will introduce the concept of lists (and their counterpart, numpy arrays
). Read through the following cell to understand the basic structure of a list.
A list is an ordered collection of objects. They allow us to store and access groups of variables and other objects for easy access and analysis. Check out this documentation for an in-depth look at the capabilities of lists.
To initialize a list, you use brackets. Putting objects separated by commas in between the brackets will add them to the list.
# an empty list
lst = []
print(lst)
# reassigning our empty list to a new list
lst = [1, 3, 6, 'lists', 'are' 'fun', 4]
print(lst)
To access a value in the list, put the index of the item you wish to access in brackets following the variable that stores the list. Lists in Python are zero-indexed, so the indicies for lst
are 0, 1, 2, 3, 4, 5, and 6.
# Elements are selected like this:
example = lst[2]
# The above line selects the 3rd element of lst (list indices are 0-offset) and sets it to a variable named example.
print(example)
It is important to note that when you store a list to a variable, you are actually storing the pointer to the list. That means if you assign your list to another variable, and you change the elements in your other variable, then you are changing the same data as in the original list.
a = [1,2,3] #original list
b = a #b now points to list a
b[0] = 4
print(a[0]) #return 4 since we modified the first element of the list pointed to by a and b
Slicing lists
As you can see from above, lists do not have to be made up of elements of the same kind. Indices do not have to be taken one at a time, either. Instead, we can take a slice of indices and return the elements at those indices as a separate list.
### This line will store the first (inclusive) through fourth (exclusive) elements of lst as a new list called lst_2:
lst_2 = lst[1:4]
lst_2
Question 2: Lists
Build a list of length 10 containing whatever elements you’d like. Then, slice it into a new list of length five using a index slicing. Finally, assign the last element in your sliced list to the given variable and print it.
### Fill in the ellipses to complete the question.
my_list = ...
my_list_sliced = my_list[...]
last_of_sliced = ...
print(...)
Lists can also be operated on with a few built-in analysis functions. These include min
and max
, among others. Lists can also be concatenated together. Find some examples below.
# A list containing six integers.
a_list = [1, 6, 4, 8, 13, 2]
# Another list containing six integers.
b_list = [4, 5, 2, 14, 9, 11]
print('Max of a_list:', max(a_list))
print('Min of b_list:', min(a_list))
# Concatenate a_list and b_list:
c_list = a_list + b_list
print('Concatenated:', c_list)
D. Numpy Arrays
Closely related to the concept of a list is the array, a nested sequence of elements that is structurally identical to a list. Arrays, however, can be operated on arithmetically with much more versatility than regular lists. For the purpose of later data manipulation, we’ll access arrays through Numpy, which will require an import statement.
Now run the next cell to import the numpy library into your notebook, and examine how numpy arrays can be used.
import numpy as np
# Initialize an array of integers 0 through 9.
example_array = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
# This can also be accomplished using np.arange
example_array_2 = np.arange(10)
print('Undoubled Array:')
print(example_array_2)
# Double the values in example_array and print the new array.
double_array = example_array*2
print('Doubled Array:')
print(double_array)
This behavior differs from that of a list. See below what happens if you multiply a list.
example_list = [1, 2, 3, 4, 5, 6, 7, 8, 9]
example_list * 2
Notice that instead of multiplying each of the elements by two, multiplying a list and a number returns that many copies of that list. This is the reason that we will sometimes use Numpy over lists. Other mathematical operations have interesting behaviors with lists that you should explore on your own.
E. Looping
Loops are often useful in manipulating, iterating over, or transforming large lists and arrays. The first type we will discuss is the for loop. For loops are helpful in traversing a list and performing an action at each element. For example, the following code moves through every element in example_array, adds it to the previous element in example_array, and copies this sum to a new array.
new_list = []
for element in example_array:
new_element = element + 5
new_list.append(new_element)
new_list
The most important line in the above cell is the “for element in...
” line. This statement sets the structure of our loop, instructing the machine to stop at every number in example_array
, perform the indicated operations, and then move on. Once Python has stopped at every element in example_array
, the loop is completed and the final line, which outputs new_list
, is executed. It’s important to note that “element” is an arbitrary variable name used to represent whichever index value the loop is currently operating on. We can change the variable name to whatever we want and achieve the same result, as long as we stay consistent. For example:
newer_list = []
for completely_arbitrary_name in example_array:
newer_element = completely_arbitrary_name + 5
newer_list.append(newer_element)
newer_list
For loops can also iterate over ranges of numerical values. If I wanted to alter example_array
without copying it over to a new list, I would use a numerical iterator to access list indices rather than the elements themselves. This iterator, called i
, would range from 0, the value of the first index, to 9, the value of the last. I can make sure of this by using the built-in range
and len
functions.
for i in range(len(example_array)):
example_array[i] = example_array[i] + 5
example_array
Other types of loops
The while loop repeatedly performs operations until a conditional is no longer satisfied. A conditional is a boolean expression, that is an expression that evaluates to True
or False
.
In the below example, an array of integers 0 to 9 is generated. When the program enters the while loop on the subsequent line, it notices that the maximum value of the array is less than 50. Because of this, it adds 1 to the fifth element, as instructed. Once the instructions embedded in the loop are complete, the program refers back to the conditional. Again, the maximum value is less than 50. This process repeats until the the fifth element, now the maximum value of the array, is equal to 50, at which point the conditional is no longer true and the loop breaks.
while_array = np.arange(10) # Generate our array of values
print('Before:', while_array)
while(max(while_array) < 50): # Set our conditional
while_array[4] += 1 # Add 1 to the fifth element if the conditional is satisfied
print('After:', while_array)
Question 3: Loops
In the following cell, partial steps to manipulate an array are included. You must fill in the blanks to accomplish the following:
- Iterate over the entire array, checking if each element is a multiple of 5
- If an element is not a multiple of 5, add 1 to it repeatedly until it is
- Iterate back over the list and print each element.
Hint: To check if an integer
x
is a multiple ofy
, use the modulus operator%
. Typingx % y
will return the remainder whenx
is divided byy
. Therefore, (x % y != 0
) will returnTrue
wheny
does not dividex
, andFalse
when it does.
# Make use of iterators, range, length, while loops, and indices to complete this question.
question_3 = np.array([12, 31, 50, 0, 22, 28, 19, 105, 44, 12, 77])
for i in range(len(...)):
while(...):
question_3[i] = ...
for element in question_3:
print(...)
The following cell should return True
if your code is correct.
answer = np.array([15, 35, 50, 0, 25, 30, 20, 105, 45, 15, 80])
question_3 == answer
F. Functions!
Functions are useful when you want to repeat a series of steps on multiple different objects, but don’t want to type out the steps over and over again. Many functions are built into Python already; for example, you’ve already made use of len()
to retrieve the number of elements in a list. You can also write your own functions, and at this point you already have the skills to do so.
Functions generally take a set of parameters (also called inputs), which define the objects they will use when they are run. For example, the len()
function takes a list or array as its parameter, and returns the length of that list.
The following cell gives an example of an extremely simple function, called add_two
, which takes as its parameter an integer and returns that integer with, you guessed it, 2 added to it.
# An adder function that adds 2 to the given n.
def add_two(n):
return n + 2
add_two(5)
Easy enough, right? Let’s look at a function that takes two parameters, compares them somehow, and then returns a boolean value (True
or False
) depending on the comparison. The is_multiple
function below takes as parameters an integer m
and an integer n
, checks if m
is a multiple of n
, and returns True
if it is. Otherwise, it returns False
.
if
statements, just like while
loops, are dependent on boolean expressions. If the conditional is True
, then the following indented code block will be executed. If the conditional evaluates to False
, then the code block will be skipped over. Read more about if
statements here.
def is_multiple(m, n):
if (m % n == 0):
return True
else:
return False
is_multiple(12, 4)
is_multiple(12, 7)
Sidenote: Another way to write is_multiple
is below, think about why it works.
def is_multiple(m, n):
return m % n == 0
Since functions are so easily replicable, we can include them in loops if we want. For instance, our is_multiple
function can be used to check if a number is prime! See for yourself by testing some possible prime numbers in the cell below.
# Change possible_prime to any integer to test its primality
# NOTE: If you happen to stumble across a large (> 8 digits) prime number, the cell could take a very, very long time
# to run and will likely crash your kernel. Just click kernel>interrupt if it looks like it's caught.
possible_prime = 9999991
for i in range(2, possible_prime):
if (is_multiple(possible_prime, i)):
print(possible_prime, 'is not prime')
break
if (i >= possible_prime/2):
print(possible_prime, 'is prime')
break
Part 2: Tables
We will be using datascience tables for much of this class to organize and sort through tabular data. datascience is a library that was developed here at Berkeley and is used for manipulating tabular data. It has a user-friendly API, and can be used to answer difficult questions in relatively few commands. Like we did with numpy
, we will have to import datascience
.
from datascience import *
Creating Tables
When dealing with a collection of things with multiple attributes, it can be useful to put the data in a table. Tables are a nice way of organizing data in a 2-dimensional data set. For example, take a look at the table below.
Table.read_table('../data/incident/36828-0004-Data.tsv', delimiter='\t')
This table is from the Incident Record-Type File of the NCVS. See page 31 of the codebook (on bCourses) for a description of the survey. To create this table, we have drawn the data from the path data/incidents
, stored in a file called 36828-0004-Data.tsv
. In general, to import data from a .csv
file, we write Table.read_table("file_name")
. Information in .csv
’s are separated by commas, and are what are typically used with the datascience
package. In this case, our data is stored as a tsv
, so information is separated by tabs, and thus we must indicated that when reading in the data with the optional paramater delimiter
.
We can also create our own tables from scratch without having to import data from another file. Let’s say we have two arrays, one with a list of fruits, and another with a list of their price at the Berkeley Student Food Collective. Then, we can create a new Table
with each of these arrays as columns with the with_columns
method:
fruit_names = make_array("Apple", "Orange", "Banana")
fruit_prices = make_array(1, 0.75, 0.5)
fruit_table = Table().with_columns("Fruit", fruit_names,
"Price ($)", fruit_prices)
fruit_table
The with_columns
method takes in pairs of column labels and arrays, and creates a new table with each array as a column of the table. Finally, to create a new table (with no columns or rows), we simply write
empty_table = Table()
empty_table
We typically start off with empty tables when we need to add rows inside for loops, which we’ll see later.
Accessing Values
Often, it is useful to access only the rows, columns, or values related to our analysis. We’ll look at several ways to cut down our table into smaller, more digestible parts.
Let’s go back to our table of incidents.
** Exercise 1 **
Below, assign a variable named incidents
to the data from the 36828-0004-Data.tsv
file with the path ../data/incident/
, then display the table. (Hint: use the read_table
function from the previous section and don’t forget about the parameter delimiter
).
# YOUR CODE HERE
incidents = Table.read_table("../data/incident/36828-0004-Data.tsv", delimiter='\t')
incidents
Notice that not all of the rows are displayed–in fact, there are over 10000 rows in the table! By default, we are shown the first 10 rows.
However, let’s say we wanted to grab only the first five rows of this table. We can do this by using the take
function; it takes in a list or range of numbers, and creates a new table with rows from the original table whose indices are given in the array or range. Remember that in Python, indices start at 0! Below are a few examples:
incidents.take([1, 3, 5]) # Takes rows with indices 1, 3, and 5 (the 2nd, 4th, and 6th rows)
incidents.take(7) # Takes the row with index 7 (8th row)
incidents.take(np.arange(7)) # Takes the row with indices 0, 1, ... 6
Similarly, we can also choose to display certain columns of the table. There are two methods to accomplish this, and both methods take in lists of either column indices or column labels:
- The
select
method creates a new table with only the columns indicated in the parameters. - The
drop
method creates a new table with all columns except those indicated by the parameters (i.e. the parameters are dropped).
Some examples:
incidents.select(["V4065", "IDPER"]) # Selects only "V4065" and "IDPER" columns
incidents.drop([0, 1]) # Drops the columns with indices 0 and 1
incidents.select([1, 68]).take([1, 2, 3, 5]) # Select only columns with indices 1 and 68,
# then only the rows with indices 1, 2, 3, 5
** Exercise 2**
To make sure you understand the take
, select
, and drop
functions, try creating a new Table with whether the incident was reported to the police (page 66 of the codebook) and the lead-in variable for the relationship to the offender (page 69), with only the first 3 rows:
# YOUR CODE HERE
Finally, the where
function is similar to the take
function in that you choose certain rows; however, rather than specifying the indices of the selected rows, we give two arguments:
- A column label
- A condition that each row should match, called the predicate
In other words, we call the where
function like so: table_name.where(column_name, predicate)
.
There are many types of predicates, but some of the more common ones that you are likely to use are:
Predicate | Example | Result |
---|---|---|
are.equal_to |
are.equal_to(50) |
Find rows with values equal to 50 |
are.not_equal_to |
are.not_equal_to(50) |
Find rows with values not equal to 50 |
are.above |
are.above(50) |
Find rows with values above (and not equal to) 50 |
are.above_or_equal_to |
are.above_or_equal_to(50) |
Find rows with values above 50 or equal to 50 |
are.below |
are.below(50) |
Find rows with values below 50 |
are.between |
are.between(2, 10) |
Find rows with values above or equal to 2 and below 10 |
Here are some examples of using the where
function:
The variable V4526AA
(page 70) indicates whether or not the incident is suspected of being a hate crime. A value of 1 corresponds to yes
. The below query will find all incidents that are suspected of being a hate crime or being of prejudice or bigotry.
incidents.where("V4526AA", are.equal_to(1))
The variable V4364
corresponds to the value of property taken. The variable takes values between 0 and 99996. With the following where statement, we’ll find the incidents where the value of property taken is reported to between or equal to \$10000 and \$99996.
incidents.where("V4364", are.between_or_equal_to(10000, 99996))
Attributes
Using the methods that we have learned, we can now dive into calculating statistics from data in tables. Two useful attributes (variables, not methods!) of tables are num_rows
and num_columns
. They store the number of rows and the number of columns in a given table, respectively. For example:
num_incidents = incidents.num_rows
print("Number of rows: ", num_incidents)
num_attributes = incidents.num_columns
print("Numbers of columns: ", num_attributes)
Notice that we do not put ()
after num_rows
and num_columns
, as we did for other methods.
Sorting
It can be very useful to sort our tables according to some column. The sort
function does exactly that; it takes the column that you want to sort by. By default, the sort
function sorts the table in ascending order of the data in the column indicated; however, you can change this by setting the optional parameter descending=True
.
Below is an example using the same variable above, V4364
, which is the value of property taken.
monetary_loss = incidents.where("V4364", are.between_or_equal_to(1, 99996)).select(2, 3, 'V4364')
monetary_loss.sort('V4364') # Sort table by value of property taken in ascending order
The above code will sort the table by the column V4364
from least to greatest. Below, we’ll sort it from greatest to least.
monetary_loss.sort('V4364').sort(2, descending=True) # Sort table by value of property taken in descending order (highest at top)
Summary
As a summary, here are the functions we learned about during this notebook:
Name | Example | Purpose |
---|---|---|
Table |
Table() |
Create an empty table, usually to extend with data |
Table.read_table |
Table.read_table("my_data.csv") |
Create a table from a data file |
with_columns |
tbl = Table().with_columns("N", np.arange(5), "2*N", np.arange(0, 10, 2)) |
Create a copy of a table with more columns |
column |
tbl.column("N") |
Create an array containing the elements of a column |
sort |
tbl.sort("N") |
Create a copy of a table sorted by the values in a column |
where |
tbl.where("N", are.above(2)) |
Create a copy of a table with only the rows that match some predicate |
num_rows |
tbl.num_rows |
Compute the number of rows in a table |
num_columns |
tbl.num_columns |
Compute the number of columns in a table |
select |
tbl.select("N") |
Create a copy of a table with only some of the columns |
drop |
tbl.drop("2*N") |
Create a copy of a table without some of the columns |
take |
tbl.take(np.arange(0, 6, 2)) |
Create a copy of the table with only the rows whose indices are in the given array |
Some materials this notebook were taken from Data 8, CS 61A, and DS Modules lessons.