Legal Studies 190 - Data, Prediction, and Law

Welcome to our class! This introductory notebook will reviews concepts that you may already be familiar with from Data 8 or similar courses. The basic strategies and tools for data analysis covered in this notebook will be the foundations of this class. It will cover an overview of our software and some programming concepts.

Table of Contents

1 - Computing Environment

2 - Coding Concepts

       1 - Python Basics

       2 - Tables

Our Computing Environment, Jupyter notebooks

This webpage is called a Jupyter notebook. A notebook is a place to write programs and view their results.

Text cells

In a notebook, each rectangle containing text or code is called a cell.

Text cells (like this one) can be edited by double-clicking on them. They’re written in a simple format called Markdown to add formatting and section headings. You don’t need to learn Markdown, but you might want to.

After you edit a text cell, click the “run cell” button at the top that looks like ▶ to confirm any changes. (Try not to delete the instructions of the lab.)
Understanding Check 1 This paragraph is in its own text cell. Try editing it so that this sentence is the last sentence in the paragraph, and then click the “run cell” ▶ button . This sentence, for example, should be deleted. So should this one.

Code cells

Other cells contain code in the Python 3 language. Running a code cell will execute all of the code it contains.

To run the code in a code cell, first click on that cell to activate it. It’ll be highlighted with a little green or blue rectangle. Next, either press ▶ or hold down the shift key and press return or enter.

Try running this cell:

print("Hello, World!")

And this one:

print("\N{WAVING HAND SIGN}, \N{EARTH GLOBE ASIA-AUSTRALIA}!")

The fundamental building block of Python code is an expression. Cells can contain multiple lines with multiple expressions. When you run a cell, the lines of code are executed in the order in which they appear. Every print expression prints a line. Run the next cell and notice the order of the output.

print("First this line is printed,")
print("and then this one.")

Don’t be scared if you see a “Kernel Restarting” message! Your data and work will still be saved. Once you see “Kernel Ready” in a light blue box on the top right of the notebook, you’ll be ready to work again. You should rerun any cells with imports, variables, and loaded data.

Writing Jupyter notebooks

You can use Jupyter notebooks for your own projects or documents. When you make your own notebook, you’ll need to create your own cells for text and code.

To add a cell, click the + button in the menu bar. It’ll start out as a text cell. You can change it to a code cell by clicking inside it so it’s highlighted, clicking the drop-down box next to the restart (⟳) button in the menu bar, and choosing “Code”.

Errors

Python is a language, and like natural human languages, it has rules. It differs from natural language in two important ways:

  1. The rules are simple. You can learn most of them in a few weeks and gain reasonable proficiency with the language in a semester.
  2. The rules are rigid. If you’re proficient in a natural language, you can understand a non-proficient speaker, glossing over small mistakes. A computer running Python code is not smart enough to do that.

Whenever you write code, you’ll make mistakes. When you run a code cell that has errors, Python will sometimes produce error messages to tell you what you did wrong.

Errors are okay; even experienced programmers make many errors. When you make an error, you just have to find the source of the problem, fix it, and move on.

We have made an error in the next cell. Run it and see what happens.

print("This line is missing something."

You should see something like this (minus our annotations):

The last line of the error output attempts to tell you what went wrong. The syntax of a language is its structure, and this SyntaxError tells you that you have created an illegal structure. “EOF” means “end of file,” so the message is saying Python expected you to write something more (in this case, a right parenthesis) before finishing the cell.

There’s a lot of terminology in programming languages, but you don’t need to know it all in order to program effectively. If you see a cryptic message like this, you can often get by without deciphering it. (Of course, if you’re frustrated, feel free to ask a friend or post on the class Piazza.)

Understanding Check 2 Try to fix the code above so that you can run the cell and see the intended message instead of an error.

Programming Concepts

Now that you are comfortable with our computing environment, we are going to be moving into more of the fundamentals of Python, but first, run the cell below to ensure all the libraries needed for this notebook are installed.

Part 1: Python basics

Before getting into the more advanced analysis techniques that will be required in this course, we need to cover a few of the foundational elements of programming in Python.

A. Expressions

The departure point for all programming is the concept of the expression. An expression is a combination of variables, operators, and other Python elements that the language interprets and acts upon. Expressions act as a set of instructions to be fed through the interpreter, with the goal of generating specific outcomes. See below for some examples of basic expressions.

# Examples of expressions:

#addition
print(2 + 2)

#string concatenation 
print('me' + ' and I')

#you can print a number with a string if you cast it 
print("me" + str(2))

#exponents
print(12 ** 2)

You will notice that only the last line in a cell gets printed out. If you want to see the values of previous expressions, you need to call print on that expression. Try adding print statements to some of the above expressions to get them to display.

Data Types

In Python, all things have a type. In the above example, you saw saw integers (positive and negative whole numbers) and strings (sequences of characters, often thought of as words or sentences). We denote strings by surrounding the desired value with quotes. For example, “Data Science” and “2017” are strings, while bears and 2020 (both without quotes) are not strings (bears without quotes would be interpreted as a variable). You’ll also be using decimal numbers in Python, which are called floats (positive and negative decimal numbers).

You’ll also often run into booleans. They can take on one of two values: True or False. Booleans are often used to check conditions; for example, we might have a list of dogs, and we want to sort them into small dogs and large dogs. One way we could accomplish this is to say either True or False for each dog after seeing if the dog weighs more than 15 pounds.

We’ll soon be going over additional data types. Below is a table that summarizes the information in this section:

Variable Type Definition Examples
Integer Positive and negative whole numbers 42, -10, 0
Float Positive and negative decimal numbers 73.9, 2.4, 0.0
String Sequence of characters "Go Bears!", "variables"
Boolean True or false value True, False

B. Variables

In the example below, a and b are Python objects known as variables. We are giving an object (in this case, an integer and a float, two Python data types) a name that we can store for later use. To use that value, we can simply type the name that we stored the value as. Variables are stored within the notebook’s environment, meaning stored variable values carry over from cell to cell.

a = 4
b = 10/5

Notice that when you create a variable, unlike what you previously saw with the expressions, it does not print anything out.

# Notice that 'a' retains its value.
print(a)
a + b

Question 1: Variables

See if you can write a series of expressions that creates two new variables called x and y and assigns them values of 10.5 and 7.2. Then assign their product to the variable combo and print it.

# Fill in the missing lines to complete the expressions.
x = ...
...
...
print(...)

Check to see if the value you get for combo is what you expect it to be.

C. Lists

The next topic is particularly useful in the kind of data manipulation that you will see throughout this class. The following few cells will introduce the concept of lists (and their counterpart, numpy arrays). Read through the following cell to understand the basic structure of a list.

A list is an ordered collection of objects. They allow us to store and access groups of variables and other objects for easy access and analysis. Check out this documentation for an in-depth look at the capabilities of lists.

To initialize a list, you use brackets. Putting objects separated by commas in between the brackets will add them to the list.

# an empty list
lst = []
print(lst)

# reassigning our empty list to a new list
lst = [1, 3, 6, 'lists', 'are' 'fun', 4]
print(lst)

To access a value in the list, put the index of the item you wish to access in brackets following the variable that stores the list. Lists in Python are zero-indexed, so the indicies for lst are 0, 1, 2, 3, 4, 5, and 6.

# Elements are selected like this:
example = lst[2]

# The above line selects the 3rd element of lst (list indices are 0-offset) and sets it to a variable named example.
print(example)

It is important to note that when you store a list to a variable, you are actually storing the pointer to the list. That means if you assign your list to another variable, and you change the elements in your other variable, then you are changing the same data as in the original list.

a = [1,2,3] #original list
b = a #b now points to list a 
b[0] = 4 
print(a[0]) #return 4 since we modified the first element of the list pointed to by a and b 

Slicing lists

As you can see from above, lists do not have to be made up of elements of the same kind. Indices do not have to be taken one at a time, either. Instead, we can take a slice of indices and return the elements at those indices as a separate list.

### This line will store the first (inclusive) through fourth (exclusive) elements of lst as a new list called lst_2:
lst_2 = lst[1:4]

lst_2

Question 2: Lists

Build a list of length 10 containing whatever elements you’d like. Then, slice it into a new list of length five using a index slicing. Finally, assign the last element in your sliced list to the given variable and print it.

### Fill in the ellipses to complete the question.
my_list = ...

my_list_sliced = my_list[...]

last_of_sliced = ...

print(...)

Lists can also be operated on with a few built-in analysis functions. These include min and max, among others. Lists can also be concatenated together. Find some examples below.

# A list containing six integers.
a_list = [1, 6, 4, 8, 13, 2]

# Another list containing six integers.
b_list = [4, 5, 2, 14, 9, 11]

print('Max of a_list:', max(a_list))
print('Min of b_list:', min(a_list))

# Concatenate a_list and b_list:
c_list = a_list + b_list
print('Concatenated:', c_list)

D. Numpy Arrays

Closely related to the concept of a list is the array, a nested sequence of elements that is structurally identical to a list. Arrays, however, can be operated on arithmetically with much more versatility than regular lists. For the purpose of later data manipulation, we’ll access arrays through Numpy, which will require an import statement.

Now run the next cell to import the numpy library into your notebook, and examine how numpy arrays can be used.

import numpy as np
# Initialize an array of integers 0 through 9.
example_array = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

# This can also be accomplished using np.arange
example_array_2 = np.arange(10)
print('Undoubled Array:')
print(example_array_2)

# Double the values in example_array and print the new array.
double_array = example_array*2
print('Doubled Array:')
print(double_array)

This behavior differs from that of a list. See below what happens if you multiply a list.

example_list = [1, 2, 3, 4, 5, 6, 7, 8, 9]
example_list * 2

Notice that instead of multiplying each of the elements by two, multiplying a list and a number returns that many copies of that list. This is the reason that we will sometimes use Numpy over lists. Other mathematical operations have interesting behaviors with lists that you should explore on your own.

E. Looping

Loops are often useful in manipulating, iterating over, or transforming large lists and arrays. The first type we will discuss is the for loop. For loops are helpful in traversing a list and performing an action at each element. For example, the following code moves through every element in example_array, adds it to the previous element in example_array, and copies this sum to a new array.

new_list = []

for element in example_array:
    new_element = element + 5
    new_list.append(new_element)

new_list

The most important line in the above cell is the “for element in...” line. This statement sets the structure of our loop, instructing the machine to stop at every number in example_array, perform the indicated operations, and then move on. Once Python has stopped at every element in example_array, the loop is completed and the final line, which outputs new_list, is executed. It’s important to note that “element” is an arbitrary variable name used to represent whichever index value the loop is currently operating on. We can change the variable name to whatever we want and achieve the same result, as long as we stay consistent. For example:

newer_list = []

for completely_arbitrary_name in example_array:
    newer_element = completely_arbitrary_name + 5
    newer_list.append(newer_element)
    
newer_list

For loops can also iterate over ranges of numerical values. If I wanted to alter example_array without copying it over to a new list, I would use a numerical iterator to access list indices rather than the elements themselves. This iterator, called i, would range from 0, the value of the first index, to 9, the value of the last. I can make sure of this by using the built-in range and len functions.

for i in range(len(example_array)):
    example_array[i] = example_array[i] + 5

example_array

Other types of loops

The while loop repeatedly performs operations until a conditional is no longer satisfied. A conditional is a boolean expression, that is an expression that evaluates to True or False.

In the below example, an array of integers 0 to 9 is generated. When the program enters the while loop on the subsequent line, it notices that the maximum value of the array is less than 50. Because of this, it adds 1 to the fifth element, as instructed. Once the instructions embedded in the loop are complete, the program refers back to the conditional. Again, the maximum value is less than 50. This process repeats until the the fifth element, now the maximum value of the array, is equal to 50, at which point the conditional is no longer true and the loop breaks.

while_array = np.arange(10)        # Generate our array of values

print('Before:', while_array)

while(max(while_array) < 50):      # Set our conditional
    while_array[4] += 1            # Add 1 to the fifth element if the conditional is satisfied 
    
print('After:', while_array)

Question 3: Loops

In the following cell, partial steps to manipulate an array are included. You must fill in the blanks to accomplish the following:

  1. Iterate over the entire array, checking if each element is a multiple of 5
  2. If an element is not a multiple of 5, add 1 to it repeatedly until it is
  3. Iterate back over the list and print each element.

Hint: To check if an integer x is a multiple of y, use the modulus operator %. Typing x % y will return the remainder when x is divided by y. Therefore, (x % y != 0) will return True when y does not divide x, and False when it does.

# Make use of iterators, range, length, while loops, and indices to complete this question.
question_3 = np.array([12, 31, 50, 0, 22, 28, 19, 105, 44, 12, 77])

for i in range(len(...)):
    while(...):
        question_3[i] = ...
        
for element in question_3:
    print(...)

The following cell should return True if your code is correct.

answer = np.array([15, 35, 50, 0, 25, 30, 20, 105, 45, 15, 80])
question_3 == answer

F. Functions!

Functions are useful when you want to repeat a series of steps on multiple different objects, but don’t want to type out the steps over and over again. Many functions are built into Python already; for example, you’ve already made use of len() to retrieve the number of elements in a list. You can also write your own functions, and at this point you already have the skills to do so.

Functions generally take a set of parameters (also called inputs), which define the objects they will use when they are run. For example, the len() function takes a list or array as its parameter, and returns the length of that list.

The following cell gives an example of an extremely simple function, called add_two, which takes as its parameter an integer and returns that integer with, you guessed it, 2 added to it.

# An adder function that adds 2 to the given n.
def add_two(n):
    return n + 2
add_two(5)

Easy enough, right? Let’s look at a function that takes two parameters, compares them somehow, and then returns a boolean value (True or False) depending on the comparison. The is_multiple function below takes as parameters an integer m and an integer n, checks if m is a multiple of n, and returns True if it is. Otherwise, it returns False.

if statements, just like while loops, are dependent on boolean expressions. If the conditional is True, then the following indented code block will be executed. If the conditional evaluates to False, then the code block will be skipped over. Read more about if statements here.

def is_multiple(m, n):
    if (m % n == 0):
        return True
    else:
        return False
is_multiple(12, 4)
is_multiple(12, 7)

Sidenote: Another way to write is_multiple is below, think about why it works.

def is_multiple(m, n):
    return m % n == 0

Since functions are so easily replicable, we can include them in loops if we want. For instance, our is_multiple function can be used to check if a number is prime! See for yourself by testing some possible prime numbers in the cell below.

# Change possible_prime to any integer to test its primality
# NOTE: If you happen to stumble across a large (> 8 digits) prime number, the cell could take a very, very long time
# to run and will likely crash your kernel. Just click kernel>interrupt if it looks like it's caught.

possible_prime = 9999991

for i in range(2, possible_prime):
    if (is_multiple(possible_prime, i)):
        print(possible_prime, 'is not prime')   
        break
    if (i >= possible_prime/2):
        print(possible_prime, 'is prime')
        break

Part 2: Tables

We will be using datascience tables for much of this class to organize and sort through tabular data. datascience is a library that was developed here at Berkeley and is used for manipulating tabular data. It has a user-friendly API, and can be used to answer difficult questions in relatively few commands. Like we did with numpy, we will have to import datascience.

from datascience import *

Creating Tables

When dealing with a collection of things with multiple attributes, it can be useful to put the data in a table. Tables are a nice way of organizing data in a 2-dimensional data set. For example, take a look at the table below.

Table.read_table('../data/incident/36828-0004-Data.tsv', delimiter='\t')

This table is from the Incident Record-Type File of the NCVS. See page 31 of the codebook (on bCourses) for a description of the survey. To create this table, we have drawn the data from the path data/incidents, stored in a file called 36828-0004-Data.tsv. In general, to import data from a .csv file, we write Table.read_table("file_name"). Information in .csv’s are separated by commas, and are what are typically used with the datascience package. In this case, our data is stored as a tsv, so information is separated by tabs, and thus we must indicated that when reading in the data with the optional paramater delimiter.

We can also create our own tables from scratch without having to import data from another file. Let’s say we have two arrays, one with a list of fruits, and another with a list of their price at the Berkeley Student Food Collective. Then, we can create a new Table with each of these arrays as columns with the with_columns method:

fruit_names = make_array("Apple", "Orange", "Banana")
fruit_prices = make_array(1, 0.75, 0.5)
fruit_table = Table().with_columns("Fruit", fruit_names,
                                  "Price ($)", fruit_prices)
fruit_table

The with_columns method takes in pairs of column labels and arrays, and creates a new table with each array as a column of the table. Finally, to create a new table (with no columns or rows), we simply write

empty_table = Table()
empty_table

We typically start off with empty tables when we need to add rows inside for loops, which we’ll see later.

Accessing Values

Often, it is useful to access only the rows, columns, or values related to our analysis. We’ll look at several ways to cut down our table into smaller, more digestible parts.

Let’s go back to our table of incidents.

** Exercise 1 **

Below, assign a variable named incidents to the data from the 36828-0004-Data.tsv file with the path ../data/incident/, then display the table. (Hint: use the read_table function from the previous section and don’t forget about the parameter delimiter).

# YOUR CODE HERE

incidents = Table.read_table("../data/incident/36828-0004-Data.tsv", delimiter='\t')
incidents

Notice that not all of the rows are displayed–in fact, there are over 10000 rows in the table! By default, we are shown the first 10 rows.

However, let’s say we wanted to grab only the first five rows of this table. We can do this by using the take function; it takes in a list or range of numbers, and creates a new table with rows from the original table whose indices are given in the array or range. Remember that in Python, indices start at 0! Below are a few examples:

incidents.take([1, 3, 5]) # Takes rows with indices 1, 3, and 5 (the 2nd, 4th, and 6th rows)
incidents.take(7) # Takes the row with index 7 (8th row)
incidents.take(np.arange(7)) # Takes the row with indices 0, 1, ... 6

Similarly, we can also choose to display certain columns of the table. There are two methods to accomplish this, and both methods take in lists of either column indices or column labels:

  • The select method creates a new table with only the columns indicated in the parameters.
  • The drop method creates a new table with all columns except those indicated by the parameters (i.e. the parameters are dropped).

Some examples:

incidents.select(["V4065", "IDPER"]) # Selects only "V4065" and "IDPER" columns
incidents.drop([0, 1]) # Drops the columns with indices 0 and 1
incidents.select([1, 68]).take([1, 2, 3, 5]) # Select only columns with indices 1 and 68, 
                                               # then only the rows with indices 1, 2, 3, 5

** Exercise 2**

To make sure you understand the take, select, and drop functions, try creating a new Table with whether the incident was reported to the police (page 66 of the codebook) and the lead-in variable for the relationship to the offender (page 69), with only the first 3 rows:

# YOUR CODE HERE

Finally, the where function is similar to the take function in that you choose certain rows; however, rather than specifying the indices of the selected rows, we give two arguments:

  • A column label
  • A condition that each row should match, called the predicate

In other words, we call the where function like so: table_name.where(column_name, predicate).

There are many types of predicates, but some of the more common ones that you are likely to use are:

Predicate Example Result
are.equal_to are.equal_to(50) Find rows with values equal to 50
are.not_equal_to are.not_equal_to(50) Find rows with values not equal to 50
are.above are.above(50) Find rows with values above (and not equal to) 50
are.above_or_equal_to are.above_or_equal_to(50) Find rows with values above 50 or equal to 50
are.below are.below(50) Find rows with values below 50
are.between are.between(2, 10) Find rows with values above or equal to 2 and below 10

Here are some examples of using the where function:

The variable V4526AA (page 70) indicates whether or not the incident is suspected of being a hate crime. A value of 1 corresponds to yes. The below query will find all incidents that are suspected of being a hate crime or being of prejudice or bigotry.

incidents.where("V4526AA", are.equal_to(1)) 

The variable V4364 corresponds to the value of property taken. The variable takes values between 0 and 99996. With the following where statement, we’ll find the incidents where the value of property taken is reported to between or equal to \$10000 and \$99996.

incidents.where("V4364", are.between_or_equal_to(10000, 99996))

Attributes

Using the methods that we have learned, we can now dive into calculating statistics from data in tables. Two useful attributes (variables, not methods!) of tables are num_rows and num_columns. They store the number of rows and the number of columns in a given table, respectively. For example:

num_incidents = incidents.num_rows
print("Number of rows: ", num_incidents)
num_attributes = incidents.num_columns
print("Numbers of columns: ", num_attributes)

Notice that we do not put () after num_rows and num_columns, as we did for other methods.

Sorting

It can be very useful to sort our tables according to some column. The sort function does exactly that; it takes the column that you want to sort by. By default, the sort function sorts the table in ascending order of the data in the column indicated; however, you can change this by setting the optional parameter descending=True.

Below is an example using the same variable above, V4364, which is the value of property taken.

monetary_loss = incidents.where("V4364", are.between_or_equal_to(1, 99996)).select(2, 3, 'V4364') 
monetary_loss.sort('V4364') # Sort table by value of property taken in ascending order

The above code will sort the table by the column V4364 from least to greatest. Below, we’ll sort it from greatest to least.

monetary_loss.sort('V4364').sort(2, descending=True) # Sort table by value of property taken in descending order (highest at top)

Summary

As a summary, here are the functions we learned about during this notebook:

Name Example Purpose
Table Table() Create an empty table, usually to extend with data
Table.read_table Table.read_table("my_data.csv") Create a table from a data file
with_columns tbl = Table().with_columns("N", np.arange(5), "2*N", np.arange(0, 10, 2)) Create a copy of a table with more columns
column tbl.column("N") Create an array containing the elements of a column
sort tbl.sort("N") Create a copy of a table sorted by the values in a column
where tbl.where("N", are.above(2)) Create a copy of a table with only the rows that match some predicate
num_rows tbl.num_rows Compute the number of rows in a table
num_columns tbl.num_columns Compute the number of columns in a table
select tbl.select("N") Create a copy of a table with only some of the columns
drop tbl.drop("2*N") Create a copy of a table without some of the columns
take tbl.take(np.arange(0, 6, 2)) Create a copy of the table with only the rows whose indices are in the given array

Some materials this notebook were taken from Data 8, CS 61A, and DS Modules lessons.