Introduction to Importing Data, Using Tables and Creating Graphs

The Jupyter Notebook

If you run into errors, check the common errors Google doc first.

First of all, note that this page is divided into what are called “cells”. For example, the following cell is a “code cell” where you will write your code. You’ll see a In [ ]: next to each cell for code, which is a counter for the cells you have run. You can navigate cells by clicking on them or by using the up and down arrows. Cells will be highlighted as you navigate them.

# this is a code cell

Executing cells

You can execute cells with Ctrl-Enter (which will run the cell and keep the same cell selected), or Shift-Enter (which will run the cell and then select the next cell).

Try running the following cell and see what it prints out:

print("Hello world!")

Hello world!

10 + 10

(10 + 10) / 5

4.0

Now run this cell to import some code we’ll use today, nothing will print out, don’t worry!

import numpy as np
import matplotlib.pyplot as plt
from datascience import *
%matplotlib inline 
plt.style.use("fivethirtyeight")

Importing

In data analytics, there is almost always a file holding your data that already exists. There are thousands of databases online that contain information on topics from all domains. In general, to import data from a file, we write:

Table.read_table("file_name")

Most often, these file names end in .csv to show the data format. .csv format is popular for spreadsheets and can be imported/exported from programs such as Microsoft Excel, OpenOffice Calc, or Google spreadsheets.

An example is shown below using U.S. Census data.

!pwd

/Users/chrispyles/GitHub/modules-textbook/content/psych

Table.read_table("data/nc-est2015-agesex-res.csv")

AGE	CENSUS2010POP	ESTIMATESBASE2010	POPESTIMATE2010	POPESTIMATE2011	POPESTIMATE2012	POPESTIMATE2013	POPESTIMATE2014	POPESTIMATE2015
0	3944153	3944160	3951330	3963087	3926540	3931141	3949775	3978038
1	3978070	3978090	3957888	3966551	3977939	3942872	3949776	3968564
2	4096929	4096939	4090862	3971565	3980095	3992720	3959664	3966583
3	4119040	4119051	4111920	4102470	3983157	3992734	4007079	3974061
4	4063170	4063186	4077551	4122294	4112849	3994449	4005716	4020035
5	4056858	4056872	4064653	4087709	4132242	4123626	4006900	4018158
6	4066381	4066412	4073013	4074993	4097605	4142916	4135930	4019207
7	4030579	4030594	4043046	4083225	4084913	4108349	4155326	4148360
8	4046486	4046497	4025604	4053203	4093177	4095711	4120903	4167887
9	4148353	4148369	4125415	4035710	4063152	4104072	4108349	4133564

... (296 rows omitted)

That’s a lot of information. As you can see from the labels on top, this table shows Biological Sex (0=total, 1=male, 2=female), Age, 2010 Census Information, and predictions for U.S. population for the next five years.

Using Tables

We can make criteria to cut down tables. Accessing only the rows, columns, or values specfic to our purpose makes information easier understood. Analysis and conclusions can be made when data is more digestible.

We need to access the census table above and name it for further use. We assign the table to a variable so that we can reference it later!

census_data = Table.read_table("data/nc-est2015-agesex-res.csv")
census_data

AGE	CENSUS2010POP	ESTIMATESBASE2010	POPESTIMATE2010	POPESTIMATE2011	POPESTIMATE2012	POPESTIMATE2013	POPESTIMATE2014	POPESTIMATE2015
0	3944153	3944160	3951330	3963087	3926540	3931141	3949775	3978038
1	3978070	3978090	3957888	3966551	3977939	3942872	3949776	3968564
2	4096929	4096939	4090862	3971565	3980095	3992720	3959664	3966583
3	4119040	4119051	4111920	4102470	3983157	3992734	4007079	3974061
4	4063170	4063186	4077551	4122294	4112849	3994449	4005716	4020035
5	4056858	4056872	4064653	4087709	4132242	4123626	4006900	4018158
6	4066381	4066412	4073013	4074993	4097605	4142916	4135930	4019207
7	4030579	4030594	4043046	4083225	4084913	4108349	4155326	4148360
8	4046486	4046497	4025604	4053203	4093177	4095711	4120903	4167887
9	4148353	4148369	4125415	4035710	4063152	4104072	4108349	4133564

... (296 rows omitted)

This notebook can calculate how large this table is with two functions: num_rows and num_columns. The general form for these functions are table.num_rows and table.num_columns.

Let’s use these on the table above.

census_data.num_rows

census_data.num_columns

That’s a 306 x 10 table! We can first start to cut down this table using only some columns. Let’s only include biological sex, age and the estimated base for 2010 census data.

There are two methods to make a table with select columns included. We could either use the ‘select’ function or the ‘drop’ function.

select can create a new table with only the columns indicated in the parameters
drop can create a new table with columns NOT indicated in the parameters

Here’s an example of two equal codes: (keep in mind that we assign each new table to a new variable, to make organization easier).

select_census_data = census_data.select("SEX", "AGE", "ESTIMATESBASE2010")
select_census_data

SEX	AGE	ESTIMATESBASE2010
0	0	3944160
0	1	3978090
0	2	4096939
0	3	4119051
0	4	4063186
0	5	4056872
0	6	4066412
0	7	4030594
0	8	4046497
0	9	4148369

... (296 rows omitted)

drop_census_data = census_data.drop("CENSUS2010POP","POPESTIMATE2010","POPESTIMATE2011","POPESTIMATE2012","POPESTIMATE2013","POPESTIMATE2014","POPESTIMATE2015")
drop_census_data

SEX	AGE	ESTIMATESBASE2010
0	0	3944160
0	1	3978090
0	2	4096939
0	3	4119051
0	4	4063186
0	5	4056872
0	6	4066412
0	7	4030594
0	8	4046497
0	9	4148369

... (296 rows omitted)

As you can see underneath the table, there are still 296 rows omitted! Our next step is to only include non-gendered data AKA data where SEX=0, neither male or female specific.

To do this, we need to use a new function where. The general form of this function is:

table_name.where(column_name, predicate)

To cut our table down to only include sex=0, we may use the predicate are.equal_to(). Note that we are assigning the new table to a new variable. We are referencing the table stored in an older variable (select_census_data), and modifying it. That modification is what is stored in the new variable.

new_census_data = select_census_data.where("SEX", are.equal_to(0))
new_census_data

SEX	AGE	ESTIMATESBASE2010
0	0	3944160
0	1	3978090
0	2	4096939
0	3	4119051
0	4	4063186
0	5	4056872
0	6	4066412
0	7	4030594
0	8	4046497
0	9	4148369

... (92 rows omitted)

There are still 92 rows omitted! Let’s take every 10th entry to cut this table down a little more.

To do this we need to use the take function. The take function creates a new table with rows from the original table whose indices(row number) are given. In Python, indices start at 0!

Here’s taking every 10th entry. Inside of the take parentheses is a Python list of numbers from 0 to 90, increasing by 10s. This indicates exactly which rows we want to keep (every 10th row).

census_10_year = new_census_data.take([0,10,20,30,40,50,60,70,80,90])
census_10_year

SEX	AGE	ESTIMATESBASE2010
0	0	3944160
0	10	4172559
0	20	4519556
0	30	4285877
0	40	4383450
0	50	4660457
0	60	3621214
0	70	2043178
0	80	1308608
0	90	435695

Now that sex is all the same, we can drop that column.

final_census_table = census_10_year.drop("SEX")
final_census_table

AGE	ESTIMATESBASE2010
0	3944160
10	4172559
20	4519556
30	4285877
40	4383450
50	4660457
60	3621214
70	2043178
80	1308608
90	435695

Tables Essentials!

For your reference, here’s a table of useful Table functions:

Name	Example	Purpose
`Table`	`Table()`	Create an empty table, usually to extend with data
`Table.read_table`	`Table.read_table("my_data.csv")`	Create a table from a data file
`with_columns`	`tbl = Table().with_columns("N", np.arange(5), "2*N", np.arange(0, 10, 2))`	Create a copy of a table with more columns
`column`	`tbl.column("N")`	Create an array containing the elements of a column
`sort`	`tbl.sort("N")`	Create a copy of a table sorted by the values in a column
`where`	`tbl.where("N", are.above(2))`	Create a copy of a table with only the rows that match some predicate
`num_rows`	`tbl.num_rows`	Compute the number of rows in a table
`num_columns`	`tbl.num_columns`	Compute the number of columns in a table
`select`	`tbl.select("N")`	Create a copy of a table with only some of the columns
`drop`	`tbl.drop("2*N")`	Create a copy of a table without some of the columns
`take`	`tbl.take(np.arange(0, 6, 2))`	Create a copy of the table with only the rows whose indices are in the given array
`join`	`tbl1.join("shared_column_name", tbl2)`	Join together two tables with a common column name
`are.equal_to()`	`tbl.where("SEX", are.equal_to(0))`	find values equal to that indicated
`are.not_equal_to()`	`tbl.where("SEX", are.not_equal_to(0))`	find values not including the one indicated
`are.above()`	`tbl.where("AGE", are.above(30))`	find values greater to that indicated
`are.below()`	`tbl.where("AGE", are.below(40))`	find values less than that indicated
`are.between()`	`tbl.where("SEX", are.between(18, 60))`	find values between the two indicated

Visualizations

Now that we have a manageable table we can start making visualizations! Due to the numerical nature of the census table above, let’s first try a scatter plot.

To create a scatter plot, we need to use the scatter() function. The general form is:

table.scatter("column for x axis", "column for y axis")

An example is shown below:

final_census_table.scatter("AGE", "ESTIMATESBASE2010") 

png

With this data, we can also make a line plot. To do this, we need to use the plot() function. This works a lot like scatter() where the general form is:

table.plot("x column", "y column")

final_census_table.plot("AGE", "ESTIMATESBASE2010") 

png

Though a bar may be better. Bar graphs follow the same formula as scatter plots and line graphs above, with the general form:

table.bar("x axis", "y axis")

final_census_table.bar("AGE", "ESTIMATESBASE2010") 

png

You can also use the functino barh() instead of bar() in order to flip the bar graph horizontally. Sometimes, this makes for a cleaner visualization.

Merging Tables

We are going to cover one more topic briefly that you will need to use in your project. We are going to look into how to merge two tables that have common information. This technique will be very valuable when the time comes for you to do your own analysis with your own data sets.

We are going to read in a table with information about psychologists. We will call this new table psych1.

psych1 = Table.read_table("example-data/psych1.csv")
psych1

Psychologist	Birth Year
Freud	1856
Skinner	1904
Piaget	1896
Maslow	1908

Adding rows

You may have another table that has the exact same columns and you just want to add the rows to what you already have. Let’s read in another short table with a couple more psychologists:

psych2 = Table.read_table("example-data/psych2.csv")
psych2

Psychologist	Birth Year
Jung	1875
Erikson	190

Great! We see that this second table has the same columns as the first one. Merging these two tables would allow us to consolidate our information. We are going to use the “append” method to append the second table onto the first!

psych_merged = Table.copy(psych1)  # copying over the new_psych table to a new variable for the merged table
psych_merged.append(psych2)
psych_merged

Psychologist	Birth Year
Freud	1856
Skinner	1904
Piaget	1896
Maslow	1908
Jung	1875
Erikson	190

As you can see, we have succesfully merged these two tables together! Now, let us try merging an additional column onto the original new_psych table.

Adding columns

Let’s pretend that we suddenly have access to the favorite foods of each psychologist. Wow! We definitely want to include that information in our table. This means that we need to merge in a new column to the initial table. Let us print the table first, and then we’ll get the other column.

psych_merged

Psychologist	Birth Year
Freud	1856
Skinner	1904
Piaget	1896
Maslow	1908
Jung	1875
Erikson	190

Now we are going to create to read in our new information.

psych_foods = Table.read_table('example-data/favorite_food.csv')
psych_foods

Favorite Food
Pizza
Snickers
Grapes
Escargot
Ice Cream
Apples

Luckily, we are going to assume that each row is in the right order in the column. We are going to use a similar process as before to merge this column! The method we now use is append_column. Otherwise, the format stays the same!

psych_merged_with_food = Table.copy(psych_merged)  # copying over the new_psych table to a new variable 
psych_merged_with_food.append_column("Favorite Food", psych_foods['Favorite Food'])
psych_merged_with_food

Psychologist	Birth Year	Favorite Food
Freud	1856	Pizza
Skinner	1904	Snickers
Piaget	1896	Grapes
Maslow	1908	Escargot
Jung	1875	Ice Cream
Erikson	190	Apples

As you can see, we’ve successfully merged a column to our table too!

Joining on columns

Suppose now that you have more information on the these psycologists (another column), but the rows aren’t in the right order so you can’t just append_column. Luckily, you have their names.

psych_birthplaces = Table.read_table("example-data/birthplaces.csv")
psych_birthplaces

Psychologist	Birth Place
Maslow	New York
Erikson	Germany
Skinner	Pennsylvania
Piaget	Switzerland
Freud	Czech Republic
Jung	Switzerland

Awesome! We love more data, but how can I merge this back into our existing information? We can’t just add a column because the order is different.

Solution: We can use the join method and tell it which column the two tables have in common, and it will match the data to the correct row:

final_table = psych_merged_with_food.join("Psychologist", psych_birthplaces)
final_table

Psychologist	Birth Year	Favorite Food	Birth Place
Erikson	190	Apples	Germany
Freud	1856	Pizza	Czech Republic
Jung	1875	Ice Cream	Switzerland
Maslow	1908	Escargot	New York
Piaget	1896	Grapes	Switzerland
Skinner	1904	Snickers	Pennsylvania

That’s super cool!

SUMMARY

You’ve learned a lot in this module! Let’s look back on the key parts.

To import data from a .csv/.txt file, we write Table.read_table("file_name").
To create our own table, we write Table( ).with_columns("Column Name", array_name, . . .) .
To count number of rows, we use table_name.num_rows.
To count number of columns, we use table_name.num_columns.
To create a new table with only the columns indicated in the parameters, we use table_name.select("COLUMN NAME", ...).
To create a new table without the columns indicated in the parameters, we use table_name.drop("COLUMN NAME", ...).
To create a table with only certain values, we can use table_name.where(column_name, predicate).
To create a new table with indicated rows from the original table, we use table_name.take([index 1, index 2, . . . ]). Remember in Python indices start at 0!
To create a scatter plot, we use table.scatter(column for x axis, column for y axis).
To create a line plot, we use table.plot(x column, y column).
To make a bar graphs, we can use either table.bar(x column, y column) or table.barh(x column, y column).
To make a histogram, we use table.hist(x axis, bins(optional), unit(optional)).
To merge tables, we use either append or append_column.
To merge two tables with a common column name we use the join method.

With just some simple code, we were able to do an incredible amount of data analysis! Play around with the examples until you feel comfortable with the content of this notebook. We will be using notebooks to analyze your own data sets in the future! Please ask if you have questions!

If you need help, please consult the Data Peers!