{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Correlation, regression, and prediction\n", "\n", "*If you run into errors, check the [common errors](https://docs.google.com/document/d/1-LUvfYYI5UtjYiZerCGIBNgzkaJHNxl4530tgh37uYs/edit?usp=sharing) Google doc first.*\n", "\n", "One of the most important and interesting aspects of data science is making predictions about the future. How can we learn about temperatures a few decades from now by analyzing historical data about climate change and pollution? Based on a person's social media profile, what conclusions can we draw about their interests? How can we use a patient's medical history to judge how well he or she will respond to a treatment?\n", "\n", "Run the cell below to import the code we'll use in this notebook.\n", "Don't worry about getting an output, simply run the cell." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from datascience import *\n", "import numpy as np\n", "import matplotlib.pyplot as plots\n", "import scipy as sp\n", "%matplotlib inline\n", "import statsmodels.formula.api as smf\n", "plots.style.use('fivethirtyeight')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this module, you will look at two **correlated** phenomena and predict unseen data points!\n", "\n", "We will be using data from the online data archive of Prof. Larry Winner of the University of Florida. The file *hybrid* contains data on hybrid passenger cars sold in the United States from 1997 to 2013. In order to analyze the data, we must first **import** it to our Jupyter notebook and **create a table.**" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/datascience/tables.py:132: FutureWarning: read_table is deprecated, use read_csv instead.\n", " df = pandas.read_table(filepath_or_buffer, *args, **vargs)\n" ] }, { "data": { "text/html": [ "
vehicle | year | msrp | acceleration | mpg | class | \n", "
---|---|---|---|---|---|
Prius (1st Gen) | 1997 | 24509.7 | 7.46 | 41.26 | Compact | \n", "
Tino | 2000 | 35355 | 8.2 | 54.1 | Compact | \n", "
Prius (2nd Gen) | 2000 | 26832.2 | 7.97 | 45.23 | Compact | \n", "
Insight | 2000 | 18936.4 | 9.52 | 53 | Two Seater | \n", "
Civic (1st Gen) | 2001 | 25833.4 | 7.04 | 47.04 | Compact | \n", "
... (148 rows omitted)
" ], "text/plain": [ "FIPS | tyoung | told | D_biep_Young_Good_all | \n", "
---|---|---|---|
10001 | 7.13782 | 7.17363 | 0.462662 | \n", "
10003 | 7.07723 | 6.88274 | 0.439701 | \n", "
10005 | 6.84831 | 6.96089 | 0.445957 | \n", "
1001 | 7.05085 | 6.79661 | 0.502814 | \n", "
1003 | 7.17904 | 7.04148 | 0.457369 | \n", "
... (3133 rows omitted)
" ], "text/plain": [ "State | FIPS | County | Year | Heart_Attack_Mortality | Stability | \n", "
---|---|---|---|---|---|
Alabama | 1001 | Autauga | 2000 | 220.8 | 1 | \n", "
Alabama | 1001 | Autauga | 2001 | 100.7 | 1 | \n", "
Alabama | 1001 | Autauga | 2002 | 68.2 | 1 | \n", "
Alabama | 1001 | Autauga | 2003 | 66.7 | 1 | \n", "
Alabama | 1001 | Autauga | 2005 | 63.2 | 1 | \n", "
... (34780 rows omitted)
" ], "text/plain": [ "FIPS | tyoung | told | D_biep_Young_Good_all | State | County | Year | Heart_Attack_Mortality | Stability | \n", "
---|---|---|---|---|---|---|---|---|
1001 | 7.05085 | 6.79661 | 0.502814 | Alabama | Autauga | 2000 | 220.8 | 1 | \n", "
1001 | 7.05085 | 6.79661 | 0.502814 | Alabama | Autauga | 2001 | 100.7 | 1 | \n", "
1001 | 7.05085 | 6.79661 | 0.502814 | Alabama | Autauga | 2002 | 68.2 | 1 | \n", "
1001 | 7.05085 | 6.79661 | 0.502814 | Alabama | Autauga | 2003 | 66.7 | 1 | \n", "
1001 | 7.05085 | 6.79661 | 0.502814 | Alabama | Autauga | 2005 | 63.2 | 1 | \n", "
1001 | 7.05085 | 6.79661 | 0.502814 | Alabama | Autauga | 2006 | 68.3 | 1 | \n", "
1001 | 7.05085 | 6.79661 | 0.502814 | Alabama | Autauga | 2007 | 73.9 | 1 | \n", "
1001 | 7.05085 | 6.79661 | 0.502814 | Alabama | Autauga | 2008 | 104.7 | 1 | \n", "
1001 | 7.05085 | 6.79661 | 0.502814 | Alabama | Autauga | 2009 | 60 | 1 | \n", "
1001 | 7.05085 | 6.79661 | 0.502814 | Alabama | Autauga | 2010 | 93.2 | 1 | \n", "
... (34706 rows omitted)
" ], "text/plain": [ "FIPS | tyoung | told | D_biep_Young_Good_all | State | County | Year | Heart_Attack_Mortality | Stability\n", "1001 | 7.05085 | 6.79661 | 0.502814 | Alabama | Autauga | 2000 | 220.8 | 1\n", "1001 | 7.05085 | 6.79661 | 0.502814 | Alabama | Autauga | 2001 | 100.7 | 1\n", "1001 | 7.05085 | 6.79661 | 0.502814 | Alabama | Autauga | 2002 | 68.2 | 1\n", "1001 | 7.05085 | 6.79661 | 0.502814 | Alabama | Autauga | 2003 | 66.7 | 1\n", "1001 | 7.05085 | 6.79661 | 0.502814 | Alabama | Autauga | 2005 | 63.2 | 1\n", "1001 | 7.05085 | 6.79661 | 0.502814 | Alabama | Autauga | 2006 | 68.3 | 1\n", "1001 | 7.05085 | 6.79661 | 0.502814 | Alabama | Autauga | 2007 | 73.9 | 1\n", "1001 | 7.05085 | 6.79661 | 0.502814 | Alabama | Autauga | 2008 | 104.7 | 1\n", "1001 | 7.05085 | 6.79661 | 0.502814 | Alabama | Autauga | 2009 | 60 | 1\n", "1001 | 7.05085 | 6.79661 | 0.502814 | Alabama | Autauga | 2010 | 93.2 | 1\n", "... (34706 rows omitted)" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "joined_data = age.join(\"FIPS\", heart)\n", "joined_data = joined_data.to_df().drop_duplicates()\n", "joined_data = Table.from_df(joined_data)\n", "joined_data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's great! By displaying the table, we can get a general idea as to what columns exist, and what kind of relations we can try to analyze. \n", "\n", "One thing to notice is that there are a lot of data points! Our visualization and regression may be cleaner if we subset the data. Let's use the functions from the first notebook to subset the data to California data from 2010." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "FIPS | tyoung | told | D_biep_Young_Good_all | State | County | Year | Heart_Attack_Mortality | Stability | \n", "
---|---|---|---|---|---|---|---|---|
6001 | 6.96069 | 6.78826 | 0.423026 | California | Alameda | 2010 | 45.8 | 1 | \n", "
6005 | 6.575 | 6.85 | 0.446357 | California | Amador | 2010 | 78.6 | 1 | \n", "
6007 | 7.11892 | 7.04696 | 0.442157 | California | Butte | 2010 | 62.7 | 1 | \n", "
6009 | 7.65079 | 7.8254 | 0.460807 | California | Calaveras | 2010 | 99.5 | 1 | \n", "
6013 | 6.96188 | 6.87655 | 0.418506 | California | Contra Costa | 2010 | 48.9 | 1 | \n", "
6015 | 7 | 6.83871 | 0.43266 | California | Del Norte | 2010 | 75.8 | 1 | \n", "
6017 | 6.84016 | 6.85714 | 0.408486 | California | El Dorado | 2010 | 53.9 | 1 | \n", "
6019 | 7.0461 | 7.09315 | 0.424963 | California | Fresno | 2010 | 80.4 | 1 | \n", "
6021 | 6.6 | 7.2 | 0.260103 | California | Glenn | 2010 | 91.7 | 1 | \n", "
6023 | 6.86624 | 6.79618 | 0.433324 | California | Humboldt | 2010 | 94.4 | 1 | \n", "
... (39 rows omitted)
" ], "text/plain": [ "FIPS | tyoung | told | D_biep_Young_Good_all | State | County | Year | Heart_Attack_Mortality | Stability\n", "6001 | 6.96069 | 6.78826 | 0.423026 | California | Alameda | 2010 | 45.8 | 1\n", "6005 | 6.575 | 6.85 | 0.446357 | California | Amador | 2010 | 78.6 | 1\n", "6007 | 7.11892 | 7.04696 | 0.442157 | California | Butte | 2010 | 62.7 | 1\n", "6009 | 7.65079 | 7.8254 | 0.460807 | California | Calaveras | 2010 | 99.5 | 1\n", "6013 | 6.96188 | 6.87655 | 0.418506 | California | Contra Costa | 2010 | 48.9 | 1\n", "6015 | 7 | 6.83871 | 0.43266 | California | Del Norte | 2010 | 75.8 | 1\n", "6017 | 6.84016 | 6.85714 | 0.408486 | California | El Dorado | 2010 | 53.9 | 1\n", "6019 | 7.0461 | 7.09315 | 0.424963 | California | Fresno | 2010 | 80.4 | 1\n", "6021 | 6.6 | 7.2 | 0.260103 | California | Glenn | 2010 | 91.7 | 1\n", "6023 | 6.86624 | 6.79618 | 0.433324 | California | Humboldt | 2010 | 94.4 | 1\n", "... (39 rows omitted)" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "joined_data = joined_data.where(\"Year\", are.equal_to(2010))\n", "joined_data = joined_data.where(\"State\", are.equal_to(\"California\"))\n", "joined_data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now have a lot less points, which will hopefully make the visualization a bit cleaner.\n", "\n", "Let's make a simple scatter plot with a fit line to look at the relation between the category `D_biep_Young_Good_all` and `Heart_Attack_Mortality`. Remember, all these functions are either on Notebook 1 or Notebook 2!" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "