Search

Correlation, regression, and prediction

If you run into errors, check the common errors Google doc first.

One of the most important and interesting aspects of data science is making predictions about the future. How can we learn about temperatures a few decades from now by analyzing historical data about climate change and pollution? Based on a person’s social media profile, what conclusions can we draw about their interests? How can we use a patient’s medical history to judge how well he or she will respond to a treatment?

Run the cell below to import the code we’ll use in this notebook. Don’t worry about getting an output, simply run the cell.

from datascience import *
import numpy as np
import matplotlib.pyplot as plots
import scipy as sp
%matplotlib inline
import statsmodels.formula.api as smf
plots.style.use('fivethirtyeight')

In this module, you will look at two correlated phenomena and predict unseen data points!

We will be using data from the online data archive of Prof. Larry Winner of the University of Florida. The file hybrid contains data on hybrid passenger cars sold in the United States from 1997 to 2013. In order to analyze the data, we must first import it to our Jupyter notebook and create a table.

hybrid = Table.read_table('data/hybrid.csv')  # Imports the data and creates a table
hybrid.show(5)  # Displays the first five rows of the table
vehicle year msrp acceleration mpg class
Prius (1st Gen) 1997 24509.7 7.46 41.26 Compact
Tino 2000 35355 8.2 54.1 Compact
Prius (2nd Gen) 2000 26832.2 7.97 45.23 Compact
Insight 2000 18936.4 9.52 53 Two Seater
Civic (1st Gen) 2001 25833.4 7.04 47.04 Compact

... (148 rows omitted)

References: vehicle: model of the car, year: year of manufacture, msrp: manufacturer’s suggested retail price in 2013 dollars, acceleration: acceleration rate in km per hour per second, mpg: fuel economy in miles per gallon, class: the model’s class.

Note: whenever we write an equal sign (=) in python, we are assigning somthing to a variable.

Let’s visualize some of the data to see if we can spot a possible association!

hybrid.scatter('acceleration', 'msrp') # Creates a scatter plot of two variables in a table

png

As we can see in the above scatter, there seems to be a positive association between acceleration and price. That is, cars with greater acceleration tend to cost more, on average; conversely, cars that cost more tend to have greater acceleration on average.

What about miles per gallon and price? Do you expect a positive or negative association?

hybrid.scatter('mpg', 'msrp')

png

Along with the negative association, the scatter diagram of price versus efficiency shows a non-linear relation between the two variables, i.e., the points appear to be clustered around a curve, not around a straight line.

Let’s subset the data so that we’re only looking at SUVs. Use what you learned in the previous notebook to choose only SUVs and then make a scatter plot of mpg and msrp:

# task

The correlation coefficient - r

The correlation coefficient ranges from −1 to 1. A value of 1 implies that a linear equation describes the relationship between X and Y perfectly, with all data points lying on a line for which Y increases as X increases. A value of −1 implies that all data points lie on a line for which Y decreases as X increases. A value of 0 implies that there is no linear correlation between the variables. ~Wikipedia

r = 1: the scatter diagram is a perfect straight line sloping upwards

r = -1: the scatter diagram is a perfect straight line sloping downwards.

Let’s calculate the correlation coefficient between acceleration and price. We can use the np.corrcoef function on the two variable (columns here) that we want to correlate:

sp.stats.pearsonr(hybrid['acceleration'], hybrid['msrp'])
(0.6955778996913979, 1.9158000667128373e-23)

This function two numbers. The first number is our r value, and the second number is the p-value for our correlation. A p-value of under .05 indicates strong validity in the correlation. Our coefficient here is 0.6955779, and our p-value is low implying strong positive correlation.


Regression

As mentioned earlier, an important tool in data science is to make predictions based on data. The code that we’ve created so far has helped us establish a relationship between our two variables. Once a relationship has been established, it’s time to create a model that predicts unseen data values. To do this we’ll find the equation of the regression line!

The regression line is the best fit line for our data. It’s like an average of where all the points line up. In linear regression, the regression line is a perfectly straight line! Below is a picture showing the best fit line.

image

As you can infer from the picture, once we find the slope and the y-intercept we can start predicting values! The equation for the above regression to predict university GPA based on high school GPA would look like this:

$UNIGPA_i= \alpha + \beta HSGPA + \epsilon_i$

The variable we want to predict (or model) is the left side y variable, the variable which we think has an influence on our left side variable is on the right side. The $\alpha$ term is the y-intercept and the $\epsilon_i$ describes the randomness.

We can fit the model by setting up an equation without the $\alpha$ and $\epsilon_i$ in the formula parameter of the function below, we’ll give it our data variable in the data parameter. Then we just fit the model and ask for a summary. We’ll try a model for:

$MSRP_i= \alpha + \beta ACCELERATION + \epsilon_i$

mod = smf.ols(formula='msrp ~ acceleration', data=hybrid.to_df())
res = mod.fit()
print(res.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   msrp   R-squared:                       0.484
Model:                            OLS   Adj. R-squared:                  0.480
Method:                 Least Squares   F-statistic:                     141.5
Date:                Wed, 22 May 2019   Prob (F-statistic):           1.92e-23
Time:                        22:33:36   Log-Likelihood:                -1691.7
No. Observations:                 153   AIC:                             3387.
Df Residuals:                     151   BIC:                             3394.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept    -2.128e+04   5244.588     -4.058      0.000   -3.16e+04   -1.09e+04
acceleration  5067.6611    425.961     11.897      0.000    4226.047    5909.275
==============================================================================
Omnibus:                       32.737   Durbin-Watson:                   1.656
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               52.692
Skew:                           1.072   Prob(JB):                     3.61e-12
Kurtosis:                       4.915   Cond. No.                         52.1
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

That’s a lot of information! While we should consider everything, we’ll look at the p value, the coef, and the R-squared. A p-value of < .05 is generally considered to be statistically significant. The coef is how much increase one sees in the left side variable for a one unit increase of the right side variable. So for a 1 unit increase in acceleration one might see an increase of $5067 MSRP, according to our model. But how great is our model? That’s the R-squared. The R-squared tells us how much of the variation in the data can be explained by our model, .484 isn’t that bad, but obviously more goes into the MSRP value of a car than just acceleration.

We can plot this line of “best fit” too:

hybrid.scatter('acceleration', 'msrp', fit_line=True)

png

Covariate

We might want to add another independent variable because we think it could influence our dependent variable. If we think mpg could have an influence and needs to be controlled for we just + add that to the equation. (NB: the example below would exhibit high multicollinearity)

$MSRP_i= \alpha + \beta ACCELERATION + \beta MPG + \epsilon_i$

mod = smf.ols(formula='msrp ~ acceleration + mpg', data=hybrid.to_df())
res = mod.fit()
print(res.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   msrp   R-squared:                       0.527
Model:                            OLS   Adj. R-squared:                  0.521
Method:                 Least Squares   F-statistic:                     83.66
Date:                Wed, 22 May 2019   Prob (F-statistic):           3.93e-25
Time:                        22:33:37   Log-Likelihood:                -1685.0
No. Observations:                 153   AIC:                             3376.
Df Residuals:                     150   BIC:                             3385.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept     5796.5295   8861.211      0.654      0.514   -1.17e+04    2.33e+04
acceleration  4176.4342    474.195      8.807      0.000    3239.471    5113.398
mpg           -471.9015    127.066     -3.714      0.000    -722.973    -220.830
==============================================================================
Omnibus:                       24.929   Durbin-Watson:                   1.614
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               33.903
Skew:                           0.922   Prob(JB):                     4.35e-08
Kurtosis:                       4.384   Cond. No.                         282.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.


We are going to walk through a brief example using the datasets that you will using. You will not be allowed to use this correlation for your project. However, the process for your project will be the same. Our goal is to make the project as seamless as possible for you. That being said, our example will be abbreviated so that you have some freedom to explore with your own data.

First we are going to read in two datasets.

Implicit-Age-IAT.csv gives us the implicit age bias according to FIPS code.

age = Table.read_table('Implicit-Age_IAT.csv')
age.show(5)
FIPS tyoung told D_biep_Young_Good_all
10001 7.13782 7.17363 0.462662
10003 7.07723 6.88274 0.439701
10005 6.84831 6.96089 0.445957
1001 7.05085 6.79661 0.502814
1003 7.17904 7.04148 0.457369

... (3133 rows omitted)

Outcome-Heart-Attack-Mortality.csv gives us the heart attack mortality count by year according to FIPS code.

heart = Table.read_table('Outcome-Heart-Attack-Mortality.csv')
heart.show(5)
State FIPS County Year Heart_Attack_Mortality Stability
Alabama 1001 Autauga 2000 220.8 1
Alabama 1001 Autauga 2001 100.7 1
Alabama 1001 Autauga 2002 68.2 1
Alabama 1001 Autauga 2003 66.7 1
Alabama 1001 Autauga 2005 63.2 1

... (34780 rows omitted)

Keep in mind that all the functions/syntax that we are using are either in this notebook, or the previous one from lecture.

Now, we are going to join the two tables together. We will be joining on the ‘FIPS’ column because that’s the only one they have in common! That column is a unique identifier for counties. We are also going to drop duplicate rows from the table.

joined_data = age.join("FIPS", heart)
joined_data = joined_data.to_df().drop_duplicates()
joined_data = Table.from_df(joined_data)
joined_data
FIPS tyoung told D_biep_Young_Good_all State County Year Heart_Attack_Mortality Stability
1001 7.05085 6.79661 0.502814 Alabama Autauga 2000 220.8 1
1001 7.05085 6.79661 0.502814 Alabama Autauga 2001 100.7 1
1001 7.05085 6.79661 0.502814 Alabama Autauga 2002 68.2 1
1001 7.05085 6.79661 0.502814 Alabama Autauga 2003 66.7 1
1001 7.05085 6.79661 0.502814 Alabama Autauga 2005 63.2 1
1001 7.05085 6.79661 0.502814 Alabama Autauga 2006 68.3 1
1001 7.05085 6.79661 0.502814 Alabama Autauga 2007 73.9 1
1001 7.05085 6.79661 0.502814 Alabama Autauga 2008 104.7 1
1001 7.05085 6.79661 0.502814 Alabama Autauga 2009 60 1
1001 7.05085 6.79661 0.502814 Alabama Autauga 2010 93.2 1

... (34706 rows omitted)

That’s great! By displaying the table, we can get a general idea as to what columns exist, and what kind of relations we can try to analyze.

One thing to notice is that there are a lot of data points! Our visualization and regression may be cleaner if we subset the data. Let’s use the functions from the first notebook to subset the data to California data from 2010.

joined_data = joined_data.where("Year", are.equal_to(2010))
joined_data = joined_data.where("State", are.equal_to("California"))
joined_data
FIPS tyoung told D_biep_Young_Good_all State County Year Heart_Attack_Mortality Stability
6001 6.96069 6.78826 0.423026 California Alameda 2010 45.8 1
6005 6.575 6.85 0.446357 California Amador 2010 78.6 1
6007 7.11892 7.04696 0.442157 California Butte 2010 62.7 1
6009 7.65079 7.8254 0.460807 California Calaveras 2010 99.5 1
6013 6.96188 6.87655 0.418506 California Contra Costa 2010 48.9 1
6015 7 6.83871 0.43266 California Del Norte 2010 75.8 1
6017 6.84016 6.85714 0.408486 California El Dorado 2010 53.9 1
6019 7.0461 7.09315 0.424963 California Fresno 2010 80.4 1
6021 6.6 7.2 0.260103 California Glenn 2010 91.7 1
6023 6.86624 6.79618 0.433324 California Humboldt 2010 94.4 1

... (39 rows omitted)

We now have a lot less points, which will hopefully make the visualization a bit cleaner.

Let’s make a simple scatter plot with a fit line to look at the relation between the category D_biep_Young_Good_all and Heart_Attack_Mortality. Remember, all these functions are either on Notebook 1 or Notebook 2!

joined_data.scatter("D_biep_Young_Good_all", "Heart_Attack_Mortality", fit_line = True) 

png

As we can see, the correlation and subsequent linear fit do not seem to be good. That means that there is not an easily quantifiable relationship between implicit age bias and the heart attack mortality rate for that geographic region.

Let’s verify this quantitatively with an r-value calculation!

sp.stats.pearsonr(joined_data['D_biep_Young_Good_all'], joined_data['Heart_Attack_Mortality'])
(-0.2566592892097549, 0.0750451858488917)

Our r-value is moderately negative, and our p-value is pretty low, but not < .05, so we would not reject the null hypothesis and call this “significant”. This indicates little to no correlation, similar to what our scatter plot predicited.

Let us display a regression summary, just for practice!

mod = smf.ols(formula='Heart_Attack_Mortality ~  D_biep_Young_Good_all + told', data=joined_data.to_df())
res = mod.fit()
print(res.summary())
                              OLS Regression Results                              
==================================================================================
Dep. Variable:     Heart_Attack_Mortality   R-squared:                       0.067
Model:                                OLS   Adj. R-squared:                  0.027
Method:                     Least Squares   F-statistic:                     1.661
Date:                    Wed, 22 May 2019   Prob (F-statistic):              0.201
Time:                            22:33:40   Log-Likelihood:                -215.40
No. Observations:                      49   AIC:                             436.8
Df Residuals:                          46   BIC:                             442.5
Df Model:                               2                                         
Covariance Type:                nonrobust                                         
=========================================================================================
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
Intercept               104.0577     93.565      1.112      0.272     -84.278     292.394
D_biep_Young_Good_all  -133.0227     79.819     -1.667      0.102    -293.690      27.645
told                      3.0434     11.323      0.269      0.789     -19.749      25.836
==============================================================================
Omnibus:                        3.154   Durbin-Watson:                   1.878
Prob(Omnibus):                  0.207   Jarque-Bera (JB):                2.932
Skew:                           0.585   Prob(JB):                        0.231
Kurtosis:                       2.744   Cond. No.                         269.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

A quick scan above shows an extremeley low R-squared and a high p-value, both of which imply that our model does not fit the data very well at all.

That’s it! By working through this module, you’ve learned how to visually analyze your data, establish a correlation by calculating the correlation coefficient, set up a regression (with a covariate), and find the regression line!


If you need help, please consult the Data Peers!