This is an exploratory Data Analysis utlizing some basic statistical concepts:
Linear regression is used to model and predict continuous outcomes with normal random errors. There are nearly an infinite number of different types of regression models and each regression model is typically defined by the distribution of the prediction errors (called "residuals") of the type of data.
This project is adapted from Lab 4 in Harvard's CS109 course.
The Boston Housing data set contains information about the housing values in suburbs of Boston. This dataset was originally taken from the StatLib library which is maintained at Carnegie Mellon University and is now available on the UCI Machine Learning Repository.
The packages used in this project are:
matplotlib
, pandas
, numpy
, statsmodels
and seaborn
# special IPython command to prepare the notebook for matplotlib and other libraries
%matplotlib inline
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import sklearn
import seaborn as sns
from matplotlib import rcParams
sns.set_style("whitegrid")
sns.set_context("poster")
from sklearn.datasets import load_boston
import pandas as pd
boston = load_boston()
boston.keys()
boston.data.shape
# Print column names
print(boston.feature_names)
print(boston.DESCR)
bos = pd.DataFrame(boston.data)
bos.columns = boston.feature_names
bos.head()
bos['PRICE'] = boston.target
bos.head()
bos.describe()
We have data from 506 neighborhoods/townships in the Boston area. Here's a look into the central tendencies and distribution of some key features:
These features have string relationships to price exemplified by a small, tight distribution of data around the line of best fit extimated by the plot.
Strong positive coorelation, as the number of rooms increase/decrease, the housing price increases/decreases
sns.regplot(y="PRICE", x="RM", data=bos, fit_reg = True)
plt.title("Relationship between RM and Price")
Strong negative coorelation, the more/less the population consists of lower status individuals,housing price decreases/increases.
sns.regplot(y="PRICE", x="LSTAT", data=bos, fit_reg = True)
plt.title("Relationship between LSTAT and Price")
Strong negative coorelation, the more/less concentrated NOX is in the air, the lower/higher the price of housing.
sns.regplot(y="PRICE", x="NOX", data=bos, fit_reg = True)
plt.title("Relationship between NOX and Price")
Strong positive coorelation, the closer/further the town is from employment centers, the higher/lower the housing price.
sns.regplot(y="PRICE", x="DIS", data=bos, fit_reg = True)
plt.title("Relationship between DIS and Price")
These features have string relationships to price exemplified by a small, tight distribution of data around the line of best fit extimated by the plot.
As the number of students increases for every teacher, the value of housing decreases.
sns.regplot(y="PRICE", x="PTRATIO", data=bos, fit_reg = True)
plt.title("Relationship between PTRATIO and Price")
As them crime rate decreases/increases, the housing price increases/decreases.
sns.regplot(y="PRICE", x="CRIM", data=bos, fit_reg = True)
plt.title("Relationship between Crime Rate and Price")
Taking the log of the data helps to normalize the data and eliminate any skew in the distribution to make patterns more visible and data more interprettable.
When we look at our Crime Rate and Price graph, we see it exhibits exponential decay. This can be coorected by taking it's log so that it has a linear relationship with price.
x = np.log(bos.CRIM)
plt.scatter(x, bos.PRICE)
plt.xlabel("Log of Crime Rate")
plt.ylabel("Housing Price")
plt.title("Adjusted Crime Rate vs. Original Prices")
plt.hist(np.log(bos.CRIM))
plt.title("CRIM")
plt.xlabel("Crime rate per capita")
plt.ylabel("Frequencey")
plt.show()
plt.hist(bos.PTRATIO)
plt.title("PT Ratio")
plt.xlabel("Pupil-Teacher ratio by town ")
plt.ylabel("Frequencey")
plt.show()
plt.hist(bos.RM)
plt.title("RM")
plt.xlabel("Average number of rooms per dwelling ")
plt.ylabel("Frequencey")
plt.show()
plt.hist(bos.LSTAT)
plt.title("LSTAT")
plt.xlabel("% lower status of the population ")
plt.ylabel("Frequencey")
plt.show()
plt.hist(bos.NOX)
plt.title("NOX")
plt.xlabel("NOX in parts per 10 million")
plt.ylabel("Frequencey")
plt.show()
bos1 = bos
bos1['CRIM'] = np.log(bos1['CRIM'])
bos1 = bos
bos1['CRIM'] = np.log(bos1['CRIM'])
df = bos1.iloc[:, [0,4,5,7,10,12]]
df.corr()
$Y$ = boston housing prices
and
$X$ = all the other features (or independent variables, predictors)
which will be use to fit a linear regression model and predict Boston housing prices. Will use the least-squares method to estimate the coefficients.
statsmodels
¶import statsmodels.api as sm
from statsmodels.formula.api import ols
bos1 = bos
bos1['CRIM'] = np.log(bos1['CRIM'])
m = ols('PRICE ~ PTRATIO + NOX + RM + LSTAT + DIS ',bos1).fit()
print(m.summary())
There is a ton of information in this output.We concentrate on the coefficient table (middle table).
P>|t|
) is so small, basically zero. This means that our selected features are a statistically significant predictor of PRICE
.In general, the $\hat{\beta_i}, i > 0$ can be interpreted as the following: "A one unit increase in $x_i$ is associated with, on average, a $\hat{\beta_i}$ increase/decrease in $y$ net of all other variables."
On the other hand, the interpretation for the intercept, $\hat{\beta}_0$ is the average of $y$ given that all of the independent variables $x_i$ are 0. Our $\hat{\beta}_0$ is $\$23,000$.
Lets look at our three most significant regression coefficients:
RM
of 4.2933 means that on average, each additional room is associated with an increase of $\$4,300$ in house price net of the other variables. The confidence interval gives us a range of plausible values for this average change, about ($\$3,400, \$5,128$). That's a lot of money for each additional room!PTRATIO
's of -.9256. This means that on average, each point increase in PTRATIO is associated with an decrease of $\$925$ in house price net of the other variables. The confidence interval gives us a range of plausible values for this average change, about ($\$692, \$1,159$).DIS
's of -.6926. This means that on average, each increase in DIS is associated with an decrease of $\$692$ in house price net of the other variables. The confidence interval gives us a range of plausible values for this average change, about ($\$398, \$998$).This coefficient of determination or r-squared, tells us that 70% percentage of the total variance in the price can be explained by the linear regression model. This is an important statistics that measures how 'good' our model is a t predicitng price.
This percentage tells us how much of variance in y can be explained by the model we have created, the higher the percent the more the model fits your data.
At 70%, almost 3/4ths of the variability can be attributed to our model, which is high. We have a well fit model!
predicted_prices = m.fittedvalues
plt.scatter(predicted_prices, bos.PRICE)
plt.xlabel("Predicted Prices by Model")
plt.ylabel("Housing Price")
plt.title("Predictions vs. Original Prices")
We create a scatterplot between the predicted prices, available in m.fittedvalues (where m is the fitted model) and the original prices. We evaluate this plot to see how well our regression model predicts price of given the data in our bos data set.
A perfect model would get us a scatterplot where all the data lies on the 45 degree line. That would mean that x = y, and every predicted price would have equalled the actual price.
Below the Original Prices of 20, the predictions are very widely distributed along the predicted pricing axis, meaning that as the actual prices decrease, the predictions are not as accurate as they are as the prices increase above this. The model seems to ve underpredicting the price, as the data falls below the 45 degree line.
There are some major outliers, below ~10 and above ~30 on predicted prices.
We will take a reduced model using only our strongest predictor, # of Rooms, to predict housing price. We will evaluate this model using:
- Residual Plot
- QQ Plot
- Identification/Elimination of Outliers
- Identification/Elimination of High Leverage Points
- Create a new model after removing these points
### New minimal model:
m = ols('PRICE ~ RM',bos).fit()
print(m.summary())
x = m.fittedvalues
y = m.resid
plt.scatter(x, y)
plt.xlabel("Fitted Values")
plt.ylabel("Residual")
plt.title("Fitted Values vs. Residuals")
This plot is testing that the error in the model is normally distributed.
It does so by ensuring that there is no pattern between the fitted values and the residuals.
There is some negative linear coorelation, but no clear pattern in the plot above, no assumptions have been violated.
sm.qqplot(m.resid)
Q-Q plots take the sample data, sort it in ascending order, and then plot them versus quantiles calculated from a theoretical distribution. If you were to plot the same distribution against itself, you would ger a straight line in the Q-Q plot. The straighter the line, the more normal the distribution.
Here, we are plotting the residuals against the sample data. The line is reasonably straight but curves up and then down, known as a 'hump'. This suggests that the data may be wide and flat in its distribution.
The Residual Q-Q plot tests for normality of the distribution and does a better job at showing outliers and distribution of the data than a residual plot. It is easier to see patterns in the single line of the Q-Q plot than a scatter plot and there is less room for open interpretation.
Outliers are marked with a red circle:
plt.scatter(-0, 25, s=10000, alpha=0.3, c = 'r' )
plt.scatter(45, 21, s=8000, alpha=0.3, c = 'r' )
predicted_prices = m.fittedvalues
plt.scatter(predicted_prices, bos.PRICE)
plt.xlabel("Predicted Prices by Model")
plt.ylabel("Housing Price")
plt.title("Predictions vs. Original Prices")
Looking at the fitted values vs. prices, there are some outliers for which the model either overpredicted (left red bubble) or underpredicted (righth red bubble). This implies that the data did not fit the model well and were towns that were outliers in the original data. These towns, for whatever reason, have some combination of attributes that are not similar to the other towns. They may have higher tax rates even though there are a large percentage of lower status people, or a low PTRatio when the model predicts they should have a higher one. There are a multidude of real-world answers such as legislation and geography that could contribute to these anomalies.
fig, ax = plt.subplots(figsize=(12,8))
fig = sm.graphics.influence_plot(m, ax=ax, criterion="cooks")
A data point has high leverage if it has extreme predictor x values. With a single predictor, an extreme x value is simply one that is particularly high or low. With multiple predictors, extreme x values may be particularly high or low for one or more predictors, or may be "unusual" combinations of predictor values. For example, with two predictors that are positively correlated, an unusual combination of predictor values might be a high value of one predictor paired with a low value of the other predictor.
Observations 367, 365, 364 and 368 are high leverage points.
bos['Fitted_Values'] = m.fittedvalues
a
#m = ols('PRICE ~ RM',bos).fit()
#print(m.summary())
# Getting Index for Underpredicted Outliers
bos[bos.Fitted_Values < 5].head(2)
# The points highlighted above graph are at 365, 367
# Getting Index ofr Overpredicted Outliers
bos[bos.Fitted_Values > 40].tail(1)
# The point highlighted above graph are at 364
#Outliers and Leverage points to be removed:
bos.loc[[364,365, 367, 368]]
#Model without Outliers
bos1 = bos.drop([364,365, 367, 368])
m_new = ols('PRICE ~ RM',bos1).fit()
print(m_new.summary())
#Compare to the model with the outliers
m_old = ols('PRICE ~ RM',bos).fit()
print(m_old.summary())
import nb