Part 1: What are the Airbnb listing trends in the U.S. Market?
Part 2: The Zillow Home Value Index (ZHVI) And Airbnb Metrics, Listing Price
Part 3: Given ZHVI and listing trends, develop model that can predict price given these two sets of features.
Airbnb Data Source: http://insideairbnb.com/
Zillow Data Source: http://www.zillow.com/research/data/
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pandas import *
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
cmap = sns.diverging_palette(220, 10, as_cmap=True)
def read_data(location):
location = location[['id', 'room_type', 'accommodates', 'bathrooms', 'bedrooms', 'price', 'minimum_nights',
'availability_365', 'number_of_reviews', 'review_scores_rating',
'review_scores_accuracy', 'review_scores_value']]
return location
def get_stats(location):
x = ['accommodates', 'bathrooms', 'bedrooms', 'price', 'minimum_nights',
'availability_365', 'number_of_reviews', 'review_scores_rating',
'review_scores_accuracy', 'review_scores_value']
location = location.loc[:, x]
location_stats = location.describe()
location_stats = concat([location_stats.ix[0:4], location_stats.ix[7:]])
return location_stats
def reorder(location):
new = location.set_index('location', append = True).unstack(0)
return new
austin = read_csv('austin.csv')
austin = read_data(austin)
austin['price'] = austin['price'].map(lambda x: str(x)[1:]).convert_objects(convert_numeric=True)
boston = read_csv('boston.csv')
boston = read_data(boston)
boston['price'] = boston['price'].map(lambda x: str(x)[1:]).convert_objects(convert_numeric=True)
dc = read_csv('dc.csv')
dc = read_data(dc)
dc['price'] = dc['price'].map(lambda x: str(x)[1:]).convert_objects(convert_numeric=True)
denver = read_csv('denver.csv')
denver = read_data(denver)
denver['price'] = denver['price'].map(lambda x: str(x)[1:]).convert_objects(convert_numeric=True)
la = read_csv('la.csv')
la = read_data(la)
la['price'] = la['price'].map(lambda x: str(x)[1:]).convert_objects(convert_numeric=True)
nashville = read_csv('nashville.csv')
nashville = read_data(nashville)
nashville['price'] = nashville['price'].map(lambda x: str(x)[1:]).convert_objects(convert_numeric=True)
nyc = read_csv('nyc.csv')
nyc = read_data(nyc)
nyc['price'] = nyc['price'].map(lambda x: str(x)[1:]).convert_objects(convert_numeric=True)
portland = read_csv('portland.csv')
portland = read_data(portland)
portland['price'] = portland['price'].map(lambda x: str(x)[1:]).convert_objects(convert_numeric=True)
sandiego = read_csv('sandiego.csv')
sandiego = read_data(sandiego)
sandiego['price'] = sandiego['price'].map(lambda x: str(x)[1:]).convert_objects(convert_numeric=True)
seattle = read_csv('seattle.csv')
seattle = read_data(seattle)
seattle['price'] = seattle['price'].map(lambda x: str(x)[1:]).convert_objects(convert_numeric=True)
sf = read_csv('sf.csv')
sf = read_data(sf)
sf['price'] = sf['price'].map(lambda x: str(x)[1:]).convert_objects(convert_numeric=True)
neworleans = read_csv('new_orleans.csv')
neworleans = read_data(neworleans)
neworleans['price'] = neworleans['price'].map(lambda x: str(x)[1:]).convert_objects(convert_numeric=True)
A look into the Airbnb listing features provided for each city by the data:
Creating a dataframe to house statistical summaries of the listing data so that we can analyze listing trends.
austin_s = get_stats(austin)
austin_s['location'] = 'Austin, TX'
austin_s = reorder(austin_s)
boston_s = get_stats(boston)
boston_s['location'] = 'Boston, MA'
boston_s = reorder(boston_s)
dc_s = get_stats(dc)
dc_s['location'] = 'Washington, DC'
dc_s = reorder(dc_s)
denver_s = get_stats(denver)
denver_s['location'] = 'Denver, CO'
denver_s = reorder(denver_s)
la_s = get_stats(la)
la_s['location'] = 'Los Angeles-Long Beach-Anaheim, CA'
la_s = reorder(la_s)
nashville_s = get_stats(nashville)
nashville_s['location'] = 'Nashville, TN'
nashville_s = reorder(nashville_s)
nyc_s = get_stats(nyc)
nyc_s['location'] = 'New York, NY'
nyc_s = reorder(nyc_s)
portland_s = get_stats(portland)
portland_s['location'] = 'Portland, OR'
portland_s = reorder(portland_s)
sandiego_s = get_stats(sandiego)
sandiego_s['location'] = 'San Diego, CA'
sandiego_s = reorder(sandiego_s)
seattle_s = get_stats(seattle)
seattle_s['location'] = 'Seattle, WA'
seattle_s = reorder(seattle_s)
sf_s = get_stats(sf)
sf_s['location'] = 'San Francisco, CA'
sf_s = reorder(sf_s)
neworleans_s = get_stats(neworleans)
neworleans_s['location'] = 'New Orleans, LA'
neworleans_s = reorder(neworleans_s)
statistics = concat([austin_s, boston_s, dc_s, denver_s, la_s, nashville_s, nyc_s, portland_s, sandiego_s, seattle_s, sf_s, neworleans_s])
statistics
Columns included in statistical summary:
statistics.columns.values
Looking at descriptive statistics in each city to explore listing trends.
New York and Los Angeles are outliers with a relatively far larger amount of listings than the others. This high supply of listings may suggest that they are the more popular travel destinations. San Francsico comes in third and San Diego fourth.
As expected, smaller and less frequented cities: Portland, Nashville and Denver have the least amount of listings.
x = {'Austin, TX': len(austin), 'Boston, MA' : len(boston),
'Washington, DC' : len(dc), 'Denver, CO' : len(denver),
'Los Angeles-Long Beach-Anaheim, CA': len(la), 'Nashville, TN': len(nashville),
'New York, NY': len(nyc), 'Portland, OR' : len(portland),
'San Diego, CA' : len(sandiego), 'Seattle, WA' : len(seattle),
'San Francisco, CA' : len(sf), 'New Orleans, LA' : len(neworleans) }
df = DataFrame(x, index=[0]).stack()
df = DataFrame(df).sort([0], ascending=[False]).reset_index(0)
df = df[0]
df = DataFrame(df)
df.columns = ['Count of Listings']
df
df['Count of Listings'].plot(kind = 'bar', cmap = cmap)
plt.title('Count of Listings')
While the raw listing numbers are interesting it is hard to draw actual conclusions about the density of Airbnb listings in each city becuase larger cities may have more listings.
So we will normalize the count of listing data by dividing it by the population of each metro area in millions.
# Creating DataFrame of population in each metro area in millions from US Census 2015
pop = read_csv('pop_cities_census.csv').convert_objects(convert_numeric=True)
pop.columns = ['City', 'Population']
pop = pop.set_index('City')
#Merging with listing data and normalizing
df1 = concat([df, pop], axis= 1)
df1 = df1.astype('float')
df1['Normalized by Population'] = df1['Count of Listings'] / df1['Population']
df1.sort(['Normalized by Population'], ascending=[False]).reset_index(0)
The demand or popularity of listings in a city can be estimated by the average number of booked days in the upsoming calendar year.
The supply of listings in a city can be estimated by the normalized number of listings in a city.
Lets look at the relationship between these two variables:
z = [(365 - statistics['availability_365']['mean']).convert_objects(convert_numeric=True), df1['Normalized by Population']]
df2 = concat(z, axis = 1)
df2
z = [(365 - statistics['availability_365']['mean']).convert_objects(convert_numeric=True), df1['Normalized by Population']]
df2 = concat(z, axis = 1)
x = df2['Normalized by Population']
y = df2['mean']
n = (df2.index.tolist())
fig, ax = plt.subplots()
ax.scatter(x, y, c=y,cmap = cmap, s =200)
for i, txt in enumerate(n):
ax.annotate(txt, (x[i],y[i]), fontsize = 9)
plt.xlabel('Listing Count (Normalized)', fontsize=14)
plt.ylabel('Booked Days in the next Calendar Year', fontsize=14)
plt.title('Popularity and Normalized Listing Count', fontsize=14, fontweight='bold')
Austin, San Francisco and Nashville have the most expensive listing prices on average, while Denver, Seattle and Portland are the cheaper places to stay, on average.
DataFrame(statistics['price']['mean'].convert_objects(convert_numeric=True)).sort(['mean'], ascending=[False])
Lets look at price in the context of another variable, Normalized Listing count. There is a somewhat strong coorlation, with a correlation coeff. of around .5. This strong positive relationship indicates that as the normalized number of listings increases, the price does as well.
z = [statistics['price']['mean'].convert_objects(convert_numeric=True), df1['Normalized by Population']]
df2 = concat(z, axis = 1)
x = df2['Normalized by Population']
y = df2['mean']
n = (df2.index.tolist())
fig, ax = plt.subplots()
ax.scatter(x, y, c=y,cmap = cmap, s =200)
for i, txt in enumerate(n):
ax.annotate(txt, (x[i],y[i]), fontsize = 10)
plt.xlabel('Listing Count (Normalized)', fontsize=14)
plt.ylabel('Price', fontsize=14)
plt.title('Price and Normalized Listing Count', fontsize=14, fontweight='bold')
#Coorelation between the two variables
df2.corr()
The first metric we will analyze is the average number of days that a listing is booked in the next year. The higher the # of booked days, the more popular a city is. The most popular travel destinations for users are San Francisco, New York, Boston. These are large cities so this relationship makes sense.
Interestingly, Portland and Denver come in as the fourth and fifth. These cities' listings are characterized by a low number of listings and low price. Users may be drawn to these cities due to thier low prices.
A second metric that can be used to determine the popularity of listings in a particular city is mean # of reviews per city. The more that a city is reviewed on average, the more the listing is booked on average. There are similarities in order of the cites in the two lists.
Seattle jumps from the bottom of the list on availability and towards the top on # of reviews. So although Seattle isn't a popular place for travelers relative to the other cities included in the data set, it is well reviewed. This may be an indication of the superior value of Seattle Airbnb listings.
x = (365 - statistics['availability_365']['mean']).convert_objects(convert_numeric=True)
x = DataFrame(x).sort(['mean'], ascending=[False])
x.columns = ['Availability']
x
x = DataFrame(statistics['number_of_reviews']['mean'].convert_objects(convert_numeric=True)).sort(['mean'], ascending=[False])
x.columns = ['Avg. # of Reviews']
x
There seems to a positive coorelation beetween the price and the popularity of listing, or the average number of days a cities' listings are booked. However this trend only emerges once we elimate Austin as an outiler. The first graph displayed below contains the Austin data point, and the second removes it.
The more popular a listing (higher the number of days booked will be) the higher the price. This seems to suggest that cities which are more popular to travel cost more to stay at. Demand drives up the price.
x = (365 - statistics['availability_365']['mean']).convert_objects(convert_numeric=True)
y = statistics['price']['mean'].convert_objects(convert_numeric=True)
n = (x.reset_index()).location.tolist()
fig, ax = plt.subplots()
ax.scatter(x, y, c=y,cmap = cmap , s =200)
for i, txt in enumerate(n):
ax.annotate(txt, (x[i],y[i]), fontsize = 10)
plt.xlabel('Average Number of Days booked in next 365 days', fontsize=14)
plt.ylabel('Average Price', fontsize=14)
plt.title('Average Price and Popularity of Listings', fontsize=14, fontweight='bold')
x = (365 - statistics['availability_365']['mean']).convert_objects(convert_numeric=True).drop('Austin, TX')
y = statistics['price']['mean'].convert_objects(convert_numeric=True).drop('Austin, TX')
n = (x.reset_index()).location.tolist()
fig, ax = plt.subplots(figsize=(9, 7))
ax.scatter(x, y, c=y,cmap = cmap , s =200)
for i, txt in enumerate(n):
ax.annotate(txt, (x[i],y[i]), fontsize = 15)
plt.xlabel('Average Number of Days booked in next 365 days', fontsize=18)
plt.ylabel('Average Price', fontsize=18)
plt.title('Average Price and Popularity of Listings', fontsize=18, fontweight='bold')
#Coorelation between the two variables
df = concat ([x,y], axis =1 )
df.corr()
The data here is very dispersed and suggests that ratings and price do not have a clear relationship.
This reveals an interesting quality of Airbnb listings relative to the traditional hotel industry. Even if a user pays more or less for a listing this does not have much to do with the quality of thier stay. In the traditional hotel industry the quality of your stay is very driven by the price (hotel vs. motel). Usually the more one pays the better your stay, whereas Airbnb users can pay minimally and have a great experience (Portland) or the other way around (Boston for example, has a low rating for its average price).
x = statistics['review_scores_rating']['mean'].convert_objects(convert_numeric=True)
y = statistics['price']['mean'].convert_objects(convert_numeric=True)
n = (x.reset_index()).location.tolist()
fig, ax = plt.subplots()
ax.scatter(x, y, c=y,cmap = cmap, s =200)
for i, txt in enumerate(n):
ax.annotate(txt, (x[i],y[i]), fontsize = 10)
plt.xlabel('Average Rating', fontsize=14)
plt.ylabel('Price', fontsize=14)
plt.title('Price and Average Rating', fontsize=14, fontweight='bold')
df1 = statistics.iloc[:,[2,7,12,17,22,27,32,37,42,47]].reset_index().drop('location',axis = 1).fillna(0)
df1.columns = df1.columns.droplevel(1)
df1.columns = ['acc', 'bt', 'bd', 'pr', 'min', '365', '#rev', 'rat', "acc", "val"]
#df_zillow = df_zillow['ZHVI_10Year']
#df1 = concat([df1, df_zillow], axis = 1).fillna(0)
coor= df1.corr()
# Generate a mask for the upper triangle
mask = np.zeros_like(df1.corr(), dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))
# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(df1.corr(), mask=mask, cmap=cmap, vmax=.3,
square=True, xticklabels=2, yticklabels=5,
linewidths=.5, cbar_kws={"shrink": .5}, ax=ax)
The ideal home price index would be based off sale prices for the same set of homes in each time period so there was never an issue of the sales mix being different across periods. This approach of using a constant basket of goods is widely used, common examples being a commodity price index and a consumer price index. Unfortunately, unlike commodities and consumer goods, for which we can observe prices in all time periods, we can’t observe prices on the same set of homes in all time periods because not all homes are sold in every time period.
The innovation that Zillow developed is a way of approximating this ideal home price index by leveraging the valuations Zillow creates on all homes (called Zestimates). Instead of actual sale prices on every home, the index is created from estimated sale prices on every home. Because of this fact, the distribution of actual sale prices for homes sold in a given time period looks very similar to the distribution of estimated sale prices for this same set of homes. But, importantly, Zillow has estimated sale prices not just for the homes that sold, but for all homes even if they didn’t sell in that time period.
Using this methodology, we now have a comprehensive and robust benchmark of home value trends can be computed which is immune to the changing mix of properties that sell in different periods of time.
ZHVI = read_csv('Metro_Zhvi_Summary_AllHomes (1).csv').set_index('RegionName')
x = ['Austin, TX', 'Boston, MA', 'Washington, DC', 'Denver, CO', 'Los Angeles-Long Beach-Anaheim, CA', 'Nashville, TN', 'New York, NY', 'Portland, OR', 'San Diego, CA', 'Seattle, WA', 'San Francisco, CA', 'New Orleans, LA']
y = ['Zhvi', 'MoM', 'QoQ', 'YoY', '5Year', '10Year']
ZHVI = ZHVI.loc[x , y ]
Putting all the data into one Dataframe for coorelation analysis:
df_zillow = ZHVI
df_zillow.columns = ['ZHVI', 'ZHVI_MoM', 'ZHVI_QoQ', 'ZHVI_YoY', 'ZHVI_5Year', 'ZHVI_10Year']
df_zillow
The average price of a listing seems to increase as the median home value increases.
Nashville and Austin are two outliers, with low median home values and high average listing prices.
df = concat([statistics['price']['mean'], df_zillow['ZHVI']], axis = 1).dropna()
x = df['ZHVI']
y = df['mean']
n = (df.index).tolist()
fig, ax = plt.subplots(figsize=(12, 10))
ax.scatter(x, y, c=y,cmap = cmap, s =200)
for i, txt in enumerate(n):
ax.annotate(txt, (x[i],y[i]), fontsize = 16)
plt.xlabel('Median Home Price', fontsize=20)
plt.ylabel('Average Price ABNB Listing', fontsize=20)
plt.title('Home Price(ZHVI) and Airbnb Listing Price', fontsize=20, fontweight='bold')
df = df.drop(['Nashville, TN', 'Austin, TX'])
df.corr()
A little over half the data follow the relationship of the average listing price increasing as home value increases. However, LA, Seattle, Denver and Portland have had high growth in home values over the last five years yet have maintained relatively low prices.
df = concat([statistics['price']['mean'], df_zillow['ZHVI_5Year']], axis = 1).dropna()
x = df['ZHVI_5Year'] *100
y = df['mean']
n = (df.index).tolist()
fig, ax = plt.subplots(figsize=(12, 10))
ax.scatter(x, y, c=y,cmap = cmap , s =200)
for i, txt in enumerate(n):
ax.annotate(txt, (x[i],y[i]), fontsize = 16)
plt.xlabel('5 Year Home Price Growth ', fontsize=20)
plt.ylabel('Average Price ABNB Listing', fontsize=20)
plt.title('Home Price Growth, 5 Years and ABNB Listing Price', fontsize=20, fontweight='bold')
df2 = df_zillow
df2.columns = ['Z', 'M', 'Q', 'Y', '5', '10']
coor= df1.corr()
# Generate a mask for the upper triangle
mask = np.zeros_like(df2.corr(), dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))
# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(df2.corr(), mask=mask, cmap=cmap, vmax=.3,
square=True, xticklabels=2, yticklabels=5,
linewidths=.5, cbar_kws={"shrink": .5}, ax=ax)
df1 = DataFrame(statistics.iloc[:,17])
df = concat([df1, df2], axis =1)
coor= df.corr()
# Generate a mask for the upper triangle
mask = np.zeros_like(df.corr(), dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))
# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(df.corr(), mask=mask, cmap=cmap, vmax=.3,
square=True, xticklabels=2, yticklabels=5,
linewidths=.5, cbar_kws={"shrink": .5}, ax=ax)
Let's look at the features we have from the Airbnb Data set and Zillow Data Set and plot them against price to determine which features, or independent variables, should be used to predict price. We are looking for a linear realtionship which may have to be achieved through transformation.
def read_data(location):
location = location[['zipcode', 'accommodates', 'bathrooms', 'bedrooms', 'price', 'minimum_nights',
'availability_365', 'number_of_reviews', 'review_scores_rating',
'review_scores_accuracy', 'review_scores_value']].dropna(axis = 0)
location = location.set_index('zipcode')
return location
1) We re-read the Airbnb data and create a dataframe containing information on every listing in each city. Set the index to the zip code.
2) We then merge the real estate data and Airbnb listing data at the zipcode and asess our features.
austin = read_csv('austin.csv')
austin = read_data(austin)
austin['price'] = austin['price'].map(lambda x: str(x)[1:]).convert_objects(convert_numeric=True)
boston = read_csv('boston.csv')
boston = read_data(boston)
boston['price'] = boston['price'].map(lambda x: str(x)[1:]).convert_objects(convert_numeric=True)
dc = read_csv('dc.csv')
dc = read_data(dc)
dc['price'] = dc['price'].map(lambda x: str(x)[1:]).convert_objects(convert_numeric=True)
denver = read_csv('denver.csv')
denver = read_data(denver)
denver['price'] = denver['price'].map(lambda x: str(x)[1:]).convert_objects(convert_numeric=True)
la = read_csv('la.csv')
la = read_data(la)
la['price'] = la['price'].map(lambda x: str(x)[1:]).convert_objects(convert_numeric=True)
nashville = read_csv('nashville.csv')
nashville = read_data(nashville)
nashville['price'] = nashville['price'].map(lambda x: str(x)[1:]).convert_objects(convert_numeric=True)
nyc = read_csv('nyc.csv')
nyc = read_data(nyc)
nyc['price'] = nyc['price'].map(lambda x: str(x)[1:]).convert_objects(convert_numeric=True)
portland = read_csv('portland.csv')
portland = read_data(portland)
portland['price'] = portland['price'].map(lambda x: str(x)[1:]).convert_objects(convert_numeric=True)
sandiego = read_csv('sandiego.csv')
sandiego = read_data(sandiego)
sandiego['price'] = sandiego['price'].map(lambda x: str(x)[1:]).convert_objects(convert_numeric=True)
seattle = read_csv('seattle.csv')
seattle = read_data(seattle)
seattle['price'] = seattle['price'].map(lambda x: str(x)[1:]).convert_objects(convert_numeric=True)
sf = read_csv('sf.csv')
sf = read_data(sf)
sf['price'] = sf['price'].map(lambda x: str(x)[1:]).convert_objects(convert_numeric=True)
#DataFrame with all listings and Airbnb features:
df_abnb = concat([austin, boston, dc, denver, la, nashville, nyc, portland, sandiego, seattle, sf])
#DataFrame with Zillow Feature: Zillow Home Value Index
ZHVI = read_csv('Zip_Zhvi_Summary_AllHomes.csv').set_index('RegionName')
ZHVI = ZHVI.ix[:,[7,11,12]]
# Merge on Zipcode
df1 = merge(df_abnb, ZHVI, left_index = True, right_index = True).dropna()
df1['reviewtotal'] = (df1['review_scores_rating'] + df1['review_scores_accuracy'] + df1['review_scores_value'])
df1.head()
Let's plot the independent variables versus price to determine which features have the strongest linear relationship with price and which we may have to transform to achieve linearity with price.
We transform Availability and # of Reviews using logs to achieve a linear trend with price.
Both have exponential relationships with price, which are corrected by taking the log.
x = np.log(df1['availability_365'])
y = df1['price']
fig, ax = plt.subplots()
ax.scatter(x, y, c= 'c', alpha = .5)
plt.ylim(0,1000)
plt.xlim(-.01,1.75)
plt.xlabel('Log of Availability', fontsize=14)
plt.ylabel('Price', fontsize=14)
plt.title('Price and Availability', fontsize=14, fontweight='bold')
x1 = np.log(df1['number_of_reviews'])
y1 = df1['price']
fig, ax = plt.subplots()
ax.scatter(x1, y1, c= 'r', alpha = .5)
plt.ylim(0,1000)
plt.xlim(-.01,1.75)
plt.xlabel('Log of # of Reviews', fontsize=14)
plt.ylabel('Price', fontsize=14)
plt.title('Price and # of Reviews', fontsize=14, fontweight='bold')
x2 = (df1['Zhvi'])
y2 = df1['price']
fig, ax = plt.subplots()
ax.scatter(x2, y2, c= 'r', alpha = .5)
plt.ylim (0,1000)
plt.xlim (200000,1400000)
plt.xlabel('ZHVI (Median Home Price)', fontsize=14)
plt.ylabel('Price', fontsize=14)
plt.title('Price and Median Home Value (ZHVI)', fontsize=14, fontweight='bold')
x = (df1['10Year'])
y = df1['price']
fig, ax = plt.subplots()
ax.scatter(x, y, c= 'c', alpha = .5)
plt.ylim (0,1000)
plt.xlim(-.02,.06)
plt.xlabel('10 Year Change in ZHVI (Median Home Price)', fontsize=14)
plt.ylabel('Price', fontsize=14)
plt.title('Price and 10 Year Median Home Value (ZHVI)', fontsize=14, fontweight='bold')
x1 = (df1['5Year'])
y1 = df1['price']
fig, ax = plt.subplots()
ax.scatter(x1, y1, c= 'm', alpha = .5)
plt.xlabel('5 Year Change in ZHVI (Median Home Price)', fontsize=14)
plt.ylabel('Price', fontsize=14)
plt.title('Price and 5 Year Median Home Value (ZHVI)', fontsize=14, fontweight='bold')
plt.ylim (0,1000)
plt.xlim(.045,.15)
x = (df1['reviewtotal'])
y = df1['price']
fig, ax = plt.subplots()
ax.scatter(x, y, c= 'c', alpha = .5)
plt.ylim (0,1000)
plt.xlim(20 ,122)
plt.xlabel('Sum of Ratings', fontsize=14)
plt.ylabel('Price', fontsize=14)
plt.title('Sum of Ratings', fontsize=14, fontweight='bold')
x1 = np.log(df1['bedrooms'])
y1 = df1['price']
fig, ax = plt.subplots()
ax.scatter(x1, y1, c = 'r', alpha = .5)
plt.ylim (0,1000)
plt.xlim(-.1 ,2)
plt.xlabel('# of Bedrooms', fontsize=14)
plt.ylabel('Price', fontsize=14)
plt.title('# of Bedrooms and Price', fontsize=14, fontweight='bold')
x1 = df1['bedrooms']
x2 = df1['bathrooms']
x3 = df1['accommodates']
y1 = df1['price']
fig, ax = plt.subplots()
ax.scatter(x1, y1, c= 'c', alpha = .5)
ax.scatter(x2, y1, c='r', alpha = .5)
ax.scatter(x3, y1, c='m', alpha = .5)
plt.ylim (-2,1000)
plt.xlim(-.1 ,17)
plt.xlabel('# of Bedrooms, Bathrooms, Accommodates', fontsize=14)
plt.ylabel('Price', fontsize=14)
plt.title('# of Bedrooms/Bathroom/Accommodates and Price', fontsize=14, fontweight='bold')
We are going to create a linear regression model to predict the price of an Airbnb Listing in the US based on the listing data and home value data in these citites:
The Airbnb Features being used are:
The Zillow Home Value Data Used:
#The merged features dataframe:
df1 = merge(df_abnb, ZHVI, left_index = True, right_index = True).dropna()
df1['reviewtotal'] = (df1['review_scores_rating'] + df1['review_scores_accuracy'] + df1['review_scores_value'])
#Transforming:
df1['number_of_reviews'] = np.log(df1['number_of_reviews'])
df1['availability_365'] = np.log(df1['availability_365'])
#Final Features dataframe
features = df1[[ 'Zhvi', 'number_of_reviews', 'availability_365', 'reviewtotal', '10Year', '5Year', 'accommodates', 'bathrooms']].replace([np.inf, -np.inf], np.nan).fillna(0)
Another requirement in creating a linear model are that independent variables, or features do not have a strong relationship to one another as this confounds the linear model. To check for this, we use a correlation matrix. Relationships between the variables are measured between -1 and 1. The higher the absolute number, the stronger the relationship, with the sign signifying the direction of the relationship. As a result of this, bedrooms was removed as feature as it had very strong correlations to bathrooms, accommodates and ZHVI.
The final correlation matrix demonstrates no strong correlations between the independent variables:
features.corr()
from sklearn import cross_validation
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
X = features
Y = DataFrame(df1['price']).fillna(0)
X_train, X_test, Y_train, Y_test = cross_validation.train_test_split(X, Y, test_size =0.2)
clf = LinearRegression()
clf.fit(X_train, Y_train)
print 'R-squared on Training Data:', r2_score(Y_train, clf.predict(X_train))
print 'R=squared on Testing Data:', r2_score(Y_test, clf.predict(X_test))
from sklearn.metrics import mean_squared_error
print 'Mean squared error on the test set: ', mean_squared_error(Y_test, clf.predict(X_test))
print 'Mean squared error on the training set: ', mean_squared_error(Y_train, clf.predict(X_train))
coef_df = concat([DataFrame(features.columns), DataFrame(clf.coef_).transpose()], axis = 1)
coef_df.columns = ['Feature', 'Regression Coefficient']
coef_df
The r-squared is the coefficient of determination, measures how close the training and test data are to the fitted regression line. In other words, how good the model will do at predicting the price of a listing based on our selected features.
This coefficient of determination tells us that ~56.9% percentage of the total variance in the price of a listing can be explained by the linear regression model. At 56%, a little over half of the variability can be attributed to our model. This is reasonable, but we may be able to achieve better results using other alogrithms for linear regression or by changing our number features or range data included. The test data perfromed somewhat similarly, which is another good sign for the data. The high MSE is a concern, but something that can be corrected by refitting the model.
Listed below the r-squares are the regression coefficients for each feature that define the linear relationships in this model. The coefficient for each feature describes how much the listing price changes for a one unit increase in said feature net of all others. The 10 Year change and 5 Year change in median home value seem to be the most powerful predictors. They have a large impact on the price of a listing net of all other features changing. For example, a one percent increase in home value over 5 years, increases the price of a listing by 88 dollars net of all other features.
This is a good starting point for developing the model. We will use the mean squared error along with the r-squared to compare the accuracy and linear fit to others as we try to improve the model. There is definetly room for improvement, especially with MSE and R-squared.
import matplotlib.patches as mpatches
plt.figure(figsize=(10,6))
plt.scatter(clf.predict(X_train), clf.predict(X_train) - Y_train, c='c', s=40, alpha=0.5)
plt.scatter(clf.predict(X_test), clf.predict(X_test) - Y_test, c='r', s=40, alpha = 0.65)
plt.hlines(y = 0, xmin=-30, xmax = 50)
#plt.title('Residual Plot using training (blue) and test (green) data')
plt.title('Residuals vs Fitted Values')
plt.ylabel('Residuals')
plt.xlabel('Fitted Values')
blue_patch = mpatches.Patch(color='c', label='Train')
green_patch = mpatches.Patch(color ='r', label='Test')
plt.legend(handles=[blue_patch, green_patch])
plt.xlim(-10, 750)
plt.ylim(-800,400)
# Looking at the distribution of listings prices to defend disregarding heterosked.
sns.distplot(Y, kde = False)
plt.title('Distribution of Price')
plt.xlabel('Price')
The residuals here describe the difference in the observed/actual listing price and predicted listing price using the model. The residuals should be distributed randomly above and below to axis tending to cluster towards the middle without any clear patterns. These qualities are indicative of a good linear model, with a normal distribution of data. Normality in the data is important as it is an assumption in creating a linear model.
The model seems to be heteroscedastic, meaning that the residuals get larger as the prediction moves from large to small. The model is better at predicting prices for average priced listings at around 300 dollars a night or lower. For those values, the residuals are evenly and randomly distributed above and below the axis with no pattern, indication of a good model.
We may be able to achieve better prediction and r-squared if we segment the cities by a strong variable such as Availability. Putting similar listings together will result in more correlation.
x = ((statistics['availability_365']['mean'] / 365)*100).convert_objects(convert_numeric=True)
DataFrame(x).sort(['mean'], ascending=[True])
Highly Popular Cities: San Francisco, NYC, Boston, Portland and San Diego
#DataFrame with Airbnb data on highly popular cities:
df_abnb = concat([sf, nyc, boston, portland, sandiego])
#DataFrame with Zillow Feature: Zillow Home Value Index
ZHVI = read_csv('Zip_Zhvi_Summary_AllHomes.csv').set_index('RegionName')
ZHVI = ZHVI.ix[:,[7,11,12]]
# Merge on Zipcode
df1 = merge(df_abnb, ZHVI, left_index = True, right_index = True).dropna()
df1['reviewtotal'] = (df1['review_scores_rating'] + df1['review_scores_accuracy'] + df1['review_scores_value'])
#Transforming:
df1['number_of_reviews'] = np.log(df1['number_of_reviews'])
df1['availability_365'] = np.log(df1['availability_365'])
#Final Features dataframe
features = df1[[ 'Zhvi', 'number_of_reviews', 'availability_365', 'reviewtotal', '10Year', '5Year', 'accommodates', 'bathrooms']].replace([np.inf, -np.inf], np.nan).fillna(0)
from sklearn import cross_validation
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
X = features
Y = DataFrame(df1['price']).fillna(0)
X_train, X_test, Y_train, Y_test = cross_validation.train_test_split(X, Y, test_size =0.2)
clf = LinearRegression()
clf.fit(X_train, Y_train)
print 'R-squared on Training Data:', r2_score(Y_train, clf.predict(X_train))
print 'R=squared on Testing Data:', r2_score(Y_test, clf.predict(X_test))
from sklearn.metrics import mean_squared_error
print 'Mean squared error on the test set: ', mean_squared_error(Y_test, clf.predict(X_test))
print 'Mean squared error on the training set: ', mean_squared_error(Y_train, clf.predict(X_train))
coef_df = concat([DataFrame(features.columns), DataFrame(clf.coef_).transpose()], axis = 1)
coef_df.columns = ['Feature', 'Coefficient']
coef_df
The R-squared is a percentage that tells us how much of variance in y can be explained by the model, the higher the percent the more the model fits the data. This model's coefficient of determination tells us that ~ 64.13% percentage of the total variance in the price of a listing from a popular city can be explained by the linear regression model.
At 64%, almost two thirds of the variability can be attributed to our model. This is much higher than our original linear model using all cities. This better result proves that by grouping the most popular cities, we have created a stronger linear model. The most popular cities listing prices react similarly to changes in the features and create a better linear fit. The MSE is also reduced, another indication that by taking a subset we hae created a better model.
This is a good starting point for developing a better model by manipulating the data set, however we may be able to increase the level of fit by using another alogrithim to create a linear model.
import matplotlib.patches as mpatches
plt.figure(figsize=(10,6))
plt.scatter(clf.predict(X_train), clf.predict(X_train) - Y_train, c='c', s=40, alpha=0.5)
plt.scatter(clf.predict(X_test), clf.predict(X_test) - Y_test, c='r', s=40, alpha = 0.65)
plt.hlines(y = 0, xmin=-30, xmax = 50)
#plt.title('Residual Plot using training (blue) and test (green) data')
plt.title('Residuals vs Fitted Values')
plt.ylabel('Residuals')
plt.xlabel('Fitted Values')
blue_patch = mpatches.Patch(color='c', label='Train')
green_patch = mpatches.Patch(color ='r', label='Test')
plt.legend(handles=[blue_patch, green_patch])
plt.xlim(-10,750)
plt.ylim(-800, 400)
The model seems to be less heteroscedastic than the previous one. The range of the residuals looks to be tighter around the axis and still display no pattern, indicative that we have retained normality in this new subset of our data.
Test using the Random Forest Regression model on our popular cities data to see if it results in a better fitting model.
The Random Forest model uses an algorithm which 'bootstraps'(taking of many random samples from the training data) to create a nodes that build a decision or regression tree which models the behavior of the data to create a linear model.
Lets see how it works with our popular cities data:
from sklearn.ensemble import RandomForestRegressor
X = features
Y = DataFrame(df1['price']).fillna(0)
X_train, X_test, Y_train, Y_test = cross_validation.train_test_split(X, Y, test_size =0.2)
rf = RandomForestRegressor(random_state = 1)
rf.fit(X_train, Y_train)
print "R-squared for training data: ", rf.score(X_train,Y_train)
print "R-squared for test data: ", rf.score(X_test,Y_test)
from sklearn.metrics import mean_squared_error
print 'Mean squared error on the test set: ', mean_squared_error(Y_test, rf.predict(X_test))
print 'Mean squared error on the training set: ', mean_squared_error(Y_train, rf.predict(X_train))
The fit has definetly increased! The model now has an r-squared of over 70%, much higher than both initial models. The mean square errors are also drasticallty reduced from around 5700 to 3700.
plt.scatter(rf.predict(X_test), Y_test.as_matrix())
plt.title('Random Forest: Predicted vs. Actual Price')
plt.xlabel('Predicted Listing Price')
plt.ylabel('Actual Listing Price')
plt.xlim(0,1000)
x = df1.price
sns.distplot(x, kde=False)
plt.title('Distribution of Price')
We plot the predicted price of the test data versus actual price to evaluate the model. The main effect apparent in this graph is a funnelling effect. As price increases, the model does not do as well of a job at predicting. This funnelling effect seems to compromise data where the price per night is above around 400. Before this price, the model does a good job predicting the price of listings.
When we look at the distribution of price for the dataset used in creating the model we see that a majority of the listings are priced below this benchmark. This means that a majority of listings are priced in a range that is able to be be accurately predicted by the model.
Therefore we establish that this model is best used for 'average' priced listings and disregard the funneling effect.