Insights into US Airbnb Listing Metrics using Zillow Home Value Index and Real Estate Data


Research Paper summarizing the process and findings of this project can be found here:

https://github.com/nsamrao/Airbnb/blob/master/Airbnb%20Analysis%20Capstone%20Final%20Paper.pdf </center></h2>



This exploration will be conducted in three parts:

Part 1: What are the Airbnb listing trends in the U.S. Market?

Part 2: The Zillow Home Value Index (ZHVI) And Airbnb Metrics, Listing Price

Part 3: Given ZHVI and listing trends, develop model that can predict price given these two sets of features.

Airbnb Data Source: http://insideairbnb.com/

Zillow Data Source: http://www.zillow.com/research/data/

In [138]:
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pandas import *
%matplotlib inline 

import warnings
warnings.filterwarnings('ignore')

cmap = sns.diverging_palette(220, 10, as_cmap=True)
In [139]:
def read_data(location):
    location = location[['id', 'room_type', 'accommodates', 'bathrooms', 'bedrooms', 'price', 'minimum_nights',
       'availability_365', 'number_of_reviews', 'review_scores_rating',
       'review_scores_accuracy', 'review_scores_value']]
    return location

def get_stats(location):
    x = ['accommodates', 'bathrooms', 'bedrooms', 'price', 'minimum_nights',
       'availability_365', 'number_of_reviews', 'review_scores_rating',
       'review_scores_accuracy', 'review_scores_value']
    location = location.loc[:, x]
    location_stats = location.describe()
    location_stats = concat([location_stats.ix[0:4], location_stats.ix[7:]])
    return location_stats

def reorder(location):
    new = location.set_index('location', append = True).unstack(0)
    return new

The Airbnb Data

a) Reading and Cleaning:

In [140]:
austin = read_csv('austin.csv')
austin = read_data(austin)
austin['price'] = austin['price'].map(lambda x: str(x)[1:]).convert_objects(convert_numeric=True)

boston = read_csv('boston.csv')
boston = read_data(boston)
boston['price'] = boston['price'].map(lambda x: str(x)[1:]).convert_objects(convert_numeric=True)

dc = read_csv('dc.csv')
dc = read_data(dc)
dc['price'] = dc['price'].map(lambda x: str(x)[1:]).convert_objects(convert_numeric=True)

denver = read_csv('denver.csv')
denver = read_data(denver)
denver['price'] = denver['price'].map(lambda x: str(x)[1:]).convert_objects(convert_numeric=True)

la = read_csv('la.csv')
la = read_data(la)
la['price'] = la['price'].map(lambda x: str(x)[1:]).convert_objects(convert_numeric=True)

nashville = read_csv('nashville.csv')
nashville = read_data(nashville)
nashville['price'] = nashville['price'].map(lambda x: str(x)[1:]).convert_objects(convert_numeric=True)

nyc = read_csv('nyc.csv')
nyc = read_data(nyc)
nyc['price'] = nyc['price'].map(lambda x: str(x)[1:]).convert_objects(convert_numeric=True)

portland = read_csv('portland.csv')
portland = read_data(portland)
portland['price'] = portland['price'].map(lambda x: str(x)[1:]).convert_objects(convert_numeric=True)

sandiego = read_csv('sandiego.csv')
sandiego = read_data(sandiego)
sandiego['price'] = sandiego['price'].map(lambda x: str(x)[1:]).convert_objects(convert_numeric=True)

seattle = read_csv('seattle.csv')
seattle = read_data(seattle)
seattle['price'] = seattle['price'].map(lambda x: str(x)[1:]).convert_objects(convert_numeric=True)

sf = read_csv('sf.csv')
sf = read_data(sf)
sf['price'] = sf['price'].map(lambda x: str(x)[1:]).convert_objects(convert_numeric=True)

neworleans = read_csv('new_orleans.csv')
neworleans = read_data(neworleans)
neworleans['price'] = neworleans['price'].map(lambda x: str(x)[1:]).convert_objects(convert_numeric=True)

A look into the Airbnb listing features provided for each city by the data:

b) Getting Statistics on Airbnb Data:

Creating a dataframe to house statistical summaries of the listing data so that we can analyze listing trends.

In [141]:
austin_s = get_stats(austin)
austin_s['location'] = 'Austin, TX'
austin_s = reorder(austin_s)

boston_s = get_stats(boston)
boston_s['location'] = 'Boston, MA'
boston_s = reorder(boston_s)

dc_s = get_stats(dc)
dc_s['location'] = 'Washington, DC'
dc_s = reorder(dc_s)

denver_s = get_stats(denver)
denver_s['location'] = 'Denver, CO'
denver_s = reorder(denver_s)

la_s = get_stats(la)
la_s['location'] = 'Los Angeles-Long Beach-Anaheim, CA'
la_s = reorder(la_s)

nashville_s = get_stats(nashville)
nashville_s['location'] = 'Nashville, TN'
nashville_s = reorder(nashville_s)

nyc_s = get_stats(nyc)
nyc_s['location'] = 'New York, NY'
nyc_s = reorder(nyc_s)

portland_s = get_stats(portland)
portland_s['location'] = 'Portland, OR'
portland_s = reorder(portland_s)

sandiego_s = get_stats(sandiego)
sandiego_s['location'] = 'San Diego, CA'
sandiego_s = reorder(sandiego_s)

seattle_s = get_stats(seattle)
seattle_s['location'] = 'Seattle, WA'
seattle_s = reorder(seattle_s)

sf_s = get_stats(sf)
sf_s['location'] = 'San Francisco, CA'
sf_s = reorder(sf_s)

neworleans_s = get_stats(neworleans)
neworleans_s['location'] = 'New Orleans, LA'
neworleans_s = reorder(neworleans_s)

statistics = concat([austin_s, boston_s, dc_s, denver_s, la_s, nashville_s, nyc_s, portland_s, sandiego_s, seattle_s, sf_s, neworleans_s])
statistics
Out[141]:
accommodates bathrooms ... review_scores_accuracy review_scores_value
count max mean min std count max mean min std ... count max mean min std count max mean min std
location
Austin, TX 5835.0 16.0 4.388175 1.0 2.689056 5789.0 8.0 1.479617 0.0 0.774630 ... 3776.0 10.0 9.645657 2.0 0.757809 3778.0 10.0 9.416093 2.0 0.892991
Boston, MA 3585.0 16.0 3.041283 1.0 1.778929 3571.0 6.0 1.221647 0.0 0.501487 ... 2762.0 10.0 9.431571 2.0 0.931863 2764.0 10.0 9.168234 2.0 1.011116
Washington, DC 3723.0 16.0 3.195004 1.0 2.012216 3696.0 8.0 1.256358 0.0 0.585539 ... 2848.0 10.0 9.540730 2.0 0.790103 2851.0 10.0 9.323395 2.0 0.836415
Denver, CO 2444.0 16.0 3.668167 1.0 2.274085 2496.0 8.0 1.376002 0.0 0.671603 ... 1982.0 10.0 9.724521 2.0 0.662544 1982.0 10.0 9.637740 2.0 0.701155
Los Angeles-Long Beach-Anaheim, CA 26080.0 16.0 3.384279 1.0 2.425537 25960.0 8.0 1.375000 0.0 0.820447 ... 19636.0 10.0 9.555765 2.0 0.845363 19634.0 10.0 9.403535 2.0 0.887206
Nashville, TN 3277.0 16.0 5.105584 1.0 3.182650 3276.0 8.0 1.566850 0.0 0.860740 ... 2761.0 10.0 9.775444 4.0 0.559754 2763.0 10.0 9.660152 2.0 0.642419
New York, NY 39553.0 16.0 2.802265 1.0 1.792188 39371.0 8.0 1.122895 0.0 0.383082 ... 30443.0 10.0 9.470157 2.0 0.883192 30459.0 10.0 9.253915 2.0 0.927049
Portland, OR 3360.0 16.0 3.385119 1.0 2.086094 3342.0 6.0 1.226062 0.0 0.515245 ... 2938.0 10.0 9.763445 2.0 0.592368 2938.0 10.0 9.604493 2.0 0.675567
San Diego, CA 6608.0 21.0 4.183263 1.0 2.849992 6597.0 10.0 1.473473 0.0 0.858805 ... 4466.0 10.0 9.595611 2.0 0.797768 4468.0 10.0 9.455685 2.0 0.835499
Seattle, WA 3818.0 16.0 3.349398 1.0 1.977599 3802.0 8.0 1.259469 0.0 0.590369 ... 3160.0 10.0 9.636392 2.0 0.698031 3162.0 10.0 9.452245 2.0 0.750259
San Francisco, CA 8619.0 16.0 3.247360 1.0 2.124207 8571.0 8.0 1.308832 0.0 0.655959 ... 6265.0 10.0 9.589306 2.0 0.804510 6266.0 10.0 9.302905 2.0 0.864396
New Orleans, LA 4514.0 16.0 4.427780 1.0 2.801228 4502.0 8.0 1.360840 0.0 0.717643 ... 3754.0 10.0 9.660629 2.0 0.696694 3755.0 10.0 9.502264 2.0 0.756209

12 rows × 50 columns

Columns included in statistical summary:

In [142]:
statistics.columns.values
Out[142]:
array([('accommodates', 'count'), ('accommodates', 'max'),
       ('accommodates', 'mean'), ('accommodates', 'min'),
       ('accommodates', 'std'), ('bathrooms', 'count'),
       ('bathrooms', 'max'), ('bathrooms', 'mean'), ('bathrooms', 'min'),
       ('bathrooms', 'std'), ('bedrooms', 'count'), ('bedrooms', 'max'),
       ('bedrooms', 'mean'), ('bedrooms', 'min'), ('bedrooms', 'std'),
       ('price', 'count'), ('price', 'max'), ('price', 'mean'),
       ('price', 'min'), ('price', 'std'), ('minimum_nights', 'count'),
       ('minimum_nights', 'max'), ('minimum_nights', 'mean'),
       ('minimum_nights', 'min'), ('minimum_nights', 'std'),
       ('availability_365', 'count'), ('availability_365', 'max'),
       ('availability_365', 'mean'), ('availability_365', 'min'),
       ('availability_365', 'std'), ('number_of_reviews', 'count'),
       ('number_of_reviews', 'max'), ('number_of_reviews', 'mean'),
       ('number_of_reviews', 'min'), ('number_of_reviews', 'std'),
       ('review_scores_rating', 'count'), ('review_scores_rating', 'max'),
       ('review_scores_rating', 'mean'), ('review_scores_rating', 'min'),
       ('review_scores_rating', 'std'),
       ('review_scores_accuracy', 'count'),
       ('review_scores_accuracy', 'max'),
       ('review_scores_accuracy', 'mean'),
       ('review_scores_accuracy', 'min'),
       ('review_scores_accuracy', 'std'), ('review_scores_value', 'count'),
       ('review_scores_value', 'max'), ('review_scores_value', 'mean'),
       ('review_scores_value', 'min'), ('review_scores_value', 'std')], dtype=object)

Part 1: Exploring Airbnb Data -- Listing Trends in U.S. Cities

Looking at descriptive statistics in each city to explore listing trends.

Count of Listings per City: Which cities have the most listings?

New York and Los Angeles are outliers with a relatively far larger amount of listings than the others. This high supply of listings may suggest that they are the more popular travel destinations. San Francsico comes in third and San Diego fourth.

As expected, smaller and less frequented cities: Portland, Nashville and Denver have the least amount of listings.

In [143]:
x = {'Austin, TX': len(austin), 'Boston, MA' : len(boston), 
     'Washington, DC' : len(dc), 'Denver, CO' : len(denver), 
     'Los Angeles-Long Beach-Anaheim, CA': len(la), 'Nashville, TN': len(nashville), 
     'New York, NY': len(nyc), 'Portland, OR' : len(portland), 
     'San Diego, CA' : len(sandiego), 'Seattle, WA' : len(seattle), 
     'San Francisco, CA' : len(sf), 'New Orleans, LA' : len(neworleans) }

df = DataFrame(x, index=[0]).stack()
df = DataFrame(df).sort([0], ascending=[False]).reset_index(0)
df = df[0]
df = DataFrame(df)
df.columns = ['Count of Listings']
df
Out[143]:
Count of Listings
New York, NY 39553
Los Angeles-Long Beach-Anaheim, CA 26080
San Francisco, CA 8619
San Diego, CA 6608
Austin, TX 5835
New Orleans, LA 4514
Seattle, WA 3818
Washington, DC 3723
Boston, MA 3585
Portland, OR 3360
Nashville, TN 3277
Denver, CO 2505
In [144]:
df['Count of Listings'].plot(kind = 'bar', cmap = cmap)
plt.title('Count of Listings')
Out[144]:
<matplotlib.text.Text at 0x1297cc3d0>

Normalizing Count of Listings by Population

While the raw listing numbers are interesting it is hard to draw actual conclusions about the density of Airbnb listings in each city becuase larger cities may have more listings.

So we will normalize the count of listing data by dividing it by the population of each metro area in millions.

In [145]:
# Creating DataFrame of population in each metro area in millions from US Census 2015
pop = read_csv('pop_cities_census.csv').convert_objects(convert_numeric=True)
pop.columns = ['City', 'Population']
pop = pop.set_index('City')

#Merging with listing data and normalizing
df1 = concat([df, pop], axis= 1)
df1 = df1.astype('float')
df1['Normalized by Population'] = df1['Count of Listings'] / df1['Population']
df1.sort(['Normalized by Population'], ascending=[False]).reset_index(0)
Out[145]:
City Count of Listings Population Normalized by Population
0 New Orleans, LA 4514.0 1.262888 3574.347052
1 Austin, TX 5835.0 2.000860 2916.246014
2 San Diego, CA 6608.0 3.299521 2002.714940
3 New York, NY 39553.0 20.182305 1959.786060
4 Los Angeles-Long Beach-Anaheim, CA 26080.0 13.340068 1955.012523
5 San Francisco, CA 8619.0 4.656132 1851.107314
6 Nashville, TN 3277.0 1.830345 1790.372853
7 Portland, OR 3360.0 2.389228 1406.311997
8 Seattle, WA 3818.0 3.733580 1022.611006
9 Denver, CO 2505.0 2.814330 890.087516
10 Boston, MA 3585.0 4.774321 750.892116
11 Washington, DC 3723.0 6.097684 610.559681

Estimating 'Supply' and 'Demand' of Listings: Popularity and Normalized Count of Listings

The demand or popularity of listings in a city can be estimated by the average number of booked days in the upsoming calendar year.

The supply of listings in a city can be estimated by the normalized number of listings in a city.

Lets look at the relationship between these two variables:

In [146]:
z = [(365 - statistics['availability_365']['mean']).convert_objects(convert_numeric=True), df1['Normalized by Population']]
df2 = concat(z, axis = 1)
df2
Out[146]:
mean Normalized by Population
Austin, TX 95.261354 2916.246014
Boston, MA 185.653556 750.892116
Denver, CO 169.988024 890.087516
Los Angeles-Long Beach-Anaheim, CA 142.578604 1955.012523
Nashville, TN 145.797986 1790.372853
New Orleans, LA 151.740806 3574.347052
New York, NY 207.526408 1959.786060
Portland, OR 170.481250 1406.311997
San Diego, CA 163.023002 2002.714940
San Francisco, CA 221.500406 1851.107314
Seattle, WA 120.227344 1022.611006
Washington, DC 129.297609 610.559681
In [147]:
z = [(365 - statistics['availability_365']['mean']).convert_objects(convert_numeric=True), df1['Normalized by Population']]
df2 = concat(z, axis = 1)

x = df2['Normalized by Population']
y = df2['mean']

n = (df2.index.tolist())

fig, ax = plt.subplots()
ax.scatter(x, y, c=y,cmap = cmap, s =200)

for i, txt in enumerate(n):
    ax.annotate(txt, (x[i],y[i]), fontsize = 9)
    
    
plt.xlabel('Listing Count (Normalized)', fontsize=14)
plt.ylabel('Booked Days in the next Calendar Year', fontsize=14)
plt.title('Popularity and Normalized Listing Count', fontsize=14, fontweight='bold')
Out[147]:
<matplotlib.text.Text at 0x125d48a90>

Average Price per City: Where is it the most and least expensive to stay, on average, at an Airbnb listing?

Austin, San Francisco and Nashville have the most expensive listing prices on average, while Denver, Seattle and Portland are the cheaper places to stay, on average.

In [148]:
DataFrame(statistics['price']['mean'].convert_objects(convert_numeric=True)).sort(['mean'], ascending=[False])
Out[148]:
mean
location
Austin, TX 231.503207
San Francisco, CA 208.376430
Nashville, TN 188.836252
San Diego, CA 183.789750
New Orleans, LA 172.031998
Boston, MA 169.072768
Los Angeles-Long Beach-Anaheim, CA 146.567610
Washington, DC 142.323251
New York, NY 140.057827
Denver, CO 128.392957
Seattle, WA 127.747708
Portland, OR 110.508041

Price and Normalized Listing Count

Lets look at price in the context of another variable, Normalized Listing count. There is a somewhat strong coorlation, with a correlation coeff. of around .5. This strong positive relationship indicates that as the normalized number of listings increases, the price does as well.

In [149]:
z = [statistics['price']['mean'].convert_objects(convert_numeric=True), df1['Normalized by Population']]
df2 = concat(z, axis = 1)

x = df2['Normalized by Population']
y = df2['mean']

n = (df2.index.tolist())

fig, ax = plt.subplots()
ax.scatter(x, y, c=y,cmap = cmap, s =200)

for i, txt in enumerate(n):
    ax.annotate(txt, (x[i],y[i]), fontsize = 10)
    
    
plt.xlabel('Listing Count (Normalized)', fontsize=14)
plt.ylabel('Price', fontsize=14)
plt.title('Price and Normalized Listing Count', fontsize=14, fontweight='bold')
Out[149]:
<matplotlib.text.Text at 0x1682f4750>
In [150]:
#Coorelation between the two variables
df2.corr()
Out[150]:
mean Normalized by Population
mean 1.000000 0.542062
Normalized by Population 0.542062 1.000000

The first metric we will analyze is the average number of days that a listing is booked in the next year. The higher the # of booked days, the more popular a city is. The most popular travel destinations for users are San Francisco, New York, Boston. These are large cities so this relationship makes sense.

Interestingly, Portland and Denver come in as the fourth and fifth. These cities' listings are characterized by a low number of listings and low price. Users may be drawn to these cities due to thier low prices.

A second metric that can be used to determine the popularity of listings in a particular city is mean # of reviews per city. The more that a city is reviewed on average, the more the listing is booked on average. There are similarities in order of the cites in the two lists.

Seattle jumps from the bottom of the list on availability and towards the top on # of reviews. So although Seattle isn't a popular place for travelers relative to the other cities included in the data set, it is well reviewed. This may be an indication of the superior value of Seattle Airbnb listings.

In [151]:
x = (365 - statistics['availability_365']['mean']).convert_objects(convert_numeric=True)
x = DataFrame(x).sort(['mean'], ascending=[False])
x.columns = ['Availability']
x
Out[151]:
Availability
location
San Francisco, CA 221.500406
New York, NY 207.526408
Boston, MA 185.653556
Portland, OR 170.481250
Denver, CO 169.988024
San Diego, CA 163.023002
New Orleans, LA 151.740806
Nashville, TN 145.797986
Los Angeles-Long Beach-Anaheim, CA 142.578604
Washington, DC 129.297609
Seattle, WA 120.227344
Austin, TX 95.261354
In [152]:
x = DataFrame(statistics['number_of_reviews']['mean'].convert_objects(convert_numeric=True)).sort(['mean'], ascending=[False])
x.columns = ['Avg. # of Reviews']
x
Out[152]:
Avg. # of Reviews
location
Portland, OR 32.643750
Nashville, TN 25.443699
New Orleans, LA 24.719539
Seattle, WA 22.223415
San Francisco, CA 19.693352
Boston, MA 19.044630
Denver, CO 18.544511
Los Angeles-Long Beach-Anaheim, CA 16.848735
Washington, DC 15.306742
New York, NY 14.531742
San Diego, CA 14.052663
Austin, TX 10.794687

Price and Availibility Listings: What the relationship between the popularity of a listing and its price?

There seems to a positive coorelation beetween the price and the popularity of listing, or the average number of days a cities' listings are booked. However this trend only emerges once we elimate Austin as an outiler. The first graph displayed below contains the Austin data point, and the second removes it.

The more popular a listing (higher the number of days booked will be) the higher the price. This seems to suggest that cities which are more popular to travel cost more to stay at. Demand drives up the price.

In [153]:
x = (365 - statistics['availability_365']['mean']).convert_objects(convert_numeric=True)
y = statistics['price']['mean'].convert_objects(convert_numeric=True)

n = (x.reset_index()).location.tolist()

fig, ax = plt.subplots()
ax.scatter(x, y, c=y,cmap = cmap , s =200)

for i, txt in enumerate(n):
    ax.annotate(txt, (x[i],y[i]), fontsize = 10)
    
    
plt.xlabel('Average Number of Days booked in next 365 days', fontsize=14)
plt.ylabel('Average Price', fontsize=14)
plt.title('Average Price and Popularity of Listings', fontsize=14, fontweight='bold')
Out[153]:
<matplotlib.text.Text at 0x1239e7390>

After Dropping Austin:

In [154]:
x = (365 - statistics['availability_365']['mean']).convert_objects(convert_numeric=True).drop('Austin, TX')
y = statistics['price']['mean'].convert_objects(convert_numeric=True).drop('Austin, TX')
n = (x.reset_index()).location.tolist()

fig, ax = plt.subplots(figsize=(9, 7))
ax.scatter(x, y, c=y,cmap = cmap , s =200)

for i, txt in enumerate(n):
    ax.annotate(txt, (x[i],y[i]), fontsize = 15)
    
plt.xlabel('Average Number of Days booked in next 365 days', fontsize=18)
plt.ylabel('Average Price', fontsize=18)
plt.title('Average Price and Popularity of Listings', fontsize=18, fontweight='bold')
Out[154]:
<matplotlib.text.Text at 0x132ffdf90>
In [155]:
#Coorelation between the two variables
df = concat ([x,y], axis =1 )
df.corr()
Out[155]:
mean mean
mean 1.000000 0.340591
mean 0.340591 1.000000

Price and Average Rating: How does the average quality of a listing coorelate to the average price?

The data here is very dispersed and suggests that ratings and price do not have a clear relationship.

This reveals an interesting quality of Airbnb listings relative to the traditional hotel industry. Even if a user pays more or less for a listing this does not have much to do with the quality of thier stay. In the traditional hotel industry the quality of your stay is very driven by the price (hotel vs. motel). Usually the more one pays the better your stay, whereas Airbnb users can pay minimally and have a great experience (Portland) or the other way around (Boston for example, has a low rating for its average price).

In [156]:
x = statistics['review_scores_rating']['mean'].convert_objects(convert_numeric=True)
y = statistics['price']['mean'].convert_objects(convert_numeric=True)

n = (x.reset_index()).location.tolist()

fig, ax = plt.subplots()
ax.scatter(x, y, c=y,cmap = cmap, s =200)

for i, txt in enumerate(n):
    ax.annotate(txt, (x[i],y[i]), fontsize = 10)
    
    
plt.xlabel('Average Rating', fontsize=14)
plt.ylabel('Price', fontsize=14)
plt.title('Price and Average Rating', fontsize=14, fontweight='bold')
Out[156]:
<matplotlib.text.Text at 0x124e51290>
In [157]:
df1 = statistics.iloc[:,[2,7,12,17,22,27,32,37,42,47]].reset_index().drop('location',axis = 1).fillna(0)
df1.columns = df1.columns.droplevel(1)
df1.columns = ['acc', 'bt', 'bd', 'pr', 'min', '365', '#rev', 'rat', "acc", "val"]
#df_zillow = df_zillow['ZHVI_10Year']
#df1 = concat([df1, df_zillow], axis = 1).fillna(0)

coor= df1.corr()

# Generate a mask for the upper triangle
mask = np.zeros_like(df1.corr(), dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(df1.corr(), mask=mask, cmap=cmap, vmax=.3,
            square=True, xticklabels=2, yticklabels=5,
            linewidths=.5, cbar_kws={"shrink": .5}, ax=ax)
Out[157]:
<matplotlib.axes._subplots.AxesSubplot at 0x122366750>

The Zillow Real Estate Data

What is the Zillow Home Value Index?

From, Methodology for determining the ZHVI: http://www.zillow.com/research/zhvi-methodology-6032/

The ideal home price index would be based off sale prices for the same set of homes in each time period so there was never an issue of the sales mix being different across periods. This approach of using a constant basket of goods is widely used, common examples being a commodity price index and a consumer price index. Unfortunately, unlike commodities and consumer goods, for which we can observe prices in all time periods, we can’t observe prices on the same set of homes in all time periods because not all homes are sold in every time period.

The innovation that Zillow developed is a way of approximating this ideal home price index by leveraging the valuations Zillow creates on all homes (called Zestimates). Instead of actual sale prices on every home, the index is created from estimated sale prices on every home. Because of this fact, the distribution of actual sale prices for homes sold in a given time period looks very similar to the distribution of estimated sale prices for this same set of homes. But, importantly, Zillow has estimated sale prices not just for the homes that sold, but for all homes even if they didn’t sell in that time period.

Using this methodology, we now have a comprehensive and robust benchmark of home value trends can be computed which is immune to the changing mix of properties that sell in different periods of time.

Zillow Home Value Index (Median Home Value):

In [158]:
ZHVI = read_csv('Metro_Zhvi_Summary_AllHomes (1).csv').set_index('RegionName')


x = ['Austin, TX', 'Boston, MA', 'Washington, DC', 'Denver, CO', 'Los Angeles-Long Beach-Anaheim, CA', 'Nashville, TN', 'New York, NY', 'Portland, OR', 'San Diego, CA', 'Seattle, WA', 'San Francisco, CA', 'New Orleans, LA']
y = ['Zhvi', 'MoM', 'QoQ', 'YoY', '5Year', '10Year']

ZHVI = ZHVI.loc[x , y ]

Putting all the data into one Dataframe for coorelation analysis:

In [159]:
df_zillow = ZHVI
df_zillow.columns = ['ZHVI', 'ZHVI_MoM', 'ZHVI_QoQ', 'ZHVI_YoY', 'ZHVI_5Year', 'ZHVI_10Year']
df_zillow
Out[159]:
ZHVI ZHVI_MoM ZHVI_QoQ ZHVI_YoY ZHVI_5Year ZHVI_10Year
RegionName
Austin, TX 258600.0 0.002714 0.012133 0.083822 0.082968 0.039407
Boston, MA 405200.0 0.007209 0.019115 0.059623 0.053598 0.010531
Washington, DC 375900.0 0.004275 0.011572 0.026488 0.037900 -0.010914
Denver, CO 350400.0 0.007476 0.023066 0.101887 0.102926 0.040802
Los Angeles-Long Beach-Anaheim, CA 584700.0 0.005157 0.014752 0.059815 0.083031 -0.002961
Nashville, TN 202400.0 0.012000 0.031600 0.114537 0.066053 0.027840
New York, NY 398000.0 0.005558 0.023136 0.057386 0.029412 -0.011078
Portland, OR 349500.0 0.009241 0.029455 0.146277 0.101233 0.020504
San Diego, CA 524900.0 0.004017 0.015084 0.064058 0.086299 -0.000266
Seattle, WA 409900.0 0.008364 0.023215 0.122091 0.093059 0.011144
San Francisco, CA 821800.0 0.005137 0.015445 0.053725 0.113260 0.017017
New Orleans, LA NaN NaN NaN NaN NaN NaN

Part 2: Listing Trends and the Real Estate Metrics

How does the median home price of a city influence the price of an Airbnb listing?

The average price of a listing seems to increase as the median home value increases.

Nashville and Austin are two outliers, with low median home values and high average listing prices.

In [168]:
df = concat([statistics['price']['mean'], df_zillow['ZHVI']], axis = 1).dropna()

x = df['ZHVI']
y = df['mean']


n = (df.index).tolist()

fig, ax = plt.subplots(figsize=(12, 10))
ax.scatter(x, y, c=y,cmap = cmap, s =200)

for i, txt in enumerate(n):
    ax.annotate(txt, (x[i],y[i]), fontsize = 16)
    
    
plt.xlabel('Median Home Price', fontsize=20)
plt.ylabel('Average Price ABNB Listing', fontsize=20)
plt.title('Home Price(ZHVI) and Airbnb Listing Price', fontsize=20, fontweight='bold')
Out[168]:
<matplotlib.text.Text at 0x122b023d0>

Dropping our two outliers to determine the coorelation:

In [161]:
df = df.drop(['Nashville, TN', 'Austin, TX'])
df.corr()
Out[161]:
mean ZHVI
mean 1.000000 0.811609
ZHVI 0.811609 1.000000

How does the change in home price over five years coorelate to the price? Are listings in emerging cities, with high growth, priced at a higher value?

A little over half the data follow the relationship of the average listing price increasing as home value increases. However, LA, Seattle, Denver and Portland have had high growth in home values over the last five years yet have maintained relatively low prices.

In [167]:
df = concat([statistics['price']['mean'], df_zillow['ZHVI_5Year']], axis = 1).dropna()

x = df['ZHVI_5Year'] *100
y = df['mean']

n = (df.index).tolist()

fig, ax = plt.subplots(figsize=(12, 10))
ax.scatter(x, y, c=y,cmap = cmap , s =200)

for i, txt in enumerate(n):
    ax.annotate(txt, (x[i],y[i]), fontsize = 16)
    
plt.xlabel('5 Year Home Price Growth ', fontsize=20)    
plt.ylabel('Average Price ABNB Listing', fontsize=20)
plt.title('Home Price Growth, 5 Years and ABNB Listing Price', fontsize=20, fontweight='bold')
Out[167]:
<matplotlib.text.Text at 0x11b30a850>

Coorelation Matrix for ZHVI: How do ZHVI metrics relate to one another?

In [87]:
df2 = df_zillow
df2.columns = ['Z', 'M', 'Q', 'Y', '5', '10']
coor= df1.corr()

# Generate a mask for the upper triangle
mask = np.zeros_like(df2.corr(), dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(df2.corr(), mask=mask, cmap=cmap, vmax=.3,
            square=True, xticklabels=2, yticklabels=5,
            linewidths=.5, cbar_kws={"shrink": .5}, ax=ax)
Out[87]:
<matplotlib.axes._subplots.AxesSubplot at 0x11b616d50>

Coorelation Matrix for ZHVI and Price: How do Listing Price and Home Value relate?

In [88]:
df1 = DataFrame(statistics.iloc[:,17])
df = concat([df1, df2], axis =1)
coor= df.corr()

# Generate a mask for the upper triangle
mask = np.zeros_like(df.corr(), dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(df.corr(), mask=mask, cmap=cmap, vmax=.3,
            square=True, xticklabels=2, yticklabels=5,
            linewidths=.5, cbar_kws={"shrink": .5}, ax=ax)
Out[88]:
<matplotlib.axes._subplots.AxesSubplot at 0x122b8dcd0>

Part 3: Regression Model for Predicting Average Price -- Using the ZHVI and Airbnb Listing Data

a) Putting together Airbnb and Zillow features to predict Price

Let's look at the features we have from the Airbnb Data set and Zillow Data Set and plot them against price to determine which features, or independent variables, should be used to predict price. We are looking for a linear realtionship which may have to be achieved through transformation.

In [89]:
def read_data(location):
    location = location[['zipcode', 'accommodates', 'bathrooms', 'bedrooms', 'price', 'minimum_nights',
       'availability_365', 'number_of_reviews', 'review_scores_rating',
       'review_scores_accuracy', 'review_scores_value']].dropna(axis = 0)
    location = location.set_index('zipcode')
    return location

We are going to join the Zillow data and Airbnb data according to the zip code of the listing to achieve more granularilty when looking at the real estate data for each listing.

1) We re-read the Airbnb data and create a dataframe containing information on every listing in each city. Set the index to the zip code.

2) We then merge the real estate data and Airbnb listing data at the zipcode and asess our features.

In [90]:
austin = read_csv('austin.csv')
austin = read_data(austin)
austin['price'] = austin['price'].map(lambda x: str(x)[1:]).convert_objects(convert_numeric=True)

boston = read_csv('boston.csv')
boston = read_data(boston)
boston['price'] = boston['price'].map(lambda x: str(x)[1:]).convert_objects(convert_numeric=True)

dc = read_csv('dc.csv')
dc = read_data(dc)
dc['price'] = dc['price'].map(lambda x: str(x)[1:]).convert_objects(convert_numeric=True)

denver = read_csv('denver.csv')
denver = read_data(denver)
denver['price'] = denver['price'].map(lambda x: str(x)[1:]).convert_objects(convert_numeric=True)

la = read_csv('la.csv')
la = read_data(la)
la['price'] = la['price'].map(lambda x: str(x)[1:]).convert_objects(convert_numeric=True)

nashville = read_csv('nashville.csv')
nashville = read_data(nashville)
nashville['price'] = nashville['price'].map(lambda x: str(x)[1:]).convert_objects(convert_numeric=True)

nyc = read_csv('nyc.csv')
nyc = read_data(nyc)
nyc['price'] = nyc['price'].map(lambda x: str(x)[1:]).convert_objects(convert_numeric=True)

portland = read_csv('portland.csv')
portland = read_data(portland)
portland['price'] = portland['price'].map(lambda x: str(x)[1:]).convert_objects(convert_numeric=True)

sandiego = read_csv('sandiego.csv')
sandiego = read_data(sandiego)
sandiego['price'] = sandiego['price'].map(lambda x: str(x)[1:]).convert_objects(convert_numeric=True)

seattle = read_csv('seattle.csv')
seattle = read_data(seattle)
seattle['price'] = seattle['price'].map(lambda x: str(x)[1:]).convert_objects(convert_numeric=True)

sf = read_csv('sf.csv')
sf = read_data(sf)
sf['price'] = sf['price'].map(lambda x: str(x)[1:]).convert_objects(convert_numeric=True)
In [91]:
#DataFrame with all listings and Airbnb features:
df_abnb = concat([austin, boston, dc, denver, la, nashville, nyc, portland, sandiego, seattle, sf])

#DataFrame with Zillow Feature: Zillow Home Value Index
ZHVI = read_csv('Zip_Zhvi_Summary_AllHomes.csv').set_index('RegionName')
ZHVI = ZHVI.ix[:,[7,11,12]]

# Merge on Zipcode
df1 = merge(df_abnb, ZHVI, left_index = True, right_index = True).dropna()
df1['reviewtotal'] = (df1['review_scores_rating'] + df1['review_scores_accuracy'] + df1['review_scores_value'])

Merge the two feature sets together on zipcode, we now have one dataframe with all the Airbnb and Zillow features/independent variables for predicting price:

In [92]:
df1.head()
Out[92]:
accommodates bathrooms bedrooms price minimum_nights availability_365 number_of_reviews review_scores_rating review_scores_accuracy review_scores_value Zhvi 5Year 10Year reviewtotal
22204.0 2.0 0.0 1.0 24.0 1 339 3 87.0 9.0 9.0 476100 0.036329 0.005587 105.0
33558.0 2.0 1.0 0.0 66.0 2 364 1 100.0 8.0 8.0 336000 0.070773 -0.007617 116.0
78701.0 4.0 1.0 1.0 150.0 3 365 8 90.0 8.0 9.0 432500 0.074260 0.045489 107.0
78701.0 2.0 1.0 1.0 600.0 1 263 1 100.0 10.0 8.0 432500 0.074260 0.045489 118.0
78701.0 3.0 1.0 1.0 175.0 3 127 4 95.0 10.0 10.0 432500 0.074260 0.045489 115.0

b) Price and Features: Linearity of Features with Dependent Variable, Price

In order to create a linear model we have to establish that the features or independent variables, used in the model have a linear relationship with price, the dependent variable.

Let's plot the independent variables versus price to determine which features have the strongest linear relationship with price and which we may have to transform to achieve linearity with price.

Price and Availability, Price and # of Reviews

We transform Availability and # of Reviews using logs to achieve a linear trend with price.

Both have exponential relationships with price, which are corrected by taking the log.

In [93]:
x = np.log(df1['availability_365'])
y = df1['price']

fig, ax = plt.subplots()
ax.scatter(x, y, c= 'c', alpha = .5)
plt.ylim(0,1000)
plt.xlim(-.01,1.75)

plt.xlabel('Log of Availability', fontsize=14)   
plt.ylabel('Price', fontsize=14)
plt.title('Price and Availability', fontsize=14, fontweight='bold')

x1 = np.log(df1['number_of_reviews'])
y1 = df1['price']

fig, ax = plt.subplots()
ax.scatter(x1, y1, c= 'r', alpha = .5)
plt.ylim(0,1000)
plt.xlim(-.01,1.75)

plt.xlabel('Log of # of Reviews', fontsize=14)   
plt.ylabel('Price', fontsize=14)
plt.title('Price and # of Reviews', fontsize=14, fontweight='bold')
Out[93]:
<matplotlib.text.Text at 0x11bfa7290>

Price and ZHVI (Median Home Value)

Price and ZHVI (Median Home Value) & Price and 5, 10 Year change in ZHVI

In [94]:
x2 = (df1['Zhvi'])
y2 = df1['price']

fig, ax = plt.subplots()
ax.scatter(x2, y2, c= 'r', alpha = .5)
plt.ylim (0,1000)
plt.xlim (200000,1400000)

plt.xlabel('ZHVI (Median Home Price)', fontsize=14)   
plt.ylabel('Price', fontsize=14)
plt.title('Price and Median Home Value (ZHVI)', fontsize=14, fontweight='bold')


x = (df1['10Year'])
y = df1['price']

fig, ax = plt.subplots()
ax.scatter(x, y, c= 'c', alpha = .5)
plt.ylim (0,1000)
plt.xlim(-.02,.06)

plt.xlabel('10 Year Change in  ZHVI (Median Home Price)', fontsize=14)   
plt.ylabel('Price', fontsize=14)
plt.title('Price and 10 Year Median Home Value (ZHVI)', fontsize=14, fontweight='bold')


x1 = (df1['5Year'])
y1 = df1['price']

fig, ax = plt.subplots()
ax.scatter(x1, y1, c= 'm', alpha = .5)

plt.xlabel('5 Year Change in ZHVI (Median Home Price)', fontsize=14)   
plt.ylabel('Price', fontsize=14)
plt.title('Price and 5 Year Median Home Value (ZHVI)', fontsize=14, fontweight='bold')
plt.ylim (0,1000)
plt.xlim(.045,.15)
Out[94]:
(0.045, 0.15)
In [95]:
x = (df1['reviewtotal'])
y = df1['price']

fig, ax = plt.subplots()
ax.scatter(x, y, c= 'c', alpha = .5)
plt.ylim (0,1000)
plt.xlim(20 ,122)

plt.xlabel('Sum of Ratings', fontsize=14)   
plt.ylabel('Price', fontsize=14)
plt.title('Sum of Ratings', fontsize=14, fontweight='bold')

x1 = np.log(df1['bedrooms'])
y1 = df1['price']

fig, ax = plt.subplots()
ax.scatter(x1, y1, c = 'r', alpha = .5)
plt.ylim (0,1000)
plt.xlim(-.1 ,2)

plt.xlabel('# of Bedrooms', fontsize=14)   
plt.ylabel('Price', fontsize=14)
plt.title('# of Bedrooms and Price', fontsize=14, fontweight='bold')
Out[95]:
<matplotlib.text.Text at 0x15ff58090>
In [96]:
x1 = df1['bedrooms']
x2 = df1['bathrooms']
x3 = df1['accommodates']

y1 = df1['price']


fig, ax = plt.subplots()
ax.scatter(x1, y1, c= 'c', alpha = .5)
ax.scatter(x2, y1, c='r', alpha = .5)
ax.scatter(x3, y1, c='m', alpha = .5)
plt.ylim (-2,1000)
plt.xlim(-.1 ,17)

plt.xlabel('# of Bedrooms, Bathrooms, Accommodates', fontsize=14)   
plt.ylabel('Price', fontsize=14)
plt.title('# of Bedrooms/Bathroom/Accommodates and Price', fontsize=14, fontweight='bold')
Out[96]:
<matplotlib.text.Text at 0x120e39710>

c) The Model

Linear Model using all cities and all features:

We are going to create a linear regression model to predict the price of an Airbnb Listing in the US based on the listing data and home value data in these citites:

  • Austin, TX; Boston, MA; Washington, DC; Denver, CO; Los Angeles, CA; Nashville, TN; NYC, NY; Portland, OR; San Diego, CA; Seattle, WA and San Francisco, CA

The Airbnb Features being used are:

  • Log of # of reviews, Log of Availability in the next 365 days, Sum of Ratings (Overall + Value Rating + Accuracy Rating), # of Reviews, Number of Bathrooms

The Zillow Home Value Data Used:

  • ZHVI (Median Home Price)
  • ZHVI 5 Year Change
  • ZHVI 10 Year Change
In [97]:
#The merged features dataframe:
df1 = merge(df_abnb, ZHVI, left_index = True, right_index = True).dropna()
df1['reviewtotal'] = (df1['review_scores_rating'] + df1['review_scores_accuracy'] + df1['review_scores_value'])

#Transforming:
df1['number_of_reviews'] = np.log(df1['number_of_reviews'])
df1['availability_365'] = np.log(df1['availability_365'])

#Final Features dataframe
features = df1[[ 'Zhvi', 'number_of_reviews', 'availability_365', 'reviewtotal', '10Year', '5Year', 'accommodates', 'bathrooms']].replace([np.inf, -np.inf], np.nan).fillna(0)

Checking for coorelation between the dependent variables:

Another requirement in creating a linear model are that independent variables, or features do not have a strong relationship to one another as this confounds the linear model. To check for this, we use a correlation matrix. Relationships between the variables are measured between -1 and 1. The higher the absolute number, the stronger the relationship, with the sign signifying the direction of the relationship. As a result of this, bedrooms was removed as feature as it had very strong correlations to bathrooms, accommodates and ZHVI.

The final correlation matrix demonstrates no strong correlations between the independent variables:

In [98]:
features.corr()
Out[98]:
Zhvi number_of_reviews availability_365 reviewtotal 10Year 5Year accommodates bathrooms
Zhvi 1.000000 -0.006757 -0.015610 -0.104761 -0.326128 -0.329153 0.038161 0.069861
number_of_reviews -0.006757 1.000000 0.181412 0.067784 0.032080 0.145784 -0.094315 -0.146591
availability_365 -0.015610 0.181412 1.000000 -0.029610 0.057907 -0.047759 0.061082 0.023357
reviewtotal -0.104761 0.067784 -0.029610 1.000000 0.102987 0.059869 -0.055349 -0.009890
10Year -0.326128 0.032080 0.057907 0.102987 1.000000 0.530245 0.021943 -0.017753
5Year -0.329153 0.145784 -0.047759 0.059869 0.530245 1.000000 -0.093868 -0.108207
accommodates 0.038161 -0.094315 0.061082 -0.055349 0.021943 -0.093868 1.000000 0.615938
bathrooms 0.069861 -0.146591 0.023357 -0.009890 -0.017753 -0.108207 0.615938 1.000000

Model Evaluation:

In [99]:
from sklearn import cross_validation
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

X = features
Y = DataFrame(df1['price']).fillna(0)

X_train, X_test, Y_train, Y_test = cross_validation.train_test_split(X, Y, test_size =0.2)

clf = LinearRegression()
clf.fit(X_train, Y_train)

print 'R-squared on Training Data:', r2_score(Y_train, clf.predict(X_train))
print 'R=squared on Testing Data:', r2_score(Y_test, clf.predict(X_test))

from sklearn.metrics import mean_squared_error
print 'Mean squared error on the test set: ', mean_squared_error(Y_test, clf.predict(X_test))
print 'Mean squared error on the training set: ', mean_squared_error(Y_train, clf.predict(X_train))

coef_df = concat([DataFrame(features.columns), DataFrame(clf.coef_).transpose()], axis = 1)
coef_df.columns = ['Feature', 'Regression Coefficient']
coef_df
R-squared on Training Data: 0.525853257769
R=squared on Testing Data: 0.53073024056
Mean squared error on the test set:  9487.47915781
Mean squared error on the training set:  8920.79898258
Out[99]:
Feature Regression Coefficient
0 Zhvi 0.000055
1 number_of_reviews -11.993403
2 availability_365 4.461033
3 reviewtotal 1.010557
4 10Year 749.316461
5 5Year -682.824401
6 accommodates 24.896625
7 bathrooms 55.915694

The r-squared is the coefficient of determination, measures how close the training and test data are to the fitted regression line. In other words, how good the model will do at predicting the price of a listing based on our selected features.

This coefficient of determination tells us that ~56.9% percentage of the total variance in the price of a listing can be explained by the linear regression model. At 56%, a little over half of the variability can be attributed to our model. This is reasonable, but we may be able to achieve better results using other alogrithms for linear regression or by changing our number features or range data included. The test data perfromed somewhat similarly, which is another good sign for the data. The high MSE is a concern, but something that can be corrected by refitting the model.

Listed below the r-squares are the regression coefficients for each feature that define the linear relationships in this model. The coefficient for each feature describes how much the listing price changes for a one unit increase in said feature net of all others. The 10 Year change and 5 Year change in median home value seem to be the most powerful predictors. They have a large impact on the price of a listing net of all other features changing. For example, a one percent increase in home value over 5 years, increases the price of a listing by 88 dollars net of all other features.

This is a good starting point for developing the model. We will use the mean squared error along with the r-squared to compare the accuracy and linear fit to others as we try to improve the model. There is definetly room for improvement, especially with MSE and R-squared.

Residual plot:

In [100]:
import matplotlib.patches as mpatches

plt.figure(figsize=(10,6))
plt.scatter(clf.predict(X_train), clf.predict(X_train) - Y_train, c='c', s=40, alpha=0.5)
plt.scatter(clf.predict(X_test), clf.predict(X_test) - Y_test, c='r', s=40, alpha = 0.65)
plt.hlines(y = 0, xmin=-30, xmax = 50)
#plt.title('Residual Plot using training (blue) and test (green) data')
plt.title('Residuals vs Fitted Values')
plt.ylabel('Residuals')
plt.xlabel('Fitted Values')

blue_patch = mpatches.Patch(color='c', label='Train')
green_patch = mpatches.Patch(color ='r', label='Test')
plt.legend(handles=[blue_patch, green_patch])
plt.xlim(-10, 750)
plt.ylim(-800,400)
Out[100]:
(-800, 400)
In [101]:
# Looking at the distribution of listings prices to defend disregarding heterosked. 
sns.distplot(Y, kde = False)
plt.title('Distribution of Price')
plt.xlabel('Price')
Out[101]:
<matplotlib.text.Text at 0x138f82790>

The residuals here describe the difference in the observed/actual listing price and predicted listing price using the model. The residuals should be distributed randomly above and below to axis tending to cluster towards the middle without any clear patterns. These qualities are indicative of a good linear model, with a normal distribution of data. Normality in the data is important as it is an assumption in creating a linear model.

The model seems to be heteroscedastic, meaning that the residuals get larger as the prediction moves from large to small. The model is better at predicting prices for average priced listings at around 300 dollars a night or lower. For those values, the residuals are evenly and randomly distributed above and below the axis with no pattern, indication of a good model.


We may be able to achieve better prediction and r-squared if we segment the cities by a strong variable such as Availability. Putting similar listings together will result in more correlation.

In [102]:
x = ((statistics['availability_365']['mean'] / 365)*100).convert_objects(convert_numeric=True)
DataFrame(x).sort(['mean'], ascending=[True])
Out[102]:
mean
location
San Francisco, CA 39.314957
New York, NY 43.143450
Boston, MA 49.136012
Portland, OR 53.292808
Denver, CO 53.427939
San Diego, CA 55.336164
New Orleans, LA 58.427176
Nashville, TN 60.055346
Los Angeles-Long Beach-Anaheim, CA 60.937369
Washington, DC 64.575997
Seattle, WA 67.061002
Austin, TX 73.900999

Highly Popular Cities: San Francisco, NYC, Boston, Portland and San Diego

In [103]:
#DataFrame with Airbnb data on highly popular cities:
df_abnb = concat([sf, nyc, boston, portland, sandiego])

#DataFrame with Zillow Feature: Zillow Home Value Index
ZHVI = read_csv('Zip_Zhvi_Summary_AllHomes.csv').set_index('RegionName')
ZHVI = ZHVI.ix[:,[7,11,12]]

# Merge on Zipcode
df1 = merge(df_abnb, ZHVI, left_index = True, right_index = True).dropna()
df1['reviewtotal'] = (df1['review_scores_rating'] + df1['review_scores_accuracy'] + df1['review_scores_value'])

#Transforming:
df1['number_of_reviews'] = np.log(df1['number_of_reviews'])
df1['availability_365'] = np.log(df1['availability_365'])

#Final Features dataframe
features = df1[[ 'Zhvi', 'number_of_reviews', 'availability_365', 'reviewtotal', '10Year', '5Year', 'accommodates', 'bathrooms']].replace([np.inf, -np.inf], np.nan).fillna(0)
In [104]:
from sklearn import cross_validation
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

X = features
Y = DataFrame(df1['price']).fillna(0)

X_train, X_test, Y_train, Y_test = cross_validation.train_test_split(X, Y, test_size =0.2)

clf = LinearRegression()
clf.fit(X_train, Y_train)

print 'R-squared on Training Data:', r2_score(Y_train, clf.predict(X_train))
print 'R=squared on Testing Data:', r2_score(Y_test, clf.predict(X_test))

from sklearn.metrics import mean_squared_error
print 'Mean squared error on the test set: ', mean_squared_error(Y_test, clf.predict(X_test))
print 'Mean squared error on the training set: ', mean_squared_error(Y_train, clf.predict(X_train))

coef_df = concat([DataFrame(features.columns), DataFrame(clf.coef_).transpose()], axis = 1)
coef_df.columns = ['Feature', 'Coefficient']
coef_df
R-squared on Training Data: 0.602207540928
R=squared on Testing Data: 0.635108031873
Mean squared error on the test set:  4971.91779762
Mean squared error on the training set:  5593.61301005
Out[104]:
Feature Coefficient
0 Zhvi 0.000092
1 number_of_reviews -6.778284
2 availability_365 1.534499
3 reviewtotal 0.802976
4 10Year -337.233077
5 5Year 93.552689
6 accommodates 25.971007
7 bathrooms 38.935938

The R-squared is a percentage that tells us how much of variance in y can be explained by the model, the higher the percent the more the model fits the data. This model's coefficient of determination tells us that ~ 64.13% percentage of the total variance in the price of a listing from a popular city can be explained by the linear regression model.

At 64%, almost two thirds of the variability can be attributed to our model. This is much higher than our original linear model using all cities. This better result proves that by grouping the most popular cities, we have created a stronger linear model. The most popular cities listing prices react similarly to changes in the features and create a better linear fit. The MSE is also reduced, another indication that by taking a subset we hae created a better model.

This is a good starting point for developing a better model by manipulating the data set, however we may be able to increase the level of fit by using another alogrithim to create a linear model.

In [105]:
import matplotlib.patches as mpatches

plt.figure(figsize=(10,6))
plt.scatter(clf.predict(X_train), clf.predict(X_train) - Y_train, c='c', s=40, alpha=0.5)
plt.scatter(clf.predict(X_test), clf.predict(X_test) - Y_test, c='r', s=40, alpha = 0.65)
plt.hlines(y = 0, xmin=-30, xmax = 50)
#plt.title('Residual Plot using training (blue) and test (green) data')
plt.title('Residuals vs Fitted Values')
plt.ylabel('Residuals')
plt.xlabel('Fitted Values')

blue_patch = mpatches.Patch(color='c', label='Train')
green_patch = mpatches.Patch(color ='r', label='Test')
plt.legend(handles=[blue_patch, green_patch])
plt.xlim(-10,750)
plt.ylim(-800, 400)
Out[105]:
(-800, 400)

The model seems to be less heteroscedastic than the previous one. The range of the residuals looks to be tighter around the axis and still display no pattern, indicative that we have retained normality in this new subset of our data.

e) Using the Random Forest Regression Model

Test using the Random Forest Regression model on our popular cities data to see if it results in a better fitting model.

The Random Forest model uses an algorithm which 'bootstraps'(taking of many random samples from the training data) to create a nodes that build a decision or regression tree which models the behavior of the data to create a linear model.

Lets see how it works with our popular cities data:

In [106]:
from sklearn.ensemble import RandomForestRegressor

X = features
Y = DataFrame(df1['price']).fillna(0)

X_train, X_test, Y_train, Y_test = cross_validation.train_test_split(X, Y, test_size =0.2)

rf = RandomForestRegressor(random_state = 1)
rf.fit(X_train, Y_train)

print "R-squared for training data: ", rf.score(X_train,Y_train) 
print "R-squared for test data: ", rf.score(X_test,Y_test) 

from sklearn.metrics import mean_squared_error
print 'Mean squared error on the test set: ', mean_squared_error(Y_test, rf.predict(X_test))
print 'Mean squared error on the training set: ', mean_squared_error(Y_train, rf.predict(X_train))
R-squared for training data:  0.92539920659
R-squared for test data:  0.593655463849
Mean squared error on the test set:  4767.65869054
Mean squared error on the training set:  1083.81173287

The fit has definetly increased! The model now has an r-squared of over 70%, much higher than both initial models. The mean square errors are also drasticallty reduced from around 5700 to 3700.

In [107]:
plt.scatter(rf.predict(X_test), Y_test.as_matrix())
plt.title('Random Forest: Predicted vs. Actual Price')
plt.xlabel('Predicted Listing Price')
plt.ylabel('Actual Listing Price')
plt.xlim(0,1000)
Out[107]:
(0, 1000)
In [108]:
x = df1.price
sns.distplot(x, kde=False)
plt.title('Distribution of Price')
Out[108]:
<matplotlib.text.Text at 0x160035d50>

We plot the predicted price of the test data versus actual price to evaluate the model. The main effect apparent in this graph is a funnelling effect. As price increases, the model does not do as well of a job at predicting. This funnelling effect seems to compromise data where the price per night is above around 400. Before this price, the model does a good job predicting the price of listings.

When we look at the distribution of price for the dataset used in creating the model we see that a majority of the listings are priced below this benchmark. This means that a majority of listings are priced in a range that is able to be be accurately predicted by the model.

Therefore we establish that this model is best used for 'average' priced listings and disregard the funneling effect.