This is my first project in data science. I would like to thank Zillow for providing the data to the aspiring data scientists like us. Also i have referenced Lisa & Jason for forecasting the Time Series data. Those information’s are invaluable.
Import & Clean Data¶
Let us read the State Time Series data and do some basic analysis. The data is available from 2010 onwards, so let us remove the previous data from the data set.
import pandas as pd
df = pd.read_csv('zecon/State_time_series.csv')
df.Date = pd.to_datetime(df.Date)
df = df[df['Date'] >= '01-01-2010']
df.head()
Exploratory Data Analysis (EDA)¶
Let us explore statewise data. Whether there is any trend for a specific state or what are the top 5 states?
Top 7 States of Median Price Per Sqft¶
This is the top 7 states of Median Listring Price per Sqft All Homes
dfallhomes = df.groupby('RegionName', as_index=False)['MedianListingPricePerSqft_AllHomes'].\
mean().dropna().sort_values('MedianListingPricePerSqft_AllHomes', ascending=False)
dfallhomes.head(7)
Bottom 5 states of Median Listing Price per Sqft All Homes¶
List the bottom 5 states from the results
dfallhomes.tail(5)
Trend in State Time Series¶
The most important EDA on time series data is to identify trend, seasonality & correlation. Let us check whether there is any trend in the data with the Top state data.
%matplotlib inline
import matplotlib.pyplot as plt
statelist = ['Hawaii', 'DistrictofColumbia', 'California', 'Massachusetts', 'NewYork', 'Colorado']
stateseries = pd.DataFrame(df[(df['RegionName'].\
isin(statelist))][['Date','RegionName','MedianListingPricePerSqft_AllHomes']].\
dropna().\
groupby(['Date', 'RegionName'])['RegionName','MedianListingPricePerSqft_AllHomes'].mean().unstack())
stateseries.plot(figsize=(15,8), linewidth=3)
plt.show()
from the above graph, we can see that
- There are trends for Hawaii & District of Columbia, even though it is up & down but there is a strong upward trend on these two states. This also shows that there is a seasonality in the trends. But the investment is costly in these two states.
- There is no point in investing the homes in New York, it's almost stationary.
- The similar trend of top 2 states is reflecting in Colorado also. This seems to be the best investment for homes as it is steadily increasing over the years among the top 5.
Seasonality in trends¶
Let us check whether there is any seasonality in the trends. This is important for the predictions.
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
sns.set()
season = df
season['Date'] = df.Date
season['Year'] = df['Date'].dt.year
season['Month'] = df['Date'].dt.month
spivot = pd.pivot_table(season, index='Month', columns = 'Year', values = 'MedianListingPricePerSqft_AllHomes', aggfunc=np.mean)
spivot.plot(figsize=(20,10), linewidth=3)
plt.show()
from above graph, we can see that
- The price start decreasing in year 2010 and continued till 2012. The price in year 2013 is almost equivalent to 2010.
- The best time to sell the house in a year is from June to October. The price is peak in these months consistently in all the years even during the down trend. Alternatively the best time to buy the house is in December & January.
- The price drops may be due to holiday season or some other reason is a problem for us to solve another day.
Correlation¶
I always have a doubt whether various bedroom types have any relation during the trends. Let's find out.
%matplotlib inline
import matplotlib.pyplot as plt
brtypes = df.groupby('Date')['Date','MedianListingPricePerSqft_1Bedroom', 'MedianListingPricePerSqft_2Bedroom','MedianListingPricePerSqft_3Bedroom','MedianListingPricePerSqft_4Bedroom','MedianListingPricePerSqft_5BedroomOrMore'].\
mean().dropna()
pd.plotting.autocorrelation_plot(brtypes);
plt.show()
The above graph shows there is a positive correlation for all the bedroom types, but still it is not clear how each bedroom types are correlated. Let us find out.
brtypes.corr()
The above table confirms the correlation and to be more specific let us remove the seasonality from the data and see. This is called order of correlation.
brtypes.diff().corr()
First order difference in correlation still has better correlation between bedroom types. You can see 1 Bedroom & 2 Bedroom are highly correlated than 1 bedroom & 5 bedroom.
Forecast¶
We will do the forecast with Median Listing price per sq ft All Homes. Let us see the trend first.
allhomes = df.groupby('Date')['Date','MedianListingPricePerSqft_AllHomes'].mean().dropna()
allhomes.plot(figsize=(10,8))
plt.show()
In the above graph, you can see the clear trend but also there are seasonality in the trend. The forecast for the time series should be stationary otherwise the predictions may not correct.
ARIMA forecast model¶
One of the common model used to forecast time series data is ARIMA. It stands for Autoregressive integrated moving average. One of the parameters are p, d & q. As you know the data has seasonality and let us use Seasonal ARIMA, SARIMAX to forecast the mode. There is a separate process to to identify the optmimum parameters, I did a grid search on GPU machine and it stopped after 700+ iterations.
import warnings
from pandas import datetime
from sklearn.metrics import mean_squared_error
import statsmodels.api as sm
import itertools
warnings.filterwarnings("ignore")
#evaluate SARIMAX model for given order & seasonal order
def evaluate_sarimax_model(dataset, param, seasonal_param):
#train test split
train_size = int(len(dataset) * 0.60)
train, test = dataset[0:train_size], dataset[train_size:]
try:
mod = sm.tsa.statespace.SARIMAX(dataset,
order = param,
seasonal_order = seasonal_param,
enforce_stationarity = False,
enforce_invertibility = False)
results = mod.fit()
pred = results.get_prediction(start = test.iloc(train_size)[0].name, dynamic = False)
pred_ci = pred.conf_int()
error = mean_squared_error(test, pred.predicted_mean)
return error
print('SARIMAX results for {} {} AIC {} MSE {}'.format(param, seasonal_param, results.aic, mse))
except:
return float("inf")
def evaluate_models(dataset, p_values, d_values, q_values, s_value):
pdq = list(itertools.product(p, d, q))
seasonal_pdq = [(x[0], x[1], x[2], s_value) for x in list(itertools.product(p, d, q))]
best_score = float("inf")
best_param = best_seasonal_param = None
for param in pdq:
for seasonal_param in seasonal_pdq:
try:
mse = evaluate_sarimax_model(dataset, param, seasonal_param)
#print(best_score, mse)
if mse < best_score:
best_score, best_param, best_seasonal_param = mse, param, seasonal_param
print('SARIMAX results for {} {} MSE {}'.format(param, seasonal_param, mse))
except:
continue
print('Best results are mse {} order {} seasonal order {}'.format(best_score, best_param, best_seasonal_param))
The output of the above code is suppressed as the above process may run for hours based on the capacity of the system. This is the code used to do the grid search for hyper parameter tuning.
import statsmodels.api as sm
import warnings
warnings.filterwarnings("ignore")
mod = sm.tsa.statespace.SARIMAX(allhomes,
order = (2, 0, 4),
seasonal_order = (3, 1, 2, 12),
enforce_stationarity = False,
enforce_invertibility = False)
results = mod.fit()
results.plot_diagnostics(figsize=(15,12))
plt.show()
From the above graph, we can see that
- The histogram has minor difference with KDE
- Linear regression can be improved
- There are still positive correlation, this can be optimized further.
Validate the Model¶
Let us validate the model by Train Test & Split.
train_size = int(len(allhomes) * 0.60)
train, test = allhomes[0:train_size], allhomes[train_size:]
pred = results.get_prediction(start = test.iloc(train_size)[0].name, dynamic = False)
pred_ci = pred.conf_int()
ax = allhomes.plot(label='actual', figsize=(10,8))
pred.predicted_mean.plot(ax=ax, label='Forecast', alpha=0.7, color='red')
plt.legend()
plt.show()
The predictions above is not perfect but it is still better compare to other parameters. Let us see the Mean Squared Error of the model.
from sklearn.metrics import mean_squared_error
error = mean_squared_error(test, pred.predicted_mean)
print('MSE {}'.format(error))
The mean squared error is 0.171, so it means the forecast can be till improved. Now let us predict for the future.
pred_uc = results.get_forecast(steps=24)
pred_ci = pred_uc.conf_int()
ax = allhomes.plot(label = 'Actual', figsize=(15,8))
pred_uc.predicted_mean.plot(ax=ax, label='Forecasted')
ax.fill_between(pred_ci.index,
pred_ci.iloc[:,0],
pred_ci.iloc[:,1],
color='k', alpha=0.25)
ax.set_xlabel('Date')
ax.set_ylabel('MedianListingPricePerSqft_AllHomes')
plt.legend()
plt.show()
In the above graph you can see the forecast for next 24 months and the confidence interval is also better and not too much variance.