Machine Learning: California Housing Price

Machine Learning

I analyzed the Callifornia Housing dataset using Python and Machine Learning models to estimate the price of houses. The dataset was obtained from the 1997 book "Sparse Spatial Autoregressions," Statistics and Probability Letters by Pace, R. Kelley, and Ronald Barry. The data itself came from the 1990 Census data. The information was collected on variables at the block group level.

I first started by imported all packages that will be needed , import the data from the CSV file, and then get some summary of the data.

import pandas as pd
import numpy as np

# Data Visualization
import matplotlib.pyplot as plt
plt.style.use('seaborn')
import matplotlib.image as mpimg # To import maps and set them as background

import seaborn as sns

import plotly.express as px

# Machine Learning
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics
from sklearn.metrics import mean_squared_error

import scipy.stats as stats

blocks = pd.read_csv('housing.csv')
blocks.head()

blocks.info()

This dataset is almost perfectly cleaned with the except of 207 missing values in total_bedrooms.

blocks.columns

Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
'total_bedrooms', 'population', 'households', 'median_income',
'median_house_value', 'ocean_proximity'],
dtype='object')

Looking at blocks with missing values in bedrooms.

blocks_na = blocks[blocks.total_bedrooms.isna()]
blocks_na.head()

blocks_na.ocean_proximity.value_counts()

<1H OCEAN 102
INLAND 55
NEAR OCEAN 30
NEAR BAY 20
Name: ocean_proximity, dtype: int64

blocks[blocks.duplicated()]

There are no duplicates in the dataset.

blocks.describe()

blocks.describe(include = 'O') #To describe categorical columns

blocks.ocean_proximity.value_counts()

<1H OCEAN 9136
INLAND 6551
NEAR OCEAN 2658
NEAR BAY 2290
ISLAND 5
Name: ocean_proximity, dtype: int64

blocks.total_rooms.value_counts().head()

1527.00      18
1613.00      17
1582.00      17
2127.00      16
1703.00      15
Name: total_rooms, dtype: int64

blocks.total_bedrooms.value_counts().head()

280.00 55
331.00      51
345.00      50
393.00      49
343.00      49
Name: total_bedrooms, dtype: int64

Exploratory Visualization

sns.pairplot(blocks)

# Creating an histogram for each numerical feature
blocks.hist(bins = 50, figsize = (20,15))
plt.title("Histogram of all numerical features")
plt.figtext(0.5, 0.01, "Data analyzed by Ashek Ag Mohamed", wrap=True, horizontalalignment='center', fontsize=12)
plt.show()

Data Cleaning and Creation of Additional Features

I will drop the missing values here. However, I better approach later on will be to replace them with the median values in a specific category they belong into instead of removing 10% of the dataset.

blocks.dropna(inplace = True)
blocks['rooms_per_household'] = blocks.total_rooms.div(blocks.households)
blocks.head()

# Extreme rooms per household values
blocks.rooms_per_household.nlargest(10)

1914 141.909091
1979 132.533333
12447 62.422222
1913 61.812500
11862 59.875000
1912 56.269231
9676 52.848214
11707 52.690476
2395 50.837838
1240 47.515152
Name: rooms_per_household, dtype: float64

# Extreme rooms per household values
blocks.rooms_per_household.nsmallest(10)

5916 0.846154
8219 0.888889
3126 1.000000
14818 1.130435
17820 1.130435
4552 1.260870
4550 1.378486
4587 1.411290
4602 1.465753
12484 1.550409
Name: rooms_per_household, dtype: float64

blocks['pop_per_household'] = blocks.population.div(blocks.households)
blocks['bedrooms_per_room'] = blocks.total_bedrooms.div(blocks.total_rooms)

Exploratory Data Analysis

blocks.median_house_value.hist(bins=80, figsize = (15,8))
plt.show()

blocks.corr()
blocks.corr().to_csv("block_corr.csv")

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value	rooms_per_household
longitude	1	-0.92466	-0.1082	0.044568	0.069608	0.099773	0.05531	-0.01518	-0.04597	-0.02754
latitude	-0.92466	1	0.011173	-0.0361	-0.06698	-0.10878	-0.07104	-0.07981	-0.14416	0.106389
housing_median_age	-0.1082	0.011173	1	-0.36126	-0.32045	-0.29624	-0.30292	-0.11903	0.105623	-0.15328
total_rooms	0.044568	-0.0361	-0.36126	1	0.93038	0.857126	0.918484	0.19805	0.134153	0.133798
total_bedrooms	0.069608	-0.06698	-0.32045	0.93038	1	0.877747	0.979728	-0.00772	0.049686	0.001538
population	0.099773	-0.10878	-0.29624	0.857126	0.877747	1	0.907222	0.004834	-0.02465	-0.07221
households	0.05531	-0.07104	-0.30292	0.918484	0.979728	0.907222	1	0.013033	0.065843	-0.0806
median_income	-0.01518	-0.07981	-0.11903	0.19805	-0.00772	0.004834	0.013033	1	0.688075	0.326895
median_house_value	-0.04597	-0.14416	0.105623	0.134153	0.049686	-0.02465	0.065843	0.688075	1	0.151948
rooms_per_household	-0.02754	0.106389	-0.15328	0.133798	0.001538	-0.07221	-0.0806	0.326895	0.151948	1

blocks.corr().median_house_value.sort_values(ascending = False)

median_house_value 1.000000
median_income 0.688355
rooms_per_household 0.151344
total_rooms 0.133294
housing_median_age 0.106432
households 0.064894
total_bedrooms 0.049686
pop_per_household -0.023639
population -0.025300
longitude -0.045398
latitude -0.144638
bedrooms_per_room -0.255880
Name: median_house_value, dtype: float64

blocks.median_income.hist(bins=80, figsize = (15,8))
plt.show()

sns.set(font_scale = 1.5)
sns.jointplot(data = blocks, x = 'median_income', y = 'median_house_value', kind = 'kde', height = 10) #kernel density estim.
plt.show()

blocks.plot(kind = 'scatter', x = 'longitude', y = 'latitude', 
          s = blocks.population/100, label = 'Population', figsize = (15,10),
          c = 'median_house_value', cmap = 'coolwarm',
          colorbar = True, alpha = 0.4, fontsize = 15, sharex = False
         )
plt.ylabel('Latitude', fontsize = 14)
plt.xlabel('Longitude', fontsize = 14)
plt.legend(fontsize = 16)
plt.show()

Importing the California Map

california_img = mpimg.imread('california.png') # Image stored as a numpy array.

plt.figure(figsize = (15,10))
plt.imshow(california_img, extent = [-124.55, -113.80, 32.45, 42.05]) # Including coordinates
plt.show()

Combining the Map with previous plots

blocks.plot(kind = 'scatter', x = 'longitude', y = 'latitude', 
          s = blocks.population/100, label = 'Population', figsize = (25,15),
          c = 'median_house_value', cmap = 'coolwarm',
          colorbar = True, alpha = 0.4, fontsize = 15, sharex = False
         )

plt.imshow(california_img, extent = [-124.55, -113.80, 32.45, 42.05], alpha = 0.5, 
          cmap = plt.get_cmap('jet'))

plt.ylabel('Latitude', fontsize = 14)
plt.xlabel('Longitude', fontsize = 14)
plt.legend(fontsize = 16)
plt.show()

Looking at houses near ocean

prox = blocks.ocean_proximity.unique()
prox

array(['NEAR BAY', '<1H OCEAN', 'INLAND', 'NEAR OCEAN', 'ISLAND'],
dtype=object)

blocks_loc = blocks[blocks.ocean_proximity == prox[3]].copy() # Slices for Near Ocean

blocks_loc.plot(kind = 'scatter', x = 'longitude', y = 'latitude', 
          s = blocks_loc.population/100, label = 'Population', figsize = (25,15),
          c = 'median_house_value', cmap = 'coolwarm',
          colorbar = True, alpha = 0.4, fontsize = 20, sharex = False
         )

plt.imshow(california_img, extent = [-124.55, -113.80, 32.45, 42.05], alpha = 0.5, 
          cmap = plt.get_cmap('jet'))

plt.ylabel('Latitude', fontsize = 14)
plt.xlabel('Longitude', fontsize = 14)
plt.legend(fontsize = 16)
plt.show()

Advanced Exploratory Data Analysis

blocks.median_income.hist(bins=50, figsize = (15,10))
plt.title('Median Income')
plt.show()

blocks['income_cat'] = pd.qcut(blocks.median_income, q = [0, 0.25, 0.5, 0.75, 0.95, 1], 
                          labels = ['Low', "Below_Average", 'Above_Average', 'High', 'Very_High']) # Categorising income between 5 groups depending on which percentile they fall into.

blocks.income_cat.value_counts(normalize = True)

Above_Average 0.250037
Low 0.250037
Below_Average 0.249988
High 0.199922
Very_High 0.050017
Name: income_cat, dtype: float64

plt.figure(figsize = (12,8))
sns.set(font_scale = 1.5, palette = 'viridis')
sns.countplot(data = blocks, x = 'income_cat', hue = 'ocean_proximity')
plt.legend(loc = 1, fontsize = 14)
plt.show()

plt.figure(figsize = (13,7))
sns.countplot(x = 'ocean_proximity', hue = 'income_cat', data = blocks, palette = 'viridis')
plt.xlabel('Purpose', fontsize = 15)

plt.figure(figsize = (12,8))
sns.set(font_scale = 1.5)
sns.barplot(data = blocks, x = 'income_cat', y = 'median_house_value', dodge = True)

plt.show()

This makes sense in the fact that the median house value increases as the residents' income increases as well.

plt.figure(figsize = (12,8))
sns.set(font_scale = 1.5)
sns.barplot(data = blocks, x = 'ocean_proximity', y = 'median_house_value', dodge = True)

plt.show()

matrix = blocks.groupby(['income_cat', 'ocean_proximity']).median_house_value.mean().unstack().drop(columns = ['ISLAND'])

matrix.astype('int')

plt.figure(figsize = (12,8))
sns.set(font_scale = 1.4)
sns.heatmap(matrix.astype('int'), cmap = 'Reds', annot = True, fmt = 'd', vmin = 90000, vmax = 470000)
plt.show()

Feature Engineering

Some feature engineering are needed before being able to forecast a machine learning model.

label = blocks.median_house_value.copy() # What to forecast through the model

features = blocks.drop(columns = ['median_house_value']) # Everything else

features.select_dtypes('float')

Numerical features with different scales can negatively impact our machine learning model. Therefore, it is best to standardize them using the z scores.

feat1 = features.select_dtypes('float').apply(lambda x: stats.zscore(x))

pd.options.display.float_format = '{:.2f}'.format # Viewing only two decimal points.

feat1.agg(['mean', 'std'])

The data is standardized because all features have a mean of 0 and a standard deviation of 1.

Some machine learning models cannot handle categorical dataset, so we need to transform them into numerical values.

features.ocean_proximity.value_counts()

<1H OCEAN 9034
INLAND 6496
NEAR OCEAN 2628
NEAR BAY 2270
ISLAND 5
Name: ocean_proximity, dtype: int64

# We use a one hot and coding - using binary values 1 and 0 - to transform categorical features into numerical data types
dummies = pd.get_dummies(features.ocean_proximity)
dummies.head()

# Concatenating our various data
features = pd.concat([feat1, dummies, blocks.income_cat], axis = 1)

Dummiers are redundant because we can determine the value of the last column by looking at the values of the previous one. This multi-colinearity pose a problem to certain machine learning algorythms like the Linear Model and will require deleting one of the columns. The Random Forest model we will use here will not be impacted by this multi-colinearity.

Splitting the Data into Training and Test Set

test_size = 0.2

# Randomly splitting our dataset
X_test = features.sample(frac = test_size, random_state = 123) # fraction = 20% - test_size

X_test.income_cat.value_counts(normalize = True)

Above_Average 0.25
Below_Average 0.25
Low 0.25
High 0.20
Very_High 0.05
Name: income_cat, dtype: float64

features.income_cat.value_counts(normalize = True)

Above_Average 0.25
Low 0.25
Below_Average 0.25
High 0.20
Very_High 0.05
Name: income_cat, dtype: float64

X_test is representative of the dataset, at least when considering the income_cat feature.

X_test.index

Int64Index([14354, 12908, 19545, 12188, 14786, 9941, 3179, 4650, 15550, 17190,
...
3992, 10261, 10862, 10863, 13864, 10262, 3614, 19296, 5826, 15383],
dtype='int64', length=4087)

X_train = features.loc[~features.index.isin(X_test.index)].copy()

X_train = X_train.sample(frac = 1, random_state = 123) # Working with a shuffled data frame

# Don't need the income_cat column for our model
X_train.drop(columns = ['income_cat'], inplace = True)
X_test.drop(columns = ['income_cat'], inplace = True)

y_train = label.loc[X_train.index]
y_test = label.loc[X_test.index]

Training the ML Model (Random Forest Regressor)

forest_reg = RandomForestRegressor(random_state = 42, n_estimators = 500,
                                  max_features = 'sqrt', max_depth = 75, min_samples_split = 2)

forest_reg.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
max_depth=75, max_features='sqrt', max_leaf_nodes=None,
max_samples=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=500, n_jobs=None, oob_score=False,
random_state=42, verbose=0, warm_start=False)

forest_reg.score(X_train, y_train)

0.9758470860678036

This score is very high and may indicate overfitting.

pred = forest_reg.predict(X_train)
pred

array([238374.608, 246813.8 , 74191.4 , ..., 184930.402, 117595.4 , 187186.008])

forest_mse = mean_squared_error(y_train, pred)
forest_rmse = np.sqrt(forest_mse)
forest_rmse

18023.671210966968

We can conclude that our model fit our training dataset pretty well.

Testing the model with the Test Set

forest_reg.score(X_test, y_test)

0.825152593253362

pred2 = forest_reg.predict(X_test)
pred2

array([224965.808, 129246.8 , 67903. , ..., 222406.402, 322985.02 , 268692.8 ])

forest_mse2 = mean_squared_error(y_test, pred2)
forest_rmse2 = np.sqrt(forest_mse2)
forest_rmse2

47348.34022326726

comp = pd.DataFrame(data = {'True_V':y_test, 'Pred':pred2})
comp

ae = comp.True_V.sub(comp.Pred).abs()
ae

14354 123165.81
12908 83753.20
19545 9103.00
12188 75950.02
14786 13871.60
...
10262 24362.20
3614 7963.20
19296 6506.40
5826 3114.98
15383 22807.20
Length: 4087, dtype: float64

mae = ae.mean()
mae

31722.632696354325

Feature Importance

forest_reg.feature_importances_

array([8.43251997e-02, 7.63824246e-02, 4.20571593e-02, 2.28712182e-02,
2.01285036e-02, 2.28577613e-02, 1.96337714e-02, 2.80813190e-01,
6.50088158e-02, 9.89248218e-02, 9.71199460e-02, 1.91328297e-02,
1.36700488e-01, 2.65441330e-04, 5.35959721e-03, 8.41883258e-03])

feature_imp = pd.Series(data = forest_reg.feature_importances_, index = X_train.columns).sort_values(ascending = False)

feature_imp

median_income 0.28
INLAND 0.14
pop_per_household 0.10
bedrooms_per_room 0.10
longitude 0.08
latitude 0.08
rooms_per_household 0.07
housing_median_age 0.04
total_rooms 0.02
population 0.02
total_bedrooms 0.02
households 0.02
<1H OCEAN 0.02
NEAR OCEAN 0.01
NEAR BAY 0.01
ISLAND 0.00
dtype: float64

feature_imp.sort_values().plot.barh(figsize = (12,8))
plt.show()

We could drop some features like Island, Near Bay, Near Ocean, <1H Ocean, households, and bedrooms and rerun our model for best result.

metrics.explained_variance_score(y_test, pred2)

0.8252854674155055

Our Random Forest model explain 82.5% of the values.

Next, I will use a Linear Regression model in the future to compare results.