Machine Learning and statistics project 2019
The boston dataset was first published in 1978 in a paper Hedonic Housing Prices and the Demand for Clean Air by Harrison and Rubenfield. 506 entries represent aggregated data about 14 features for homes in Boston and capture the Crime rate (CRIM), number of Rooms(RM), age(AGE) of owners and more. This dataset is widely used in many machine learning papers that address regression problems.
- Use descriptive statistics and plots to describe the Boston House Prices dataset.
- Use inferential statistics to analyse whether there is a significant difference in median house prices between houses that are along the Charles river and those that aren’t. You should explain and discuss your findings within the notebook.
- Use keras to create a neural network that can predict the median house price based on the other variables in the dataset.
I recommend using nbviewer to view this file: https://nbviewer.jupyter.org/github/RitRa/MachineLearning-project/blob/master/Machine%20Learning%20and%20Statistics%20Project%202019.ipynb
This Project concerns Boston House Prices dataset and the Python packages Scipy, Keras, and Jupyter.
I recommend installing jupyter using the anaconda distribution to run this Project
Dataset used
from sklearn.datasets import load_boston boston_df = load_boston() Libraries used in this Jupyter Notebook include:
-
Pandas: Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
-
NumPy: NumPy is the fundamental package for scientific computing with Python.
-
Matplotlib: Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms.
-
Seaborn: Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
-
researchpy: produces Pandas DataFrames that contain relevant statistical testing information that is commonly required for academic research.
conda install -c researchpy researchpy - statsmodels: is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration.
conda install -c anaconda statsmodels conda install -c conda-forge ipywidgets - Keras is a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano.
conda install -c conda-forge keras import numpy as np import pandas as pd # charts import seaborn as sns import matplotlib.pyplot as plt # for creating folder for plots import os # statistical analysis import researchpy as rp import statsmodels.api as sm #for interactive widgets for charts from ipywidgets import interact, interactive, fixed import ipywidgets as widgets # machine learning from keras.models import Sequential from keras.layers import Dense | CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | MEDV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.00632 | 18.0 | 2.31 | 0.0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1.0 | 296.0 | 15.3 | 396.90 | 4.98 | 24.0 |
| 1 | 0.02731 | 0.0 | 7.07 | 0.0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2.0 | 242.0 | 17.8 | 396.90 | 9.14 | 21.6 |
| 2 | 0.02729 | 0.0 | 7.07 | 0.0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2.0 | 242.0 | 17.8 | 392.83 | 4.03 | 34.7 |
| 3 | 0.03237 | 0.0 | 2.18 | 0.0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3.0 | 222.0 | 18.7 | 394.63 | 2.94 | 33.4 |
| 4 | 0.06905 | 0.0 | 2.18 | 0.0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3.0 | 222.0 | 18.7 | 396.90 | 5.33 | 36.2 |
describe() gives us a quick overview of the dataset
summary = boston.describe() summary = summary.transpose() summary.head(14) | count | mean | std | min | 25% | 50% | 75% | max |
|---|---|---|---|---|---|---|---|
| CRIM | 506.0 | 3.613524 | 8.601545 | 0.00632 | 0.082045 | 0.25651 | 3.677083 |
| ZN | 506.0 | 11.363636 | 23.322453 | 0.00000 | 0.000000 | 0.00000 | 12.500000 |
| INDUS | 506.0 | 11.136779 | 6.860353 | 0.46000 | 5.190000 | 9.69000 | 18.100000 |
| CHAS | 506.0 | 0.069170 | 0.253994 | 0.00000 | 0.000000 | 0.00000 | 0.000000 |
| NOX | 506.0 | 0.554695 | 0.115878 | 0.38500 | 0.449000 | 0.53800 | 0.624000 |
| RM | 506.0 | 6.284634 | 0.702617 | 3.56100 | 5.885500 | 6.20850 | 6.623500 |
| AGE | 506.0 | 68.574901 | 28.148861 | 2.90000 | 45.025000 | 77.50000 | 94.075000 |
| DIS | 506.0 | 3.795043 | 2.105710 | 1.12960 | 2.100175 | 3.20745 | 5.188425 |
| RAD | 506.0 | 9.549407 | 8.707259 | 1.00000 | 4.000000 | 5.00000 | 24.000000 |
| TAX | 506.0 | 408.237154 | 168.537116 | 187.00000 | 279.000000 | 330.00000 | 666.000000 |
| PTRATIO | 506.0 | 18.455534 | 2.164946 | 12.60000 | 17.400000 | 19.05000 | 20.200000 |
| B | 506.0 | 356.674032 | 91.294864 | 0.32000 | 375.377500 | 391.44000 | 396.225000 |
| LSTAT | 506.0 | 12.653063 | 7.141062 | 1.73000 | 6.950000 | 11.36000 | 16.955000 |
| MEDV | 506.0 | 22.532806 | 9.197104 | 5.00000 | 17.025000 | 21.20000 | 25.000000 |
Let's look at the correlation between the variables in the dataset
-
Positive Correlation: both variables change in the same direction (light color).
-
Neutral Correlation: No relationship in the change of the variables.
-
Negative Correlation: variables change in opposite directions (dark color).
From the correlation heatmap:
- We can see a positive correlation between MEDV and RM at 0.69. When the average number of rooms increase the price of the house also increases.
- Negative correlation between MEDV and LSTAT (% lower status of the population): -0.76
- Negative correlation between MEDV and PTRatio (pupil-teacher ratio by town): (-0.52)
- Negative correlation between MEDV and INDUS (proportion of non-retail business acres per town) (-0.6)
Let's plot these for more detail:

Use inferential statistics to analyse whether there is a significant difference in median house prices between houses that are along the Charles river and those that aren’t.
Visualise the spread of data add box plots here of river vs non river
Using a T-Test we can find out if there is a statistical significance between th two samples. I will use the researchpy library for this section.
descriptives, results = rp.ttest(other_df['MEDV'], riverhouse_df['MEDV']) descriptives | Variable | N | Mean | SD | SE | 95% Conf. | Interval |
|---|---|---|---|---|---|---|
| MEDV | 461.0 | 21488.503254 | 7898.848164 | 367.886036 | 20765.557728 | 22211.448780 |
| MEDV | 29.0 | 23979.310345 | 7024.161328 | 1304.354013 | 21307.462269 | 26651.158421 |
| combined | 490.0 | 21635.918367 | 7865.301063 | 355.318083 | 20937.779775 | 22334.056960 |
This gives us a good overview of the the two samples. We can see that there is not a large different in the Means of each dataset and also the Standard deviations are similar.
| Independent | t-test | results |
|---|---|---|
| 0 | Difference (MEDV - MEDV) = | -2490.8071 |
| 1 | Degrees of freedom = | 488.0000 |
| 2 | t = | -1.6571 |
| 3 | Two side test p value = | 0.0981 |
| 4 | Difference > 0 p value = | 0.0491 |
| 5 | Difference < 0 p value = | 0.9509 |
| 6 | Cohen's d = | -0.3172 |
| 7 | Hedge's g = | -0.3168 |
| 8 | Glass's delta = | -0.3153 |
| 9 | r = | 0.0748 |
Results: According to the pvalue which is greater than 0.05, there is no stastistical significance. Prior to removing the duplicates, the results were showing a pvalue less than 0.05.
We will use keras to create a Neural network that can predict the median house price based on the other variables in the dataset. Neural networks are a set of algorithms, modeled loosely after the human brain, that are designed to recognize patterns.
import numpy as np import pandas as pd from keras.models import Sequential from keras.layers import Dense We will only take the variables that had a strong correlations with 'MEDV'. 'MEDV', will be the variable we seek to predict.
#features = boston.iloc[:,0:13] features = boston[['RM', 'LSTAT', 'PTRATIO', 'INDUS']] # target is the price, boston[['MEDV']] prices = boston.iloc[:,13] A best practices are to normalise the data before added it to the Neural network to see better results.
from sklearn.preprocessing import MinMaxScaler scaler = MinMaxScaler() features= scaler.fit_transform(features) prices= prices.values.reshape(-1,1) prices = scaler.fit_transform(prices) Now we need to split the data into two subsets. Train will be used inside the neural network.
from sklearn.model_selection import train_test_split # Shuffle and split the data into training and testing subsets X_train, X_test, y_train, y_test = train_test_split(features, prices, test_size=0.20, random_state=50) With our training and test data set-up, we are now ready to build our model.
We create the model using 1 input layer, 2 hidden layers and 1 output layer.
As this a regression problem, the loss function we use is mean squared error and the metrics against which we evaluate the performance of the model is mean absolute error and accuracy. Also the output layer should be linear also.
from keras.callbacks import ModelCheckpoint model = kr.models.Sequential() n_cols = X_train.shape[1] # The Input Layer, activation layer model.add(kr.layers.Dense(128, kernel_initializer='normal',input_shape=(n_cols,), activation='relu')) # The Hidden Layers : #m.add(kr.layers.Dense(256, kernel_initializer='normal',activation='relu')) model.add(kr.layers.Dense(256, kernel_initializer='normal',activation='relu')) model.add(kr.layers.Dense(128, kernel_initializer='normal',activation='relu')) # The Output Layer : model.add(kr.layers.Dense(1, kernel_initializer='normal',activation='linear')) print(model.summary()) model.compile(loss='mse', optimizer='adam', metrics=['mean_absolute_error', 'accuracy'])
from keras.callbacks import History history = History() # y_train = prices, use validation data 10% hist = model.fit(X_train, y_train, validation_split=0.20, epochs=100, batch_size=6, callbacks=[history]) 





