Predicting Boston Housing Prices with Linear Regression
Explore a practical machine learning project using linear regression to predict Boston housing prices. Learn how to prepare the dataset, train a linear regression model, make predictions, and evaluate the model's performance. Discover the key features influencing housing prices and gain insights into applying linear regression for real-world prediction tasks.
Boston Housing Price Prediction with Linear Regression
Introduction
This project uses the Boston Housing dataset to build a linear regression model for predicting house prices. The dataset contains 506 instances and 13 features. We'll explore data preparation, model training, prediction, and evaluation.
The Boston Housing Dataset
The dataset includes the following features (attributes):
Boston Housing Dataset Features
Feature Name | Description |
---|---|
crim | Crime rate per capita by town. |
zn | Proportion of residential land zoned for lots over 25,000 sq.ft. |
indus | Proportion of non-retail business acres per town. |
chas | Charles River dummy variable (1 if tract bounds river; 0 otherwise). |
nox | Nitric oxides concentration (parts per 10 million). |
rm | Average number of rooms per dwelling. |
age | Proportion of owner-occupied units built prior to 1940. |
dis | Weighted distances to five Boston employment centres. |
rad | Index of accessibility to radial highways. |
tax | Full-value property-tax rate per \$10,000. |
ptratio | Pupil-teacher ratio by town. |
black | 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town. |
lstat | % lower status of the population. |
medv | Median value of owner-occupied homes in \$1000's. (Target Variable) |
(Data source: Carnegie Mellon University StatLib library)
Data Loading and Preparation
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error
import matplotlib.pyplot as plt
# Load the dataset (replace with your file path)
boston = pd.read_csv("boston_housing.csv")
# Separate features (X) and target (y)
X = boston.drop('medv', axis=1)
y = boston['medv']
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Model Training and Prediction
# Create and train the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)
# Make predictions on the test set
y_pred = model.predict(X_test)
Model Evaluation
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse}")
print(f"Mean Absolute Error (MAE): {mae}")
#Plot the results (optional)
plt.scatter(y_test, y_pred, color='blue')
plt.xlabel("Actual Prices")
plt.ylabel("Predicted Prices")
plt.title("Actual vs. Predicted House Prices")
plt.show()
Mean Squared Error (MSE): 21.57659368987862
Mean Absolute Error (MAE): 3.191472667461123

(The MSE and MAE values indicate the model's error. A lower value is better. The scatter plot visually represents the model's predictions against actual prices.)