How to Build a Linear Regression Model from Scratch

Building a Linear Regression model from scratch involves implementing the fundamental mathematical principles behind it, without using machine learning libraries like sklearn. Below is a step-by-step guide:

1. Understanding Linear Regression

Linear Regression aims to model the relationship between a dependent variable $Y$ and one or more independent variables $X$ using a linear equation:

Y = mX + b

where:

$m$ (slope) determines the direction and steepness of the line.
$b$ (intercept) is the value of $Y$ when $X = 0$ .

For multiple variables (features), the equation extends to:

Y = W_1X_1 + W_2X_2 + ... + W_nX_n + b

where:

$W$ represents the weights (coefficients).
$X$ represents the independent variables (features).
$b$ is the bias term.

The goal of training is to find the best values for $m$ (or $W$ ) and $b$ .

2. Implementing Linear Regression from Scratch

We will use Gradient Descent to optimize our model parameters.

Step 1: Import Libraries

import numpy as np
import matplotlib.pyplot as plt

Step 2: Generate Sample Data

# Generate random data
np.random.seed(42)
X = 2 * np.random.rand(100, 1)  # 100 random values for X
y = 4 + 3 * X + np.random.randn(100, 1)  # Linear relation y = 4 + 3X + noise

# Plot the data
plt.scatter(X, y, color='blue')
plt.xlabel("X")
plt.ylabel("y")
plt.title("Generated Data")
plt.show()

Step 3: Initialize Parameters

# Initialize model parameters
m = 0  # Slope
b = 0  # Intercept
learning_rate = 0.01
epochs = 1000

Step 4: Define the Cost Function (Mean Squared Error)

def compute_cost(X, y, m, b):
    n = len(y)
    y_pred = m * X + b
    cost = (1/(2*n)) * np.sum((y_pred - y) ** 2)
    return cost

Step 5: Implement Gradient Descent

def gradient_descent(X, y, m, b, learning_rate, epochs):
    n = len(y)
    cost_history = []  # Store cost at each step

    for i in range(epochs):
        y_pred = m * X + b
        error = y_pred - y
        
        # Compute gradients
        dm = (1/n) * np.sum(error * X)  # Derivative w.r.t. m
        db = (1/n) * np.sum(error)  # Derivative w.r.t. b
        
        # Update parameters
        m -= learning_rate * dm
        b -= learning_rate * db

        # Store cost
        cost = compute_cost(X, y, m, b)
        cost_history.append(cost)

        # Print cost at intervals
        if i % 100 == 0:
            print(f"Epoch {i}: Cost = {cost}")

    return m, b, cost_history

Step 6: Train the Model

m_final, b_final, cost_history = gradient_descent(X, y, m, b, learning_rate, epochs)
print(f"Final Parameters: m = {m_final}, b = {b_final}")

Step 7: Visualizing the Regression Line

plt.scatter(X, y, color='blue', label="Actual Data")
plt.plot(X, m_final * X + b_final, color='red', label="Predicted Line")
plt.xlabel("X")
plt.ylabel("y")
plt.legend()
plt.title("Linear Regression Fit")
plt.show()

Step 8: Predict on New Data

def predict(X_new, m, b):
    return m * X_new + b

X_new = np.array([[1.5]])
y_pred = predict(X_new, m_final, b_final)
print(f"Prediction for X={X_new[0][0]}: y={y_pred[0][0]}")

Summary

Generated Data: Created sample data with a linear relationship.
Defined Cost Function: Used Mean Squared Error (MSE).
Implemented Gradient Descent: Optimized parameters $m$ and $b$ .
Trained Model: Found the best line using multiple iterations.
Visualized Results: Plotted the regression line.
Made Predictions: Used the trained model for new inputs.

Next Steps

Multiple Linear Regression: Extend to multiple features.
Polynomial Regression: Use non-linear relationships.
Regularization: Implement Lasso or Ridge Regression.
Optimization: Use advanced optimizers like Adam.

Would you like an implementation for multiple regression as well? 🚀