Skip to content

Stephen-Austine/Linear_Regression_Wave_Dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

Linear Regression on Wave Dataset

This repository contains a Python implementation of a Simple Linear Regression model. Using the synthetic wave dataset from the mglearn library, the project demonstrates how to model the relationship between a single feature and a continuous target variable.

Overview

The goal of this script is to illustrate the fundamental steps of supervised learning:

  1. Data Synthesis: Creating a non-linear "wave" pattern.
  2. Visualization: Understanding data distribution before modeling.
  3. Data Splitting: Partitioning data into training and testing sets to evaluate generalization.
  4. Modeling: Fitting a Linear Regression line to the data.
  5. Evaluation: Using the R^2 (coefficient of determination) metric to assess performance.

Prerequisites & Installation

To run this code, you will need Python 3.x and the following libraries:

pip install scikit-learn mglearn matplotlib

Project Structure

1. Data Generation & Visualization

The script utilizes mglearn.datasets.make_wave, which generates a synthetic 1D dataset. This is ideal for visualizing how a linear model attempts to capture trends in data that might have slight curvature.

2. The Machine Learning Pipeline

  • Split: The data is divided using an 80/20 split. The random_state=42 ensures that the results are reproducible.

  • Training: The LinearRegression() model finds the optimal parameters (slope and intercept) by minimizing the Mean Squared Error (MSE).

  • Mathematical Representation: The model follows the simple linear equation:

    y = wx + b

    Where w is the Coefficient (slope) and b is the Intercept.

3. Performance Metrics

The model is evaluated using the R^2 Score:

  • Training Score: Indicates how well the model fits the data it was trained on.
  • Testing Score: Indicates how well the model predicts unseen data.

Expected Output

When you run the script, you can expect the following results:

Console Output

Training data shape: (32, 1), (32,)
Testing data shape: (8, 1), (8,)
Slope (Coefficient): 0.459...
Intercept: -0.017...
Training R^2 Score: 0.67
Testing R^2 Score: 0.66

Visualization

A Matplotlib window will display a scatter plot of the wave dataset, showing the relationship between the input feature and the target values.


Technical Documentation

Component Description
Library sklearn.linear_model.LinearRegression
Dataset mglearn.datasets.make_wave (n=40)
Test Size 20%
Model Parameters coef_ (Weight), intercept_ (Bias)
Metric R^2 (Coefficient of Determination)

Key Functions Used:

  • train_test_split(): Prevents overfitting by isolating test data.
  • model.fit(): The "learning" phase where the model calculates w and b.
  • model.score(): Returns the R^2 score. A score of 1.0 is a perfect fit, while 0.0 indicates the model performs no better than predicting the mean.

Contributing

Feel free to fork this repository,experiment with the n_samples parameter, or try applying a PolynomialFeatures transformation to see if you can improve the R^2 score! But follow me first before you do this,and don't forget to mention me so that I can see the changes and probably learn from your code.

About

This repository contains a Python implementation of a Simple Linear Regression model. Using the synthetic wave dataset from the mglearn library, the project demonstrates how to model the relationship between a single feature and a continuous target variable.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages