This repository contains a Python implementation of a Simple Linear Regression model. Using the synthetic wave dataset from the mglearn library, the project demonstrates how to model the relationship between a single feature and a continuous target variable.
The goal of this script is to illustrate the fundamental steps of supervised learning:
- Data Synthesis: Creating a non-linear "wave" pattern.
- Visualization: Understanding data distribution before modeling.
- Data Splitting: Partitioning data into training and testing sets to evaluate generalization.
- Modeling: Fitting a Linear Regression line to the data.
- Evaluation: Using the R^2 (coefficient of determination) metric to assess performance.
To run this code, you will need Python 3.x and the following libraries:
pip install scikit-learn mglearn matplotlib
The script utilizes mglearn.datasets.make_wave, which generates a synthetic 1D dataset. This is ideal for visualizing how a linear model attempts to capture trends in data that might have slight curvature.
-
Split: The data is divided using an 80/20 split. The
random_state=42ensures that the results are reproducible. -
Training: The
LinearRegression()model finds the optimal parameters (slope and intercept) by minimizing the Mean Squared Error (MSE). -
Mathematical Representation: The model follows the simple linear equation:
y = wx + b
Where w is the Coefficient (slope) and b is the Intercept.
The model is evaluated using the R^2 Score:
- Training Score: Indicates how well the model fits the data it was trained on.
- Testing Score: Indicates how well the model predicts unseen data.
When you run the script, you can expect the following results:
Training data shape: (32, 1), (32,)
Testing data shape: (8, 1), (8,)
Slope (Coefficient): 0.459...
Intercept: -0.017...
Training R^2 Score: 0.67
Testing R^2 Score: 0.66
A Matplotlib window will display a scatter plot of the wave dataset, showing the relationship between the input feature and the target values.
| Component | Description |
|---|---|
| Library | sklearn.linear_model.LinearRegression |
| Dataset | mglearn.datasets.make_wave (n=40) |
| Test Size | 20% |
| Model Parameters | coef_ (Weight), intercept_ (Bias) |
| Metric | R^2 (Coefficient of Determination) |
train_test_split(): Prevents overfitting by isolating test data.model.fit(): The "learning" phase where the model calculates w and b.model.score(): Returns the R^2 score. A score of 1.0 is a perfect fit, while 0.0 indicates the model performs no better than predicting the mean.
Feel free to fork this repository,experiment with the n_samples parameter, or try applying a PolynomialFeatures transformation to see if you can improve the R^2 score! But follow me first before you do this,and don't forget to mention me so that I can see the changes and probably learn from your code.