# Linear Regression

![](/files/-LgUe-Y3FbFfCrEU3M7l)

## What is it?

Simply put, linear regression is building a model that to a line that fits data samples with the least loss values.

To do so, a model should figure out a proper relation, if there exists, between independent (x) and dependent values (y). This relation could be proportional or not or no relation at all.

As with other machine learning examples, it is impossible to predict something with no errors, so our goal is to build a model that produces the least possible loss values which is done by computing the difference between actual and predicted values.

One naive loss function can be $$loss = \sum\_i^N(y - \hat y\_i)$$ where $$\hat y\_i$$ is the predicted values and $$N$$ is the number of samples. \
In this post, we will use Mean Squared Error function $$MSE = \frac{1}{N}\lVert \hat y - y \rVert\_2^2$$.

## Data Exploration

Let's say we would like to know the relation between the height and weight of a person.

We can already tell that the taller the person is, the heavier the weight gets. Let's find out if this is true.

The data we are going to use is from [Kaggle's weight-height](https://www.kaggle.com/mustafaali96/weight-height) uploaded by Mustafa Ali.

```python
data[data['Gender'] == 'Male'].head(2)
```

|   | Gender | Height    | Weight     |
| - | ------ | --------- | ---------- |
| 0 | Male   | 73.847017 | 241.893563 |
| 1 | Male   | 68.781904 | 162.310473 |

```python
data[data['Gender'] == 'Female'].head(2)
```

|      | Gender | Height    | Weight     |
| ---- | ------ | --------- | ---------- |
| 5000 | Female | 58.910732 | 102.088326 |
| 5001 | Female | 65.230013 | 141.305823 |

```python
data.shape
```

```
(10000, 3)
```

We have 10,000 data samples and gener, height and weight features.

```python
data.describe()
```

|       | Height       | Weight       |
| ----- | ------------ | ------------ |
| count | 10000.000000 | 10000.000000 |
| mean  | 66.367560    | 161.440357   |
| std   | 3.847528     | 32.108439    |
| min   | 54.263133    | 64.700127    |
| 25%   | 63.505620    | 135.818051   |
| 50%   | 66.318070    | 161.212928   |
| 75%   | 69.174262    | 187.169525   |
| max   | 78.998742    | 269.989699   |

![](/files/-LgUfAdO3EIj_eXduJkv)

It seems that our assumption is right. The weight increases as the height does. Also by the looks of it, we could just ignore gender and treat the samples as one bigger group since one line could still fit pretty decently.

If we zoom out and view the height and weight samples (of male and female), it looks like this.

![](/files/-LgUfFO_CydCsfh5Dex9)

So surely, we cannot fit a line that goes through the origin to the samples.

## Code

### Using Gradient Descent

As mentioned in [Gradient Descent](https://tlr.gitbook.io/data-science/machine-learning-basics/gradient-descent) post, we first have to choose which loss function we are going to use and define partial derivatives.

Let's reuse the codes from the post and try running gradient descent.

```python
# Loss function
def mse(y, x, w, b):

    return np.mean((y - (x * w + b))**2)

# Partial Derivative with respect to w
def partial_w(y, x, w, b):

    return -2 * np.mean((y - (x * w + b)) * x)

# Partial Derivative with respect to b
def partial_b(y, x, w, b):

    return -2 * np.mean(y - (x * w + b))
```

```python
x = data['Height']
y = data['Weight']

w = b = 0

learning_rate = 1e-3

loss = []

for i in range(1000):

    dw = partial_w(y, x, w, b)
    db = partial_b(y, x, w, b)

    w = w - dw * learning_rate
    b = b - db * learning_rate

    if i % 100 == 0:

        l = mse(y, x, w, b)

        print('Loss :', l)

        loss.append(l)
```

```
Loss : 1631769.5829055535
Loss : 1.218227932815399e+185
Loss : inf
Loss : inf


/home/han/anaconda3/envs/py/lib/python3.7/site-packages/ipykernel_launcher.py:15: RuntimeWarning: invalid value encountered in double_scalars
  from ipykernel import kernelapp as app


Loss : nan
Loss : nan
Loss : nan
Loss : nan
Loss : nan
Loss : nan
```

We see that the loss goes to infinity and becomes nan. Usually this happens when x and y values are not small and the sum of losses gets huge.

One thing is normalization while the other is standardization.

$$
normalization = \frac{x - min\_x}{max\_x - min\_x}
$$

$$
standardization = \frac{x - \mu\_x}{\sigma\_x}
$$

Let's use both and compare.

```python
def gradient_descent(x, y, verbose=True, epochs=10000):

    losses = []

    w = b = 0

    iter_ver = epochs*.1

    for i in range(epochs):

        dw = partial_w(y, x, w, b)
        db = partial_b(y, x, w, b)

        w = w - dw * learning_rate
        b = b - db * learning_rate

        if (i+1) % iter_ver == 0:

            loss = mse(y, x, w, b)

            losses.append(loss)

            if verbose:

                print(f'Epoch : {i+1} Loss : {loss}')

    return w, b, losses
```

```python
x = data['Height']
y = data['Weight']

norm_x = (x - x.min()) / (x.max() - x.min())
norm_y = (y - y.min()) / (y.max() - y.min())

std_x = (x - x.mean()) / x.std()
std_y = (y - y.mean()) / y.std()

norm_w, norm_b, norm_losses = gradient_descent(norm_x, norm_y, verbose=False)
std_w, std_b, std_losses = gradient_descent(std_x, std_y, verbose=False)
```

![](/files/-LgUfKJ6rz4d7lTFBsx2)

We see that standardization converged faster than normalization. As shown, the speed of convergence depends on which scaling method we choose to use. However, it does not mean that we can use anything we want. There are some cases (or models) that prefer normalization over standardization and vice versa.

One example is when we work with SVM model. In this case, standardization will be better to maximize the margin between two classes. More details will be in another post.

Since we standardized samples, we have to do the same when we predict other samples.

## Misc.

Linear regression we used is Ordinary Least Squares but there are other linear regression as well, such as 1. Weighted Least Squares 2. Generalized Least Squares 3. Ridge Regression 4. Lasso Regression 5. Elastic Net Regression

There are also other forms not mentioned here. The last three regressions are regularized regression which will be covered in a separate post.

Also it is also possible to have linear regression whose line is actually not a line!

For example, let's say we have the following samples.

![](/files/-LgUfP366ONArNWs9ENX)

If we use the model used above, we will have a line just like this.

![](/files/-LgUfRVfsEE-SEal0Zeh)

An equation used to generate plots is $$y = x \* \theta\_1 + sin(x \* \theta\_2)$$ where $$\theta$$ is our new weights.

Since we have two different weights, the derivatives are different as well. The equations are

$$
\frac{\partial y}{\partial \theta\_1} = \frac{-2}{N}(x \* (y - \theta\_1 \* x - sin(\theta\_2 \* x))\\
$$

$$
\frac{\partial y}{\partial \theta\_2} = \frac{-2}{N}(y - (\theta\_1 \* x + sin(\theta\_2 \* x)) \* (x \* cos(\theta\_2) \* x))
$$

```python
def model(x, theta):
    return x*theta[0] + np.sin(x*theta[1])

def grad_dt1(x, y, theta):

    return -2 * np.mean(x * (y - theta[0] * x - np.sin(theta[1] * x)) )

def grad_dt2(x, y, theta):

    return -2 * np.mean( (y - (theta[0] * x + np.sin(theta[1] * x))) * (x * np.cos(theta[1] * x)) )

def nonlinear_gd(x, y, theta, learning_rate=0.01):

    iter_num = 3000

    for i in range(iter_num):

        dt1 = grad_dt1(x, y, theta)
        dt2 = grad_dt2(x, y, theta)

        dtheta = np.array([dt1, dt2])
        theta = theta - learning_rate * dtheta

    return theta
```

```python
theta = np.array([0, 0])
theta = nonlinear_gd(x, y, theta, learning_rate=0.01)
theta
```

```
array([0.99936438, 1.03895846])
```

![](/files/-LgUfTk8lngXL8C4m54T)

Although it requires us to know which model is used to generate samples, it is possible to fit a line to nonlinear data.

## Conclusion

This post only deals with the basic linear regression without any regularization such as Lasso, Ridge or Elastic Net. There are many versions of it besides Ordinary Least Squares. These topics will be covered in later posts.

You can find the full code [here](https://github.com/hsong1101/ML-Tutorials/blob/master/Supervised/Linear%20Regression/LinearRegression_FullCode.ipynb).

Thank you all for reading and if you find any errors or typos or have any suggestions, please let me know.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://tlr.gitbook.io/data-science/supervised-learning/linear-regression-coding-part-1.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
