Cross Validation

Intro

Every time after we make a model, we have to check if it is functioning properly or not. But testing on the data that was used to build the model is a bad idea. When testing if one performs well, it should be tested on a separate data set which hasn't been used for training which is the reason before doing anything, we separate the whole data into training and testing sets.

Once we build a model, we test it on testing set over and over again until it becomes accurate enough to be used. A problem in this case is this is also a bad practice. Doing this will cause the model to be over-fitting on testing data that when some new data is inserted, it may work poorly. This is a reason that testing set should only be used once throughout the entire training and testing.

cross_validation function(features, x, y, k):

error = []

for k times

validation = one chunk of x
v_label = one chunk of y

training = k-1 chunks of x
t_label = k-1 chunks of y

train and predict model
append error from loss

return the mean of errors

Coding

We can easily do the cross validation with existing library (scikit). But since the idea is simple and implementation is also easy do, let's try implementing it on our own first, and then see how to use the library. Again, we will use linear regression with housing price data.

Import libraries and Load and Prep Data

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
train = pd.read_csv('./data/train.csv')
test = pd.read_csv('./data/test.csv')
X = train
y = train.loc[:, ['SalePrice']]
# set split size to 0.8
X_train, y_train, X_test, y_test = train_test_split(X, y, train_size=0.8)

Now, let's define necessary functions.

# Root Mean Squared Error
def rmse(y, y_hat):
    return np.sqrt(np.mean((y - y_hat)**2))

def get_index(x, size):
    index = []
    validation = []
    r = set([i for i in range(len(x))])

    for i in range(len(x)):
        validation.append(i)

        if (i+1) % size == 0 or i == len(x)-1:

            training = list(r - set(validation))
            index.append([np.array(training), np.array(validation)])
            validation = []

    return index

def cross_validation(x, y, k=5):

    size = ceil(len(x) / k)

    # list of list of indices to split
    index = get_index(x, size)
    error = np.array([])

    model = LinearRegression()

    for training, validation in index:

        train = x.iloc[training]
        test = y.iloc[training]

        val_train = x.iloc[validation]
        val_test = y.iloc[validation]

        model.fit(train, test)

        pred = model.predict(val_train)

        err = rmse(val_test, pred)

        error = np.append(err, error)

    return error.mean()

Now that we have finished all the functions necessary for cross validation, let's check out a set of features to put into the model to compare each performance. I've set five different list of features as below and using the cross validation functions we just implemented, we will see which features set produce the least amount of error.

features = [['LotArea', 'YearBuilt', 'FullBath'],
            ['LotArea', 'GarageArea', 'PoolArea'],
            ['GarageArea', 'FullBath', 'HalfBath'],
            ['LotArea', 'YearBuilt', 'GarageArea'],
            ['LotArea', 'GarageArea', 'FullBath']]
errors = []
best_feature_set = None
least_error = 0

for f in features:
    errors.append(cross_validation(X[f], y))

least_error = min(errors)
best_feature_index = errors.index(least_error)
best_feature_set = features[best_feature_index]
for f, loss in zip(features, errors):
    print('Features: {} with Loss: {}'.format(f, loss))

print('\nBest Features: {}'.format(best_feature_set))
Features: ['LotArea', 'YearBuilt', 'FullBath'] with Loss: 60177.62086438129
Features: ['LotArea', 'GarageArea', 'PoolArea'] with Loss: 61322.452532656185
Features: ['GarageArea', 'FullBath', 'HalfBath'] with Loss: 54635.92474394878
Features: ['LotArea', 'YearBuilt', 'GarageArea'] with Loss: 57546.42623587614
Features: ['LotArea', 'GarageArea', 'FullBath'] with Loss: 55443.39583203288

Best Features: ['GarageArea', 'FullBath', 'HalfBath']

It turns out, the set of 'GarageArea' and number of 'Full' and 'Half' baths result in the least error. With this information, we have extracted features to train and evaluate a model.

The main reason we use the validation sets is to avoid using test sets multiple times which could result in over-fitting the data. Without it, we will try to improve the model by fitting it to the testing sets, rather than the whole data. In every case, the testing sets should only be used once at the end of the whole process.

Since the cross validation is simple and easy to implement, I've done so myself but there already is existing library for it (from scikit) such as the KFold in the following.

from sklearn.model_selection import KFold

This function replaces get_index function we created above. It can be substituted just like the next.

def cross_validation(x, y, k=5):

    size = ceil(len(x) / k)

    # n_splits is the number to split the data
    kf = KFold(n_splits=5)

    error = np.array([])

    model = LinearRegression()

    # we should put data in kf.split and it would return the training and validation indices just as we created.
    for training, validation in kf.split(x):

        train = x.iloc[training]
        test = y.iloc[training]

        val_train = x.iloc[validation]
        val_test = y.iloc[validation]

        model.fit(train, test)

        pred = model.predict(val_train)

        err = rmse(val_test, pred)

        error = np.append(err, error)

    return error.mean()

Ending

As I've mentioned above, it is very important to use the testing set only as the last step to check overall performance of a model or else it could fall into the over-fitting.

We could use cross validation to see which combinations of features could produce better performance than others and doing so, we can eliminate any feature whose absence doesn't impact much on the model, to reduce the dimensionality of data. We can try out different hyper-parameters with the best set of features to improve the model further.

Also there are other functions besides KFold from scikit library related to cross validation that you can explore more.

Thanks again for reading!

Last updated