Cross Validation
Last updated
Last updated
Every time after we make a model, we have to check if it is functioning properly or not. But testing on the data that was used to build the model is a bad idea. When testing if one performs well, it should be tested on a separate data set which hasn't been used for training which is the reason before doing anything, we separate the whole data into training and testing sets.
Once we build a model, we test it on testing set over and over again until it becomes accurate enough to be used. A problem in this case is this is also a bad practice. Doing this will cause the model to be over-fitting on testing data that when some new data is inserted, it may work poorly. This is a reason that testing set should only be used once throughout the entire training and testing.
If we cannot use the testing set, how can we check if a model is working well? We cannot use training nor testing set. For this reason we have something called Cross Validation. This is a simple idea. We just divide a training set into sets and train a model with set and get an error value with the remaining set. After we divided and used the last set to train, next thing we do is initialize a new model and train it with different combinations of sets and test with the one left. We do this process until there is no more combination of sets we haven't used to train model. Now we have different error values. We get the mean value of it and that will become our final performance score (or error value) of the model. Also by doing cross validation, we can check which subset of features produce high or low error and select the ones with low value. The following is a pseudo-code of cross validation.
We can easily do the cross validation with existing library (scikit). But since the idea is simple and implementation is also easy do, let's try implementing it on our own first, and then see how to use the library. Again, we will use linear regression with housing price data.
Now, let's define necessary functions.
Now that we have finished all the functions necessary for cross validation, let's check out a set of features to put into the model to compare each performance. I've set five different list of features as below and using the cross validation functions we just implemented, we will see which features set produce the least amount of error.
It turns out, the set of 'GarageArea' and number of 'Full' and 'Half' baths result in the least error. With this information, we have extracted features to train and evaluate a model.
The main reason we use the validation sets is to avoid using test sets multiple times which could result in over-fitting the data. Without it, we will try to improve the model by fitting it to the testing sets, rather than the whole data. In every case, the testing sets should only be used once at the end of the whole process.
Since the cross validation is simple and easy to implement, I've done so myself but there already is existing library for it (from scikit) such as the KFold in the following.
This function replaces get_index function we created above. It can be substituted just like the next.
As I've mentioned above, it is very important to use the testing set only as the last step to check overall performance of a model or else it could fall into the over-fitting.
We could use cross validation to see which combinations of features could produce better performance than others and doing so, we can eliminate any feature whose absence doesn't impact much on the model, to reduce the dimensionality of data. We can try out different hyper-parameters with the best set of features to improve the model further.
Also there are other functions besides KFold from scikit library related to cross validation that you can explore more.
Thanks again for reading!