<< Chapter < Page Chapter >> Page >

So both versions are actually done and used really often. You can either, you know, just take the best hypothesis that was trained on 70 percent of the data, and just output that as you find the hypothesis or you can use this to – say, having chosen the degree of the polynomial you want to fit, you can then go back and retrain the model on the entire 100 percent of your data. And both of these are commonly done. How about a cross validation does – sort of work straight? And sometimes we’re working with a company or application or something. The many machine-learning applications we have very little data or where, you know, every training example you have was painfully acquired at great cost; right?

Sometimes your data is acquired by medical experiments, and each of these – each training example represents a sick man in amounts of physical human pain or something. So we talk and say, “Well, I’m going to hold out 30 percent of your data set, just to select my model.” If people were who – sometimes that causes unhappiness, and so maybe you wanna use – not have to leave out 30 percent of your data just to do model selection.

So there are a couple of other variations on hold out cross validation that makes sometimes, slightly more efficient use of the data. And one is called k-fold cross validation. And here’s the idea: I’m gonna take all of my data s; so imagine, I’m gonna draw this box s, as to note the entirety of all the data I have. And I’ll then divide it into k pieces, and this is five pieces in what I’ve drawn. Then what’ll I’ll do is I will repeatedly train on k minus one pieces. Test on the remaining one – test on the remaining piece, I guess; right? And then you average over the k result.

So another way, we’ll just hold out – I will hold out say, just 1/5 of my data and I’ll train on the remaining 4/5, and I’ll test on the first one. And then I’ll then go and hold out the second 1/5 from my [inaudible] for the remaining pieces – test on this, you remove the third piece, train on the 4/5; I’m gonna do this five times. And then I’ll take the five error measures I have and I’ll average them. And this then gives me an estimate of the generalization error of my model; okay? And then, again, when you do k-fold cross validation, usually you then go back and retrain the model you selected on the entirety of your training set. So I drew five pieces here because that was easier for me to draw, but k equals 10 is very common; okay? I should say k equals 10 is the fairly common choice to do 10 fold cross validation.

And the advantage of the over hold out cross option is that you switch the data into ten pieces. Then each time you’re only holding out 1/10 of your data, rather than, you know, say, 30 percent of your data. I must say, in standard hold out – in simple hold out cross validation, a 30 – 70 split is fairly common. Sometimes like 2/3 – 1/3 or a 70 – 30 split is fairly common. And if you use k-fold cross validation, k equals 5 or more commonly k equals 10, and is the most common choice. The disadvantage of k-fold cross validation is that it can be much more computationally expensive. In particular, to validate your model, you now need to train your model ten times, instead of just once. And so you need to: from logistic regression, ten times per model, rather than just once. And so this is computationally more expensive. But k equals ten works great.

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Machine learning. OpenStax CNX. Oct 14, 2013 Download for free at http://cnx.org/content/col11500/1.4
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Machine learning' conversation and receive update notifications?

Ask