<< Chapter < Page | Chapter >> Page > |
In the original linear regression algorithm, to make a prediction at a query point (i.e., to evaluate ), we would:
In contrast, the locally weighted linear regression algorithm does the following:
Here, the 's are non-negative valued weights . Intuitively, if is large for a particular value of , then in picking , we'll try hard to make small. If is small, then the error term will be pretty much ignored in the fit.
A fairly standard choice for the weights is If is vector-valued, this is generalized to be , or , for an appropriate choice of or .
Note that the weights depend on the particular point at which we're trying to evaluate . Moreover, if is small, then is close to 1; and if is large, then is small. Hence, is chosen giving a much higher “weight” to the (errors on) training examples close to the query point . (Note also that while the formula for the weights takes a form that is cosmetically similar to thedensity of a Gaussian distribution, the 's do not directly have anything to do with Gaussians, and in particular the are not random variables, normally distributed or otherwise.)The parameter controls how quickly the weight of a training example falls off with distance of its from the query point ; is called the bandwidth parameter, and is also something that you'll get to experiment with in your homework.
Locally weighted linear regression is the first example we're seeing of a non-parametric algorithm. The (unweighted) linear regression algorithm that we saw earlier is known as a parametric learning algorithm, because it has a fixed, finite number of parameters(the 's), which are fit to the data. Once we've fit the 's and stored them away, we no longer need to keep the training data around to make future predictions.In contrast, to make predictions using locally weighted linear regression, we need to keep the entire training set around. The term “non-parametric”(roughly) refers to the fact that the amount of stuff we need to keep in order to represent the hypothesis grows linearly with the size of the training set.
Let's now talk about the classification problem. This is just like the regression problem, except that the values we now want to predict take on only a small number of discrete values. For now,we will focus on the binary classification problem in which can take on only two values, 0 and 1. (Most of what wesay here will also generalize to the multiple-class case.) For instance, if we are trying to build a spam classifier for email,then may be some features of a piece of email, and may be 1 if it is a piece of spam mail, and 0 otherwise.0 is also called the negative class , and 1 the positive class , and they are sometimes also denoted by the symbols “-” and “+.” Given , the corresponding is also called the label for the training example.
Notification Switch
Would you like to follow the 'Machine learning' conversation and receive update notifications?