<< Chapter < Page Chapter >> Page >

We return to the topic of classification, and we assume an input (feature) space X and a binary output (label) space Y = { 0 , 1 } . Recall that the Bayes classifier (which minimizes the probability of misclassification) is defined by

f * ( x ) = 1 , P ( Y = 1 | X = x ) 1 / 2 0 , o t h e r w i s e .

Throughout this section, we will denote the conditional probability function by

η ( x ) P ( Y = 1 | X = x ) .

Plug-in classifiers

One way to construct a classifier using the training data { X i , Y i } i = 1 n is to estimate η ( x ) and then plug-it into the form of the Bayes classifier. That is obtain an estimate,

η ^ n ( x ) = η ( x ; { X i , Y i } i = 1 n )

and then form the “plug-in" classification rule

f ^ ( x ) = 1 , η ^ ( x ) 1 / 2 0 , o t h e r w i s e .
The function η ( x ) is generally more complicated than the ultimate classification rule (binary-valued), as we cansee
η : X [ 0 , 1 ] f : X { 0 , 1 } .

Therefore, in this sense plug-in methods are solving a more complicated problem than necessary. However, plug-in methods can perform well,as demonstrated by the next result.

Theorem

Plug-in classifier

Let η ˜ be an approximation to η , and consider the plug-in rule

f ( x ) = 1 , η ˜ ( x ) 1 / 2 0 , o t h e r w i s e .

Then,

R ( f ) - R * 2 E [ | η ( x ) - η ˜ ( x ) | ]

where

R ( f ) = P ( f ( X ) Y ) R * = R ( f * ) = inf f R ( f ) .

Consider any x R d . In proving the optimality of the Bayes classifier f * in Lecture 2 , we showed that

P f ( x ) Y | X = x - P f * ( x ) Y | X = x = 2 η ( x ) - 1 1 { f * ( x ) = 1 } - 1 { f ( x ) = 1 } ,

which is equivalent to

P f ( x ) Y | X = x - P f * ( x ) Y | X = x = 2 η ( x ) - 1 1 { f * ( x ) f ( x ) } ,

since f * ( x ) = 1 whenever 2 η ( x ) - 1 > 0 . Thus,

P ( f ( X ) Y ) - R * = R d 2 | η ( x ) - 1 / 2 | 1 { f * ( x ) f ( x ) } p X ( x ) d x where p X ( x ) is the marginal density of X R d 2 | η ( x ) - η ˜ ( x ) | 1 { f * ( x ) f ( x ) } p X ( x ) d x R d 2 | η ( x ) - η ˜ ( x ) | p X ( x ) d x = 2 E [ | η ( X ) - η ˜ ( X ) | ]

where the first inequality follows from the fact

f ( x ) f * ( x ) | η ( x ) - η ˜ ( x ) | | η ( x ) - 1 / 2 |

and the second inequality is simply a result of the fact that 1 { f * ( x ) f ( x ) } is either 0 or 1.

Pictorial illustration of | η ( x ) - η ˜ ( x ) | | η ( x ) - 1 / 2 | when f ( x ) f * ( x ) . Note that the inequality P ( f ( X ) Y ) - R * R d 2 | η ( x ) - η ˜ ( x ) | 1 { f * ( x ) f ( x ) } p X ( x ) d x shows that the excess risk is at most twice the integral over the setwhere f * ( x ) f ( x ) . The difference | η ( x ) - η ˜ ( x ) | may be arbitrarily large away from this set without effecting the error rate of the classifier. Thisillustrates the fact that estimating η well everywhere (i.e., regression) is unnecessary for the design of a good classifier (weonly need to determine where η crosses the 1 / 2 -level). In other words, “classification is easier than regression.”

The theorem shows us that a good estimate of η can produce a good plug-in classification rule. By “good" estimate, we mean an estimator η ˜ that is close to η in expected L 1 -norm .

The histogram classifier

Let's assume that the (input) features are randomly distributed over theunit hypercube X = [ 0 , 1 ] d (note that by scaling and shifting any set of bounded features we can satisfy this assumption),and assume that the (output) labels are binary, i.e., Y = { 0 , 1 } . A histogram classifier is based on a partition the hypercube [ 0 , 1 ] d into M smaller cubes of equal size.

Partition of hypercube in 2 dimensions

Consider the unit square [ 0 , 1 ] 2 and partition it into M subsquares of equal area (assuming M is a squared integer). Let the subsquares be denoted by { Q i } , i = 1 , ... , M .

Example of hypercube [ 0 , 1 ] 2 in M equally sized partition

Define the following piecewise-constant estimator of η ( x ) :

η ^ n ( x ) = j = 1 M P ^ j 1 { x Q j }

where

P ^ j = i = 1 n 1 { X i Q j , Y i = 1 } i = 1 n 1 { X i Q j } .

Like our previous denoising examples, we expect that the bias of η ^ n will decrease as M increases, but the variance will increase as M increases.

Questions & Answers

how to create a software using Android phone
Wiseman Reply
how
basra
what is the difference between C and C++.
Yan Reply
what is software
Sami Reply
software is a instructions like programs
Shambhu
what is the difference between C and C++.
Yan
yes, how?
Hayder
what is software engineering
Ahmad
software engineering is a the branch of computer science deals with the design,development, testing and maintenance of software applications.
Hayder
who is best bw software engineering and cyber security
Ahmad
Both software engineering and cybersecurity offer exciting career prospects, but your choice ultimately depends on your interests and skills. If you enjoy problem-solving, programming, and designing software syste
Hayder
what's software processes
Ntege Reply
I haven't started reading yet. by device (hardware) or for improving design Lol? Here. Requirement, Design, Implementation, Verification, Maintenance.
Vernon
I can give you a more valid answer by 5:00 By the way gm.
Vernon
it is all about designing,developing, testing, implementing and maintaining of software systems.
Ehenew
hello assalamualaikum
Sami
My name M Sami I m 2nd year student
Sami
what is the specific IDE for flutter programs?
Mwami Reply
jegudgdtgd my Name my Name is M and I have been talking about iey my papa john's university of washington post I tagged I will be in
Mwaqas Reply
yes
usman
how disign photo
atul Reply
hlo
Navya
hi
Michael
yes
Subhan
Show the necessary steps with description in resource monitoring process (CPU,memory,disk and network)
samuel Reply
What is software engineering
Tafadzwa Reply
Software engineering is a branch of computer science directed to writing programs to develop Softwares that can drive or enable the functionality of some hardwares like phone , automobile and others
kelvin
if any requirement engineer is gathering requirements from client and after getting he/she Analyze them this process is called
Alqa Reply
The following text is encoded in base 64. Ik5ldmVyIHRydXN0IGEgY29tcHV0ZXIgeW91IGNhbid0IHRocm93IG91dCBhIHdpbmRvdyIgLSBTdGV2ZSBXb3puaWFr Decode it, and paste the decoded text here
Julian Reply
what to do you mean
Vincent
hello
ALI
how are you ?
ALI
What is the command to list the contents of a directory in Unix and Unix-like operating systems
George Reply
how can i make my own software free of cost
Faizan Reply
like how
usman
hi
Hayder
The name of the author of our software engineering book is Ian Sommerville.
Doha Reply
what is software
Sampson Reply
the set of intruction given to the computer to perform a task
Noor
Got questions? Join the online conversation and get instant answers!
Jobilize.com Reply

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Statistical learning theory. OpenStax CNX. Apr 10, 2009 Download for free at http://cnx.org/content/col10532/1.3
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Statistical learning theory' conversation and receive update notifications?

Ask