Machine Learning

CSCI 4622
Fall 2019

Prof. Claire Monteleoni

Today: Lecture 10

•  Introduction to Ensemble Methods
– Random Forests
– Voted Perceptron
– Boosting (if time)

With slide credits: S. Dasgupta, T. Jaakkola, J. Shotton, C. Tan, and A. Torralba
Review: (Binary) Decision Trees
class c
split nodes
leaf nodes
v

<
<

Credit: Shotton, ICCV 09 Tutorial
Distribution
over class labels
Speeding up Decision Tree Learning
x
y
It is cumbersome to test
all possible splits

Try several random splits

Keep the split that best
separates data
•  Reduces uncertainty
Recurse
Animation: Shotton, ICCV 09 Tutorial
x
y
Speeding up Decision Tree Learning
Shotton, ICCV 09 Tutorial
It is cumbersome to test
all possible splits

Try several random splits

Keep the split that best
separates data
•  Reduces uncertainty
Recurse
x
y
Speeding up Decision Tree Learning
Shotton, ICCV 09 Tutorial
It is cumbersome to test
all possible splits

Try several random splits

Keep the split that best
separates data
•  Reduces uncertainty
Recurse
x
y
Speeding up Decision Tree Learning
Shotton, ICCV 09 Tutorial
Random Decision Tree

It is cumbersome to test
all possible splits

Try several random splits

Keep the split that best
separates data
•  Reduces uncertainty
Recurse
Random Decision Trees
•  How many features and thresholds to try?
–  just one: “extremely randomized” [Geurts et al. 06]
–  few: fast training, may under-fit
–  many: slower training, may over-fit
•  When to stop growing the tree?
–  maximum depth
–  minimum entropy gain
–  threshold changes in class distribution
–  pruning
•  A forest is an ensemble of several decision trees

Classification:
Decision Forests
……
tree t1 tree tT
class c
class c
split nodes
leaf nodes
v v
Shotton, ICCV 09 Tutorial
Learning a Forest
•  Divide training examples into T subsets St
–  improves generalization
–  reduces memory requirements & training time

•  Train each decision tree, t, on subset St
–  same decision tree learning as before
•  Easy to parallelize

An ensemble classifier combines a set of weak “base”
classifiers into a “strong” ensemble classifier.
•  “boosted” performance
•  more robust against overfitting
•  Decision Forests, Random Forests [Breiman ‘01], Bagging
•  Voted-Perceptron
•  Boosting
•  ….

Ensemble methods
Perceptron: nonseparable data
What if data is not linearly separable?
In this case: almost linearly separable… how will the
perceptron perform?
Batch perceptron
Batch algorithm:

w = 0
while some (xi,yi) is misclassified: w = w + yi xi

Nonseparable data: this algorithm will never converge.
How can this be fixed?

Dream: somehow find the separator that misclassifies the
fewest points… but this is NP-hard (in fact, even NP-
hard to approximately solve).
Fixing the batch perceptron
Idea 1: only go through the data once, or a fixed number of times, K

w = 0
for k = 1 to K
for i = 1 to m
if (xi,yi) is misclassified: w = w + yi xi
At least this stops!

Problem: the final w might not be good.
Eg. right before terminating, the algorithm might perform an update on
an outlier!
Voted-perceptron
Idea 2: keep around intermediate hypotheses, and have them
“vote” [Freund and Schapire, 1998]

n = 1
w1 = 0 c1 = 0 for k = 1 to K
for i = 1 to m
if (xi,yi) is misclassified: wn+1 = wn + yi xi cn+1 = 1 n = n + 1
else
cn = cn + 1
At the end, a collection of linear separators w0, w1, w2, …, along
with survival times: cn = amount of time that wn survived.
Voted-perceptron
Idea 2: keep around intermediate hypotheses, and have them
“vote” [Freund and Schapire, 1998]

At the end, a collection of linear separators w0, w1, w2, …, along
with survival times: cn = amount of time that wn survived.

This cn is a good measure of the reliability (or confidence) of wn.

To classify a test point x, use a weighted majority vote:

Voted-perceptron
Problem: may need to keep around a lot of wn vectors.

Solutions:

(i)  Find “representatives” among the w vectors.
(ii)  Alternative prediction rule:
Just keep track of a running average, wavg
100 rounds, 1595 updates (5 errors)
Final hypothesis: makes 5 errors for voting
IRIS: features 3 and 4; goal: separate + from o/x
Interesting questions
Modify the (voted) perceptron algorithm to:

[1] Find a linear separator with large margin

[2] “Give up” on troublesome points after a while
Voting margin and generalization
• If we can obtain a large (positive) voting margin

across the training examples, we will have better
generalization guarantees (discussed later)
fraction of points
with positive margin
fraction of points
with margin >0.05
fraction of points
with margin >0.1
iteration
•  A simple algorithm for learning robust ensemble
classifiers
–  Freund & Shapire, 1995
–  Friedman, Hastie, Tibshhirani, 1998

•  Easy to implement, no external optimization tools
needed.

Boosting
•  Defines a classifier using an additive model:

Boosting
Strong
classifier
Weak classifier
Weight
Data point
•  Defines a classifier using an additive model:

•  We need to define a family of weak classifiers
–  E.g. linear classifiers, decision trees, or even decision stumps
(threshold on one axis-parallel dimension)
Boosting
Strong
classifier
Weak classifier
Weight
Data point
Each data point has
a class label:

- we initialize all wt =1
and a weight, wt.
+1 ( )
-1 ( )
yt =
Boosting
•  Run sequentially on a batch of n data points

xt=1
xt=2
xt
Example using linear separators
Weak learners from the family of lines
This linear separator has error rate 50%
and a weight:
Torralba, ICCV 05 Short Course
Each data point has
a class label:
+1 ( )
-1 ( )
yt =
wt =1
Example using linear separators
This one seems to be the best, call it f1
and a weight:
This is a ‘weak classifier’: Its error rate is slightly less than 50%.
Torralba, ICCV 05 Short Course
wt =1
Each data point has
a class label:
+1 ( )
-1 ( )
yt =
Example using linear separators
- Re-weight the points such that the previous weak classifier now has 50% error
- Iterate: find a weak classifier for this new problem
We update the weights:
Torralba, ICCV 05 Short Course
Each data point xi has
a class label:
+1 ( )
-1 ( )
yi=
wt wt exp{-yt fm(xt)}
Example using linear separators
We update the weights:
Torralba, ICCV 05 Short Course
- Re-weight the points such that the previous weak classifier now has 50% error
- Iterate: find a weak classifier for this new problem
Each data point xi has
a class label:
+1 ( )
-1 ( )
yi=
wt wt exp{-yt fm(xt)}
Example using linear separators
We update the weights:
Torralba, ICCV 05 Short Course
- Re-weight the points such that the previous weak classifier now has 50% error
- Iterate: find a weak classifier for this new problem
Each data point xi has
a class label:
+1 ( )
-1 ( )
yi=
wt wt exp{-yt fm(xt)}
Example using linear separators
We update the weights:
Torralba, ICCV 05 Short Course
- Re-weight the points such that the previous weak classifier now has 50% error
- Iterate: find a weak classifier for this new problem
Each data point xi has
a class label:
+1 ( )
-1 ( )
yi=
wt wt exp{-yt fm(xt)}
Example using linear separators
The strong (non-linear) ensemble classifier is built as a weighted combination
of all the weak (linear) classifiers.
f1 f2
f3
f4
Torralba, ICCV 05 Short Course
Boosting
•  AdaBoost (Freund and Shapire, 1995)
•  Real AdaBoost (Friedman et al, 1998)
•  LogitBoost (Friedman et al, 1998)
•  Gentle AdaBoost (Friedman et al, 1998)
•  BrownBoosting (Freund, 2000)
•  FloatBoost (Li et al, 2002)
•  …
Mostly differ in choice of loss function and how
it is minimized. Torralba, ICCV 05 Short Course
Loss functions: motivation
• Want a smooth upper bound on 0-1 training error.
training loss
0-1 training error
iteration
Loss functions
-1.5 -1 -0.5 0 0.5 1 1.5 2 0
0.5
1
1.5
2
2.5
3
3.5
4 Squared error
Exponential loss
yF(x) = margin
Misclassification error
Loss
Squared error
Exponential loss
Torralba, ICCV 05 Short Course
Boosting
Sequential procedure. At each step we add
For more details: Friedman, Hastie, Tibshirani. “Additive Logistic Regression: a Statistical View of Boosting” (1998)
to minimize the residual loss

input Desired output Parameters
weak classifier
How to set the ensemble weights?
•  Prediction on a new data point x is typically of the form:
•  How to set the values?
•  Depends on the algorithm. E.g. in AdaBoost:
•  Where is the training error of on the (currently)
weighted data set.
F (x) =
kX
m=1
↵mfm(x)
↵m
↵m =
1
2
ln
1 ✏m
✏m
fm✏m
Boosting example
Boosting example
Boosting example
Boosting example
Boosting example
Bias-Variance Error Decomposition
Assume the data (x,y) is drawn i.i.d. from D, s.t.:
where:

Then, for any data point (x0,y0),

Irreducible error (can’t be changed by choice of h)
where:

NOTE: All expectations are w.r.t. the random training set h is learned from, NOT x0.

y = f(x) + ✏
E[✏] = 0
Var(✏) = 2
AAACRnicbVBBSxtBGP02ttau1 ab16GVoKESEsBuElkJAKAWPCk0Usmv4dvJtHJydWWZmS8OSX9dL z976E3rpoUW8djZGqNoHA4/33sf3zctKKayLoh9Ba+3J0/VnG8 /DzRdb2y/br16PrK4MpyHXUpuzDC1JoWjohJN0VhrCIpN0ml1+b PzTL2Ss0Oqzm5eUFjhTIhccnZcm7TScD/Lu1z22zxIqrZBasSQJ E12SQaeNwoLqT4vxnZkOIu+z+/4IzaJ7l9gbJFbMCjzvh+Gk3Yl 60RLsMYlXpAMrHE/aV8lU86og5bhEa8dxVLq0RuMEl7QIk8pSif wSZzT2tNlu03pZw4K99cqU5dr4pxxbqv9O1FhYOy8ynyzQXdiHX iP+zxtXLn+f1kKVlSPFbxfllWROs6ZTNhWGuJNzT5Ab4W9l/AIN cuebb0qIH375MRn1e3HUi08OOocfVnVswC68gS7E8A4O4QiOYQ gcvsFP+A1/gu/Br+A6uLmNtoLVzA7cQwv+AjiasHA=AAACRnicbVBBSxtBGP02ttau1 ab16GVoKESEsBuElkJAKAWPCk0Usmv4dvJtHJydWWZmS8OSX9dL z976E3rpoUW8djZGqNoHA4/33sf3zctKKayLoh9Ba+3J0/VnG8 /DzRdb2y/br16PrK4MpyHXUpuzDC1JoWjohJN0VhrCIpN0ml1+b PzTL2Ss0Oqzm5eUFjhTIhccnZcm7TScD/Lu1z22zxIqrZBasSQJ E12SQaeNwoLqT4vxnZkOIu+z+/4IzaJ7l9gbJFbMCjzvh+Gk3Yl 60RLsMYlXpAMrHE/aV8lU86og5bhEa8dxVLq0RuMEl7QIk8pSif wSZzT2tNlu03pZw4K99cqU5dr4pxxbqv9O1FhYOy8ynyzQXdiHX iP+zxtXLn+f1kKVlSPFbxfllWROs6ZTNhWGuJNzT5Ab4W9l/AIN cuebb0qIH375MRn1e3HUi08OOocfVnVswC68gS7E8A4O4QiOYQ gcvsFP+A1/gu/Br+A6uLmNtoLVzA7cQwv+AjiasHA=AAACRnicbVBBSxtBGP02ttau1 ab16GVoKESEsBuElkJAKAWPCk0Usmv4dvJtHJydWWZmS8OSX9dL z976E3rpoUW8djZGqNoHA4/33sf3zctKKayLoh9Ba+3J0/VnG8 /DzRdb2y/br16PrK4MpyHXUpuzDC1JoWjohJN0VhrCIpN0ml1+b PzTL2Ss0Oqzm5eUFjhTIhccnZcm7TScD/Lu1z22zxIqrZBasSQJ E12SQaeNwoLqT4vxnZkOIu+z+/4IzaJ7l9gbJFbMCjzvh+Gk3Yl 60RLsMYlXpAMrHE/aV8lU86og5bhEa8dxVLq0RuMEl7QIk8pSif wSZzT2tNlu03pZw4K99cqU5dr4pxxbqv9O1FhYOy8ynyzQXdiHX iP+zxtXLn+f1kKVlSPFbxfllWROs6ZTNhWGuJNzT5Ab4W9l/AIN cuebb0qIH375MRn1e3HUi08OOocfVnVswC68gS7E8A4O4QiOYQ gcvsFP+A1/gu/Br+A6uLmNtoLVzA7cQwv+AjiasHA=AAACRnicbVBBSxtBGP02ttau1 ab16GVoKESEsBuElkJAKAWPCk0Usmv4dvJtHJydWWZmS8OSX9dL z976E3rpoUW8djZGqNoHA4/33sf3zctKKayLoh9Ba+3J0/VnG8 /DzRdb2y/br16PrK4MpyHXUpuzDC1JoWjohJN0VhrCIpN0ml1+b PzTL2Ss0Oqzm5eUFjhTIhccnZcm7TScD/Lu1z22zxIqrZBasSQJ E12SQaeNwoLqT4vxnZkOIu+z+/4IzaJ7l9gbJFbMCjzvh+Gk3Yl 60RLsMYlXpAMrHE/aV8lU86og5bhEa8dxVLq0RuMEl7QIk8pSif wSZzT2tNlu03pZw4K99cqU5dr4pxxbqv9O1FhYOy8ynyzQXdiHX iP+zxtXLn+f1kKVlSPFbxfllWROs6ZTNhWGuJNzT5Ab4W9l/AIN cuebb0qIH375MRn1e3HUi08OOocfVnVswC68gS7E8A4O4QiOYQ gcvsFP+A1/gu/Br+A6uLmNtoLVzA7cQwv+AjiasHA=
Var[h(x0)] = E[h(x0)
2] E[h(x0)]2
AAACPHicdZBL S8NAFIUn9VXrK+rSzWAR6sKSF EERhIIILivaB6RpmEwn7dBJJsx MxBL6w9z4I9y5cuNCEbeuTdoI ttULA4fz3cude9yQUakM41nLLS wuLa/kVwtr6xubW/r2TkPySGB Sx5xx0XKRJIwGpK6oYqQVCoJ8l 5GmO7hIefOOCEl5cKuGIbF91A uoRzFSieXoN20eEoEUFwHySdx AYmT1S/eOcWifT6PLH9Cp2PDoH 2Z3KrDg6EWjbIwLzgszE0WQVc 3Rn9pdjiOfBAozJKVlGqGyYyQU xYyMCu1IkhDhAeoRK5HpSmnH4 +NH8CBxutDjInmBgmP390SMfCm Hvpt0+kj15SxLzb+YFSnv1I5p EEaKBHiyyIsYVBymScIuFQQrN kwEwoImf4W4jwTCKsk7DcGcPXl eNCpl0yib18fF6lkWRx7sgX1Q AiY4AVVwBWqgDjB4AC/gDbxrj9 qr9qF9TlpzWjazC6ZK+/oGQKi t8w==AAACPHicdZBL S8NAFIUn9VXrK+rSzWAR6sKSF EERhIIILivaB6RpmEwn7dBJJsx MxBL6w9z4I9y5cuNCEbeuTdoI ttULA4fz3cude9yQUakM41nLLS wuLa/kVwtr6xubW/r2TkPySGB Sx5xx0XKRJIwGpK6oYqQVCoJ8l 5GmO7hIefOOCEl5cKuGIbF91A uoRzFSieXoN20eEoEUFwHySdx AYmT1S/eOcWifT6PLH9Cp2PDoH 2Z3KrDg6EWjbIwLzgszE0WQVc 3Rn9pdjiOfBAozJKVlGqGyYyQU xYyMCu1IkhDhAeoRK5HpSmnH4 +NH8CBxutDjInmBgmP390SMfCm Hvpt0+kj15SxLzb+YFSnv1I5p EEaKBHiyyIsYVBymScIuFQQrN kwEwoImf4W4jwTCKsk7DcGcPXl eNCpl0yib18fF6lkWRx7sgX1Q AiY4AVVwBWqgDjB4AC/gDbxrj9 qr9qF9TlpzWjazC6ZK+/oGQKi t8w==AAACPHicdZBL S8NAFIUn9VXrK+rSzWAR6sKSF EERhIIILivaB6RpmEwn7dBJJsx MxBL6w9z4I9y5cuNCEbeuTdoI ttULA4fz3cude9yQUakM41nLLS wuLa/kVwtr6xubW/r2TkPySGB Sx5xx0XKRJIwGpK6oYqQVCoJ8l 5GmO7hIefOOCEl5cKuGIbF91A uoRzFSieXoN20eEoEUFwHySdx AYmT1S/eOcWifT6PLH9Cp2PDoH 2Z3KrDg6EWjbIwLzgszE0WQVc 3Rn9pdjiOfBAozJKVlGqGyYyQU xYyMCu1IkhDhAeoRK5HpSmnH4 +NH8CBxutDjInmBgmP390SMfCm Hvpt0+kj15SxLzb+YFSnv1I5p EEaKBHiyyIsYVBymScIuFQQrN kwEwoImf4W4jwTCKsk7DcGcPXl eNCpl0yib18fF6lkWRx7sgX1Q AiY4AVVwBWqgDjB4AC/gDbxrj9 qr9qF9TlpzWjazC6ZK+/oGQKi t8w==AAACPHicdZBL S8NAFIUn9VXrK+rSzWAR6sKSF EERhIIILivaB6RpmEwn7dBJJsx MxBL6w9z4I9y5cuNCEbeuTdoI ttULA4fz3cude9yQUakM41nLLS wuLa/kVwtr6xubW/r2TkPySGB Sx5xx0XKRJIwGpK6oYqQVCoJ8l 5GmO7hIefOOCEl5cKuGIbF91A uoRzFSieXoN20eEoEUFwHySdx AYmT1S/eOcWifT6PLH9Cp2PDoH 2Z3KrDg6EWjbIwLzgszE0WQVc 3Rn9pdjiOfBAozJKVlGqGyYyQU xYyMCu1IkhDhAeoRK5HpSmnH4 +NH8CBxutDjInmBgmP390SMfCm Hvpt0+kj15SxLzb+YFSnv1I5p EEaKBHiyyIsYVBymScIuFQQrN kwEwoImf4W4jwTCKsk7DcGcPXl eNCpl0yib18fF6lkWRx7sgX1Q AiY4AVVwBWqgDjB4AC/gDbxrj9 qr9qF9TlpzWjazC6ZK+/oGQKi t8w==
The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If
the red x still appears, you may have to delete the image and then insert it again.
= 2 +Var[h(x0)] + (Bias[h(x0)])
2
AAACNHicbZDN SsNAFIUn/lv/oi7dDBahIpSkC IogiG4ENxVsFZoYbqbTdugkE2Y mYgl9KDc+iBsRXCji1mdw0hax 1QsDh/Pdy517woQzpR3nxZqanp mdm19YLCwtr6yu2esbdSVSSWi NCC7kTQiKchbTmmaa05tEUohCT q/D7lnOr++oVEzEV7qXUD+Cds xajIA2VmBfHGNPsXYEtxW8hz2 RUAlayBgimtVB9hud0n3g7PoGl sbpKQP1g3dvK4VCYBedsjMo/F e4I1FEo6oG9pPXFCSNaKwJB6Ua rpNoPwOpGeG0X/BSRRMgXWjTh pH5WuVng6P7eMc4TdwS0rxY44H 7eyKDSKleFJrOCHRHTbLc/I81 Ut069DMWJ6mmMRkuaqUca4HzB HGTSUo07xkBRDLzV0w6IIFok3M egjt58l9Rr5Rdp+xe7hdPjkZx LKAttI1KyEUH6ASdoyqqIYIe0D N6Q+/Wo/VqfVifw9YpazSzicb K+voG24OpEA==AAACNHicbZDN SsNAFIUn/lv/oi7dDBahIpSkC IogiG4ENxVsFZoYbqbTdugkE2Y mYgl9KDc+iBsRXCji1mdw0hax 1QsDh/Pdy517woQzpR3nxZqanp mdm19YLCwtr6yu2esbdSVSSWi NCC7kTQiKchbTmmaa05tEUohCT q/D7lnOr++oVEzEV7qXUD+Cds xajIA2VmBfHGNPsXYEtxW8hz2 RUAlayBgimtVB9hud0n3g7PoGl sbpKQP1g3dvK4VCYBedsjMo/F e4I1FEo6oG9pPXFCSNaKwJB6Ua rpNoPwOpGeG0X/BSRRMgXWjTh pH5WuVng6P7eMc4TdwS0rxY44H 7eyKDSKleFJrOCHRHTbLc/I81 Ut069DMWJ6mmMRkuaqUca4HzB HGTSUo07xkBRDLzV0w6IIFok3M egjt58l9Rr5Rdp+xe7hdPjkZx LKAttI1KyEUH6ASdoyqqIYIe0D N6Q+/Wo/VqfVifw9YpazSzicb K+voG24OpEA==AAACNHicbZDN SsNAFIUn/lv/oi7dDBahIpSkC IogiG4ENxVsFZoYbqbTdugkE2Y mYgl9KDc+iBsRXCji1mdw0hax 1QsDh/Pdy517woQzpR3nxZqanp mdm19YLCwtr6yu2esbdSVSSWi NCC7kTQiKchbTmmaa05tEUohCT q/D7lnOr++oVEzEV7qXUD+Cds xajIA2VmBfHGNPsXYEtxW8hz2 RUAlayBgimtVB9hud0n3g7PoGl sbpKQP1g3dvK4VCYBedsjMo/F e4I1FEo6oG9pPXFCSNaKwJB6Ua rpNoPwOpGeG0X/BSRRMgXWjTh pH5WuVng6P7eMc4TdwS0rxY44H 7eyKDSKleFJrOCHRHTbLc/I81 Ut069DMWJ6mmMRkuaqUca4HzB HGTSUo07xkBRDLzV0w6IIFok3M egjt58l9Rr5Rdp+xe7hdPjkZx LKAttI1KyEUH6ASdoyqqIYIe0D N6Q+/Wo/VqfVifw9YpazSzicb K+voG24OpEA==AAACNHicbZDN SsNAFIUn/lv/oi7dDBahIpSkC IogiG4ENxVsFZoYbqbTdugkE2Y mYgl9KDc+iBsRXCji1mdw0hax 1QsDh/Pdy517woQzpR3nxZqanp mdm19YLCwtr6yu2esbdSVSSWi NCC7kTQiKchbTmmaa05tEUohCT q/D7lnOr++oVEzEV7qXUD+Cds xajIA2VmBfHGNPsXYEtxW8hz2 RUAlayBgimtVB9hud0n3g7PoGl sbpKQP1g3dvK4VCYBedsjMo/F e4I1FEo6oG9pPXFCSNaKwJB6Ua rpNoPwOpGeG0X/BSRRMgXWjTh pH5WuVng6P7eMc4TdwS0rxY44H 7eyKDSKleFJrOCHRHTbLc/I81 Ut069DMWJ6mmMRkuaqUca4HzB HGTSUo07xkBRDLzV0w6IIFok3M egjt58l9Rr5Rdp+xe7hdPjkZx LKAttI1KyEUH6ASdoyqqIYIe0D N6Q+/Wo/VqfVifw9YpazSzicb K+voG24OpEA==
The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears,
you may have to delete the image and then insert it again.
Understanding boosting
• There are four different kinds of errors in the boosting
procedure that we can try to understand
-  weighted error that the base learner achieves at each
iteration
-  weighted error of the base learner relative to just updated
weights (i.e., trying the same base learner again)
-  training error of the ensemble as a function of the number of
boosting iterations
-  generalization error of the ensemble

Email:51zuoyejun

@gmail.com