Ensemble Methods
Ensemble
methods is a machine
learning technique that combines several base models in order to produce one
optimal predictive model.
An ensemble is itself a supervised learning algorithm, because it can be trained and then used to make predictions.
Ensembles tend to yield better results when there is a significant diversity among the models.
Ensemble techniques (especially bagging) tend to reduce problems related to over-fitting of the training data.
Ensembling
reduces variance and bias, two things that can cause big differences between
predicted and actual results.
Types of
ensembles:
1) Bayes
optimal classifier
2) Bagging
3) Boosting
4) Bayesian parameter averaging
5) Bayesian model combination
6) Bucket of models
7) Stacking
1) Bayes optimal classifier (or) Optimal Bayes classifier:
The Optimal Bayes
classifier chooses the class that has greatest a posteriori probability of
occurrence (so called maximum a posteriori
estimation, or MAP). It can be shown
that of all classifiers, the Optimal Bayes classifier is the one that will have
the lowest probability of miss classifying an observation, i.e. the lowest
probability of error. So if we know the posterior distribution, then using the
Bayes classifier is as good as it gets.
The Bayes optimal
classifier is a classification technique. It is an ensemble of all the
hypotheses in the hypothesis space.
2) Bagging:
BAGGing gets its name because
it combines Bootstrapping and Aggregation to form one ensemble model.
Given a sample of data, multiple bootstrapped subsamples are pulled. A Decision
Tree is formed on each of the bootstrapped subsamples. After each subsample
Decision Tree has been formed, an algorithm is used to aggregate over the
Decision Trees to form the most efficient predictor.
3) Boosting:
Boosting provides sequential learning of the predictors.
The first predictor is learned on the whole data set, while the following
are learnt on the training set based on the performance of the previous one. It starts
by classifying original data set and giving equal weights to each observation.
If classes are predicted incorrectly using the first learner, then it gives
higher weight to the missed classified observation. Being an iterative process,
it continues to add classifier learner until a limit is reached in the number
of models or accuracy. Boosting has shown better predictive accuracy than
bagging, but it also tends to over-fit the training data as well.
Boosting,
on the other hand, runs weighted averages in parallel.
AdaBoost -- At each iteration, adaptive
boosting changes the sample distribution by modifying the weights
attached to each of the instances. It increases the weights of the wrongly
predicted instances and decreases the ones of the correctly predicted
instances. The weak learner thus focuses more on difficult instances. After
being trained, the weak learner is added to the strong one according to
his performance (so-called alpha weight). The higher it performs, the more
it contributes to the strong learner.
Gradient Boosting -- gradient boosting doesn’t modify the
sample distribution. Instead of training on a new sample distribution, the weak
learner trains on the remaining errors (so-called pseudo-residuals)
of the strong learner. It is another way to give more importance to the
difficult instances. At each iteration, the pseudo-residuals are computed and a
weak learner is fitted to these pseudo-residuals. Then, the contribution of the
weak learner (so-called multiplier) to the strong one isn’t computed according
to his performance on the new distribution sample but using a gradient
descent optimization process. The computed contribution is the one minimizing
the overall error of the strong learner.
4) Bayesian parameter averaging:
Bayesian
parameter averaging (BPA) is an ensemble technique that seeks to approximate
the Bayes optimal classifier by sampling hypotheses from the hypothesis space,
and combining them using Bayes' law.
5) Bayesian model combination:
Bayesian
model combination (BMC) is an algorithmic correction to Bayesian model
averaging (BMA). Instead of sampling each model in the ensemble individually,
it samples from the space of possible ensembles (with model weightings drawn
randomly from a Dirichlet distribution having uniform parameters). This
modification overcomes the tendency of BMA to converge toward giving all of the
weight to a single model. Although BMC is somewhat more computationally
expensive than BMA, it tends to yield dramatically better results. The results
from BMC have been shown to be better on average (with statistical
significance) than BMA, and bagging.
6) Bucket of models:
A
"bucket of models" is an ensemble technique in which a model
selection algorithm is used to choose the best model for each problem. When
tested with only one problem, a bucket of models can produce no better results
than the best model in the set, but when evaluated across many problems, it
will typically produce much better results, on average, than any model in the
set.
7)
Stacking
Stacking
works in two phases. First, we use multiple base classifiers to predict
the class. Second, a new learner is used to combine their predictions
with the aim of reducing the generalization error.
Random Forest Models:
Random
Forest Models can be thought of as BAGGing, with a slight tweak. When deciding where to split and
how to make decisions, BAGGed Decision Trees have the full disposal of features
to choose from. Therefore, although the bootstrapped samples may be slightly
different, the data is largely going to break off at the same features
throughout each model. In contrary, Random Forest models decide where to split
based on a random selection of features. Rather than splitting at similar
features at each node throughout, Random Forest models implement a level of
differentiation because each tree will split based on different features. This
level of differentiation provides a greater ensemble to aggregate over, ergo
producing a more accurate predictor. Refer to the image for a better
understanding.
Comments
Post a Comment