Ensembles 4: Performance Analysis

Welcome to this fourth video in Lucidate’s series on Ensemble methods.

Hit '>Play' above for a comparative performance analysis of three tree-based machine learning models.

In previous videos we have discussed Decision trees. What they are for, how to build them and how to interpret their output.

We’ve looked at an extension to decision trees called random forests. Random forests - as the name suggests - produce a large number of diversified trees from a training dataset. A technique called ‘Bagging’ is used to randomly shuffle the training data to create diversified trees. All trees in a forest make errors, but they don’t all make the same error. This means that in aggregate they make better decisions than a single tree alone.

This approach is further developed with AdaBoost. Boosting is another technique for producing diversified, uncorrelated machine learning models. But rather than randomly shuffling the training data the training data is skewed to include more samples where previous models either gave an incorrect classification or disagreed with one another.

Where random forests are built in parallel, AdaBoost models are built in series. Each new model “learns” from the mistakes in prior models. AdaBoost typically uses very simple learners - this helps to avoid excessive overfitting.

In this video we will compare the performance of these three approaches on our loan default dataset. As we are comparing performance we will make extensive use of the ‘Confusion Matrix’, and the metrics derived from it. Metrics such as accuracy, precision, Sensitivity, F1 score.

Links to videos: Decision Trees: https://youtu.be/5swHVbJNWpw

AdaBoost: https://youtu.be/n8ywyA2_nVE

Random Forests: https://youtu.be/jz1ZBTrIx5Y & https://youtu.be/4SjkF13Sl_0

Evaluation Metrics & the confusion matrix: https://youtu.be/a2oZwdwo0M0