Random Forest: Introduction & Implementation in Python
A beginner friendly introduction of random forest and introductory implementation in Python.
As always, let’s start with a question.
Have you ever been in a situation where you needed the opinion of more than one person to make a decision?
For example, we often review many reviews and seek opinions from family and friends before buying a laptop.
Here, the decisions and recommendations from different sources allowed us to make an informed decision to buy a laptop.
In Machine Learning, Random Forest works similarly.
- Decisions from multiple Decision Trees are considered to make a final decision.
- Here is an article on Introduction to the Decision Trees (if you haven’t read it)
Random Forest was introduced by Breiman (2001).
Random Forests are based on the intuition that “It’s better to get a second opinion when you want to make a decision.”
Fig. Random Forest in a Nutshell.
By the end of this article, you will:
- Understand what a Random Forest is.
- Know what Ensemble Learning is,
- Be able to use a Random Forest for Predictions.
But Before we dive in, let me ask you a couple of questions:
- Do you know what Random is in Random Forest?
- Do you know what Ensemble Learning is?
- What is Bootstrap Aggregation?
- What is Horizontal and Vertical Split?
- Can we use Random Forest for both Classification and Regression Tasks?
Don’t worry if you do not have answers to these questions yet.
By the end of this article, you will understand all of them, if not more.
From Decision Trees to Random Forest
The decision tree alone does a fairly good job of prediction. However, the performance of the model can be increased by using multiple decision trees, mitigating some of the Decision tree’s problems.
When we use multiple(groups) models in a single prediction task**, we call it an ensemble model.** The umbrella term it falls into is Ensemble Learning.
The group of models uses different techniques to contribute to the single prediction task. Two of the most widely used methods are Bagging and Boosting.
We will focus on the Bagging technique because Random Forest uses this technique.
Now, when a group of decision trees is used to create an ensemble model using the Bagging technique, we will have multiple trees — like a Forest. Each tree will be trained on a Random subset of our data. When we combine those terms, we get a Random Forest.
Note: Bagging Technique is a combination of Boostrap and Aggregate. In statistics, Bootstrapping means randomly sampling (choosing) a subset of data (with replacement). Aggregate means getting a mean, median, mode, etc.
In a Random Forest, each tree makes a decision (Prediction); we call this voting. When it is time to make a final prediction, the prediction that the majority of trees make will be chosen.
Here, three Decision Trees voted on Buying a house, and one decision tree voted on not buying a house. Since the majority of the votes were for buying a house, we chose to buy a house. We can initialize Random Forest in Python as:
from sklearn.ensemble import RandomForestClassiffier, RandomForestRegressor
regressor = RandomForestRegressor() # for a regression task
classifier = RandomForestClassiffier() # for a classification task
In Random Forest, each of those individual Decision Trees is trained on the randomly selected training data.
- But, the randomly selected data is different for each decision tree.
- If we have 10 decision trees, we will have 10 subsets from the original dataset.
- Those 10 decision trees will be individually trained on those data.
- And finally, aggregated to make the final prediction.
This all can be done in Python, scikit-learn as:
from sklearn.ensemble import RandomForestClassiffier, RandomForestRegressor
regressor = RandomForestRegressor(n_estimators=10) # for a regression task
classifier = RandomForestClassiffier(n_estimators=10) # for a classification task
Here, n_estimators sets the number of decision trees in Random Forest. Since each tree requires its own subset of original data, we will have 10 subsets of data.
We now know that there will be subsets of data for each individual tree, so let’s see how the subset is selected.
The subset is created by selecting features and the observations Vertically and Horizontally.
Vertically — A random subset of Features is selected.
Horizontally — A random subset of Observations is selected.
Here is a fig. to explain this.
For any decision tree in the forest, a Random number of features and a Random number of observations will be selected and used to train that particular individual decision tree. Here, for another decision tree, different sets of Features and Observations are selected.
The idea behind this is to create diversity among decision trees. Using random features and observations, no two decision trees will have learned the same pattern. Which helps in having diversity among the predictors (decision trees)
in scikit-learn we have two parameters that control this.
By default, one decision tree will select a maximum of sqrt(total features) for the classification task. This means that if we have 100 features, then one decision tree will see a maximum of 10 features for a classification task.
However, it selects 1.0 features by default for a regression task, which means choosing all the features for the regression task.
The default values for classification and regression are confusing for beginners. But know one thing, if we have a default value in float (e.g., 1.0), then 100% of the features will be selected.
We can set max_samples=0.2, and it will select a maximum of 20 features.
we calculate that by max(1, 0.2*100) = max(1, 20) = 20
# for a classification task
classifier = RandomForestClassiffier(n_estimators=100, max_features='sqrt')
# for a regression task
regressor = RandomForestRegressor(n_estimators=100, max_features=0.2)
for a number of observations, we can tweak the max_samples parameter.
classifier = RandomForestClassiffier(max_samples=0.5) # for a classification task
regressor = RandomForestRegressor(max_samples=0.5) # for a regression task
Here, max_samples=0.5 means each tree will have a bootstrapped sample of 50% observations.
If we have 500 observations, each tree will have a bootstrapped sample of 250 observations to train.
Here is an amazing article on Bootstrapping Method and how to create a bootstrap sample
Please go through the documentation of RandomForestClassifier and RandomForestRegression in scikit-learn document to see what the other parameters you can set.
Since random forest is just a wrapper for multiple decision trees, we can also use the parameters we used for the decision tree as well.
- criterion=’entropy’, criterion=’squared_error’
- max_depth = 3
Amazing!!!
Finally, after training our RandomForest Model, we will have a result like this:
Among, 10 Decision Tree, 7 of them voted for Buying the house. So. the final decision will be Buying the house.
How to Measure the Results of Random Forests?
Well, this is the same as the Decision Tree.
It depends on what type of problem we are solving.
If we are solving a Regression problem, then we should use metrics such as MSE, MAE, RMSE, etc.
If we are solving a Classification problem then Accuracy, Precision, Recall, F1 Scores are the metrics we should use.
Implementation Examples:
Here is the Kaggle Notebook example of using a Random Forest for a Regression Problem.
Here is the Kaggle Notebook example of using a Random Forest for a Classification Problem.
Random Forest fixes a couple of limitations of the Decision Tree
- Random Forests are less prone to overfitting
- Small changes in the dataset don’t necessarily affect the result of Random Forest.
- Reduces the bias towards imbalanced data as well.
Let me go back and ask you these questions again:
- Understand what a Random Forest is.
- Know what Ensemble Learning is,
- Be able to use a Random Forest for Predictions.
I hope you are now able to answer these questions now:
- Do you know what Random is in Random Forest?
- Do you know what Ensemble Learning is?
- What is Bootstrap Aggregation?
- What is Horizontal and Vertical Split?
- Can we use Random Forest for both Classification and Regression Tasks?
What’s Next
Bagging is not the only ensemble technique we have. Lets talk about Boosting, another ensemble learning technique, in our next article.
This is the 5th article on the series: Forming a strong foundation.
Here are the list of previous articles:
References and More Materials:
Bento, C. (2022, January 10). Random Forests Algorithm explained with a real-life example and some Python code. Medium. https://towardsdatascience.com/random-forests-algorithm-explained-with-a-real-life-example-and-some-python-code-affbfa5a942c
Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/a:1010933404324
R, S. E. (2024, June 11). Understand random Forest algorithm with examples (Updated 2024). Analytics Vidhya. https://www.analyticsvidhya.com/blog/2021/06/understanding-random-forest/ A guide to random Forest | E2E networks. (n.d.). https://www.e2enetworks.com/blog/random-forest-algorithm-in-machine-learning-a-guide
Meltzer, R. (2023, August 31). What is Random Forest? [Beginner’s Guide + Examples]. CareerFoundry. https://careerfoundry.com/en/blog/data-analytics/what-is-random-forest/#:~:text=3.,%2C patient history%2C and safety.
11.2.1 — Bootstrapping methods. (n.d.). https://online.stat.psu.edu/stat500/book/export/html/619#:~:text=Bootstrapping is a resampling procedure,bootstrap sample for the median.