
Random forests
Random forests are one of the most commonly utilized supervised learning algorithms. While they can be used for both classification and regression tasks, we're going to focus on the former. Random forests are an example of an ensemble method, which works by aggregating the outputs of multiple models in order to construct a stronger performing model. Sometimes, you'll hear this being referred to as a grouping of weak learners to create a strong learner.
Setting up a random forest classifier in Python is quite simple with the help of scikit-learn. First, we import the modules and set up our data. We do not have to perform any data cleaning here, as the Iris dataset comes pre-cleaned.
Before training machine learning algorithms, we need to split our data into training and testing sets. When we train algorithms to learn, we want them to learn how to predict on unseen data, not just memorize the patterns of existing data. To do this, we split our dataset into different sets, typically using 75% of the data for training and 25% for testing. The train_test_split function from sklearn will help us create a training and testing set for our model automatically:
from sklearn.model_selection import train_test_split
features = data.iloc[:,0:4]
labels = data.iloc[:,4]
x_train, x_test, y_train, y_test = train_test_split(features, labels, test_size = 0.25, random_state = 50)
Random forests work from the basic structure of the decision tree, and are trained utilizing the bagging method (bootstrap aggregating). Bagging randomly samples subsets of the training data with replacement, which helps reduce variance in the model that can arise from fitting too closely to a particular dataset. Each of these subsets are then used to train a decision tree, whose outputs are then aggregated for a final prediction.
Random forests take the basic bagging technique further by introducing random feature selection into the mix. The key to random forests is, in fact, this randomness. By applying the bagging method to the entire feature space, the algorithm reduces the variance of its individual trees. As a result, however, these algorithms create high bias. Overall error is determined by how close the correlation between two trees is, as well as the error rate of the individual trees.
In Python, to create a random forest, we first initiate the random forest. Then we define the n_estimators parameter, which tells us how many trees we want in our forest; and lastly, we fit the model:
from sklearn.ensemble import RandomForestClassifier
rf_classifier = RandomForestClassifier(n_estimators=1000) rf_classifier.fit(x_train, y_train)
Upon fitting the forest, the algorithm is applied to the data. Let's take a look at what's going on behind the scenes:
- The algorithm randomly samples the dataset for the defined number of trees. In this case, we defined 1000 trees.
- For each tree, the algorithm randomly selects features and tests the prediction power of those features by using a metric.
- This process iteratively continues until the trees are fully grown.
- The predictions of all of the trees are aggregated into a final prediction.
For classification tasks, trees and forests grow and divide in this algorithm based on one of two metrics, gini impurity or information gain.
- Gini Impurity: A metric that measures the probability of obtaining two different outputs from a classification; the more impure a leaf is, the less likely it is that one classification has a higher probability over the other. Gini impurity can create bias towards certain variables.
- Information Gain: Also called entropy, is a measure of randomness and uncertainty . It is slightly slower to compute compared to Gini Impurity.
If we look at the mathematical formulas for both of these, you can see where they differ:
In the majority of cases, your choice of gini impurity versus entropy will not affect the performance of the model. For regression trees, the algorithms seeks to minimize the variance.
For more information on random forests, or to see more code examples, reference the code examples and exercises at the end of this chapter for other common supervised algorithms. Besides the commonly found logistic regression and random forest models that we've discussed, other supervised learning algorithms are:
- Linear regression
- Naive Bayes
- Support vector machines (SVMs)