Random Forest

Random Forest#

The random forest is an ensemble of decision trees used for classification and regression tasks.

Decision trees work together to make better predictions and reduce errors.
A key aspect of the algorithm is incorporating randomness to generate diverse decision trees. This is achieved in two primary ways:
- Bagging: Using randomly chosen samples from the initial training set with some repititions for each decision tree.
- Applying the max_feature hyperparameter to randomly select features for splitting each node.
The algorithm constructs multiple different decision trees and combines them to produce a more accurate and stable prediction, introducing additional randomness while growing the trees.
The final decision (prediction) is made as follows:
- For classification, soft voting is used, considering predicted probabilities.
- For regression, predictions are combined by averaging the results.

Random Forest Classifier#

from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(random_state=0)

rf.fit(X_train, y_train)

RandomForestClassifier(random_state=0)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

rf.score(X_train, y_train)

1.0

rf.score(X_test, y_test)

0.972027972027972

n_estimators#

This parameter is used to specify the number of decision trees to be included in the model.

from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=10, random_state=0)

rf.fit(X_train, y_train)

RandomForestClassifier(n_estimators=10, random_state=0)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

rf.score(X_train, y_train)

1.0

rf.score(X_test, y_test)

0.951048951048951

for ne in [10, 20, 40, 50, 100]:
    rf = RandomForestClassifier(n_estimators=ne, max_depth=2, random_state=0)
    rf.fit(X_train, y_train)
    print(f'Number of Trees: {ne} --- Training Score: {rf.score(X_train, y_train):.2f} --- Test Score: {rf.score(X_test, y_test):.2f}')

Number of Trees: 10 --- Training Score: 0.95 --- Test Score: 0.97
Number of Trees: 20 --- Training Score: 0.96 --- Test Score: 0.97
Number of Trees: 40 --- Training Score: 0.97 --- Test Score: 0.97
Number of Trees: 50 --- Training Score: 0.97 --- Test Score: 0.97
Number of Trees: 100 --- Training Score: 0.96 --- Test Score: 0.96

Feature Importances#

The feature_importances_ attribute returns the significance of each feature in building the model using the Random Forest algorithm.

The sum of the importances for all features equals 1.

rf.feature_importances_

array([3.03146148e-02, 5.10500369e-03, 6.17712997e-02, 5.35757452e-02,
       1.98074648e-03, 7.26183129e-03, 6.82845741e-02, 1.15938370e-01,
       1.33644616e-03, 2.47844515e-04, 3.15481379e-02, 3.82948809e-03,
       1.07409521e-02, 3.77591122e-02, 0.00000000e+00, 1.17547311e-04,
       1.22963830e-03, 4.54687870e-03, 0.00000000e+00, 4.11727249e-04,
       1.33369060e-01, 1.89391783e-03, 1.85689644e-01, 6.42969350e-02,
       8.69161951e-03, 8.48533691e-03, 2.85407256e-02, 1.22046376e-01,
       2.51496314e-03, 8.47146445e-03])

import matplotlib.pyplot as plt
plt.figure(figsize=(20,4))
plt.bar(load_breast_cancer().feature_names, rf.feature_importances_)
plt.xticks(rotation=90);

_images/5e43951a1e200049f8fea0357dbe8aaf5ace9adb685fe4cf0f221a26e911ab93.png

Random Forest Regressor#

from sklearn.datasets import fetch_california_housing
X, y = fetch_california_housing(return_X_y=True)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(random_state=0)

rf.fit(X_train, y_train)

RandomForestRegressor(random_state=0)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

rf.score(X_train, y_train)

0.972774690316785

rf.score(X_test, y_test)

0.793952082698899

n_estimators#

from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators=10, random_state=0)

rf.fit(X_train, y_train)

RandomForestRegressor(n_estimators=10, random_state=0)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

rf.score(X_train, y_train)

0.9602653623018315

rf.score(X_test, y_test)

0.7750266107715603

for ne in [10, 20, 40, 50, 100]:
    rf = RandomForestRegressor(n_estimators=ne, random_state=0)
    rf.fit(X_train, y_train)
    print(f'Number of Trees: {ne} --- Training Score: {rf.score(X_train, y_train):.2f} --- Test Score: {rf.score(X_test, y_test):.2f}')

Number of Trees: 10 --- Training Score: 0.96 --- Test Score: 0.78

Number of Trees: 20 --- Training Score: 0.97 --- Test Score: 0.78

Number of Trees: 40 --- Training Score: 0.97 --- Test Score: 0.79

Number of Trees: 50 --- Training Score: 0.97 --- Test Score: 0.79

Number of Trees: 100 --- Training Score: 0.97 --- Test Score: 0.79

for ne in [10, 20, 40]:
    for md in [2, 3, 4, 5, 10]:
        rf = RandomForestRegressor(n_estimators=ne, max_depth=md, random_state=0)
        rf.fit(X_train, y_train)
        print(f'Number of Trees: {ne} --- Max Depth: {md}--- Training Score: {rf.score(X_train, y_train):.2f} --- Test Score: {rf.score(X_test, y_test):.2f}')

Number of Trees: 10 --- Max Depth: 2--- Training Score: 0.47 --- Test Score: 0.45

Number of Trees: 10 --- Max Depth: 3--- Training Score: 0.57 --- Test Score: 0.54

Number of Trees: 10 --- Max Depth: 4--- Training Score: 0.64 --- Test Score: 0.60

Number of Trees: 10 --- Max Depth: 5--- Training Score: 0.68 --- Test Score: 0.64

Number of Trees: 10 --- Max Depth: 10--- Training Score: 0.86 --- Test Score: 0.76

Number of Trees: 20 --- Max Depth: 2--- Training Score: 0.47 --- Test Score: 0.45

Number of Trees: 20 --- Max Depth: 3--- Training Score: 0.57 --- Test Score: 0.54

Number of Trees: 20 --- Max Depth: 4--- Training Score: 0.63 --- Test Score: 0.60

Number of Trees: 20 --- Max Depth: 5--- Training Score: 0.68 --- Test Score: 0.64

Number of Trees: 20 --- Max Depth: 10--- Training Score: 0.87 --- Test Score: 0.76

Number of Trees: 40 --- Max Depth: 2--- Training Score: 0.47 --- Test Score: 0.44

Number of Trees: 40 --- Max Depth: 3--- Training Score: 0.57 --- Test Score: 0.54

Number of Trees: 40 --- Max Depth: 4--- Training Score: 0.64 --- Test Score: 0.60

Number of Trees: 40 --- Max Depth: 5--- Training Score: 0.68 --- Test Score: 0.64

Number of Trees: 40 --- Max Depth: 10--- Training Score: 0.87 --- Test Score: 0.76

Feature Importances#

rf.feature_importances_

array([0.60126682, 0.05132849, 0.03458775, 0.01762482, 0.01736841,
       0.1352393 , 0.07189581, 0.0706886 ])

import matplotlib.pyplot as plt
plt.figure(figsize=(20,4))
plt.bar(fetch_california_housing().feature_names, rf.feature_importances_)
plt.xticks(rotation=90);

_images/e8048bc3cb894a1e323b75cd0d54fd3e8b8dac29995c49c2668f7080cc3a2d64.png

Random Forest

Contents

Random Forest#

Random Forest Classifier#

n_estimators#

Feature Importances#

Random Forest Regressor#

n_estimators#

Feature Importances#