Random Forest — Method and Application (Python)

Advantage of RF:

  1. Only little time is needed for optimization (the default param are good enough)
  2. Strong with outliers, correlated variables
  3. For continuous variables, it’s able to segmentize it


  1. Create a bootstrapped dataset (Sample with replacement)
  2. Create a decision tree using the bootstrapped dataset
    But only use a random subset of variables at each split
    i.e. in each split, randomly consider a subset left-over variables
    that are not selected by the previous split
  3. Repeat above step to have 100 tree
  4. Prediction:
    classify a observation to a class that has the most vote by 100 tree result
  5. How to validate? OOB out-of-bag
    i. Run obs in OOB to see if trees classify it right or not, majority vote wins
    ii. The proportion of OOB samples that were incorrectly classified is the OOB error
  6. Now since we know how to validate RF, We can use this to choose number of variable to consider in step 2. Test using different number of variables, and compare OOB error
    Note: typically start by using sqrt(num of var) and try few above or below that value.

Step for RF Application:

  • a. Build Random Forest:
    1. create dummy for categorical var
    2. split into test and train
    3. train model
    4. get accuracy and cofusion matrix
    5. checking overfitting
      • (by comparing OOB error rate bt test and train)
    6. if OOB error and test error are similar, not overfitting
    7. (for Fraud) build ROC and look for possible cut-off point
      • max(1-class1 error(FP) – class0 error(FN))
  • b. Plot variable importance for insights of each variable
    1. Plot PDP (Partial Dependence Plots)
    2. for insights of each levels for each variable

a. Build Random Forest:

# useful packages
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from numpy.c·ore.umath_tests import inner1d
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
#dummy variables for the categorical ones
data_dummy = pd.get_dummies(data, drop_first=True)
#split into train and test to avoid overfitting
train, test = train_test_split(data_dummy , test_size = 0.34)
#build the model
rf = RandomForestClassifier(n_estimators=100, max_features=3, oob_score=True)
rf.fit(train.drop('outcome_column', axis=1), train['outcome_column'])
#let's print OOB accuracy and confusion matrix
print("OOB accuracy is", rf.oob_score_, "\n", 
"OOB Confusion Matrix", "\n",
pd.DataFrame(confusion_matrix(train['converted'], rf.oob_decision_function_[:,1].round(), labels=[0, 1])))
#and let's print test accuracy and confusion matrix
"Test accuracy is", rf.score(test.drop('outcome_column', axis=1),
"Test Set Confusion Matrix", 
                              rf.predict(test.drop('outcome_column', axis=1)),
                              labels=[0, 1]))

Since Accuracy for Test and Train are similar, we are not over-fitting

Question 1 : If response variable is continuous, how to define accuracy?
Answer: Change accuracy standard

If the prediction is within 25% of the actual value, we say it’s predicting right
I.e. if a given person salary is 100K,
we consider the model to predict correctly if the prediction is within 25K.
We can look at this as a sort of accuracy when the label is continuous.

accuracy_25pct =  ((rf.predict(test.drop('outcome_column',
print("We are within 25% of the actual outcome in ",
      accuracy_25pct.round(2)*100, "% of the cases", sep="")
Question 2: How to know the model is actually learning things
Answer: (if so, insights generated from RF will be fairly reliable, and for sure directionally true)
print(np.percentile(train['outcome_column'], np.arange(0, 100, 10))) 

b. Variable Importance (check each variable)

  1. Var Imp is useful to determine whether rebuild is needed.
    is the most important var the least actionable? (i.e. total page visited by user)
    • if so, drop that variable from data and refit RF
  2. continuous variable tends to be very important in RF, if categorical varaible stands on the top of important list, means it’s a really important variable
  3. If two variable is likely correlated, check with pearson correlation. RF not going to pick same information twice, thus robust to correlated variable, that’s why it’s so popular.
# Var Imp
# rf is the random forest model we previously built
feat_importances = pd.Series(rf.feature_importances_,
                             index=train.drop('outcome_column', axis=1).columns)

# PDP for column1, which has 3 levels ('level1', 'level2', 'level3')
from pdpbox import pdp, info_plots
pdp_iso = pdp.pdp_isolate( model=rf, 
                          dataset=train.drop(['outcome_column'], axis=1),      
                          model_features=list(train.drop(['outcome_column'], axis=1)), 
                          feature=['level1', 'level2', 'level3'], 
pdp_dataset = pd.Series(pdp_iso.pdp, index=pdp_iso.display_columns)
pdp_dataset.sort_values(ascending=False).plot(kind='bar', title='column1')

# Pearson Corr
from scipy.stats import pearsonr
print("Correlation between A and B is:", 
      round(pearsonr(data.A, data.B)[0],2))

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: