Random Forest — Business Insights

Draw Business Insights from RF

1. Var Imp:

  • Look at the rank of important variables, if the top one are the least actionable variable, meaning that it’s impossible for company to change that variable, delete it and re-build RF
  • check whether the top variable are continuous or categorical variable
    • continuous variables tend to show up at the top in RF variable importance plots.
    • If a categorical variables shows up at the top, it usually means it’s really important

2. PDP plot:

  • For categorical features with multiple levels:
    • Always remember there is a base level that was not plotted here
    • If all level are high positive, that means all those levels have high values compare to the base level, which means the base level has lowest outcome value
  • For binary features:
    • The plot is usually straight forward
  • For continuous:
    • Check the trend, and make a division
    • i.e. people with more than 70k income (feature) tends to have higher success rate (outcome value)

3. Build a simple DT to check 2 or 3 important segments

import graphviz
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
from graphviz import Source
  
tree = DecisionTreeClassifier( max_depth=2,class_weight={0:1, 1:10}, min_impurity_decrease = 0.001)
tree.fit(train.drop(['outcome'], axis=1), train['outcome'])
  
#visualize it
export_graphviz(tree, out_file="tree_conversion.dot", feature_names=train.drop(['outcome'], axis=1).columns, proportion=True, rotate=True)
with open("decision tree.dot") as f:
    dot_graph = f.read()
s = Source.from_file("decision tree.dot")
s.view()
  • Each nodes has 4 values:
    • The tree split
    • Gini index:
      • Represent impurity of the node, 0.5 the worst
      • The average weighted Gini Impurity decreases as we move down the tree
      • 0 means perfect classification, best possible value
      • It’s the probability that randomly chosen sample in a node would be incorrectly labeled if it was labeled by the distribution of samples in the node.
    • Samples:
      • Proportion of events in that node, the higher the better
      • It means that node is very important capture many people
    • Value:
      • Proportion of class 0 and class 1 event
  • If a variable is in the tree throughout all levels,
    • it probably have information about the other features as well
    • Note: it’s often that one variable is way more important than the others, this happens because it’s highly correlated with the other variables, try to get to the bottom to those relationship between the most important var and the others. Or try to remove that feature and see which variable starts to matter.
    • Plot important feature vs outcome, to investigate pattern of how outcome was influenced by that feature

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: