**Advantage of RF: **

- Only little time is needed for optimization (the default param are good enough)
- Strong with outliers, correlated variables
- For continuous variables, it’s able to segmentize it

**Method:**

- Create a bootstrapped dataset (Sample with replacement)
- Create a decision tree using the bootstrapped dataset

But only use a random subset of variables at each split

i.e. in each split, randomly consider a subset left-over variables

that are not selected by the previous split - Repeat above step to have 100 tree
- Prediction:

classify a observation to a class that has the most vote by 100 tree result - How to validate? OOB out-of-bag

i. Run obs in OOB to see if trees classify it right or not, majority vote wins

ii. The proportion of OOB samples that were incorrectly classified is the OOB error - Now since we know how to validate RF, We can use this to choose number of variable to consider in step 2. Test using different number of variables, and compare OOB error
*Note: typically start by using sqrt(num of var) and try few above or below that value.*

### Step for RF Application:

**a. Build Random Forest:**- create dummy for categorical var
- split into test and train
- train model
- get accuracy and cofusion matrix
- checking overfitting
- (by comparing OOB error rate bt test and train)

- if OOB error and test error are similar, not overfitting
- (for Fraud) build ROC and look for possible cut-off point
- max(1-class1 error(FP) – class0 error(FN))

**b. Plot variable importance for insights of each variable**- Plot PDP (Partial Dependence Plots)
- for insights of each levels for each variable

#### a. Build Random Forest:

# useful packages import pandas as pd import numpy as np from sklearn.ensemble import RandomForestClassifier from numpy.c·ore.umath_tests import inner1d from sklearn.metrics import confusion_matrix from sklearn.model_selection import train_test_split

#dummy variables for the categorical ones data_dummy = pd.get_dummies(data, drop_first=True) np.random.seed(1234) #split into train and test to avoid overfitting train, test = train_test_split(data_dummy , test_size = 0.34) #build the model rf = RandomForestClassifier(n_estimators=100, max_features=3, oob_score=True) rf.fit(train.drop('outcome_column', axis=1), train['outcome_column']) #let's print OOB accuracy and confusion matrix print("OOB accuracy is", rf.oob_score_, "\n", "OOB Confusion Matrix", "\n", pd.DataFrame(confusion_matrix(train['converted'], rf.oob_decision_function_[:,1].round(), labels=[0, 1])))

#and let's print test accuracy and confusion matrix print( "Test accuracy is", rf.score(test.drop('outcome_column', axis=1), test['outcome_column']), "\n", "Test Set Confusion Matrix", "\n", pd.DataFrame(confusion_matrix(test['outcome_column'], rf.predict(test.drop('outcome_column', axis=1)), labels=[0, 1])) )

Since Accuracy for Test and Train are similar, we are not over-fitting

###### Question 1 : If response variable is continuous, how to define accuracy?

###### Answer: Change accuracy standard

If the prediction is within 25% of the actual value, we say it’s predicting right

I.e. if a given person salary is 100K,

we consider the model to predict correctly if the prediction is within 25K.

We can look at this as a sort of accuracy when the label is continuous.

```
accuracy_25pct = ((rf.predict(test.drop('outcome_column',
axis=1))/test['outcome_column']-1).abs()<.25).mean()
print("We are within 25% of the actual outcome in ",
accuracy_25pct.round(2)*100, "% of the cases", sep="")
```

###### Question 2: How to know the model is actually learning things

###### Answer: (if so, insights generated from RF will be fairly reliable, and for sure directionally true)

#deciles print(np.percentile(train['outcome_column'], np.arange(0, 100, 10)))

#### b. Variable Importance (check each variable)

##### Note:

- Var Imp is useful to determine whether rebuild is needed.

is the most important var the least actionable? (i.e. total page visited by user)- if so, drop that variable from data and refit RF

- continuous variable tends to be very important in RF, if categorical varaible stands on the top of important list, means it’s a really important variable
- If two variable is likely correlated, check with pearson correlation. RF not going to pick same information twice, thus robust to correlated variable, that’s why it’s so popular.

# Var Imp # rf is the random forest model we previously built feat_importances = pd.Series(rf.feature_importances_, index=train.drop('outcome_column', axis=1).columns) feat_importances.sort_values().plot(kind='barh') # PDP for column1, which has 3 levels ('level1', 'level2', 'level3') from pdpbox import pdp, info_plots pdp_iso = pdp.pdp_isolate( model=rf, dataset=train.drop(['outcome_column'], axis=1), model_features=list(train.drop(['outcome_column'], axis=1)), feature=['level1', 'level2', 'level3'], num_grid_points=50) pdp_dataset = pd.Series(pdp_iso.pdp, index=pdp_iso.display_columns) pdp_dataset.sort_values(ascending=False).plot(kind='bar', title='column1') # Pearson Corr from scipy.stats import pearsonr print("Correlation between A and B is:", round(pearsonr(data.A, data.B)[0],2))