随机森林树的特点--摘抄笔记

摘抄自《Random Forests explained intuitively》

链接:https://www.datasciencecentral.com/profiles/blogs/random-forests-explained-intuitively

 

Why is it called random then?

Say our dataset has 1,000 rows and 30 columns. There are two levels of randomness in this algorithm:

  1. At row level: Each of these decision trees gets a random sample of the training data (say 10%) i.e. each of these trees will be trained independently on 100 randomly chosen rows out of 1,000 rows of data. Keep in mind that each of these decision trees is getting trained on 100 randomly chosen rows from the dataset i.e they are different from each other in terms of predictions.

  2. At column level: The second level of randomness is introduced at the column level. Not all the columns are passed into training each of the decision trees. Say we want only 10% of columns to be sent to each tree. This means a randomly selected 3 column will be sent to each tree. So for the first decision tree, may be column C1, C2 and C4 were chosen. The next DT will have C4, C5, C10 as chosen columns and so on. 

When is a random forest a poor choice relative to other algorithms?

  1. Random forests don't train well on smaller datasets as it fails to pick on the pattern. To simplify, say we know that 1 pen costs INR 1, 2 pens cost INR 2, 3 pens cost INR 6. In this case, linear regression will easily estimate the cost of 4 pens but random forests will fail to come up with a good estimate.

  2. There is a problem of interpretability with random forest. You can't see or understand the relationship between the response and the independent variables. Understand that random forest is a predictive tool and not a descriptive tool. You get variable importance but this may not suffice in many analysis of interests where the objective might be to see the relationship between response and the independent features.

  3. The time taken to train random forests may sometimes be too huge as you train multiple decision trees. Also, in the case of a categorical variable, the time complexity increases exponentially. For a categorical column with n levels, RF tries split at 2^n -1 points to find the maximal splitting point. However, with the power of H2O we can now train random forests pretty fast. You may want to read about H2O at H2O in R explained.

  4. In the case of a regression problem, the range of values response variable can take is determined by the values already available in the training dataset. Unlike linear regression, decision trees and hence random forest can't take values outside the training data. 

What are the advantages of using random forest?

  1. Since we are using multiple decision trees, the bias remains same as that of a single decision tree. However, the variance decreases and thus we decrease the chances of overfitting. I have explained bias and variance intuitively at The curse of bias and variance.

  2. When all you care about is the predictions and want a quick and dirty way out, random forest comes to the rescue. You don't have to worry much about the assumptions of the model or linearity in the dataset. 

你可能感兴趣的:(算法摘抄笔记)