Random forest is an ensemble algorithm (Ensemble Learning), which belongs to the Bagging type. By combining multiple weak classifiers, the final result is voted or averaged, making the results of the overall model highly accurate. degree and generalization performance. It can achieve good results mainly due to "random" and "forest". One makes it resistant to over-fitting, and the other makes it more accurate.
Bagging is an integration technique that reselects k new data sets through replacement sampling on the original data set to train a classifier. It uses a set of trained classifiers to classify new samples, and then uses majority voting or averaging the output to count the classification results of all classifiers. The category with the highest result is the final label. This type of algorithm can effectively reduce bias and reduce variance.
Bootstrap method It uses bootstrap resampling technology to collect a fixed number of samples from the training set, but after each sample is collected, the sample is put back. In other words, previously collected samples may continue to be collected after they are released.
OOB In each round of random sampling of Bagging, approximately 36.5% of the data in the training set is not collected in the sampling set. For this part of data that has not been collected, we often call it out-of-bag data (OOB for short). These data do not participate in the fitting of the training set model, so they can be used to test the generalization ability of the model.
Randomness For our Bagging algorithm, samples are generally collected randomly using boostrap. Each tree collects the same number of samples, which is generally smaller than the original sample size. The content of the sample set obtained in this way is different each time. Through this bootstrap method, K classification trees are generated to form a random forest to achieve sample randomness.
The aggregation strategy of output bagging is also relatively simple. For classification problems, a simple voting method is usually used, and the category or one of the categories that gets the most votes is the final model output. For regression problems, the simple average method is usually used, and the final model output is obtained by arithmetic averaging the regression results obtained by T weak learners.
Weak classifier First, RF uses the CART decision tree as a weak learner. In other words, we actually just call the Bagging method of experimental CART decision tree as a weak learner a random forest.
Randomness At the same time, when generating each tree, the features selected for each tree are not just a few randomly selected features. Generally, the root of the total number of features m is taken by default. The general CART tree will select all features for modeling. Therefore, not only the features are random, but also the randomness of the features is guaranteed.
Sample size Compared with the general Bagging algorithm, RF will choose to collect the same number of samples as the number N of training set samples. ,
Characteristics Due to randomness, it is very effective in reducing the variance of the model, so random forest generally does not require additional pruning, that is, it can achieve better generalization ability and anti-fitting ability (Low Variance) . Of course, the degree of fitting to the training set will be slightly worse, that is, the bias of the model will be larger (High Bias), which is only relative.
In the original paper on random forests, it was shown that the random forest error rate depends on two things:
?The correlation between any two trees in the forest. Increasing correlation increases forest error rate.
?The power of each tree in the forest (trees with low error rates are strong classifiers). Increasing the strength of individual tree data (more accurate classification) will reduce the forest error rate.
The weak classifier of random forest uses CART tree, and CART decision tree is also called classification regression tree.
When the dependent variable of the data set is a continuous value, the tree algorithm is a regression tree, and the mean value of the leaf node observations can be used as the predicted value; when the dependent variable of the data set is a discrete value, the tree algorithm is a classification Trees can solve classification problems very well. However, it should be noted that this algorithm is a binary tree, that is, each leaf node can only derive two branches, so when a non-leaf node is a multi-level (more than 2) discrete variable, the variable may Used many times. At the same time, if a non-leaf node is a continuous variable, the decision tree will also treat it as a discrete variable (that is, divide it among limited possible values).
Currently, the more popular methods for feature selection are information gain, gain rate, Gini coefficient and chi-square test. Here we mainly introduce the feature selection based on the Gini coefficient (Gini), because the CART decision tree used by the random forest selects features based on the Gini coefficient.
The criterion for selecting the Gini coefficient is that each sub-node reaches the highest purity, that is, all observations falling in the sub-node belong to the same category. At this time, the Gini coefficient is the smallest, the purity is the highest, and the uncertainty is the smallest. For a general decision tree, if there are K classes in total, the probability that the sample belongs to the Kth class: pk, then the Gini index of the probability distribution is:
The larger the Gini index, the greater the uncertainty. The smaller the Gini coefficient, the smaller the uncertainty, and the more thorough and clean the data segmentation is.
For the CART tree, since it is a binary tree, it can be represented by the following:
When we traverse each split point of each feature, when using feature A=a, Divide D into two parts, namely D1 (the sample set that satisfies A=a) and D2 (the sample set that does not satisfy A=a). Then under the condition of feature A=a, the Gini index of D is:
Gini(D): represents the uncertainty of set D.
Gini(A, D): represents the uncertainty of the set D after being divided by A=a.
Each CART decision tree in the random forest is to continuously traverse all possible split points of the feature subset of the tree, find the split point of the feature with the smallest Gini coefficient, and divide the data set into two subset until the stopping condition is met.
First of all, as mentioned in the introduction to Bagging, when selecting the features used by each tree, they are randomly generated from all m feature values, which itself has reduced the risk and tendency of over-fitting. . The model will not be determined by specific feature values ??or combinations of features, and the increase in randomness will prevent the model's fitting ability from infinite improvement.
Second, unlike the decision tree, RF has improved the resume of the decision tree. For ordinary decision trees, we will select an optimal feature among all m sample features on the node to divide the left and right subtrees of the decision tree. However, each tree of RF actually uses a part of the features. Among these few features, an optimal feature is selected to divide the left and right subtrees of the decision tree, which expands the effect of randomness and further enhances the generalization of the model. ability.
Assume that each tree selects msub features. The smaller the msub, the better the model will fit the training set and the bias will increase, but the generalization ability will be stronger and the model variance will decrease. The larger msub is, the opposite is true. In actual use, the value of msub is generally used as a parameter. By turning on OOB verification or using cross-validation, the parameters are continuously adjusted to obtain a suitable msub value.
Advantages:
(1) Due to the use of integrated algorithms, the accuracy is better than most individual algorithms.
(2) Performed well on the test set. Due to the introduction of two randomnesses, the random forest is not prone to overfitting (random samples, random features)
(3 ) In industry, due to the introduction of two randomnesses, random forest has a certain anti-noise ability and has certain advantages compared with other algorithms.
(4) Due to the combination of trees, random forest can handle nonlinear data and itself is a nonlinear classification (fitting) model.
(5) It can handle very high-dimensional data (many features) without feature selection, and has strong adaptability to data sets: it can handle both discrete data and continuous data. , the data set does not need to be normalized.
(6) The training speed is fast and can be applied to large-scale data sets.
(7) Due to out-of-bag data (OOB), an unbiased estimate of the true error can be obtained during the model generation process without losing the amount of training data.
(8) During the training process, the mutual influence between features can be detected, and the importance of the feature can be derived, which has certain reference significance.
(9) Since each tree can be generated independently and simultaneously, it is easy to make a parallel method.
(10) Due to its simple implementation, high accuracy, and strong resistance to over-fitting, it is suitable as a benchmark model when facing nonlinear data.
Disadvantages:
(1) Random forest does not perform as well as it does in classification when solving regression problems. This is because it does not give a continuous output. When performing regression, random forests are unable to make predictions beyond the range of the training set data, which may lead to overfitting when modeling certain noisy data. (PS: Random forests have been proven to overfit in some noisy classification or regression problems)
(2) For many statistical modelers, random forests feel like Like a black box, you have no control over the inner workings of the model. Just try between different parameters and random seeds.
(3) There may be many similar decision trees, masking the real results.
(4) For small data or low-dimensional data (data with fewer features), it may not be able to produce good classification. (Processing high-dimensional data, processing missing feature data, and processing imbalanced data are the strengths of random forests).
(5) Although executing data is faster than boosting, it is much slower than a single decision tree.
(1) It is not required to be a linear feature. For example, logistic regression is difficult to handle categorical features, and the tree model, which is a collection of decision trees, can easily handle these situations.
(2) Due to the algorithm construction process, these algorithms can easily handle high-dimensional data and scenarios with large amounts of training data.
Extreme random tree is a variant of random forest. The principle is almost exactly the same as RF. The only differences are:
(1) For the training set of each decision tree, RF uses It is a random sampling bootstrap to select the sampling set as the training set for each decision tree, while extra trees generally do not use random sampling, that is, each decision tree uses the original training set.
(2) After selecting the dividing features, the RF decision tree will select an optimal feature value dividing point based on principles such as Gini coefficient and mean square error. This is different from the traditional decision tree. same. But the extreme random tree is more radical. It will randomly select a feature value to divide the decision tree.