Current location - Recipe Complete Network - Fat reduction meal recipes - Computer Vision - Typical Target Detection Algorithm (Fast R-CNN Algorithm) (5)
Computer Vision - Typical Target Detection Algorithm (Fast R-CNN Algorithm) (5)

Embedded Niu Guide Target detection is widely used in reality. We need to detect the location and category of objects in digital images. It requires us to build a model. The input of the model is a picture, and the output of the model needs to be Circle the location of all objects in the picture and the category to which they belong. Before the arrival of the deep learning wave, the progress of target detection accuracy was very slow, and it was quite difficult to improve accuracy by relying on traditional methods that rely on manual features. The powerful performance of AlexNet, a convolutional neural network (CNN) that appeared in the ImageNet classification competition, has attracted scholars to migrate CNN to other tasks, including target detection tasks. In recent years, many target detection tasks have appeared algorithm.

Computer vision with embedded cow nose

Embedded cow asks how to understand the target detection algorithm - Fast R-CNN

Embedded cow text

is To overcome the problems of SPP-Net, in 2015 Girshick et al. proposed the Fast R-CNN [31] algorithm based on bounding box and multi-task loss classification. This algorithm simplifies the SPP layer and designs a single-scale ROI Pooling pooling layer structure; samples the candidate area of ??the entire image to a fixed size, generates a feature map and performs SVD decomposition, and obtains the Softmax classification score and BoundingBox through the ROI Pooling layer The window surrounding the rectangular frame returns two vectors; uses Softmax instead of SVM and proposes the idea of ??a multi-task loss function, integrating the two stages of deep network and SVM classification, that is, merging the classification problem and the border regression problem.

Detailed explanation of the algorithm:

The flow chart of Fast R-CNN is as follows. The network has two inputs: image and corresponding region proposal. The region proposal is obtained by the selective search method and is not shown in the flow chart. A regressor is trained for each category, and only non-background region proposals need to be regressed.

ROI pooling: The function of ROI Pooling is to extract a fixed-size feature map from the feature map output by the last convolutional layer for region proposals of different sizes. Simply put, it can be regarded as a simplified version of SPPNet. Because the inputs of the fully connected layer need to be of the same size, region proposals of different sizes cannot be directly mapped to feature maps as output, and size transformation is required. In the article, the VGG16 network uses the parameters of H=W=7, which divides a region proposal of h*w into a grid of H*W size, and then maps this region proposal to the feature map output by the last convolutional layer. Finally, the maximum value in each grid is calculated as the output of the grid, so regardless of the size of the feature map before ROI pooling, the size of the feature map obtained after ROI pooling is H*W.

Therefore, it can be seen that Fast RCNN has three main improvements: 1. Convolution is no longer performed on each region proposal, but directly on the entire image, which reduces a lot of repeated calculations. It turns out that RCNN performs convolution on each region proposal separately. Because there are about 2,000 region proposals in an image, the overlap rate between them must be very high, thus causing repeated calculations. 2. Use ROI pooling to transform the feature size. Because the input of the fully connected layer requires the same size, the region proposal cannot be used directly as input. 3. Put the regressor into the network and train together. Each category corresponds to a regressor, and use softmax to replace the original SVM classifier.

In actual training, each mini-batch contains 2 images and 128 region proposals (or ROI), that is, each image has 64 ROIs. Then select about 25% of these ROIs, the IOU values ??of these ROIs and ground truth are both greater than 0.5. In addition, only random horizontal flipping is used to increase the data set.

During testing, each image had approximately 2,000 ROIs.

The definition of the loss function is to integrate the loss of classification and the loss of regression. The classification uses log loss, that is, the probability of the true classification (pu in the figure below) is negative log, and the regression loss is basically the same as R-CNN. The classification layer outputs K+1 dimensions, representing K classes and 1 background class.

This is the loss of regression, where t^u represents the predicted result and u represents the category. v represents the real result, that is, bounding box regression target.

Use SVD decomposition to improve the fully connected layer. If it is an ordinary classification network, then the calculation of the fully connected layer should be far less than the calculation of the convolutional layer. However, for object detection, Fast RCNN has to go through several fully connected layers for each region proposal after ROI pooling, which makes the entire The calculation of the connection layer accounts for nearly half of the calculation of the network, as shown in the figure below, so the author uses SVD to simplify the calculation of the fully connected layer. The R-FCN network discussed in another blog link is a new algorithm for optimizing the calculation of this fully connected layer.

Let’s briefly summarize the structure of training and testing, as shown in the two figures below, and you will have a clearer understanding of the algorithm.

It is easier to understand how the test structure diagram is output in the ROI Pooling layer.