Preface ?
This article is about the development of deep neural networks from MobileNet.
With the boom in deep learning, there has been a proliferation of convolutional neural network models in the field of computer vision. From LeNet in 1998, to AlexNet that triggered the deep learning boom in 2012, and later on VGG in 2014, and ResNet in 2015, deep learning network models have been applied in image processing more and more effectively. Neural networks are getting bigger and bigger, the structure is getting more and more complex, and the hardware resources needed for prediction and training are gradually increasing, and often deep learning neural network models can only be run in high-computing-power servers. Mobile devices are difficult to run complex deep learning network models due to the limitations of hardware resources and computing power.
Efforts are also underway within the deep learning space to drive neural networks towards miniaturization. Smaller and faster while maintaining model accuracy. By 2016 until now, the industry has proposed lightweight network models such as SqueezeNet, ShuffleNet, NasNet, MnasNet, and MobileNet. These models make it possible to run neural network models on mobile terminals, embedded devices. MobileNet is more representative of lightweight neural networks.
Google launched the latest MobileNetV3 in May 2019. the new version of MobileNet uses more new features, which makes MobileNet very meaningful for research and analysis, and this article will analyze MobileNet in detail.
Advantages of MobileNet
The MobileNet network has a smaller size, less computation, and higher accuracy. It has a great advantage in lightweight neural networks.
1
Smaller size
MobileNet has significantly fewer parameters than classic large networks, and the fewer the parameters, the smaller the model size.
2
Less computation
MobileNet optimizes the structure of the network so that the amount of model computation decreases exponentially.
3
Higher Accuracy
MobileNet's optimized network structure allows it to outperform some of the larger neural networks with fewer parameters and less computation. In the latest MobileNetV3-Large, it achieves a Top1 accuracy of 75.2% on the ImageNet dataset.
4
Faster
Tested on a Google Pixel-1 phone, all versions of MobileNet kept the runtime under 120ms, with the latest version of MobileNetV3-Large running at 66ms, and MobileNetV3-Small, with a lower number of parameters and computation, running at 22ms. The latest version of MobileNetV3-Large has a running time of 66ms, MobileNetV3-Small has a lower number of parameters and computation volume, and the running time of GoogleNet is about 250ms; while VGG-16 can't run due to the memory overflow error reported by the cell phone system because the one-time space required to load into the memory is more than 500MB.
5
Multiple Application Scenarios
MobileNet can be used in mobile terminals to realize a wide range of applications, including target detection, target classification, face attribute recognition, and face recognition.
Introduction of MobileNet versions
1
MobileNetV1 network structure
The whole network is not counting the average pooling layer and softmax layer, ***28 layers;
In the whole network structure, the convolution with a step size of 2 is more characteristic, and the convolution acts as a downsampling function;
After the first layer, there are two layers of convolution. >
The 26 layers after the first are repeated convolution operations with deeply separable convolution;
Each convolutional layer (including regular convolution, deep convolution, and point-by-point convolution) is immediately followed by a batch normalization and a ReLU activation function;
The last, fully-connected layer does not use an activation function.
2
MobileNetV2 network structure
The main structures introduced in MobileNetV2 are the linear bottleneck structure and the inverse residual structure.
The MobileNetV2 network model has*** 17 Bottleneck layers (each Bottleneck contains two point-by-point convolutional layers and one deep convolutional layer), one standard convolutional layer (conv), two point-by-point convolutional layers (pw conv), and 54 trainable parameter layers.
MobileNetV2 uses a linear bottleneck (LB) structure. MobileNetV2 optimizes the network with Linear Bottleneck and Inverted Residuals structures, making the network deeper, but the model smaller and faster.
3
MobileNetV3 Network Architecture
MobileNetV3 is available in two versions, Large and Small, with the Large version for platforms with high compute and storage performance and the Small version for platforms with lower hardware performance.
The Large version*** has 15 bottleneck layers, one standard convolutional layer, and three point-by-point convolutional layers.
The Small version*** has 12 bottleneck layers, one standard convolutional layer, and two point-by-point convolutional layers.
A 5×5 sized deep convolution was introduced in MobileNetV3 instead of a partial 3×3 deep convolution. Squeeze-and-excitation (SE) module and h-swish (HS) activation function are introduced to improve model accuracy. The ending two-layer point-by-point convolution does not use batch normalization (Batch Norm), and the NBN logo is used in the MobileNetV3 structure picture.
(Image source https://arxiv.org/pdf/1905.02244.pdf)
The network structure has been optimized with respect to the ending part of MobileNetV2 by removing the three high strata, as shown in the figure above. The removal reduces the amount of computation and number of parameters, but there is no loss of model accuracy.
It is worth noting that both the Large and Small versions use Neural Architecture Search (NAS) techniques to generate the network structure.
4
Characteristics of MobileNet versions
MobileNet achieves high accuracy while reducing the amount of computation and the number of parameters, thanks to the following characteristics:
Characteristics proposed by MobileNetV1
Characteristics proposed by MobileNetV2
Features introduced in MobileNetV3
Summary of features in each version of MobileNet
The following section describes each of the features in the table above.
MobileNet features in detail
1
Depth separable convolution
Starting with MobileNetV1, the linear bottleneck structure of V2 and V3 made extensive use of depth separable convolution.
Depthwise Separable Convolution is a convolutional structure. It consists of a layer of Depthwise convolution combined with a layer of Pointwise Convolution, each of which is immediately followed by a batch normalization and a ReLU activation function. The difference with standard convolution is a significant reduction in parameters and computation with essentially the same accuracy.
Depthwise convolution
Depthwise convolution (DW) is different from conventional convolution, in which a convolution kernel has only one dimension and is responsible for one channel, and a channel is convolved by only one convolution kernel; in conventional convolution, each convolution kernel is of the same dimension as the input, and each channel is convolved individually and then summed.
Take a 5x5x3 (length and width of 5, RGB3 channels) color image as an example. The number of convolution kernels in each layer of deep convolution is the same as the number of channels in the previous layer (channels and convolution kernels correspond to each other). Set padding=1, stride=1, a three-channel image after the operation generated three feature maps, as shown below:
The number of channels of the output feature map after the completion of deep convolution is the same as the number of channels in the input layer, and it is not possible to extend the number of channels. Moreover, this kind of operation is performed independently for each channel of the input layer, which does not effectively utilize the feature information of different channels at the same spatial location. Therefore, point-by-point convolution is needed to combine the generated feature maps to generate new feature maps.
Pointwise Convolution
The operation of Pointwise Convolution (PW) is very similar to the standard convolution operation.
Pointwise Convolution has a convolution kernel size of 1×1xM (M is the dimension of the input data) and convolves a region of one pixel at a time. The point-by-point convolution operation combines the feature maps of the previous layer weighted in the depth direction to generate a new feature map, which is the same size as the input data; then combines the feature maps of each channel to perform dimensionality reduction or dimensionality enhancement operations (changing the dimensionality of the output data) with less computation.
As an example, a 5x5x3 (length and width of 5, RGB 3-channel) color image is convolved using four 1x1x3 point-by-point convolution kernels, and four feature maps are generated after the point-by-point convolution operation. This example is a point-by-point convolution operation, the feature map is upgraded from 5x5x3 to 5x5x4. As shown in the following figure:
Analysis of the depth-separable convolution structure
Depth convolution and point-by-point convolution to form a depth-separable convolution after the schematic diagram, as shown in the following figure:
First, the depth of the convolution operation, resulting in feature maps of each channel are not The resulting feature map is not correlated between channels. Then the point-by-point convolution is performed to correlate the channels of the depth-convolution output.
Deep separable convolution achieves the same result (feature extraction) as standard convolutional layers with less space (fewer parameters) and less time (less computation).
In general, let Df be the edge length of the input feature map, Dk be the edge length of the convolution kernel, the feature map and the convolution kernel are of the same length and width, the number of input channels is M, and the number of output channels is N. The computation of the standard convolution is:
The computation of the deeper convolution is:
The computation of the point-by-point convolution is:
The computation of the point-by-point convolution is:
Point-by-point convolution is calculated as: Df × Df × M × N
The above figure shows the realization of the input feature map size of 5 × 5 × 3, the output into the map size of 5 × 5 × 4, set padding = 1, stride = 1, the depth of convolution convolution kernel size of 3 × 3, the standard convolution also uses 3 × 3 size convolution kernel. Achieving the same convolution effect, the number of parameters (excluding bias) and the computational effort are compared as shown in the following table:
Evolution of depth-separable convolution
In fact depth-separable convolution was not first proposed in MobileNetV1, but in 2016 by Google's Xception network architecture.MobileNetV1 in Xception, the depth separable convolution has been improved to achieve a decrease in the number of computations and parameters:
Assume that M is the number of channels in the input layer and N is the number of channels in the output layer.
Xcenption's deeply separable convolution starts with the input parameters, converts the number of channels in the input layer to the target number of channels using a 1x1xMxN convolution, and then convolves each channel with a 3x3x1 convolution kernel, activating it with ReLU after each convolution.
The depth-separable convolution of MobileNetV1 uses 3x3x1xM to convolve each channel of the input layer separately, and then converts the number of channels in the input layer to the number of channels in the output layer through 1x1xMxN, and then does a batch normalization operation after each convolution, and then activates it using ReLU.
Here we use the first depth-separable convolutional layer of the MobileNetV1 network structure as an example, with an input layer dimension of 112x112x32, and an output layer dimension of 112x112x64. The computational volume and number of parameters of Xception vs. MobileNet's depth-separable convolution are shown in the following table:
From this, it can be seen that Adjusting the order of PW convolution and DW convolution optimizes the space complexity and time complexity of the network.
2
Width Multiplier
MobileNet's own network structure is already relatively small and has low execution latency, but in order to adapt to more customized scenarios, MobileNet provides a hyperparameter called Width Multiplier for us to adjust. The Width Multiplier is available in MobileNetV1, V2, and V3.
The width multiplier allows you to adjust the size of the features produced in the middle of the neural network by adjusting the size of the number of channels of feature data, and thus the size of the computation.
The width factor is simply the ratio of the number of convolutional kernels to be used per module in the new network compared to the standard MobileNet ratio. For a deep convolution combined with a 1x1 approach to convolutional kernels, the amount of computation is:
α in the equation is the width factor, and α is commonly configured to be 1,0.75,0.5,0.25; when α is equal to 1 it's standard MobileNet. α is used to reduce the amount of computation and the number of parameters to approximately the square of α very efficiently.
The following figure shows the relationship between accuracy, computation, and number of parameters on ImageNet when MobileNetV1 uses different α coefficients for tuning the network parameters (the top number in each term indicates the value of α).
(Source https://arxiv.org/pdf/1704.04861.pdf)
It can be seen that when the input resolution is fixed at 224x224, the computation and parameters of the model get smaller and smaller as the width factor decreases. From the above table, it can be seen that the correct rate of 0.25 MobileNet is 20% lower than the standard version of 1.0 MobileNet, but the amount of computation and parameterization is almost 10% of the amount of computation and parameterization of the standard version of 1.0 MobileNet! For mobile platforms where computing and storage resources are very tight, it is very practical to adjust the number of meals in the network by the α-width factor, and we can adjust the α-width factor as needed to achieve a balance between accuracy and performance when we really use it.
3
Resolution Multiplier
MobileNet also provides another hyper-parameter, the Resolution Multiplier, to customize the structure of the network, which is also available in MobileNetV1, V2, and V3.
The resolution factor is generally referred to as β. β can take values in the range (0,1), and is an approximate reduction factor for the input size of each module, which simply means that the input data, and hence the feature maps generated in each module, are made smaller, and in combination with a width factor of α, deep convolution combined with the convolution kernel of the 1x1 approach computes:
The following figure shows that MobileNetV1 with different β coefficients acting on standard MobileNet, the impact on accuracy and computation on ImageNet (α fixed at 1.0)
(Source: https://arxiv.org/pdf/1704.04861.pdf)
In the above figure, 224, 192, 160, 128 correspond to the resolution of the image. 160 and 128 correspond to resolution factors of 1, 6/7, 5/7 and 4/7 respectively.
When β=1, the resolution of the input image is 224x224, and the size of the convolved image changes to: 224x224, 112x112, 56x56, 28x28, 14x14, 7x7.
When β= 6/7, the resolution of the input image is 192x192, and the size of the convolved image changes to: 224x224, 112x112, 56x56, 28x28, 14x14, 7x7. When β= 6/7, the resolution of the input image is 192x192, the size of the feature image of each layer after convolution changes to 192x192, 96x96, 48x48, 24x24, 12x12, 6x6.
The change of the size of the convolved feature image doesn't cause the change of parameter count, but only change the amount of the calculation of the model M-Adds. In the above figure 224 resolution model testing ImageNet dataset accuracy is 70.6%, 192 resolution model accuracy is 69.1%, but the M-Adds computation is reduced by 151M, for the mobile platform computing resources are tight, the same can be adjusted through the β-resolution factor of the resolution of the network input feature image, to do the trade-off between the model accuracy and the computational volume.
4
Normalization
Normalization in deep learning can help accelerate the convergence speed of models based on gradient descent or stochastic gradient descent to improve the accuracy of the model, and the normalized parameters can improve the ability of the model to generalize and improve the compressibility of the model.
According to the different objects involved in the normalization operation can be divided into two categories, one is the input value of the normalization operation, such as Batch Normalization (Batch Normalization), Layer Normalization (Layer Normalization), Instance Normalization (Instance Normalization), Group Normalization (Group Normalization) methods are all part of the normalization operation. Group Normalization) methods all belong to this category. Another category is the normalization of parameters in the neural network, such as the use of L0,L1 paradigm.
Batch Normalization
Batch Normalization exists behind almost every convolutional layer of MobileNetV1, V2, and V3 to speed up training convergence and improve accuracy.
Batch normalization is a special function transformation method for numerical values, that is, assuming that the original value is x, and applying a function that acts as a normalizer, the value x before normalization is transformed to form a normalized value, that is. Different normalization goals lead to different forms of functions in specific methods. The adaptive reparameterization approach overcomes the problem that deepening the layers of the neural network makes it difficult to train the model.
Weight Normalization
Weight Normalization (WN) is a type of normalization that removes redundant parameters from the model (setting them to 0) by setting up a sparse algorithm to sparsify the model parameters, which can be implemented using the L1 paradigm.
Parameter normalization is to prevent the model from overfitting the training data. When training a batch of samples, the model will tend to fit the sample data more and more as the training progresses. This is because too many parameters can cause the complexity of the model to rise and make it easy to overfit.
Need to ensure that the model is "simple" on the basis of minimizing the training error, so that the parameters have good generalization performance (that is, the test error is also small), and the model is "simple" through the rule function to achieve.
As shown above, the left-hand classification is clearly underfitting, and the model does not fit the data. The middle illustration shows a proper fit, and the right illustration shows overfitting, where the model is a good fit in the training samples, but violates the feature classification laws and performs poorly in new test samples, affecting the model's ability to generalize. Obviously the right-hand side model is trained with additional parameter interference. Parameter regularization can make the parameters sparse, reduce the interference of extra parameters, and improve the generalization ability.
Having sparse parameters (a large number of parameters in the model are 0) also facilitates the compression of the model size by the compression algorithm.
5
Linear Bottleneck
Linear Bottleneck, known in English as Linear Bottleneck, evolved from the Bottleneck structure and was used in MobileNetV2 & V3.
The Bottleneck structure was first proposed for ResNet networks. The first layer of the structure uses point-by-point convolution, the second layer uses a 3×3 sized convolution kernel for deep convolution, and the third layer uses point-by-point convolution.The last layer of the Bottleneck structure in MobileNet uses a Linear activation function for point-by-point convolution, and is therefore called a Linear Bottleneck. There are two types of Linear Bottleneck structures, the first one uses the residual structure when the step size is 1, and the second one does not use the residual structure when the step size is 2.
Where the number of input channels is M, and the expansion factor is T. The value of T is a positive number greater than 0. When 0<T<1, the first layer of point-by-point convolution plays the role of dimensionality reduction. When 1<T, the first layer of point-by-point convolution plays the role of dimension up.
The second layer is the depth convolution, the number of input channels = the number of output channels = M × T.
The third layer is the point-by-point convolution, which serves to correlate the feature maps after the depth convolution and outputs the specified number of channels N.
The linear bottleneck structure is able to reduce the number of parameters relative to the standard convolution, and reduces the amount of convolutional computation. The network is optimized in terms of space and time.
6
Inverted Residuals
The concept of Inverted Residuals was proposed in MobileNetV2 based on the optimization of ResNet's Residuals structure, and has since been used in MobileNetV3 as well.
The residuals structure proposed in ResNet solves the problem of the gradient disappearing as the depth of the network increases during training, so that the gradient can be obtained from the shallow layer of the deep network in the backpropagation process, and the parameters of the shallow network can be trained, thus increasing the feature expression capability.
The residual structure of ResNet actually adds residual propagation to the linear bottleneck structure. This is shown in the following figure:
The residual structure in ResNet uses a first layer of point-by-point convolution for dimensionality reduction, followed by deep convolution, and then point-by-point convolution for dimensionality uplift.
The residual structure in MobileNetV2 version uses the first layer of point-by-point convolution to raise the dimension and use the Relu6 activation function instead of Relu, and then uses deep convolution, also using the Relu6 activation function, and then uses point-by-point convolution to lower the dimension, and then uses the Linear activation function after the lowering. Such a convolution operation is more conducive to mobile use (in favor of reducing the number of parameters and M-Adds computed), because the dimension lifting and lowering method is just the opposite of the residual structure in ResNet, MobileNetV2 will call it Inverted Residuals.
7
5x5 Deep Convolution
In MobileNetV3, deep convolution makes heavy use of convolution kernels of 5x5 size. This is due to the fact that in the course of calculating the structure of the MobileNetV3 network using the Neural Architecture Search (NAS) technique, it was found that using 5x5 sized convolution kernels in deep convolution is better and more accurate than using 3x3 sized convolution kernels. The NAS technology will be described in a separate section below.
8
Squeeze-and-excitation module
The Squeeze-and-Excitation module (or SE module for short) was first proposed in the 2017 Squeeze-and-Excitation Networks (SENet) network architecture, improved in MNasNet, and later heavily used in MobileNetV3. The researchers expected to improve the expressiveness of the network model by accurately modeling the interaction between the individual channels of the convolutional features. To achieve this expectation, a mechanism that enables the network model to calibrate the features so that effective weights are large and ineffective or less effective weights are small is proposed, which is the SE module.
(Image source https://arxiv.org/pdf/1905.02244.pdf)
As shown above, MobileNetV3's SE module is utilized on the last layer of the linear bottleneck structure, and instead of the final point-by-point convolution in V2, a SE operation is performed before point-by-point convolution. This maintains the inputs and outputs of each layer of the network structure, and only does the processing in the middle, similar to a hook in software development.
SE module structure in detail
The following figure represents an SE module. It mainly contains two parts, Squeeze and Excitation. W, H represents the width and height of the feature map, C represents the number of channels, and the input feature map size is W×H×C.
Compression (Squeeze)
The first step is the compression operation, as shown in the figure below
This operation is a global average pooling (global average pooling). This operation is a global average pooling.) After the compression operation the feature map is compressed into a 1×1×C vector.
Excitation
The next operation is the Excitation operation, shown in the following figure
It consists of two fully-connected layers, where SERatio is a scaling parameter, which is designed to reduce the number of channels and thus the amount of computation.
The first fully-connected layer has C*SERatio neurons, with input 1×1×C and output 1×1×C×SERadio.
The second fully-connected layer has C neurons, with input 1×1×C×SERadio and output 1×1×C.
Scaling operation
Finally, the scaling operation, after getting the 1×1×C vector, you can scale the original feature map. It is very simple, it is the channel weight multiplication, the original feature vector is W×H×C, the SE module calculates the value of each channel weight and the original feature map corresponding to the channel of the two-dimensional matrix multiplication, the resulting output.
Here we can derive the properties of the SE module:
Number of parameters = 2×C×C×SERatio
Computation = 2×C×C×SERatio
Overall speaking, the SE module increases the total number of parameters of the network, the total computation, because the use of the fully-connected layer computation compared to the convolution layer is not big, but the number of parameters will be significantly increased.
The SE module in MobileNetV3-Large increases the total number of parameters by 2M compared to MobileNetV2.
The SE module in MobileNetV3
The SE module is very flexible, and can be added to an existing network without disrupting the main structure of the network.
The SE module is added to ResNet to form the SE-ResNet network, which is added after the bottleneck structure, as shown on the left in the following figure.
In MobileNetV3, the SE module is added inside the bottleneck structure, and the SE block is added after the deep convolution, and the point-by-point convolution is done after the scale operation, as shown on the right side of the figure above. the SERadio coefficient of MobileNetV3 is 0.25. The number of parameters in MobileNetV3 with the SE module is more than that of MobileNetV2, and the number of parameters is more than that of MobileNetV2. The number of parameters in MobileNetV3 after using the SE module is about 2M more compared to MobileNetV2, reaching 5.4M, but the accuracy of MobileNetV3 has been greatly improved, and the accuracy in both image classification and target detection has been significantly improved.
9
h-swish activation function
MobileNetV3 found that the swish activation function was effective in improving the accuracy of the network, but swish was too computationally intensive for lightweight neural networks. mobileNetV3 found an alternative activation function, h-swish, which is similar to swish, but much less computationally intensive. An alternative activation function, h-swish (hard version of swish), is shown below:
Comparison of sigmoid, h-sigmoid, swish, and h-swish activation functions:
(Image from https://arxiv.org/pdf/1905.02244. pdf)
This nonlinearity brings many advantages while maintaining accuracy, firstly ReLU6 is available in numerous hardware and software frameworks and secondly quantization avoids loss of numerical accuracy and runs fast. This nonlinear change increases the delay of the model by 15%. However, the network effect it brings has a positive contribution to accuracy and latency and the remaining overhead can be eliminated by fusing the nonlinearity with the previous layers.