Object
detection is a task in computer vision that has been around for very long time,
way before the deep learning has emerged as the state of the art in machine
vision. In a self-driving car, the camera from the car is acquiring images and
it needs to detect different categories of objects in the surrounding
environment, such as pedestrians, cars, bicycles, trees and so on.
Neural
networks and convolutional neural networks (CNN) in particular are considered a
component of the state-of-the-art object detection algorithms. One of the most
popular object detection mechanism involving CNN is Single-Shot MultiBox Detector(SSD).
SSD introduction
SSD
shares a lot of techniques with another popular algorithm for object detection
named YOLO (You Only Look Once). SSD is considered a significant milestone in
computer vision because before of this, the task of object detection was quite
slow as it required multiple stages of processing. The speed of the object
detection algorithms is critical in real time applications like in self-driving
cars. The object detection task has been converting from a computer science
research problem into a safety requirement in automotive industry.
The
original paper that introduced SSD can be found at: http://arxiv.org/abs/1512.02325. From this paper it may be observed that the
SSD algorithm outperforms YOLO so is therefore becoming the new state-of-the-art in object detection. In the figure below we may observe how the performance
of the object detection task improved over time and CNN boosted its accuracy.
Mean
Average Precision of object detection algorithms in time
Algorithm and network architecture in brief
The
first stage in the task of object detection is localization. In this stage the objective is to produce a binary
classification result on weather any object is present in the image or not. The
region where an object has been localized will serve as input for a classification algorithm, which will
decide to which category the object belongs. The classification algorithm is
considered already available through the training of a state-of-the-art deep
neural network architectures employed in image classification.
One of the simplest ideas it may be applied for
object detection is called sliding window technique. A window is taken and for
each position in the original image the sub-image is passed into a CNN. The CNN
will produce a binary result saying if an object is inside that sub-image or
not.
If
the height and width of the image are denoted with H and W, and the size of
the sliding window is denoted by window_size,
the following Python pseudo-code may be used for a sliding window object
localization task:
for i in range(H):
for
j in range(W):
i1
= i * step_size
i2 = i1 + window_size
j1 = j * step_size
j2 = j1 + window_size
w = img[i1:i2, j1:j2]
prediction = cnn.predict(w)
The
localization will produce a bounding box, and its dimensions will need to be
normalized between (0, …, 1). The problem with this approach is that it has an
order of complexity of O(N2), being impractical for real time
application. One may observe that a sliding window however, can be implemented
as a convolution. Therefore, a typical CNN with some specific architecture may
be used to perform object localization.
Such
a network architecture is very similar to a regular CNN that has 3-4
convolutional layers followed by 2-3 fully connected layers. However, the
particularity of CNN architectures when exploring object localization is
replacing the final fully connected layers with convolutional layers of the
same size. For example, if a fully connected layer from the original
architecture has the shape of 128x1, then the convolutional layer will have the
shape of 1x1x128. The figure below illustrates a CNN architecture with the
properties discussed here in order to perform object localization:
CNN
architecture employed in object localization
By
considering the deep learning architecture from the figure above, the input
image is considered the sliding window and therefore, the original image region
has a slightly larger dimension, a few pixels on width and height. When
processing the image region, the center of the region is fixed at the size of
the sliding window. Then, 4 shifts are made, so the input to the network
architecture from the above image is an image with 4 times more channels.
The
extremely interesting observation that was made is that the final 1x1x128
convolutional layers become 2x2x128 layer. The 2x2x128 shape contains 4
vectorized elements that correspond to the 4 input sliding windows. This is the
same result as that obtained by doing a 2x2 for loop over the image!
Another
issue in object detection is the scale of the object. The object detection is a
system that employs CNNs, is not just applying a single CNN architecture. What
if the object is far away in the image? Its size will be significantly smaller
than an object that is closer to the camera.
The
general pattern of a CNN is that as we go through each layer, the image is
shrinking, and therefore the features detected in the image become
proportionally larger with every layer. The question that naturally arises is
how this fact can be explored to cope with different scales in object detection.
The idea is to think about sub-parts of a CNN as feature extractors instead of
considering the entire CNN as a feature extractor. The output of the sub-parts
of the CNN can be fed into mini-neural networks. The mini-neural networks do
object detection only for a region of the CNN layers. This is the idea behind
SSD algorithm and such an architecture is shown in the figure below.
CNN
architecture augmented with mini-neural networks