Functioning in this real-world is so easy and habitual that we are hardly aware of how complex things we are capable of doing. While driving what difficulty does one face to localize another car, person, traffic lights or say any obstacles.
But when trying to implement this on self-driving cars, so many challenges come across. Object detection, one of the major challenges in computer vision is being understood in many ways. Our minds have evolved through the experiences of millions of creatures. Sapiens: A brief history of Humankind is a perfect book to understand how our neural networks are built over millions of years making us capable of tasks that we do so easily today yet training machines to do the same tasks seems difficult.
The decision-making process is so fast that very little conscious effort is required. For example, if suddenly an obstacle pops up in front of you, we take a quick decision even before we become consciously aware of it. Such fast is the recognization process and hence the reaction. Scene analysis is a very quick process for the brain that even the fastest GPU’s cannot combat that speed. Yes, that’s evolution!
To understand object detection algorithms one has to understand the process of object recognition.
Object Recognition vs Object detection:
Object recognition is the process of understanding the patterns of pixels within an image and taking a decision about what the object is. For example, a CNN trained on dogs vs cats if the sample image is that of a dog: the neuron patterns in different layers may be as follows depending on the architecture.
What does this signify? It means our CNN has hidden layers of neurons that are responding differently to different patterns of pixels in the images. It implies some layers will capture the edges, some will capture more intricate features like the nose and eyes. Finally, after several epochs, our model will have an imprint of neurons that have learned to classify cats vs dogs. If we pass a dog image the following pattern will fire up in between the layers, it clearly resembles how the image of a dog. This visualization demonstrates the power of convolutional neural networks.
Object Detection is nothing but also detecting and localizing the object in an image. In the scenery, there may be multiple objects that we want to detect. Once a CNN model is capable of recognizing an object that yes this is the pattern of a car, then the next question is where the car is. If there is a single object in the image, it is a very simple task, but when multiple objects come in together, the features and boundaries of one object start overlapping with others.
This image is a classic example where the most popular and trending object detection algorithm, YOLO failed to recognize the person. The image that the person is wearing was made after long research on how the YOLO perceives an object in its layers and finally, this image fools the object detection algorithm so easily. Obviously not any image would fool this powerful algorithm that is trained on millions of images. To understand this one has to understand the concept of Generative Adversarial Networks(GAN’s). But the main idea is the algorithm is confused by the infusion of new features from the additional image.
Behind object recognition and detection lies the power of deep convolutional neural networks. Let’s start with some interesting visualizations that will help us understand what makes object recognition possible.
Visualizing convolutional neural networks:
Deep learning has been fruitful in many scenarios and is progressing every day. Thanks to the backbone of deep learning, convolutional neural networks! CNN’s are the main architecture behind every deep learning model, yet we hardly know how to visualize them. There was a time when CNN’s were known as ‘black boxes’ because we had no knowledge, how they are making those predictions and what each layer is learning. By adjusting weights and neurons in each layer, we used to observe patterns in our results and understand what it actually means.
What do CNN’s learn?
Each hidden layer of CNN has groups of filters that are applied on the images and these filters are used to extract patterns out of images. A convolution is nothing but a mathematical function that is passed all over an image and filters the patterns according to the property of the filter applied.
If we build a simple object detection model for a building. It will consist of various convolution like Gaussian filters(used for noise reduction in an image), Sober filters(for horizontal and vertical edge detection respectively).
The Canny edge detection algorithm is nothing but a combination of convolutions. At each stage we apply certain convolutions to find features and finally, the edges will look like:
Now, this canny edge detection detects edges through a series of convolutions on the images like blurring the image than finding vertical and horizontal edges through Sobel filters, etc. The final result is the edges as shown in the image. Here is an image of the Sobel Filter convolution:
What do CNN’s do? They simply apply convolutions layer by layer image to finally create a map like an image shown above of the edges of a zebra. The layers will contain filters like these Sobel filters and many others to extract particular parts of the image.
Once a CNN is trained on lots of images to recognize a particular object, it creates a map for each object in the layers. Whenever a related pattern of pixels comes as input, the neurons lighten up in that specific order and give predictions with high confidence. For example, what do you see in this image below?
A red tree, some clouds, riping crops and shadow of a tree. But how will a CNN see this? A CNN trained to recognize animals is applied to this image.
The perception of this CNN is amazing. It was trained to recognize animals yet in this image also it is trying to find those activations of neurons. There are foxes, snakes, cats, and dogs clearly visible in this image that has no relation with the original image. This highlights the feature maps that the CNN has learned while being trained to recognize animals.
Visualizing a CNN is a very powerful tool in deep learning as it actually makes us aware of what the model is learning. Rather than adjusting the weights randomly, we can take decisions intuitively, how to bring the change in the model. Understanding the feature maps learned by CNN’s is very important to understand object detection algorithms.
Evolution of object detection algorithms:
The most basic object detection algorithm works on the simple principle of a sliding window.
- Let a CNN model learn the features of objects that you want to detect.
- For example, if you want to detect a car, pedestrians, traffic lights, the model must be trained to recognize these objects.
- Create a window whose size is much smaller than that of the original image. Let the window hover over the complete image and crop small windows out of the image.
- Send these cropped images to the Convolutional Network and check the predicted class of each cropped region of the image.
- After multiple epochs, the predicted classes will be clogged up in the nearest regions and the window size will be increased.
- This is the most basic method of detecting boxes however it is computationally very expensive.
- Also, the bounding boxes are not as accurate as we want because the window may be square but the object to be detected may be a rectangle.
To overcome the heavy computational power of sliding window algorithm, algorithms like R-CNN and its further improvements were made. The problem with the sliding window algorithm is that also classifies many regions that are of no use. A better approach to this was using Selective Search algorithm for object detection.
RCNN: Region-based convolutional neural network is an evolved version of traditional object detection algorithms mentioned above. The primary feature of this technique is the Selective Search Algorithm that solves the problem of extra regions like in the previous case.
Selective Search Algorithm:
The problem with the traditional algorithm is that it classifies too many regions. The selective search algorithm works by:
- It starts with random regions in the image which are equal to 2000.
- It recursively combines these regions on the basis of similar color texture to clog the similar regions into one.
In this figure, you can realize initially random boxes are built. After that algorithm starts clogging up the regions and finally in the third image it predicts the correct bounding boxes.
The regions extracted from the selective search algorithm are then passed to CNN for feature extraction. These features are then used to classify with the help of an SVM, to which class the object belongs to.
Problems with Selective Search Algorithm:
- It is very time-consuming to combine 2000 regions and roughly takes 50 seconds per image for classification.
- Hence this algorithm is not suitable for real-time.
- It is a fixed algorithm and no new learning is happening through the training process.
To over the problem of clogging 2000 regions into one Fast R-CNN was introduced:
- The idea is to share the computation among 2000 regions.
- The whole image is passed on to a CNN network to extract a region of proposals and the convolutional feature maps.
- The use of this CNN is to extract generate convolution maps of the original image and propose regions.
- Then comes the role of ROI pooling layer.
Importance of ROI pooling Layer in Fast R-CNN:
The main problem with RCNN was the computational expense. Each region had to go through the CNN for classification. But what if we can use the computation for multiple proposals?
Here you can see there are multiple proposals around the cat. It would be computationally very expensive if all the detections are passed to the CNN for classification. We can realize all of them are around the same object. Why can’t we use the computation of classification for all the regions at one time? Here comes the role of ROI pooling layer!
ROI Pooling Layer:
As we know a max-pooling layer in a CNN architecture is used to divide the image into boxes of equal size and each box is replaced by the max value within the box.
The ROI pooling layer does similar work but instead of working on the complete image, it works on regions of interest. A region of interest is divided into equal blocks and each block is replaced by the maximum value within the box. This solves the purpose that multiple regions around the same object can share the same values.
In this GIF you can see there is a region proposal that is divided into blocks, and value is replaced by the max value within the block. The best part is the region of proposals close to the original one gets clogged up by the ROI pooling layer and they share the same computation.
The output from the ROI layer is then connected to a fully connected layer and finally classification and bounding box prediction is done.
- The ROI pooling layer helps in the reuse of the feature map.
- It speeds up the training and testing time significantly.
Fast RCNN is very much faster than RCNN because instead of directly working on the image, we are working on the feature maps generated by the CNN. The selective search algorithm is still used on feature maps to propose regions. Yet it is time-consuming.
Faster RCNN overcomes this selective search process by a region proposal network that is the backbone of Faster-RCNN.
The region proposal networks predict the bounding box. Since it is a separate CNN architecture, it is very much fast than selective search. The RPN network consists of a classifier and a regressor. The classifier predicts the probability of finding an object within a proposal and the regressor finds the bounding boxes.
To understand the full functionality of Region Proposal Network, read more: Region Proposal Networks.
You can apply very easily Faster RCNN by creating your own dataset and further annotating. Read full steps to implement Faster RCNN
I implemented Faster RCNN on a dataset created by me for fridge detecting fridge internal parts. The results were very good.
A brief comparison of RCNN vs Fast RCNN vs Faster RCNN:
All the further algorithms have evolved from the previous ones. Faster RCNN is way faster than the previous two and results are pretty much good.
All of the algorithms discussed till now predict a bounding box around the object. Self-driving cars have a major use of object detection techniques. Just imagine what will happen if a car predicts bounding box around a sharp turn road it will lead to a disaster. Here comes the role of Mask RCNN that instead of predicting bounding boxes, predicts a mask over the object, determining its correct shape.
- Mask RCNN is just an upgradation of Faster RCNN
- Every step till ROI pooling layer is the same but it’s only in the end that a final step of mask segmentation is introduced.
The segmented masks look like
These masks are then applied to the original image and the final detections are:
Mask RCNN is computationally very expensive as the segmentation task is very much time-consuming. Yet it is widely used for semantic segmentation processes. You can check this project about Mask R-CNN
YOLO Object detection:
YOLO is a revolutionary algorithm that completely changed the scenario of object detection. It is way faster than the previous object detection algorithms with a processing speed of 45 fps even faster than real-time scenarios.
- Yolo is the fastest object detection algorithm as of now.
- Unlike other techniques, it uses a single neural network for boundary and further class prediction.
- Methods like Faster RCNN often detect the background portions which are not necessary. Contrary to this YOLO makes half of these false predictions.
- The YOLO object detection gives real-time feedback making it possible to be implemented in a self-driving car like scenarios with high precision.
- The other architectures of YOLO like yolov3 and yolov2 are even faster than YOLO but the accuracy is compromised in that case. These models are compressed versions of the overall network.
How YOLO works?
The image is divided into small grids of size (SXS). For each cell in the grid the algorithm predicts B bounding boxes. If the object center is there in the grid cell, that grid cell will be used for detection of the object. For each of the B bounding boxes predicted, it also gives a confidence C, that how much it is sure of predicting the object. The confidence is measured by comparing the predicted box with the ground truth box. In the case where there is maximum overlap, the box is predicted.
These are the results of YOLO object detection on a simple fridge dataset to analyze the internal parts of the fridge.
Know more about this interesting Deep Learning project at Emuron Technology
How to implement these object detection algorithms?
You can use any annotation tool to annotate your dataset and customize the classes. This is an amazing tool for YOLO object detection. Know here.
If you want to implement the same dataset for Fast RCNN or say Faster RCNN, just change the format of annotations by a simple code. All these algorithms just use slightly different formats of annotation files like ratio with the center of 4 vertices is used in YOLO whereas, in Faster RCNN, direct use of 4 vertices of the annotation is done. You need not annotate data for different algorithms. Just adjust the format by a simple code.
Then you can implement any YOLO or faster RCNN file from Github by using your annotations.
These were some of the best object detection algorithms to understand the process of object detection. There are many more algorithms like FCNN and RetinaNet about which you can read in our other blogs. Do give your reviews in the comment section. Stay tuned at NEURALAI for more interesting artificial intelligence and deep learning projects.