When In Doubt, Go YOLO
Walk around the streets of Los Angeles and I bet you five bucks a Model 3 will pass by every minute or so. Tesla’s best selling auto comes at a “bargain” price of just under $40k and is popular for reasons. The company has taken pride in being the first to popularize “smart” electric vehicle. Furthermore, Tesla’s strong point comes in the form of Autopilot. This feature enables a seemingly ordinary car to drive autonomously without the need of human drivers, a piece of technology commonly imagined by futurist back then. How do they achieve this? To put it simple, their cars are fitted with cameras and sensors to gather the necessary data about the car’s surrounding environment. This data comes in many form, but for now let’s just look at the images coming from the camera. This image would then be fed through algorithms tasked for vision processing which, according to Tesla, are built on top of “deep neural network”.
First, I have to apologize because I won’t elaborate further on Tesla Autopilot (and I’m sure some of you are already sick with Tesla this Tesla that anyway). Instead, what will be discussed next is about image processing or, to be more specific, digital image processing. Chakravorty (2018) and Gonzalez (2018) define image processing as the use of digital computers to process digital images through an algorithm. The common steps of image processing are:
- Getting the input. This is usually done by importing the image via image acquisition tools.
- Processing the image by means of analysis and manipulation technique.
- Producing the output in the form of report or altered image based on the processing steps taken before.
Looking back at the previous definition, emphasis is put on the word “algorithm”. Classical image processing approaches can be found in the form of morphological, Gaussian, Fourier transformation, edge detection, and wavelet algorithms. Although these approaches are largely still relevant to understand the concepts, modern technique for image processing revolves around the use of neural networks (NN). The two most common NN models are the Generative Adversarial Network (GAN) with its generator and discriminator models and the popular Convolutional Neural Network (CNN) with multiple layers resembling the neurons in human brain.
Those approaches, however, still have several major drawbacks. One of the most notable is they can’t run on real time or at least unable to complete its job in one single run. This poses a challenge for real-world cases where we need almost instant detection such as detecting cars in the road to change lanes or avoiding collision. To solve this, a group of scientists created an approach called YOLO back in 2016. Over time, this approach became more popular and evolved into more mature version with the latest being YOLOv4 released in 2020 (many spin-off works are also available including the unofficial YOLOv5).
YOLO is an abbreviation of “You Only Look Once”. From the paper published by its creators, YOLO utilizes regression problem to perform object detection tasks like separating bounding boxes and associate class probabilities. At its core, YOLO is a type of CNN that runs on one single network. Because of this, it can be easily optimized to attain better performance. Do keep in mind that YOLO relies on regression as opposed to classification problems used in CNN. YOLO have three major advantages when compared to traditional object detection models:
- Speed: YOLO runs on single network to provide real-time detection.
- Accuracy: YOLO may encounter more localization errors but is less likely to predict false positives than traditional models.
- Learning capabilities: YOLO learns very general representations of objects that is useful for cases like generalizing natural images to artwork.
We arrive at the heart of YOLO algorithm: how it works. YOLO is built on three core techniques. Those three are residual blocks, bounding box regression, and Intersection Over Union (IOU).
The algorithm first takes an input image and splits it into multiple cells or grids. Object classification and localization are done independently in each cells. If any object is found, the framework then predicts the bounding boxes and assign class probabilities to corresponding objects. Confidence score for each cell is calculated by multiplying the probabilities with Intersection Over Union (IOU) between the predicted cell and the ground truth. We will take a more detailed look at IOU later in this article.
As stated before, an object identified in a specific cell will be marked with a bounding box. Each bounding box consists of 5 prediction attributes:
- bh: height of the box
- bw: width of the box
- bx, by: center of the box
- c: class of the object
- pc: confidence of object presention
The (bx, by) coordinates represent the center of the box relative to the bounds of the cell. The width and height are predicted relative to the whole image. Each cell also predicts whether an object is present or not. This is shown by confidence prediction pc that represents the IOU between the predicted box and any ground truth box. If an object is deemed to be present, the model would then determine what class c the object is.
Intersection Over Union (IOU)
This last bit is not necessarily an independent step of YOLO. However, IOU is certainly an important part as it is used when determining object’s class in bounding box. In object detection, IOU is used to describe how boxes overlap. Each cell is responsible for predicting the bounding boxes and their confidence scores. The IOU is equal to 1 if the predicted bounding box is the same as the ground truth. Bounding boxes that are not equal to the ground truth would then be eliminated.
All at once
Execution of YOLO at a glance can be seen from the illustration above. First, it divides the image into an S×S grid and for each grid cell predicts B bounding boxes, confidence for those boxes, and C class probabilities. Based on the image, there are three probable classes present: dog, bicycle, and car. Bounding boxes are formed in grids with the presence of objects. To determine the final results, the model uses IOU to ensure that the predicted bounding boxes are equal to the real boxes of the objects. This phenomenon eliminates unnecessary bounding boxes that do not meet the characteristics of the objects. The final result will consist of unique bounding boxes that fit the objects perfectly. In this case, dog is represented with blue box, bicycle with yellow box. and car with red box.
As an object detection algorithm, YOLO is relatively simple both in theory and practical aspects relative to traditional neural network models. It is arguably the fastest, state-of-the-art general-purpose object detection model currently in use. This is largely due to the model being able to detect objects in real-time. YOLO also generalizes well to new domains, making it ideal for applications that rely on fast and robust object detection.
This article was written by someone new to deep learning for educational purposes. Feel free to give correction or voice your opinion by contacting me!
J. Redmon, S. Divvala, R. Girshick and A. Farhadi, “You Only Look Once: Unified, Real-Time Object Detection,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 779–788, doi: 10.1109/CVPR.2016.91.
Introduction to YOLO Algorithm for Object Detection
YOLO is an algorithm that uses neural networks to provide real-time object detection. This algorithm is popular because…
YOLO — You Only Look Once
A State of the Art Algorithm for Real-Time Object Detection System