1 Introduction
The field of Artificial Intelligence (AI) encompasses a diverse range of technologies and applications, with its roots in creating intelligent systems that can perform tasks that typically require human intelligence.
One prominent subfield, Computer Vision, focuses on endowing machines with the ability to interpret and understand visual information from the world, opening up possibilities for applications in image analysis, video processing, and augmented reality.
Within the realm of Computer Vision, the You Only Look Once (YOLO) algorithm stands out as a groundbreaking approach to object detection. YOLO’s innovation lies in its unified, real-time processing capabilities, achieved through a single neural network that can simultaneously predict bounding boxes and class probabilities for objects within an image.
The influential paper introducing YOLO, authored by Joseph Redmon and Santosh Divvala in 2016 [16], has since garnered widespread attention and has become a foundational reference in the field of computer vision, influencing subsequent developments and applications of object detection technologies. The increasing population of India in the last 30 years is leading to the use of more vehicles.
According to Statista, as of 2023, the current population of India is 1.429 billionfn. A study by Financial Express found that the majority of India’s population is middle-class, which is about 31% of the population in 2020–2021 and is expected to rise to 61% by 2046–47fn, a two-wheeler is the most sought-after vehicle in India.
Two-wheeler domestic sales rose from 13.57 million in the financial year 2022 to 15.86 million in the financial year 2023fn, as suggested by data from Statista. The increasing use of two-wheelers without helmets and reckless driving is leading to the deaths of riders. A news article by the Times of India shows that, in the year 2021, 47,000 Indians died in two-wheeler accidents due to not wearing helmetsfn.
Head injuries sustained by riders who do not wear helmets are a major cause of these deaths. Addressing this issue requires a comprehensive approach that combines technology and law enforcement. A study shows that using surveillance cameras in traffic has led to decreased road accidents.
Times of India reports that, in the state of Kerala in India, the use of surveillance cameras led to a decrease in accidents from 1,669 deaths in road accidents from June 5, 2022, to October 31, 2022. However, it dropped to 1,081 during the same period in 2023 after the installation of AI camerasfn.
A study found out that wearing helmets lowers the death rate chances by 37% and the head injury rate chances by 69% [10]. So there is a need to automate the process of helmet detection for proper law enforcement and to reduce deaths by two-wheelers.
The implementation of an automated system for monitoring helmet usage and identifying license plate numbers of non-compliant two-wheelers is a crucial step toward enhancing road safety. AI and computer vision algorithms can analyze real-time CCTV camera footage, enabling the detection of riders without helmets and the retrieval of their license plate numbers.
Our approach will be using state of the art YOLOv8fn model to extract the number plates of the without-helmet bike riders and store them in a database. This information can then be used to enforce helmet usage regulations and educate riders about the importance of helmet safety.
2 Related Work
Numerous domains, including pose detection, decision-making, self-driving vehicles, computer vision, and digital image processing techniques. The use of deep learning models has demonstrated success in a variety of fields, including healthcare [18], social sciences [11], earth sciences [2], etc.
R. Meenu et al. [12], carried out research where they were performing helmet detection and number plate extraction using Faster Region-Convolutional Neural Network (Faster R-CNN). They used CCTV footage and then split it into frames for analysis. Their methodology was split into four stages: motorcycle detection, head detection, helmet detection, and then number plate detection.
They utilized image processing algorithms like the Gabor wavelet filter to get accurate head positions. They achieved an accuracy of around 92%, depending on the quality of the CCTV cameras. However, cases of false detection are not addressed in the solution. Kunal Dahiya et al. [3] applied algorithms like background subtraction to detect only moving motorcycles and deal with false detection rates.
They also used Gaussian models to deal with various environmental detection challenges. Further, after extracting the foreground layer, many image processing algorithms were applied, like a noise filter and a Gaussian filter, and a binary image was obtained. Furthermore, objects were detected only based on a threshold area range that can be likely classified as a motorcycle.
They used techniques like Histogram of Oriented Gradients (HOG) and scale-invariant feature transformation for feature extraction. For classification, they used a Support Vector Machine(SVM). To remove false detection, they also consolidated the results using the information from the past frames.
They achieved a frame processing time of 11.58 ms and a frame generation time of around 33 ms, implying high efficiency. However, there is a lack of comprehensive evaluation on a diverse range of datasets, thus limiting the generalizability of the results.
Pushkar Sathe et al. [17] used yolov5 for helmet detection with an accuracy of 0.995 mean Average Precision(mAP) score [15]. They are using two methods to check if the rider is wearing a helmet.
Firstly, they check with the help of overlapping boxes of the helmet, numberplate, and the person and verify through a set of conditions if the person is wearing a helmet or not. The second method uses a range of motorcycle coordinates to check for helmets. Finally, they are using EasyOCR for character recognition of number plates.
However, this suffers from the lack of inclusion of a diverse dataset to make the model more generalizable. J Mistry et al. [13] used YOLOv2 for first detecting persons in a frame, citing the better performances in detecting a person rather than a motorcycle of the model. It then proceeds to detect the helmet, and if it is not found, then it goes for the number plate.
For no number plate detected, the model infers that the person detected is a pedestrian. The model achieved a 0.9470 value accuracy for helmet detection. However, this model also suffers from generalizability as not all cases of numberplates, riders, and helmet positions are discussed. M.M. Shidore and S.P. Narote [19] worked on techniques for efficient and accurate extraction of number plates from vehicles.
They used image processing techniques like histogram equalization and grey-scale conversions to deal with low-resolution images. Candidate number plate areas were extracted, and then true number plate areas were extracted. Character regions were enhanced, and background pixels were weakened.
Further character segmentation is done to get information about each number plate character. Then, finally, SVM was used to classify each character properly. The final results showcased an accuracy of around 85%. However, there is no mention of the dataset used for training and testing the system, which could be a limitation in evaluating the performance of the proposed approach. Waranusast et al. (2013) [21], in their work suggested a four step process to automatically identify motorcycles and determine whether they are wearing helmets or not.
Utilizing machine vision methodologies, the system employs algorithms to extract dynamic entities from the scene, distinguishing between motorcycles and other objects. Following this differentiation, it proceeds to enumerate and segment the heads of riders.
Subsequently, a comprehensive analysis is conducted to determine helmet usage, facilitated by a K-Nearest Neighbor (KNN) classifier. This classifier utilizes distinct features extracted from the segmented head regions to discern whether a helmet is present or not.
Through this iterative process, the system effectively identifies motorcyclists, segments their heads, and evaluates helmet compliance. However, the paper does not discuss the model performance under different lighting conditions or presence of occlussion.
Rupesh Chandrakant et al. (2022) [7] used a pre-trained model that uses the YOLO algorithm to detect whether the rider is wearing a helmet or not. Weights were tweaked as per the requirements. The authors created the dataset to ensure relevant data availability.
An accuracy of 96% and a frame detection time of around 1.35 sec were achieved. However, there is a lack of diversity in the dataset, including variations in lighting conditions, camera angles, and different types of helmets, which may limit the generalizability of the model.
V, Sri Uthra, et al. (2020) [20] presented significant findings where the paper proposed a motorcycle detection and classification method, helmet detection and helmet detection, and license plate recognition. Vehicle Classification was performed using an SVM classifier.
Helmet detection was done by applying Convolutional Neural Network (CNN) algorithms to extract image attributes, followed by classification using the SVM classifier. License plate recognition was done using Optical Character Recognition (OCR).
The system utilized background subtraction and feature extraction using Wavelet Transform. The accuracy for motorcycle classification is 93%, for helmet classification is 85%, and license plate recognition is about 81%. The paper, however, didn’t mention computational requirements.
Adil Afzal et al. (2021) [1] introduce a deep learning-based methodology for the automatic detection of helmet wear by motorcyclists in surveillance videos. Leveraging the Faster R-CNN model, the approach involves two phases: helmet detection using the Region Proposal Network (RPN) and subsequent recognition of the detected helmets.
Trained on a self-generated dataset from three distinct locations in Lahore, Pakistan, the methodology achieves a notable 97.26% accuracy in real-time surveillance video analysis. Its strengths lie in the effective utilization of deep learning techniques, the accuracy afforded by the Faster R-CNN model, and the realism added by the use of a self-generated dataset from actual surveillance footage.
However, limitations include the lack of detailed information on addressing challenges like low resolution and varying weather conditions, limited generalizability to other locations or datasets, and a lack of discussion on the computational requirements and scalability of the proposed methodology.
Further, Mamidi Kiran Kumar et al. (2023) [9] use the YOLO Darknet deep learning framework to automate the detection of motorcycle riders wearing helmets from images, simultaneously triggering alerts for non-compliance. Through bounding boxes and confidence scores, the model identifies regions of interest like riders, helmets, and number plates.
The dataset used for training encompasses a diverse collection of images with 80 object categories, capturing a broad spectrum of real-world scenarios. The strengths of the model lie in its automated and efficient solution for helmet detection, eliminating the need for manual checks, and its utilization of the YOLO Darknet framework, enabling real-time detection and alert generation.
However, the limitations include the absence of detailed information on performance metrics or evaluation results, making it challenging to assess the model’s accuracy, and a lack of specificity about the training dataset, raising concerns about its representativeness and potential biases.
3 Dataset
For our work we collected various images by ourselves and annotated them, we used state-of-the-art YOLOv8 for its ability to detect images in a single pass, its speed and efficiency in object detection tasks, we fined-tuned it.
In the subsequent sections, we provide detailed insights into our fine-tuning methodology, including the selection of hyperparameters, the augmentation strategies employed, and the evaluation metrics used to assess the model’s performance. Our objective was to harness the power of YOLOv8 to deliver precise and efficient object detection for our application.
3.1 Dataset Statistics
Our datasetfn was compiled based on both online sources and self-collected data. Since there was no such public repository for bike riders’ images, we scrapped various news articles to get images of interest. Figure 2 shows samples of images from our dataset. First, a total of 3155 images were sourced from online, enriching our dataset with diverse visual data for comprehensive model training. These images were already annotated to serve our purpose.
Further, we collected 12 videos from outside the National Institute of Technology, Silchar, campus. The images were then annotated using the Roboflowfn online annotation tool. Also, various image augmentation techniques were applied so that we could further diversify our dataset and ensure the model remained robust and had good generalizability.
Techniques like flips, rotation, blur, and adjusting the values of RGB channels were employed to achieve a total of 3600 images among the self-collected data. In total, we amassed a total of 6755 images, among which 3600 were self-collected and self-annotated, and 3155 were outsourced from Roboflow as in 1.
3.2 Augmentations Applied on Images
We used various augmentation techniques to improve the training and diversify the dataset.
We applied horizontal flip, coloured images were augmented to grayscale images to simulate nighttime CCTV video feeds or images, rotation was applied with a magnitude between -15° to +15° shear was done randomly with a magnitude between -16° to +16° in the horizontal direction and -23° to +23° in the vertical direction, hue and saturation of the images were changed between -25° to +25° gaussian blur was applied to the extent of 0.75 pixels, brightness of images were changed between -25% to +25% and lastly noise was added to 5% of the pixels.
Figure 3 shows various augmentations applied on a sample image from the dataset.
3.3 Dataset Annotation and Validation
We utilized the online Roboflow annotation tools to label the images nicely in YOLO format for the image annotation. This annotation format is useful for object detection tasks, as it divides the image in a grid and assigns bounding boxes to objects in those grid cells.
The Roboflow annotation tools provided us with an interactive interface to accurately mark and label objects of interest in the images. Also, we used Roboflow’s generation tools to apply augmentations. For dataset validation, we inspected and verified the annotated dataset using the built-in validation features of the Roboflow tool.
The tool provides a visual graphic of the annotations, allowing us to quickly verify the completeness of the labelled objects. This manual validation step was important for ensuring the dataset’s quality and removing potential errors in annotations.
4 Methodology
Our first step involves segmenting the video into consecutive frames and then applying some image processing techniques for better inference. We are using the Open Source Computer Vision Library (OpenCV)fn library to first read the video as consecutive frames.
Then, for each frame, we are first resizing the frame to the YOLO input standard size (480) and then applying the following transformations:
– Grayscale conversion: Grayscale conversion uses the values of the RGB channel and then calculates the pixel value using the following formula:
– Histogram Equalization [4]: This method is performed after grayscale conversion. It is a method that improves an image’s contrast to stretch out the intensity range.
Equalization implies mapping one distribution (the given histogram) to another distribution (a wider and more uniform distribution of intensity values) so the intensity values are spread over the whole range. The remapping should be the cumulative distribution function to accomplish the equalisation effect. To use this as a remapping function, we have to normalize such that the maximum value is 255.
– Gaussian blur [5]: This blur focuses on taking a weighted mean, where neighbourhood pixels that are closer to the central pixel contribute more “weight” to the average. This generally helps in removing noise from our image.
Then, we are using background substraction [6] to separate the forward mask from the image. Then, to enhance the quality of the mask, we apply morphological transformations and then extract the contours beyond a threshold. Thus, the first step is complete, as we have the bounding boxes of all the moving objects in the frame.
This ensures that, in any case, non-moving objects shouldn’t get selected in a frame. Our second step involves passing the frame through our fine-tuned YOLOv8 model. To prevent repeated boxes from being sent, we are first taking the union of intersecting boxes. Then, all the bounding boxes of moving objects corresponding to that frame are sent to the model. The model detects the image in four classes of objects, namely: “Motorcycle”, “WithHelmet”, “WithoutHelmet”, and “NumberPlate”.
With reference to the algorithm 1 and figure 4, our model first stores the information of the bounding boxes of the number plates, motorcycles, and WithoutHelmet classes. Then, for every motorcycle’s bounding box, we are only considering the top 40% section of the box, as this serves as the most likely region where we are going to find a rider’s head.
Then, for each without Helmet and Numberplate class, we check if both exist within the motorcycle’s bounding boxes. If this is the case, then we are sure that one of the riders on the bike is not wearing a helmet, and the numberplate detected also belongs to that motorcycle itself. So, the Numberplate coordinates can be extracted and saved for further inference.
4.1 Model Parameters
We used Adam Optimizer [8] during the training process. The learning rate, momentum, and weight decay are set to 0.00125, 0.8, and 0.0005 for 104 weights and 0.0 for 97 weights, respectively. The number of epochs was 55, and the batch size was set to 16.
We evaluate the mean Average Precision of the object detection to measure the performance of our model. The Intersection Over Union (IOU) [14] threshold range for measuring the accuracy of predicted bounding boxes relative to ground truth has been set to 0.50 to 0.95, with an interval of 0.05.
5 Results
In object detection, precision, recall, and mAP are commonly used metrics to evaluate the performance of a model such as YOLO. Precision, recall, and mAP can be defined as follows:
Precision is a measure of the accuracy of positive predictions made by an object detection model. It is defined as the ratio of true positives to the total predicted positives. The precision formula for object detection is given by:
True Positives are correctly predicted positive instances, and false positives are those predicted as positive but actually negative. In the context of object detection, a “positive” prediction typically means the model correctly identified and localized an object of interest.
Recall, also known as sensitivity or true positive rate, is a measure of the ability of an object detection model to capture all relevant instances. It is defined as the ratio of true positives to the total actual positives. The recall formula for object detection is given by:
where true positives are the correctly predicted positive instances and false negatives are the instances that are actually positive but were predicted as negative. Recall helps assess how well the model captures all instances of the objects in the dataset. The mAP at IoU 0.5 is calculated by averaging the precision values at a specific IoU threshold (commonly set to 0.5) for each class.
The precision at IoU is calculated using the precision-recall curve. The formula is given by:
where
where
5.1 Testing Results
The mAP serves as a performance metric, with higher values generally indicating better overall object detection accuracy. Further analysis and adjustments may be considered to optimize and enhance model performance.
5.2 Training Results
The model training dataset comprises a total of 6755 images. The dataset is divided into three subsets: the testing, validation, and training sets. The testing set consists of 726 images, serving as a separate portion for assessing the model’s performance. The validation set, consisting of 755 images, is employed for fine-tuning and parameter optimization during the training process.
The majority of the dataset, totalling 5274 images, forms the training set, providing the foundation for training the model to recognize and generalize patterns from the input images. Figure 5 shows metrics for training and validations.
Our model underwent evaluation on a diversity dataset containing a total of 726 images with a total of 2600 instances across all classes, achieving promising results across all classes. Figure 6 shows some of the inferences from our model. The overall performance, as indicated by the “all” class, demonstrated high mAP50 of 93.6% and mAP50-95 of 75.1%, contributing significantly to the robustness of the model.
The motorcycle class also exhibited strong performance, achieving a mAP50 of 95.2%. Additionally, the model performed well in identifying instances of withHelmet and withoutHelmet, showcasing its versatility in handling diverse scenarios in object detection tasks. Further, the overall performance metrics are shown in the Table 2 and figure 7. However, our model showed variations in performance across different classes.
Sources | Total Images |
Outside the campus | 12 videos collected |
Online sources including Google and news articles | 3600 |
Data from the private repository of Roboflow | 3155 |
Class | Images | Instances | Box(P) | R | mAP50 | mAP50-95 | Correct Instances |
all | 726 | 2600 | 0.932 | 0.907 | 0.936 | 0.751 | 2402 |
licensePlate | 726 | 762 | 0.946 | 0.966 | 0.964 | 0.755 | 737 |
motorcycle | 726 | 819 | 0.924 | 0.939 | 0.952 | 0.845 | 778 |
withHelmet | 726 | 686 | 0.902 | 0.834 | 0.887 | 0.672 | 586 |
withoutHelmet | 726 | 333 | 0.955 | 0.888 | 0.939 | 0.733 | 301 |
Though licensePlate and motorcycle classes achieved outstanding results, the withHelmet and withoutHelmet classes showed lower precision and recall values, indicating potential room for optimization. The model speed, with preprocessing, takes 0.8 milliseconds, inference takes 29.2 milliseconds and postprocessing consumes 3.5 milliseconds per image, showing its efficiency in real-time applications.
In summary, our model with YOLOv8 architecture demonstrated high accuracy in detecting and localizing objects across multiple classes. The detailed class-wise metrics provide insights into the model’s strengths and areas for refinement, informing potential adjustments or fine-tuning strategies to enhance its overall performance.
6 Conclusion
This paper presented the development and evaluation of our fine-tuned YOLOv8 model for detecting without helmets bike riders and extracting their number plates. We employed various augmentation techniques to improve the accuracy and robustness of our model. The result shows a high mAP50 score of 0.936 on the testing data, correctly labelling the majority of the classes regardless of lighting and weather conditions of the images or videos showcasing the working of the model under diverse scenarios. Our model can also be efficiently deployed in real-time applications to monitor traffic in cities and highways. This model will help law enforcement agencies enforce laws on helmets properly and reduce the incidence of fatalities resulting from failure to wear helmets, undeniably contributing to saving lives.
Further improvements can be made by increasing the size of the dataset. We anticipate that our efforts will serve as a catalyst for additional investigations in this field, fostering the creation of models that are more precise and more efficient in enhancing safety for individuals on motorcycles, including riders, passengers, and fellow commuters on the road.