1 Introduction

Road damage is a significant cause of vehicle damage, pedestrian injuries, and road accidents [1]. When vehicles encounter severe road defects at high speeds, they may experience sudden steering deviations, increasing the risk of crashes. In 2019 alone, road damage-related accidents in Malaysia reached approximately 364,800 accidents [1]. Similarly, recent government data indicate that pothole-related accidents resulted in 5626 fatalities in India between 2018 and 2020 [2]. These statistics highlight the critical need for an effective automated road damage detection system to minimize accidents and improve road safety.

Fig. 1
Fig. 1The alternative text for this image may have been generated using AI.
Full size image

ADAS field of view and sensors

Advanced Driver-Assistance Systems (ADAS) enhance vehicle safety by continuously monitoring the surrounding environment with a 360-degree field of view (See Fig. 1) and taking appropriate action. It relies on multiple sensors, including cameras, LiDAR, radar, and ultrasound, which are fused and processed together. The computer vision software analyzes real-time video streams and sensor data, allowing ADAS to detect obstacles, identify hazards, and react faster than a human driver  [3, 4]. Integrating road damage detection into Advanced Driver-Assistance Systems (ADAS) could reduce accident risks by helping vehicles identify road hazards. However, this integration is challenging due to the irregular shapes and small damage sizes, like potholes and cracks, which are difficult to detect against complex backgrounds. Also, variations in illumination conditions impact the detection accuracy. A critical challenge is balancing inference speed and detection accuracy to ensure real-time performance without neglecting reliability.

This paper extensively reviews different datasets employed for road damage detection, discusses state-of-the-art deep learning models and analyzes attention mechanisms for small object detection and compares them. Additionally, it summarizes previous results achieved in road damage detection and identifies key challenges that must be addressed. The paper is organized as follows: Sect. 2 discusses the evaluation metrics for object detection models. Section 3 reviews road damage detection datasets and competitions, comparing their characteristics. Section 4 explains object detection methodologies, focusing on various models and attention mechanisms, particularly the evolution of YOLO from V1 to V12. Section 5 summarizes previous research results, including findings from CRDDC2022 [5] and other datasets. Section 6 discusses the limitations and challenges of implementing road damage detection systems. Finally, Sect. 7 presents the conclusion and future research directions.

2 Evaluation metrics

To assess the performance of road damage detection algorithms, various evaluation metrics (see Eq. 1) are employed to quantify the accuracy and correctness of the detected damages [6].

$$\begin{aligned} \mathrm{{recall}}=\frac{\mathrm{{TP}}}{\mathrm{{TP}}+\mathrm{{FN}}}, \end{aligned}$$
(1a)
$$\begin{aligned} \mathrm{{precision}}=\frac{\mathrm{{TP}}}{\mathrm{{TP}}+\mathrm{{FP}}}, \end{aligned}$$
(1b)
$$\begin{aligned} \mathrm{{F1-score}}=\frac{2 \times \mathrm{{precision}} \times \mathrm{{recall}}}{\mathrm{{precision}}+\mathrm{{recall}}}, \end{aligned}$$
(1c)
$$\begin{aligned} \mathrm{{IoU}}=\frac{\mathrm{{area}}\,\mathrm{{of}}\,\mathrm{{intersection}}}{\mathrm{{area}}\,\mathrm{{of}}\,\mathrm{{union}}}, \end{aligned}$$
(1d)
$$\begin{aligned} \mathrm{{confidence}}=\mathrm{{IoU}}*p_r(\mathrm{{object}})*p_r(\mathrm{{class}}_i|\mathrm{{object}}). \end{aligned}$$
(1e)

Recall measures the model’s ability to detect actual road damage correctly, ensuring that significant damages are not missed. Precision assesses the accuracy of the model’s positive predictions, helping to reduce false alarms in automated damage detection systems, such as misclassifying shadows and lane markings as road damage. The F1-Score evaluates a model’s accuracy by combining precision and recall. This is particularly important for road damage datasets, which often exhibit class imbalances between damaged and undamaged surfaces. This ensures balanced performance without favoring one metric over the other. The IoU metric evaluates how accurately a localized predicted bounding box corresponds to the original. Object detection relies on precise damage localization. Higher IoU values indicate better alignment between predictions and actual damage annotation, ensuring accurate mapping and responses. Confidence indicates how certain the model is that a detected box contains an object and the accuracy of its prediction. High-confidence predictions can be trusted, while low-confidence ones need verification. Inference time is the duration taken for the model to make a prediction, which is essential for calculating frames per second (fps) in real-time applications.

The metrics discussed are interconnected in several ways. Increasing the confidence threshold leads to higher precision because only the most certain detections are retained. However, this improvement in precision can result in lower recall, as some true positives may be overlooked. Similarly, using a higher IoU threshold enhances the localization accuracy but may decrease the number of true positives, which can also lower recall. Additionally, lightweight models provide faster inference speeds, often at the expense of accuracy, potentially reducing precision and recall.

The confusion matrix shown in Fig. 2 contains: True Positive (TP) represents the number of positive instances correctly predicted, True Negative (TN) indicates the number of negative instances correctly predicted, False Positive (FP) shows the number of positive mispredicted instances, and False Negative (FN) denotes the number of negative mispredicted instances [7].

Fig. 2
Fig. 2The alternative text for this image may have been generated using AI.
Full size image

Confusion matrix

3 Available datasets and limitations

Table 1 summarizes the available datasets, highlighting key characteristics such as collection methods, annotation formats, class distributions, and geographic diversity. The RDD2018 was introduced in the first road damage detection competition (RDDC2018) and consisted of 9,053 images with 15,435 annotated road damage instances collected from Japan [8]. It contains eight damage classes as follows: longitudinal linear cracks (D00), transverse cracks (D10), alligator cracks (D20), potholes (D40), longitudinal construction joint parts (D01), transverse construction joint parts (D11), crosswalk blur (D43), and white/yellow line blur (D44). The class distribution shows that the D10, D11, D40, and D43 are underrepresented, with 742, 636, 409, and 817 instances respectively. Meanwhile, the D00, D01, D20, and D44 are overrepresented, with 2768, 3789, 2541, and 3733 instances, respectively. The image’s resolution was 600 \(\times\) 600 pixels and annotated in PASCAL VOC format. Following RDD2018, the RDD2020 dataset was launched in the second competition, RDDC2020, to enhance road damage data collection in Japan, India, and the Czech Republic [9]. It consisted of 21,041 images and over 31,000 annotated instances, covering the same four core damage classes: D00, D10, D20, and D40. The dataset also had a class imbalance, with D00 (6592 instances) and D20 (8381 instances) being overrepresented, while D40 (5627 instances) was underrepresented. The image resolutions were 600\(\times\)600 or 720\(\times\)720 pixels, and annotations followed the PASCAL VOC format.

Fig. 3
Fig. 3The alternative text for this image may have been generated using AI.
Full size image

a The D00 damage type, b The D10 damage type, c The D20 damage type, d The D40 damage type [10]

The RDD2022 was introduced in the third road damage detection competition (GRDDC2022), significantly expanding on RDD2020 by including images from countries like India, Japan, Norway, the United States, the Czech Republic, and China. This resulted in 47,420 images with 55,007 annotated instances [5]. It focused on D00, D10, D20, and D40, as shown in Fig. 3. However, the class imbalance remains an issue, with D00 (26,016 instances) being overrepresented and D40 (6544 instances) being the least represented. The image resolutions range from 512\(\times\)512 to 4040\(\times\) 2035 pixels, and the annotations are in PASCAL VOC format.

The UAV-PDD2023 road damage dataset was collected using an Unmanned Aerial Vehicle (UAV) at an altitude of 30 meter [11]. This dataset was collected in China with 2440 images and 11,158 annotated instances. It has six damage types: longitudinal cracks (LC), transverse cracks (TC), alligator cracks (AC), oblique cracks (OC), repair (RP), and potholes (PH). The images, taken under sunny conditions and one hour after rain, have a high resolution of 2592 \(\times\) 1944 pixels and are annotated in PASCAL VOC format.

A dataset of thermal images of potholes was collected in [12]. It contained 500 images, which included both potholes and normal roads. Information about the resolution and annotation is not available.

A publicly available dataset on Kaggle [13] includes 665 images with 1740 pothole annotations collected from various online sources. However, it has some limitations: it solely focuses on potholes and captures images from diverse perspectives rather than a vehicle’s viewpoint. The dataset follows the PASCAL VOC annotation format and is licensed under the Database Contents License (DbCL) v1.0. In 2015, Stellenbosch University developed a road damage dataset using a smartphone mounted on a vehicle’s dashboard [14, 15]. It contains 47,804 high-resolution images (3680\(\times\)2760 pixels) with 800 annotated potholes, each marked with x- and y-coordinates for precise localization. The dataset is divided into two subsets: a simpler one with clearly visible potholes and a more complex one with both positive samples (roads with potholes) and negative samples (roads without potholes).

The EGY_PDD dataset, collected in Port Said, Egypt, uniquely contributes by introducing 2D and 3D imaging techniques for road damage detection [16]. The 2D dataset consists of 14,612 images with 19,528 annotated instances, covering 10 damage types: rutting (D0), reflective and transverse cracks (D1), block cracks (D2), longitudinal cracks (D3), alligator cracks (D4), patching (D5), potholes (D6), bleeding (D7), corrugation (D8), raveling and weathering (D9), and bumps and sags (D10). The dataset, annotated in YOLO format, has a class imbalance, with D3 being overrepresented (7388 instances) and D5, D8, D9, and D10 underrepresented (489, 145, 672, 108 instances). The image resolutions range from 320\(\times\)320 to 1280\(\times\)1280 pixels. The EGY_PDD 3D dataset, captured with an Intel RealSense depth camera D455, includes 4323 images and 8370 road damage instances. It provides distance information to assess pavement distress severity. The data are in Point Cloud Data (PCD) format and can be requested, subject to certain restrictions. A pothole dataset was collected in Slovakia under various weather and lighting conditions to improve model generalization [17]. It includes 2099 images with 3591 pothole instances captured in clear weather, sunset, evening, night, and rain, all at a resolution of 1920 \(\times\) 1080 pixels. This is the only dataset in this survey that accounts for environmental variations.

Table 1 Datasets summary comparison

These datasets are limited to specific regions like Japan, India, and Norway, creating challenges for deployment in countries with different road types and weather conditions. Most images are taken during daylight and in clear weather, which affects performance in nighttime, foggy, or rainy conditions. One dataset includes varied weather conditions but lacks foggy and snowy scenarios. The class distribution is also imbalanced, with potholes underrepresented compared to cracks, introducing potential bias in detection. Furthermore, most datasets are in 2D, providing only X and Y damage coordinates, so more 3D images are needed for accurate localization in ADAS.

As for datasets’ privacy considerations, the RDD series(RDD2018, RDD2020, RDD2022), in the case of a person’s face or a car license plate, are reflected in the image and are blurred out based on visual inspection. In the UAV-PDD2023, the data is collected using a drone, which makes it hard to record a car license, and no privacy considerations were mentioned. There were no privacy considerations in the Kaggle, Stellenbosch University, EGY_PDD, and Slovakia potholes datasets. All these datasets are publicly available except the EGY_PDD, and the thermal datasets which are available upon request from the authors.

4 RDD methodologies

Road damage detection can be conducted using various deep learning models that differ in inference rates, accuracy, and complexity. CNN architectures are fundamental for feature extraction in image classification but are insufficient for object detection, which needs precise localization and classification. Advanced models such as YOLO, SSD, and R-CNN combine CNNs with detection mechanisms, enhancing accuracy and real-time performance and making them practical for road damage detection. Object detection models fall into two categories: two-stage detectors (e.g., R-CNN) and one-stage (e.g., YOLO and SSD).

4.1 Deep learning models

The CNN architecture is a multi-layer neural network consisting of convolutional, fully connected (FC), and sub-sampling layers [18]. Convolutional layers use kernels to compute 2D activation maps, extracting image features at specific spatial positions, while pooling layers reduce the spatial size of representations to minimize parameters and avoid overfitting. The fully connected layer connects all activations from the previous layer using matrix multiplication and bias offset. The loss functions like cross-entropy are used to regulate training by comparing predicted and actual labels. The CNNs are used as image feature extractors. Other CNN models include LeNet, AlexNet, ZFNet, GoogleNet, VGGNet, ResNet, Inception, ResNeXt, SENet, MobileNet, DenseNet, Xception, and EfficientNet [18].

The R-CNN is a two-stage detector that uses selective search to propose bounding boxes, resizing each region before passing it through a CNN for feature extraction [19]. The fast R-CNN is an enhanced version of R-CNN that improves speed by processing the entire image through a CNN and using Region of Interest (RoI) pooling to extract features from the feature map. It classifies objects and refines bounding boxes simultaneously but still depends on selective search, making it faster than R-CNN, though not the fastest option available [20]. The Faster R-CNN (See Fig. 4) advances the Fast R-CNN by using a Region Proposal Network (RPN), eliminating the need for selective search [21]. The RPN creates object proposals directly from the feature map with multi-scale detection using anchor boxes. Although it is the fastest R-CNN in inference speed, it remains slower than one-stage detectors.

Fig. 4
Fig. 4The alternative text for this image may have been generated using AI.
Full size image

Faster R-CNN abstraction

The SSD (See Fig. 5) is a one-stage detector that directly predicts bounding boxes and class scores [22]. It uses multiple feature maps at different scales to detect objects of varying sizes. It applies default anchor boxes for different aspect ratios. It processes the entire image in one pass. It is faster than the Faster R-CNN and works well for small objects. However, it is not as accurate as the R-CNN.

The YOLOv1 model (See Fig. 6) is a one-stage regression model that detects objects by dividing the image into grids and using a class probability map for each cell [23]. It is significantly faster than the faster R-CNN but struggles with detecting small and overlapping objects. The bounding boxes are not as precise as those of the R-CNN. YOLOv2 uses anchor boxes, improving the bounding boxes’ accuracy and introducing batch normalization. YOLOv3 introduced multi-scale detection with a deeper network. It uses Feature Pyramid Networks (FPN) to detect small, medium, and large objects. The YOLOv3 model is slower than SSD but faster than R-CNN.

A comparative survey of object detection models, including R-CNN, Fast R-CNN, Faster R-CNN, YOLOv1, YOLOv2, YOLOv3, and SSD, was conducted using the Pascal VOC and COCO datasets in [24]. Table 2 shows the Model’s performance based on the Pascal VOC 2007 dataset. Table 3 shows that one-stage detectors are faster but less accurate for small objects, while two-stage detectors are slower but more accurate.

Fig. 5
Fig. 5The alternative text for this image may have been generated using AI.
Full size image

SSD abstraction

Fig. 6
Fig. 6The alternative text for this image may have been generated using AI.
Full size image

YOLO abstraction

Table 2 Model performance on Pascal VOC 2007 dataset [24]
Table 3 Comparison of one-stage and two-stage models
Table 4 Performance comparison of SSD and YOLOv3 on COCO dataset [24]

The analysis of different object detection models revealed that YOLO variants achieved the highest mean Average Precision (mAP%) and Frames Per Second (FPS). YOLOv2, with a 544\(\times\)544 input image size, achieved the highest mAP of 78.6% with an inference speed of 40 FPS, as shown in Table 2. Meanwhile, YOLOv2, with a smaller 288\(\times\)288 input size, had the fastest inference rate at 91 FPS but a lower mAP of 69.0%. The highest inference rate recorded for SSD models was 46 FPS, while the best mAP reached 76.8%. R-CNN models exhibited significantly slower inference speeds, with the fastest being just 7 FPS, too slow for real-time applications. The highest mAP achieved by R-CNN was 76.4%. From Table 4, COCO benchmark results showed that YOLOv3 with a 320\(\times\)320 input size had the fastest inference rate at 45.45 FPS, while YOLOv3 with a 608\(\times\)608 input size achieved the highest mAP of 33%. The findings reveal a trade-off between input size, inference speed, and accuracy: smaller input sizes improve speed but reduce accuracy. YOLO models offer the best balance between these factors. However, their relatively low mean Average Precision (mAP) makes them unsuitable for ADAS. Adding attention modules is crucial to improve accuracy without significantly increasing complexity or inference time.

4.2 Attention modules

Attention modules improve feature extraction by using feature maps to enhance important features and suppress noise and non-important features. These feature maps extract the edges and patterns from the input data. The input feature maps have three main dimensions: width (W), height (H), and the number of channels (C).

These attention modules include Coordinate Attention (CA), non-local/self-attention (NLNet), Convolutional Block Attention Module (CBAM) and squeeze-and-excitation network (SENet).

The Convolutional Block Attention Module (CBAM) (See Fig. 7) enhances feature extraction by sequentially inferring two types of attention maps: channel attention and spatial attention [25]. Channel attention assigns more weight to channels with important features in the feature map. It compresses the input feature map along its spatial dimensions and uses global average and global max pooling to get each channel’s mean activation and strongest activation. It is then passed through a Feed Forward Network (MLP) with Sigmoid Activation to produce the channel attention map of shape C \(\times\) 1 \(\times\) 1. Finally, this attention map is multiplied by the original feature map. The spatial attention identifies the areas of the feature map that require focus. It takes the channel attention map as input and compresses it along the channel dimension (C), preserving only the spatial details. This map undergoes both max pooling and average pooling, resulting in two feature maps with the shape of 1 \(\times\) H \(\times\) W. These feature maps are then processed through a convolution layer followed by a sigmoid activation to enhance the features. Finally, the output is multiplied by the original input feature map.

Fig. 7
Fig. 7The alternative text for this image may have been generated using AI.
Full size image

CBAM attention module

The Coordinate Attention (CA) is used to enhance the model’s accuracy without increasing the model’s complexity. It emphasizes the significant pixels in an image while tracking their positions using patterns and edges identified from the input feature maps. The CA employs 1D pooling to compress and divide the feature map into two vectors, as illustrated in Fig. 8. As shown in Eq. (2), this process results in one vector of size C \(\times\) 1 \(\times\) W, obtained by pooling along the height, and another vector of size C \(\times\) H \(\times\) 1, created by pooling along the width.

$$\begin{aligned} z_H^c (H)=\frac{1}{W} \quad \sum _{0 \le i \le H} x_c(H,i), \end{aligned}$$
(2a)
$$\begin{aligned} z_W^c (W)=\frac{1}{H} \quad \sum _{0 \le i \le W} x_c (W,i). \end{aligned}$$
(2b)

A Conv2d layer concatenates and operates on the vectors \(Z_H\) and \(Z_W\) for channel reduction and integrates data from both directions to determine their relationships. Batch normalization and a non-linear activation function, such as sigmoid, ReLu. This stage improves feature learning. Two attention maps-horizontal and vertical directions-are then created from the processed feature vector. The attention maps are then run through a second convolution layer (Conv2d) to learn attention representations. A sigmoid activation function is applied to the resultant output to provide attention weights ranging from 0 to 1. These attention weights are multiplied and applied to the original feature map to suppress less important features and enhance the important ones [26].

Fig. 8
Fig. 8The alternative text for this image may have been generated using AI.
Full size image

Coordinate attention module

CBAM is more computationally complex than CA. The CBAM is often used in applications requiring general attention, such as classification tasks, while CA is particularly useful for tasks that demand precise spatial understanding, like object detection and segmentation.

4.3 YOLO models evolution

This subsection investigates the evolution of the YOLO models from YOLOv1 to YOLOv12, focusing on architecture, number of parameters, inference rate, accuracy, and loss functions. To detect objects, the YOLO model divides the image into a grid. Each grid cell is assigned a class probability indicating whether an object class is detected (1) or not (0). The model then generates candidate boxes with normalized x and y coordinates, along with the width and height of the bounding boxes. The correct class is labeled as 1, while all other classes are labeled as 0, as illustrated in Fig. 9.

Fig. 9
Fig. 9The alternative text for this image may have been generated using AI.
Full size image

Understanding YOLO labels

The main loss functions used in YOLO are Localization loss, Confidence loss, and Classification loss for each grid cell. Localization loss measures how accurately the predicted bounding box matches the ground truth by calculating the difference between the predicted coordinates (x, y), width, and height using Mean Squared Error (MSE), ensuring precise object localization. Confidence loss determines how specific the model is that a bounding box contains an object, computed using Binary Cross-Entropy (BCE) or MSE, which helps differentiate between objects and the background. Lastly, Classification loss applies Cross-Entropy Loss to assess the accuracy of the predicted class label by comparing predicted class probabilities with the ground truth.

The YOLO series, from YOLOv1 to YOLOv12, had major improvements in their architectures. Table 5 summarizes the evolution of these models, focusing on metrics such as mAP@0.5:0.95, inference rate, and number of parameters based on the COCO dataset benchmark for medium-sized models. YOLOv1 started as a CNN-based network consisting of 24 convolutional layers and two fully connected layers for predicting bounding box coordinates and probabilities [27]. However, it only detects a maximum of two objects per grid cell and struggles with varying aspect ratios. YOLOv2 has 19 convolutional layers and introduces anchor boxes and pass-through layers to improve object localization. Key enhancements include the addition of batch normalization, the implementation of a high-resolution classifier through pre-training on the ImageNet dataset, the removal of dense layers in favor of relying entirely on convolutional layers, and the introduction of anchor boxes for predicting bounding boxes [28]. This model achieved a mAP@0.5:0.95 of 21.6% and a mAP@0.5 of 44.0% on the COCO dataset. YOLOv3 introduced a multi-scale feature extraction architecture to detect objects at various scales [29]. Key improvements include an object score for better prediction accuracy and class prediction using binary cross-entropy. The model introduced multi-scale prediction to predict objects at three different scales (13\(\times\)13, 26\(\times\)26, and 52\(\times\)52 for a 416\(\times\)416 input image). This enhancement allows the model to detect large, medium, and small objects more effectively. The YOLOv3 (608 \(\times\) 608) model achieved a mAP@0.5:0.95 of 33%, with an inference time of 20 FPS on a Titan X, and a mAP@0.5 of 65.7% on the COCO dataset. For YOLOv4, the architecture was improved by implementing the CSPNet, a variant of the ResNet architecture explicitly designed for object detection tasks [30]. The bag-of-freebies technique data augmentation was introduced to improve the model’s accuracy without increasing the inference time. It applied augmentation techniques such as the mosaic augmentation of images. Spatial Pyramid Pooling (SPP) was introduced, which helps detect small objects. Self Adversarial Training (SAT) was added to make the model resistant to turbulations and adversarial attacks. Finally, the hyperparameter optimization with Genetic Algorithm (GA) was added. The YOLOv4(608 x 608) model achieved a mAP@ 0.5:0.95 of 43.5%, with an inference rate of 62 FPS on a V100 GPU, and a mAP@0.5 of 57.9% on the COCO dataset. YOLOv5 introduces “dynamic anchor boxes,” where ground truth bounding boxes are clustered, and the centroids serve as anchor boxes [31]. The model includes five scaled versions, such as YOLOv5n (nano) and YOLOv5x (extra-large), balancing accuracy and speed. It also features “CIoU loss,” a variation of the IoU loss function to enhance performance on unbalanced datasets. The YOLOv5m model achieved a mAP@ 0.5:0.95 of 45.4%, with an inference rate of 122 FPS on a V100 b1 GPU, and a mAP@0.5 of 64.1% on the COCO dataset. The complexity of this medium model is 21.2 M parameters. The YOLOv6 model adopted an anchor-free detector; the EfficientNet-L2 was used as a backbone [32]. It also began using VariFocal loss to address the imbalance between foreground and background classes [33]. The SIoU/GIoU regression losses were also introduced to the model [34]. The YOLOv6m model achieved a mAP@0.5:0.95 of 50.0%, with an inference rate of 175 FPS on a Tensor RT fp16 b1. The complexity of this medium model is 34.9 M parameters. YOLOv7’s architecture has been enhanced with an Efficient Layer Aggregation Network (ELAN) for faster training and convergence [35]. It uses only nine anchor boxes to detect a wider range of object shapes and sizes than previous versions. It introduced the focal down-weighting of the loss for well-classified examples and focused on the complex examples. The YOLOv7 model achieved a mAP@0.5:0.95 of 51.4%, with an inference rate of 161 FPS on a Tensor RT fp16 b1. The complexity of this medium model is 36.9 M parameters. The YOLOv8 model is an anchor-free architecture that minimizes box predictions, speeding up non-maximum suppression (NMS) and training [36]. It utilizes mosaic augmentation for training and introduces Course-to-Fine layers (C2F) and Spatial Pyramid Pooling Fusion (SPPF) layers to remove fixed-size constraints. The model’s loss is calculated using bounding box loss, class loss, and Distribution Focal Loss (DFL). The YOLOv8m model achieved a mAP@0.5:0.95 of 50.2%, with an inference rate of 546 FPS on an A100 GPU. The complexity of this medium model is 25.9 M parameters. The YOLOv9 model builds on YOLOv7 and introduces key features such as the information bottleneck principle, reversible functions, Programmable Gradient Information (PGI), and the Generalized Efficient Layer Aggregate Network (GELAN) [37]. The PGI block includes a main branch for inference and an auxiliary reversible branch for accurate gradient calculations, effectively addressing deep supervision issues without additional inference costs. It can handle objects of various sizes. GELAN enhances the model’s data processing and learning capabilities, improving inference speed. The YOLOv9m model achieved a mAP@0.5:0.95 of 51.4%, with an inference rate of 155 FPS on a T4 tensorRT10. The complexity of this medium model is 20.1 M parameters. The YOLOv10 model is designed to enhance inference speed without sacrificing accuracy [38]. It eliminates post-processing latency through NMS-free training and employs a dual assignment strategy that integrates one-to-many and one-to-one matching. Key improvements include a lightweight classification head that reduces computational demands, spatial-channel decoupled down-sampling to minimize information loss, and a rank-guided block design that optimizes parameter utilization for faster and more efficient object detection. The YOLOv10m model achieved a mAP@0.5:0.95 of 51.1%, with an inference rate of 223 FPS on a T4 tensorRT10. The complexity of this medium model is 15.4M parameters. YOLOv11 introduced new blocks: the Cross Stage Partial with kernel size 2 (C3k2) and Convolutional block with Parallel Spatial Attention (C2PSA) [39]. These blocks provide multiscale feature extraction at different depths and faster processing. The YOLOv11m model achieved a mAP@0.5:0.95 of 51.5%, with an inference rate of 213 FPS on a T4 tensorRT10. The complexity of this medium model is 20.1M parameters.

The YOLOv12 improved inference time efficiency compared to previous models using RepVGG-style re-parameterization. It refines the CSP (Cross-Stage Partial) network for better feature propagation and reduced computation in the backbone architecture [40]. The model includes an optimized Path Aggregation Network (PAN) and refined Feature Pyramid Network (FPN) for multiscale feature fusion, an optimized anchor-free head design, improved augmentation techniques like Mosaic, MixUp, and CutMix, and adaptive label assignment similar to Optimal Transport Assignment (OTA) to enhance the accuracy of small and medium-sized objects. The YOLOv12m model achieved a mAP@0.5:0.95 of 52.5%, with an inference rate of 206 FPS on a T4 tensorRT10. The complexity of this medium model is 20.2M parameters.

Table 5 YOLO Evolution & Comparison of medium-sized models on COCO benchmark (FPS varies based on processing power)

5 Key findings in state-of-the-art in RDD

Table 6 summarizes previous methods for road damage detection, highlighting the various datasets used and the results achieved.

The dataset RDD2018 was used in [41] to train a specialized YOLOv5 model, DenseSPH-YOLOv5. This model improved the F1-score for RDD2018, outperforming the winner in CRDDC2018. Key enhancements included integrating DenseNet blocks and Convolutional Block Attention Modules (CBAM), which improved feature extraction. It achieved an average F1-score of 0.812. R-CNN and YOLO models were employed to train on the RDD2020 dataset used in the CRDDC2020, the second road damage competition [42]. The u-YOLO model achieved an F1-score of 0.63, indicating high classification accuracy. In comparison, the faster R-CNN model had an F1-score of 0.50, reflecting lower performance in identifying and categorizing road damages.

In the Crowdsensing-based Road Damage Detection Challenge 2022 (CRDDC’2022), several models demonstrated unique methodologies and achieved different performance levels across various datasets. The winning ensemble model in [43] combined YOLO series models and Faster R-CNN serving as a baseline, Swin Transformer as the backbone, and Deformable ROI as the ROI head, achieving an average F1 score of 0.7699. The “+Faster Swin l w12 Deform Roi ms 3” ensemble model attained the highest overall F1 score of 0.77, with F1 scores of 0.583 for India, 0.789 for Japan, and 0.844 for the US.

The second place winner in the CRDDC’2022 [44] used an approach that involved transfer learning on yolov5x with pre-trained weights from the CRDDC’2020 winner USC-InfoLab, yielding yolov5x-P5 and yolov5x-P6 models. These models were trained with specified hyperparameters and employed various preprocessing techniques on the RDD2022 dataset, achieving an average F1 score of 0.743233, with distinct scores across different regions. The third-place winner implemented a YOLOv7 model for automated road damage identification and classification [45]. Coordinate attention and fine-tuning methods like label smoothing, and ensembles were integrated into the model. The model achieved an average F1 score of 0.741. The fourth and fifth-place models, detailed in [46, 47], employed YOLOv5 and YOLOv7 in ensemble setups with attention modules and non-maximum suppression. These models achieved average F1 scores of 0.727 and 0.726, respectively. Further down the rankings, the eleventh-place model in [48] used YOLOv5l6 and YOLOv5x6 models, training them across all six countries in the RDD2022 dataset and fine-tuning with high data augmentation. This approach achieved an average F1 score of 0.53 and a 0.6 F1 score for all countries, showing a lower performance. In [49], the RDD2022 dataset was used to evaluate YOLOv5-x, YOLOv7 variations, and self-distillation (DINO) models. Different batch sizes and SGD optimizers were employed, maintaining a 640 input size for YOLO models and a 4-scale ResNet50 backbone for DINO. YOLOv5x training yielded the best F1 score of 0.73.

The Kaggle pothole dataset was used in training pothole detection models in [50], employing YOLOv5 and Single Shot Detector (SSD) models. YOLOv5 outperformed SSD-mobilenetv2, securing an F1-score of 0.87, precision of 0.93, and recall of 0.83, while SSD-mobilenetv2 achieved lower scores: F1-score of 0.479, precision of 0.42, and recall of 0.56.

The Stellenbosch University 2015 dataset was trained and tested on four models: YOLOv3, SSD, HOG with SVM, and Faster R-CNN [51]. The results showed that YOLOv3 achieved the highest accuracy at 82%, followed by Faster R-CNN at 74%, SSD at 80%, and HOG with SVM at 27%.

Five hundred thermal camera images were manually gathered and were trained using a CNN model and a ResNet-based CNN for comparison [12]. The ResNet-based CNN achieved higher accuracy at 97% compared to the designed CNN, which scored 70.62%.

The Dynamic Scale-Aware Fusion Detection Model (RT-DSAFDet) was created for adaptive, multi-scale road damage detection, automatically removing background interference [52]. This model was trained on the UAV-PDD2023 dataset. The results of this model obtained a mAP50 of 54.2%, 11.1% higher than the YOLOv10-m model.

In [16] the Egy_PDD 2D dataset was used to detect road damages in Egypt. The models used are YOLOv7, YOLOv7x, and YOLOv8x. The results achieved are an F1-score of 0.6438 for YOLOv7, 0.6604 for YOLOv7x, and 0.6417 for YOLOv8x. The inference was conducted using a Tesla T4 GPU on Google Colab.

A system was implemented in  [53] to detect road damages and save the GPS location of these damages to be later fixed. The YOLOv5 model was applied to the dataset RDD2020 with a video from American Honda Motor Co, which achieved an F1-score of 0.846. A YOLOv3 model was implemented to detect road damages in different weather conditions [17]. Under clear conditions, the F1-score was 0.774, dropping to 0.562 in rain, 0.578 at sunset, 0.585 in the evening, and 0.219 at night.

Table 6 Summary of previous road damage detection models

6 Discussion

Most road damage datasets, except for the Slovakia potholes dataset, are collected under similar weather conditions and daylight, making models biased toward detecting damage in well-lit environments. Studies such as [17] show that detection accuracy drops significantly at night due to reduced illumination and under adverse weather conditions, leading to higher false detections. Real-world factors like streetlights, vehicle headlights, and varying lighting conditions further impact model performance, requiring robust solutions to improve detection reliability.

To address these challenges, several techniques can be used. Data augmentation can simulate various weather effects, enhancing model robustness [54]. Preprocessing methods like image enhancement and noise reduction improve clarity in low visibility. Combining LiDAR or thermal cameras, multi-sensor fusion compensates for optical imagery limitations. Adaptive illumination, HDR imaging, and infrared cameras also enhance nighttime road damage detection. To address dataset class imbalance, mosaic augmentation or class re-weighting can foster more effective learning.

A vehicle-based maintenance aid system can detect and report road damage in real-time using a trained road damage detection model on an edge device with a camera. This device processes images locally, while a GPS module records the location of detected damage, and a communication module sends the data to a central database for maintenance planning, as implemented in [53]. However, the system struggles at night because it is trained only on daylight images and lacks severity classification. It may require 3D datasets for accurate identification. An advanced implementation presented in [55] combines road damage detection with Advanced Driver Assistance Systems (ADAS) to enable autonomous vehicle responses. Accurate assessments require precise localization, real-time detection, and multi-sensor fusion (e.g., LiDAR, IMU). The decision-making system evaluates avoidance maneuvers to ensure vehicle safety. In semi-autonomous vehicles, it can notify the driver of detected road damage for manual intervention.

7 Conclusion

This review investigated deep learning methods for real-time road damage detection, focusing on key metrics like F1-score, IoU, precision, recall, and confidence. It analyzes several datasets from Kaggle, EGY_PDD, Slovakia potholes, and other datasets. EGY_PDD offers 3D images, while Slovakia’s dataset includes various weather conditions, enhancing real-world applicability. Among deep learning models, R-CNN provides high accuracy but is computationally intensive, whereas YOLO effectively balances speed and accuracy for real-time use. SSD is the fastest but less effective. The evolution of CNN architectures and attention mechanisms (CBAM, CA) has enhanced feature extraction capabilities. Despite advancements, existing implementations face challenges, such as poor nighttime detection, inaccurate localization, and a lack of damage severity classification-crucial for practical applications. Expanding 3D image datasets for precise damage localization and developing a road damage severity dataset for effective maintenance prioritization is essential to enhance ADAS. Future datasets should incorporate diverse environmental conditions and be collected across various times and seasons to improve model robustness. A future ADAS system could be developed by gathering a dataset focused on the severity of road damage. By assessing the level of damage, the ADAS can make better-informed decisions, such as rerouting to avoid heavily damaged roads or adjusting the vehicle’s speed to minimize impact.