Subject Area
Computer and Control Systems Engineering
Article Type
Original Study
Abstract
End-to-end object detection is one of the recent trends in object detection; however, it is time- and memory-consuming due to the Transformer encoder-decoder (TED) module. Detection TRansformer (DETR) is the first end-to-end object detector using a TED architecture. Despite achieving competitive performance, it suffers from slow convergence due to a long sequence of attention and the whole image. In this paper, ScaledDETR is proposed to handle the slow convergence issue in DETR and speed up the training process by implementing end-to-end detection based on the latest efficient backbone with fewer parameters. ScaledDETR proposes an efficient model with fewer parameters by replacing the ResNet backbone with EfficientNet, which is an efficient CNN backbone. The recent Relative Position Encoding (RPE) is adopted rather than standard Position Encoding (PE) which proves to gain 1.3% (AP) improvement. ScaledDETR invokes a simple architecture that runs on a single GPU that could be suitable for autonomous driving applications. The proposed model is trained for 20 epochs, which are 25x fewer than the number of epochs in DETR, and achieves competitive results with state-of-the-art object detection methods. The proposed method achieved 41.7 Ap on the COCO dataset compared with Faster RCNN, which achieved 40.2 Ap.
Keywords
Object detection; Deep learning; DETR; Computer vision; CNN
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
Elhenidy, Ali; Mohamed, Labib; Yassien, Amira; and saafan, Mahmoud
(2024)
"ScaledDETR: An alight weight object detection model for autonomous driving,"
Mansoura Engineering Journal: Vol. 49
:
Iss.
5
, Article 13.
Available at:
https://doi.org/10.58491/2735-4202.3238
Included in
Architecture Commons, Engineering Commons, Life Sciences Commons