•  
  •  
 

Subject Area

Computer and Control Systems Engineering

Article Type

Original Study

Abstract

End-to-end object detection is one of the recent trends in object detection; however, it is time- and memory-consuming due to the Transformer encoder-decoder (TED) module. Detection TRansformer (DETR) is the first end-to-end object detector using a TED architecture. Despite achieving competitive performance, it suffers from slow convergence due to a long sequence of attention and the whole image. In this paper, ScaledDETR is proposed to handle the slow convergence issue in DETR and speed up the training process by implementing end-to-end detection based on the latest efficient backbone with fewer parameters. ScaledDETR proposes an efficient model with fewer parameters by replacing the ResNet backbone with EfficientNet, which is an efficient CNN backbone. The recent Relative Position Encoding (RPE) is adopted rather than standard Position Encoding (PE) which proves to gain 1.3% (AP) improvement. ScaledDETR invokes a simple architecture that runs on a single GPU that could be suitable for autonomous driving applications. The proposed model is trained for 20 epochs, which are 25x fewer than the number of epochs in DETR, and achieves competitive results with state-of-the-art object detection methods. The proposed method achieved 41.7 Ap on the COCO dataset compared with Faster RCNN, which achieved 40.2 Ap.

Keywords

Object detection; Deep learning; DETR; Computer vision; CNN

Creative Commons License

Creative Commons Attribution 4.0 License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS