An In-Depth Comparison of Plain CNN, Fine-Tune VGG16, and Vision Transformer Models in Object Detection

Najib Hassan Adamu; Dr. Anas Tukur Balarabe

PDF

Published: Jan 11, 2026

Keywords:

Key words: Computer Vision (CV); Deep Learning (DL), CNNs (Convolutional Neural Networks); You Only Look Once (YOLO); Vision Transformers (ViTs); VGG (Visual Geometry Group); MSE (Mean Squared Error)

Najib Hassan Adamu

Sokoto State University Sokoto

Dr. Anas Tukur Balarabe

Sokoto State University Sokoto

Abstract

Object detection is one of the most essential yet challenging tasks in computer vision, playing a crucial role in applications such as tumour detection, quality control and inventory management, security and surveillance, autonomous systems and robotics, crop monitoring and disease detection in agriculture, defect detection in the construction industry and items detection for home robots. With the rise of deep learning techniques, significant progress has been made through models like convolutional neural networks (CNNs) and, more recently, vision transformers (ViTs). In this research, we investigate how three object detection architectures- plain CNN, transfer learning, and ViTs- perform in detecting multiple household objects under controlled conditions. The findings show that ViTs consistently outperformed plain CNN and transfer learning in key performance areas. While the plain CNN achieved a peak IoU of 97.10%, VGG16 reached 98.01%, and the ViTs model attained the highest IoU of 98. 42%, with a smoother and more stable learning curve. Additionally, the mean square error (MSE) was lower for ViTs at 5. 5.197%, compared to the plain CNN and the fine-tune VGG 16, which settled at 11. 40% and 10. 21%, indicating better prediction precision. Loss metrics for ViTs were also consistently lower, decreasing to 5. 88% compared to plain CNN and VGG16, which settled at 11.23% and 10. 01%, demonstrating more efficient learning with less fluctuation during training. However, this came at the expense of increased training time. The VGG16 required approximately 1, 326 seconds per epoch, compared to about 147 and 150 seconds per epoch for plain CNN and the ViTs. By comparing the three models in terms of IoU, MSE, loss behaviour, and execution time, this research highlights the growing strength of transformer-based models in object detection tasks. These results not only reinforce the potential of ViTs but also offer valuable insights for researchers and practitioners aiming to balance performance with computational costs in real-world detection tasks.

How to Cite

An In-Depth Comparison of Plain CNN, Fine-Tune VGG16, and Vision Transformer Models in Object Detection. (2026). BAYERO JOURNAL OF ENGINEERING AND TECHNOLOGY, 21(1), 46-54. https://bjet.ng/index.php/jet/article/view/142

Issue

Vol. 21 No. 1 (2026): FIRST ISSUE

Section

Articles

Author Biography

Dr. Anas Tukur Balarabe, Sokoto State University Sokoto

PhD Holder, A.g Dean, Faculty of Computing, HOD, Department of Computer Science, Sokoto State University, Sokoto

How to Cite

An In-Depth Comparison of Plain CNN, Fine-Tune VGG16, and Vision Transformer Models in Object Detection. (2026). BAYERO JOURNAL OF ENGINEERING AND TECHNOLOGY, 21(1), 46-54. https://bjet.ng/index.php/jet/article/view/142

Download Citation

Article Sidebar

Main Article Content

Abstract

Article Details

Dr. Anas Tukur Balarabe, Sokoto State University Sokoto

How to Cite

Similar Articles