Traffic Accident Image Classification using CLIP with a Zero-Shot Approach

Authors

  • Della Septiani Sumatra Institute of Technology, Indonesia
  • Hartiti Fadilah Sumatra Institute of Technology, Indonesia
  • Josua Alfa Viando Sumatra Institute of Technology, Indonesia
  • Anissa Luthfi Alifia Sumatra Institute of Technology, Indonesia
  • Syifa Firnanda Sumatra Institute of Technology, Indonesia
  • Pramudya Wibowo Sumatra Institute of Technology, Indonesia
  • Ade Lailani Sumatra Institute of Technology, Indonesia

DOI:

https://doi.org/10.55749/ijdrr.v1i1.185

Keywords:

CLIP, Computer Vision, Crash Detection, Multimodal Model, Zero-Shot Learning

Abstract

Accident image detection is a critical step in supporting rapid response systems during emergency situations. This study employs the Contrastive Language–Image Pre-training (CLIP) model to detect accident images using a zero-shot approach, without requiring retraining on a specific dataset. The CLIP model leverages multimodal embeddings from text and images, enabling detection based on textual descriptions. The experimental results, using the ViT base patch 32 model, show that this method achieves a Top-1 accuracy of 32% and a Top-5 accuracy of 95.86%. Although the Top-1 accuracy indicates that further optimization is needed, the high Top-5 accuracy demonstrates the significant potential of CLIP for efficient accident image detection. With further development, this technology can serve as a reliable solution for various emergency response scenarios, offering flexibility and efficiency in detecting accident-related images.

References

[1] S. M. Prasetiyo, A. Rahmayani, and A. Melania, “Artificial Intellegence dalam kesehatan dan keselamatan kerja di bidang kelistrikan,” OKTAL: Jurnal Ilmu Komputer dan Science, vol. 2, no. 8, 2023.

[2] A. Rezky, A. Bagir, D. Pamerean, and F. Makhrus, “Deteksi kecelakaan lalu lintas otomatis pada rekaman CCTV Indonesia menggunakan Deep Learning,” Buletin Pagelaran Mahasiswa Nasional Bidang Teknologi Informasi dan Komunikasi, vol. 1, no. 1, 2023.

[3] M. M. Al Rahhal, Y. Bazi, H. Elgibreen, and M. Zuair, “Vision-Language models for Zero-Shot classification of remote sensing images,” Applied Sciences (Switzerland), vol. 13, no. 22, 2023, doi: 10.3390/app132212462.

[4] A. Radford et al., “Learning transferable visual models from natural language supervision,” Feb. 2021.

[5] A. Syarwani, “Deteksi kecelakaan berbasis perubahan pola berkendara menggunakan parameter arah dan kecepatan kendaraan,” Universitas Hasanuddin, 2023.

[6] C. Kay, “Accident detection from CCTV footage.” [Online]. Available: https://www.kaggle.com/datasets/ckay16/accident-detection-from-cctv-footage

[7] G. Arya et al., “Multimodal hate speech detection in memes using Contrastive Language-Image Pre-Training,” IEEE Access, vol. 12, 2024, doi: 10.1109/ACCESS.2024.3361322.

[8] Y. N. Nabuasa, “Pengolahan citra digital perbandingan metode histogram equalization dan spesification pada citra abu-abu,” J-Icon : Jurnal Komputer dan Informatika, vol. 7, no. 1, 2019.

[9] OpenAI, “CLIP: Connecting text and images.” [Online]. Available: https://openai.com/research/clip

[10] X. Pan, T. Ye, D. Han, S. Song, and G. Huang, “Contrastive Language-Image Pre-Training with knowledge graphs,” in Advances in Neural Information Processing Systems, 2022.

[11] W. Tu, W. Deng, and T. Gedeon, “A closer look at the robustness of Contrastive Language-Image Pre-Training (CLIP),” in Advances in Neural Information Processing Systems, 2023.

[12] Y. Wei et al., “iCLIP: Bridging image classification and Contrastive Language-Image Pre-training for visual recognition,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2023. doi: 10.1109/CVPR52729.2023.00272.

[13] F. Pourpanah et al., “A review of generalized Zero-Shot learning methods,” IEEE Trans. Pattern Anal. Mach. Intell., pp. 1–20, 2022, doi: 10.1109/TPAMI.2022.3191696.

[14] Z. Han, Z. Fu, S. Chen, and J. Yang, “Contrastive embedding for generalized Zero-Shot learning,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2021. doi: 10.1109/CVPR46437.2021.00240.

[15] L. Choi and R. Greer, “Evaluating vision-language models for Zero-Shot detection, classification, and association of motorcycles, passengers, and helmets,” in IEEE Vehicular Technology Conference, 2024. doi: 10.1109/VTC2024-Fall63153.2024.10757944.

[16] N. M. Abdi and S. Aisyah, “Peningkatan kualitas citra digital menggunakan metode super resolusi pada domain spasial,” Jurnal Rekayasa Elektrika, vol. 9, no. 3, 2011.

[17] N. R. Hanifan and A. Rahmatulloh, “Optimasi histogram equalization untuk peningkatan kualitas citra digital,” Jurnal Rekayasa Informatika, vol. 2, no. 2, 2025.

[18] A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” Jun. 2021.

[19] OpenAI, “CLIP ViT-Large-Patch14 (336px): A vision transformer model.” [Online]. Available: https://huggingface.co/openai/clip-vit-large-patch14-336

Downloads

Published

2026-06-15

How to Cite

Septiani, D., Fadilah, H., Viando, J. A., Alifia, A. L., Firnanda, S., Wibowo, P., & Lailani, A. (2026). Traffic Accident Image Classification using CLIP with a Zero-Shot Approach. Indonesian Journal of Data Risk Research, 1(1), 12–25. https://doi.org/10.55749/ijdrr.v1i1.185