Evaluation of the CLIP Architecture for Zero-Shot Image Classification on the Intel Image Classification Dataset

Ade Lailani; Rohmi Dyah Astuti; Christyan Tamaro Nadeak

doi:10.55749/ijdrr.v1i1.188

Authors

Ade Lailani Sumatra Institute of Technology, Indonesia
Rohmi Dyah Astuti Sumatra Institute of Technology, Indonesia
Christyan Tamaro Nadeak Sumatra Institute of Technology, Indonesia

DOI:

https://doi.org/10.55749/ijdrr.v1i1.188

Keywords:

CLIP, Image Classification, ResNet, Vision Transformer, Zero-Shot

Abstract

The performance evaluation of various architectures in the Contrastive Language–Image Pre-training (CLIP) model was conducted in a zero-shot image classification scenario. Image classification was performed using the Intel Image Classification Dataset, which consists of 3000 images representing several environmental categories. This study compares several CLIP architectures based on ResNet and Vision Transformer. Model performance was evaluated using accuracy, F1-score, precision, and recall metrics. The experimental results show that the RN50x16 architecture achieved the best performance with an accuracy of 0.925, an F1-score of 0.925, a precision of 0.929, and a recall of 0.925. The RN101, RN50x64, and ViT-B/32 architectures also demonstrated relatively strong performance with accuracy values around 0.92. In contrast, the ViT-B/16, ViT-L/14, and ViT-L/14@336px architectures produced lower performance with accuracy values below 0.90. Furthermore, the Mean Cosine Similarity Matrix analysis indicates that models with ResNet-based architectures produce clearer class representation separation compared to several Vision Transformer variants. Overall, the results suggest that the choice of architecture significantly influences the performance of the CLIP model in zero-shot image classification, with RN50x16 emerging as the most optimal architecture for the dataset used.

References

[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” Commun. ACM, vol. 60, no. 6, 2017, doi: 10.1145/3065386.

[2] M. R. Wedatama, L. J. E. Dewi, and N. W. Marti, “Klasifikasi pose yoga surya namaskar menggunakan algoritma convolutional neural network dengan arsitektur VGG19 dan ResNet-50,” Jurnal Informatika dan Teknik Elektro Terapan, vol. 14, no. 1, 2026, doi: 10.23960/jitet.v14i1.8824.

[3] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2016. doi: 10.1109/CVPR.2016.90.

[4] C. Tan, F. Sun, T. Kong, W. Zhang, C. Yang, and C. Liu, “A survey on deep transfer learning,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2018. doi: 10.1007/978-3-030-01424-7_27.

[5] D. D. Putri, G. F. Nama, and W. E. Sulistiono, “Analisis sentimen kinerja dewan perwakilan rakyat (DPR) pada twitter menggunakan metode naive bayes classifier,” Jurnal Informatika dan Teknik Elektro Terapan, vol. 10, no. 1, 2022, doi: 10.23960/jitet.v10i1.2262.

[6] P. Bansal, “Intel Image Classification.” [Online]. Available: https://www.kaggle.com/datasets/puneet6060/intel-image-classification

[7] S. Bianco, R. Cadene, L. Celona, and P. Napoletano, “Benchmark analysis of representative deep neural network architectures,” IEEE Access, vol. 6, 2018, doi: 10.1109/ACCESS.2018.2877890.

[8] R. Sinaga, A. Purnama, and B. Nugroho, “Klasifikasi citra pemandangan menggunakan ResNet50 dengan transfer learning,” J. Teknol. Inf., vol. 15, no. 2, pp. 45–53, 2022.

[9] D. Prasetyo, M. Fauzi, and S. Wahyuni, “Perbandingan arsitektur CNN untuk klasifikasi citra intel image classification,” J. Ilmu Komput. dan Inf., vol. 8, no. 1, pp. 12–21, 2023.

[10] M. Minderer et al., “Revisiting the calibration of modern neural networks,” in Advances in Neural Information Processing Systems, 2021.

[11] X. Zhai, A. Kolesnikov, N. Houlsby, and L. Beyer, “Scaling vision transformers,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2022. doi: 10.1109/CVPR52688.2022.01179.

[12] T. Lin, Y. Wang, X. Liu, and X. Qiu, “A survey of transformers,” AI Open, vol. 3, 2022, doi: 10.1016/j.aiopen.2022.10.001.

[13] Y. Tay, M. Dehghani, D. Bahri, and D. Metzler, “Efficient Transformers: A Survey,” ACM Comput. Surv., vol. 55, no. 6, 2023, doi: 10.1145/3530811.

[14] K. Han et al., “A Survey on Vision Transformer,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 1, 2023, doi: 10.1109/TPAMI.2022.3152247.

[15] A. Radford et al., “Learning transferable visual models from natural language supervision,” Feb. 2021.

[16] K. Weiss, T. M. Khoshgoftaar, and D. D. Wang, “A survey of transfer learning,” J. Big Data, vol. 3, no. 1, 2016, doi: 10.1186/s40537-016-0043-6.

[17] F. Pedregosa et al., “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, 2011.

[18] M. Sokolova and G. Lapalme, “A systematic analysis of performance measures for classification tasks,” Inf. Process. Manag., vol. 45, no. 4, pp. 427–437, Jul. 2009, doi: 10.1016/j.ipm.2009.03.002.

Evaluation of the CLIP Architecture for Zero-Shot Image Classification on the Intel Image Classification Dataset

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

Custom Block

Make a Submission

Template of Manuscript