Abstract
<title>Abstract</title> Fish species classification plays an essential role in aquaculture management, marine biodiversity conservation and fisheries monitoring. Traditional methods rely heavily on manual identification, which is time-consuming, prone to human error and inefficient on a large scale. This paper proposes a new approach entitled DeepLIFT-ViT, which combines the Visual Geometry Group 16 (VGG16) and Vision Transformer (ViT) architectures to improve the accuracy and efficiency of image-based fish species classification. Unlike existing methods that rely solely on CNN-based or transformer-based models, our approach introduces a novel hybrid architecture that integrates interpretability-based saliency features with transformer-based attention mechanisms. The process of our approach begins with the VGG16 model pre-trained on the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) 2012 dataset, to extract deep visual features. The DeepLIFT interpretability technique is then used to generate heat maps highlighting salient areas of the image contributing to model predictions. These maps are then divided into patches. The patches are concatenated with those extracted from the original images to form combined vectors, which are fed into a ViT model containing an MLP (Multi-Layer Perceptron) head for final classification. The model was trained and evaluated on public datasets containing various fish species from different aquatic environments. Experimental results show that DeepLIFT-ViT outperforms existing state-of-the-art models in terms of classification accuracy, noise robustness and computational efficiency. With classification accuracy of up to 99%, this approach enhances the capabilities of automatic fish species recognition systems, offering a scalable solution for fisheries management and aquatic research.
Affiliated Institutions
Related Publications
Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet
Transformers, which are popular for language modeling, have been explored for solving vision tasks recently, e.g., the Vision Transformer (ViT) for image classification. The ViT...
Video Swin Transformer
The vision community is witnessing a modeling shift from CNNs to Transformers, where pure Transformer architectures have attained top accuracy on the major video recognition ben...
CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification
The recently developed vision transformer (ViT) has achieved promising results on image classification compared to convolutional neural networks. Inspired by this, in this paper...
Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions
Although convolutional neural networks (CNNs) have achieved great success in computer vision, this work investigates a simpler, convolution-free backbone network use-fid for man...
Swin-Unet: Unet-Like Pure Transformer for Medical Image Segmentation
In the past few years, convolutional neural networks (CNNs) have achieved milestones in medical image analysis. Especially, the deep neural networks based on U-shaped architectu...
Publication Info
- Year
- 2025
- Type
- article
- Citations
- 0
- Access
- Closed
External Links
Social Impact
Social media, news, blog, policy document mentions
Citation Metrics
Cite This
Identifiers
- DOI
- 10.21203/rs.3.rs-8017135/v1