Skip to content
Image Classification with Hugging Face
  • AI Chat
  • Code
  • Report
  • Image Classification with Hugging Face

    In this project you'll explore the capabilities of Visual Transformers (ViT) for various image understanding tasks. It aims to demonstrate the versatility of ViT models by showcasing their performance in simple image classification, zero-shot image classification, and zero-shot object detection tasks. By the end of the project you'll understand the potential of ViT models in handling diverse image-related challenges.

    • Simple Image Classification: This task involves training a ViT model to classify images into predefined categories. It serves as a foundational example of using ViT for image understanding.
    • Zero-Shot Image Classification: Here, the project explores the ViT model's ability to classify images into categories it has never seen during training. This task demonstrates the model's capability to generalize to new classes, a crucial skill in real-world scenarios.
    • Zero-Shot Object Detection: This task goes beyond classification and involves localizing and identifying objects within images, even if the model has never encountered those objects during training. It showcases the ViT model's potential for object detection in novel contexts.

    The intended audience for this project targets individuals with an intermediate to advanced level of knowledge in machine learning and computer vision. This may include data scientists, machine learning engineers, computer vision researchers, or anyone interested in exploring the capabilities of state-of-the-art ViT models for image-related tasks. The project assumes a basic understanding of deep learning concepts and coding proficiency in Python.

    Task 0: Setup

    For this image classification task we need the transformers, image and requests Python packages in order to work with pre-trained transformer-based models, including the Visual Transformer (ViT) model.


    Import the following packages.

    • From the transformers package, import ViTFeatureExtractor and ViTForImageClassification.
    • From the PIL package, import Image and Markdown.
    • import requests.
    • import matplotlib.
    # From the transformers package, import ViTFeatureExtractor and ViTForImageClassification
    # From the PIL package, import Image and Markdown
    # import requests
    # import torch
    # import matplotlib

    Task 1: Image Classification - Loading Vision Transformer

    In this task, we initialize both the feature extractor and the pre-trained ViT model for image classification task. In this case, we're using the 'vit-base-patch16-224' model, which is a popular choice for various computer vision tasks. It has a patch size of 16x16 pixels and is trained on 224x224 pixel images.
    In computer vision, patch size refers to the dimensions or size of a rectangular region or "patch" that is extracted from an image. This patch typically contains a subset of the pixels from the original image and is used for various computer vision tasks, including feature extraction, object detection, image classification, and image processing.

    After loading the model and feature extractor, you'll have both the feature extractor, you can then proceed to load images, preprocess them using the feature extractor, and pass them through the model for image classification or any other related tasks.


    Image Classification - Loading Vision Transformer

    • Load the feature extractor for the vision transformer
    • Load the pre-trained weights from vision transformer
    # Load the feature extractor for the vision transformer
    feature_extractor = ViTFeatureExtractor.from_pretrained('google/vit-base-patch16-224')
    # Load the pre-trained weights from vision transformer
    model = ViTForImageClassification.from_pretrained('google/vit-base-patch16-224')

    Task 2: Image Classification - Generate features from an Image

    In the previous task, we loaded the feature extractor and pretrained model vision transformer. This task involves the process of extracting features from an image using a pre-trained Visual Transformer (ViT) model. Extracting meaningful features from an image is a crucial step in many computer vision tasks. These features serve as a representation of the image that can be used for tasks such as image classification, object detection, image captioning, and more.

    We use plt.imread() function to read the image for further processing. We use feature_extractor() function which takes the image as input and extracts the features needed for the ViT model. The return_tensors="pt" argument specifies that the features should be returned as PyTorch tensors, making them suitable for input to the ViT model.