CNNs with Self-Supervised Learning: A New Paradigm for Computer Vision

Explore the integration of CNNs with self-supervised learning for image classification. Learn about the benefits, challenges, and implementation details of self-supervised learning approach.

Convolutional Neural Networks (CNNs) are essential for image analysis and computer vision tasks, including image classification, detection, and segmentation. Traditionally, CNNs depend on supervised learning that requires large amounts of labeled data, creating significant limitations. Self-supervised learning provides a solution by allowing models to learn from unlabeled data. It automatically generates labels by predicting parts of the input from other parts.

The integration of CNNs with self-supervised learning uses the feature extraction capabilities of CNNs and the data efficiency of self-supervised learning. Models can learn important representations and semantic features by pre-training CNNs on unlabeled data using self-supervised learning, enhancing performance and generalization. This approach reduces the need for labeled data and improves the overall effectiveness of CNNs. 

self-supervised learning
Illustration of self-supervised learning by rotating the entire input images | Source

This article will explore the fundamental concepts, methodologies, advantages, and challenges of integrating Convolutional Neural Networks (CNNs) with self-supervised learning. Additionally, we will train self-supervised CNNs.

Understanding Convolutional Neural Networks (CNNs)

Convolutional Neural Networks (CNNs), also known as ConvNets, are a specialized type of deep learning architecture that has transformed the field of computer vision. They excel at processing grid data, like images, making them ideal for image classification, object detection, and segmentation. 

CNNs are inspired by the hierarchical structure of the human visual cortex, where simple features are detected in the early layers and more complex features are built up in deeper layers. This layered approach allows CNNs to learn increasingly sophisticated representations of visual inputs.

Key Characteristics of CNNs

CNNs possess unique characteristics that make them particularly well-suited for image analysis, including:

  • Local Connectivity: Like neurons in the visual cortex, CNN neurons connect only to a local region of the input, not the entire visual field. This local connectivity enables efficiency by reducing the number of parameters.
  • Translation Invariance: CNNs can detect features regardless of their location in the visual field due to the use of convolutional layers and pooling layers. This is also referred to as shift-invariance.
  • Multiple Feature Maps: CNNs extract multiple feature maps at each stage of processing, similar to how the visual cortex operates. This is achieved through the use of multiple filters (kernels) in each convolutional layer.
  • Non-Linearity: CNNs achieve non-linearity through the use of activation functions like ReLU, which are applied after each convolution operation, allowing the network to learn complex patterns.

Core Components of CNNs

 CNNs are typically composed of several key layers, including:

  • Convolutional Layers: These layers are the fundamental building blocks of a CNN. They perform the mathematical operation of convolution, applying a sliding window function (filter or kernel) to the input image matrix. These filters extract features such as edges, corners, and textures.
  • Activation Layers: After the convolutional operation, an activation function such as ReLU is applied to introduce non-linearity to the model. This allows CNN to learn complex relationships in the data.
  • Pooling Layers: These layers downsample the feature maps, reducing their spatial dimensions and computational complexity. Common pooling operations include max pooling and average pooling.
  • Fully Connected Layers: These are typically the final layers of a CNN. They take the flattened output of the previous layers and use it to perform the final classification or regression task. They apply activation functions like Softmax for prediction.

The Emergence of Self-Supervised Learning

Self-supervised learning is a machine learning approach that addresses the limitations of supervised learning, especially in scenarios where labeled data is limited or expensive to obtain. Self-supervised learning uses the inherent structure of unlabeled data to generate its own supervisory signals. Unlike supervised learning, which relies on large amounts of manually annotated data, it enables models to learn meaningful representations without explicit human-provided labels.

Self-Supervised Learning |Source

The Need for Self-Supervised Learning

  • Limitations of Supervised Learning: Supervised learning requires large amounts of high-quality labeled data, which can be costly, time-consuming, and sometimes infeasible to acquire. This is a major bottleneck in various domains, particularly in specialized fields like medical imaging, where expert annotations are needed.
  • Abundance of Unlabeled Data: In contrast, unlabeled data is readily available and far more abundant than labeled data. Self-supervised learning allows us to use this vast resource, enabling models to learn from massive datasets without the need for manual annotation.
  • Generalization and Scalability: Self-supervised learning can improve the generalization performance of models, meaning they are able to make more accurate predictions on unseen data and learn new concepts after seeing only a few examples. It also provides a more scalable approach to machine learning, since models can be trained on large datasets without human annotation.

Core Concepts of Self-Supervised Learning

  • Pretext Tasks: Self-supervised learning includes defining a pretext task that allows a model to learn from unlabeled data by predicting certain properties or parts of the input. The pretext task is not the actual goal but helps the model learn representations useful for downstream tasks.
  • Supervisory Signals from Data: Self-supervised learning derives supervisory signals directly from the unlabeled data instead of depending on external labels. This is achieved by using one part of the input to predict another part or by exploiting the inherent structure or properties of the data.
  • Pseudo-Labels: Self-supervised learning generates “pseudo-labels” from unlabeled data, which are used as a ground truth for training. The model is trained using these generated labels, which are changed as the model learns, and it is optimized using a loss function, similar to supervised learning.
  • Representation Learning: Self-supervised learning models learn meaningful representations of the input data that understand useful features and patterns through these pretext tasks. These learned representations can then be transferred to other downstream tasks.

Key Techniques in Self-Supervised Learning

  • Contrastive Learning: This technique includes training a model to distinguish between similar and dissimilar examples. The model is trained to bring similar data points closer together in the latent space while pushing dissimilar data points farther apart. Examples include SimCLR and MoCo.
  • Generative Methods: These methods involve training models to generate new data that resembles the training data. Autoencoders, Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs) fall under this category. They can also be used as pretext tasks.
    • Autoencoders are trained to reconstruct the input, forcing them to learn compressed representations of the data.
  • Predictive Learning: Models are trained to predict a hidden part of the data from other visible parts. This can involve masking parts of the input and tasking the model to reconstruct the original.
  • Context Prediction: Models are trained to predict the relationship between different parts of the data, such as patches of an image or words in a sentence.
  • Non-Contrastive Learning: This method involves training a model only using non-contrasting or positive sample pairs rather than positive and negative ones like contrastive learning.

Challenges and Limitations of Self-Supervised Learning

While self-supervised learningis a practical approach to overcoming the limitations of supervised learning, it also presents its own challenges and limitations. These issues need to be carefully considered when developing and applying self-supervised learning techniques.

  • Noisy or Incomplete Labels: One of the primary limitations of self-supervised learning is that the supervisory signals are derived from the data itself rather than explicit human annotations. This can create noisy or incomplete pseudo-labels, resulting in lower performance versus supervised learning with human-provided labels.
  • Impact on Accuracy: Inaccurate pseudo-labels generated in the initial steps of training can be counterproductive and impact overall model accuracy.
  • Increased Processing Needs: Self-supervised learning often requires more computational power and resources compared to supervised learning. The model needs to both generate labels from unlabeled data and learn from these generated labels, adding to the computational burden.
  • Multiple Stages of Training: Due to multiple stages of training (e.g., generating pseudo-labels and then training on these labels), the overall time taken to train a self-supervised learning model is high, especially when compared to supervised learning.
  • Large Data Requirements: Current self-supervised learning approaches often require huge amounts of data to achieve accuracy levels comparable to supervised learning methods.
  • Implementation and Tuning: Some self-supervised learning techniques, such as contrastive learning and unsupervised representation learning, can be more complex to implement and tune than supervised learning methods. This requires specialized knowledge and careful parameter selection.
  • Choosing the Right Pretext Task: The choice of the pretext task is crucial for the success of self-supervised learning. A poorly chosen pretext task can lead to the model learning trivial or irrelevant patterns, which do not generalize well to downstream tasks.
  • Expert Knowledge Required: Formulating effective pretext tasks can be challenging and may require expert knowledge and understanding of the underlying data. It’s important to ensure that the pretext task forces the model to learn high-level latent features and not low-level trivial features.
  • Limited Task Scope: Self-supervised learning may not be as effective for tasks where the data is more complex or unstructured, limiting its applicability to certain types of problems.

CNN with Self-Supervised Learning for Image Classification

Now, let’s examine the detailed implementation of self-supervised learning with convolutional neural networks (CNNs) for image classification. Using a rotation prediction task as our self-supervised learning approach, we’ll cover everything from setup to evaluation.

Required Dependencies

pip install torch torchvision matplotlib numpy sklearn pillow

Imports

import os
import numpy as np
import seaborn as sns
from PIL import Image
import matplotlib.pyplot as plt
from typing import Tuple, List, Dict
from sklearn.metrics import confusion_matrix, classification_report

import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import Dataset, DataLoader
from torchvision.datasets import ImageFolder
import matplotlib.pyplot as plt
from typing import Tuple, Dict, List

CNN Implementation

This class defines the architecture of the CNN model for image classification. It includes the following components:

  • Backbone CNN: This is the main feature extraction part of the network. It consists of several convolutional layers followed by batch normalization, ReLU activation, and pooling layers. These layers progressively extract higher-level features from the input image.
  • Rotation prediction head (for self-supervised pre-training): This head is used during the pre-training phase, where the model is trained to predict the rotation of an image. It takes the output of the backbone CNN and uses additional layers to predict one of four possible rotations (0, 90, 180, or 270 degrees).
  • Classification head (for downstream task): This head is used during the fine-tuning phase, where the model is trained for a specific image classification task. It also takes the output of the backbone CNN and uses additional layers to predict the class label of the image.
  • forward_rotation and forward_classification methods: These methods define how the data flows through the network for rotation prediction and image classification tasks, respectively.
class ImageClassificationCNN(nn.Module):
    """
    CNN architecture for self-supervised learning followed by image classification
    """
    def __init__(self, num_classes: int = 10):
        super(ImageClassificationCNN, self).__init__()
        # Backbone CNN architecture
        self.backbone = nn.Sequential(
            # First block
            nn.Conv2d(3, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.MaxPool2d(2),
            
            # Second block
            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(),
            nn.MaxPool2d(2),
            
            # Third block
            nn.Conv2d(128, 256, kernel_size=3, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU(),
            nn.MaxPool2d(2),
            
            # Fourth block
            nn.Conv2d(256, 512, kernel_size=3, padding=1),
            nn.BatchNorm2d(512),
            nn.ReLU(),
            nn.MaxPool2d(2)
        )
        
        # Rotation prediction head (for self-supervised pre-training)
        self.rotation_head = nn.Sequential(
            nn.AdaptiveAvgPool2d((1, 1)),
            nn.Flatten(),
            nn.Linear(512, 256),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(256, 4)  # 4 rotations (0, 90, 180, 270 degrees)
        )
        
        # Classification head (for downstream task)
        self.classification_head = nn.Sequential(
            nn.AdaptiveAvgPool2d((1, 1)),
            nn.Flatten(),
            nn.Linear(512, 256),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(256, num_classes)
        )

    def forward_rotation(self, x: torch.Tensor) -> torch.Tensor:
        features = self.backbone(x)
        return self.rotation_head(features)

    def forward_classification(self, x: torch.Tensor) -> torch.Tensor:
        features = self.backbone(x)
        return self.classification_head(features)

Creating Rotation Dataset

Random Rotation:

  • A random integer between 0 and 3 is generated using torch.randint(0, 4, (1,)) to select a random rotation angle.
  • The rotation_idx is multiplied by 90 degrees to get the rotation angle (0, 90, 180, or 270 degrees).
  • The transforms.functional.rotate() function is used to rotate the image by the calculated rotation_angle.
class RotationDataset(Dataset):
    """Dataset wrapper for self-supervised rotation prediction"""
    def __init__(self, dataset: Dataset):
        self.dataset = dataset

    def __getitem__(self, index: int) -> Tuple[torch.Tensor, int]:
        img, _ = self.dataset[index]
        
        # Random rotation
        rotation_idx = torch.randint(0, 4, (1,)).item()
        rotation_angle = rotation_idx * 90
        rotated_img = transforms.functional.rotate(img, rotation_angle)
        
        return rotated_img, rotation_idx

    def __len__(self) -> int:
        return len(self.dataset)

Self-Supervised Pre-training

Initialization (__init__)

  • model: This is the instance of the ImageClassificationCNN class, representing the neural network architecture.
  • device: This specifies the device (CPU or GPU) where the training will be executed. Using a GPU significantly accelerates training.
  • criterion: This is the loss function used to measure the discrepancy between the model’s predictions and the ground truth labels. In this case, nn.CrossEntropyLoss() is used, which is suitable for multi-class classification problems.

Pre-training (pretrain)

  • Steps:
    1. Optimizer: An Adam optimizer is created with a learning rate of 0.001. This optimizer will adjust the model’s parameters to minimize the loss.
    2. Epoch Loop: The training process iterates over a specified number of epochs (default: 50).
      • Training Mode: The model is switched to training mode (model.train()), enabling operations like dropout and batch normalization that are only used during training.
      • Batch Loop: The code iterates through each batch of images and their corresponding rotation labels from the train_loader.
        • Data Transfer: The images and labels are transferred to the designated device (device).
        • Gradient Clearing: The gradients accumulated from previous training steps are cleared (optimizer.zero_grad()).
        • Forward Pass: The model predicts the rotation of the input images using the forward_rotation() method.
        • Loss Calculation: The loss is computed using the criterion between the predicted rotations and the actual labels.
        • Backpropagation: The gradients of the loss with respect to the model’s parameters are calculated (loss.backward()).
        • Parameter Update: The optimizer updates the model’s parameters based on the calculated gradients (optimizer.step()).
        • Loss Accumulation: The loss of the current batch is added to the epoch_loss variable.
      • Average Loss: After processing all batches in an epoch, the average loss (avg_loss) is calculated by dividing the epoch_loss by the number of batches.
      • Loss Tracking: The avg_loss is appended to the losses list for later analysis.
      • Progress Printing: Every 10 epochs, the training progress is printed, showing the current epoch and the average loss.
    3. Return: The pretrain method returns the list of average losses recorded during the pre-training phase.

Fine-tuning (finetune)

  • Steps:
    1. Freeze Backbone: Initially, the parameters of the model’s backbone (convolutional layers) are frozen (param.requires_grad = False). This prevents these layers from being updated during the initial phase of fine-tuning, preserving the features learned during pre-training.
    2. Train Classification Head: An Adam optimizer is created with a learning rate of 0.001, but this time it only optimizes the parameters of the classification head (fully connected layers).
    3. Epoch Loop: The fine-tuning process iterates over a specified number of epochs (default: 30).
      • Unfreeze Backbone: After 10 epochs, the parameters of the backbone are unfrozen (param.requires_grad = True), allowing them to be fine-tuned as well. A new Adam optimizer with a lower learning rate (0.0001) is created to optimize all parameters.
      • Training: Similar to the pre-training loop, the model processes batches of images, calculates the loss, performs backpropagation, and updates the parameters. The training loss for each epoch is accumulated and stored in the train_losses list.
      • Validation: After each epoch, the model is switched to evaluation mode (model.eval()). The gradients are disabled (with torch.no_grad()) to prevent unnecessary computations during validation. The model predicts the class labels for the validation data and calculates the accuracy. The accuracy is stored in the val_accuracies list.
      • Progress Printing: Every 5 epochs, the training progress is printed, showing the current epoch, training loss, and validation accuracy.
    4. Return: The finetune method returns a dictionary containing the lists of training losses (train_losses) and validation accuracies (val_accuracies) recorded during the fine-tuning phase.
class Trainer:
    def __init__(self, model: nn.Module, device: torch.device):
        self.model = model
        self.device = device
        self.criterion = nn.CrossEntropyLoss()
        
    def pretrain(self, train_loader: DataLoader, num_epochs: int = 50) -> List[float]:
        """Self-supervised pre-training using rotation prediction"""
        optimizer = optim.Adam(self.model.parameters(), lr=0.001)
        losses = []
        
        for epoch in range(num_epochs):
            self.model.train()
            epoch_loss = 0
            
            for batch_idx, (inputs, targets) in enumerate(train_loader):
                inputs, targets = inputs.to(self.device), targets.to(self.device)
                
                optimizer.zero_grad()
                outputs = self.model.forward_rotation(inputs)
                loss = self.criterion(outputs, targets)
                
                loss.backward()
                optimizer.step()
                
                epoch_loss += loss.item()
            
            avg_loss = epoch_loss / len(train_loader)
            losses.append(avg_loss)
            
            if (epoch + 1) % 10 == 0:
                print(f'Pretraining Epoch [{epoch+1}/{num_epochs}], Loss: {avg_loss:.4f}')
        
        return losses

    def finetune(self, train_loader: DataLoader, val_loader: DataLoader, 
                 num_epochs: int = 30) -> Dict[str, List[float]]:
        """Fine-tuning for image classification"""
        # Freeze backbone initially
        for param in self.model.backbone.parameters():
            param.requires_grad = False
        
        # Train only classification head first
        optimizer = optim.Adam(self.model.classification_head.parameters(), lr=0.001)
        
        train_losses = []
        val_accuracies = []
        
        for epoch in range(num_epochs):
            # After 10 epochs, unfreeze backbone for fine-tuning
            if epoch == 10:
                for param in self.model.backbone.parameters():
                    param.requires_grad = True
                optimizer = optim.Adam(self.model.parameters(), lr=0.0001)
            
            # Training
            self.model.train()
            epoch_loss = 0
            
            for inputs, targets in train_loader:
                inputs, targets = inputs.to(self.device), targets.to(self.device)
                
                optimizer.zero_grad()
                outputs = self.model.forward_classification(inputs)
                loss = self.criterion(outputs, targets)
                
                loss.backward()
                optimizer.step()
                
                epoch_loss += loss.item()
            
            avg_loss = epoch_loss / len(train_loader)
            train_losses.append(avg_loss)
            
            # Validation
            self.model.eval()
            correct = 0
            total = 0
            
            with torch.no_grad():
                for inputs, targets in val_loader:
                    inputs, targets = inputs.to(self.device), targets.to(self.device)
                    outputs = self.model.forward_classification(inputs)
                    _, predicted = outputs.max(1)
                    total += targets.size(0)
                    correct += predicted.eq(targets).sum().item()
            
            accuracy = 100. * correct / total
            val_accuracies.append(accuracy)
            
            if (epoch + 1) % 5 == 0:
                print(f'Finetuning Epoch [{epoch+1}/{num_epochs}], '
                      f'Loss: {avg_loss:.4f}, Accuracy: {accuracy:.2f}%')
        
        return {'train_losses': train_losses, 'val_accuracies': val_accuracies}

Visualization

Plot pretraining loss, fine-tuning loss, and validation accuracy using Matplotlib.

def plot_results(pretrain_losses: List[float], finetune_results: Dict[str, List[float]]):
    """Plot training and validation metrics"""
    fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(15, 5))
    
    # Plot pretraining loss
    ax1.plot(pretrain_losses)
    ax1.set_title('Pretraining Loss')
    ax1.set_xlabel('Epoch')
    ax1.set_ylabel('Loss')
    
    # Plot finetuning loss
    ax2.plot(finetune_results['train_losses'])
    ax2.set_title('Finetuning Loss')
    ax2.set_xlabel('Epoch')
    ax2.set_ylabel('Loss')
    
    # Plot validation accuracy
    ax3.plot(finetune_results['val_accuracies'])
    ax3.set_title('Validation Accuracy')
    ax3.set_xlabel('Epoch')
    ax3.set_ylabel('Accuracy (%)')
    
    plt.tight_layout()
    plt.show()

Initialize Training

  • Data transformations 
  • Training Phases
  • Fine-tuning
  • Evaluation and Saving
def main():
    # Set device
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    
    # Data transforms
    transform = transforms.Compose([
        transforms.Resize(32),
        transforms.ToTensor(),
        transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
    ])
    
    # Load CIFAR-10 dataset
    train_dataset = torchvision.datasets.CIFAR10(
        root='./data', train=True, download=True, transform=transform
    )
    test_dataset = torchvision.datasets.CIFAR10(
        root='./data', train=False, download=True, transform=transform
    )
    
    # Create dataloaders
    rotation_dataset = RotationDataset(train_dataset)
    rotation_loader = DataLoader(rotation_dataset, batch_size=64, shuffle=True, num_workers=2)
    train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True, num_workers=2)
    test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False, num_workers=2)
    
    # Initialize model and trainer
    model = ImageClassificationCNN().to(device)
    trainer = Trainer(model, device)
    
    # Pre-training phase
    print("Starting self-supervised pre-training...")
    pretrain_losses = trainer.pretrain(rotation_loader, num_epochs=50)
    
    # Fine-tuning phase
    print("\nStarting supervised fine-tuning...")
    finetune_results = trainer.finetune(train_loader, test_loader, num_epochs=30)
    
    # Plot results
    plot_results(pretrain_losses, finetune_results)
    
    # Save the model
    torch.save(model.state_dict(), 'cifar10_classifier.pth')
    
    # Final evaluation
    model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for inputs, targets in test_loader:
            inputs, targets = inputs.to(device), targets.to(device)
            outputs = model.forward_classification(inputs)
            _, predicted = outputs.max(1)
            total += targets.size(0)
            correct += predicted.eq(targets).sum().item()
    
    final_accuracy = 100. * correct / total
    print(f'\nFinal Test Accuracy: {final_accuracy:.2f}%')

if __name__ == "__main__":
    main()

Output:

Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data/cifar-10-python.tar.gz
100%|██████████| 170M/170M [00:03<00:00, 47.6MB/s]
Extracting ./data/cifar-10-python.tar.gz to ./data
Files already downloaded and verified
Starting self-supervised pre-training...
Pretraining Epoch [10/50], Loss: 0.5294
Pretraining Epoch [20/50], Loss: 0.3461
Pretraining Epoch [30/50], Loss: 0.2263
Pretraining Epoch [40/50], Loss: 0.1487
Pretraining Epoch [50/50], Loss: 0.1083

Starting supervised fine-tuning...
Finetuning Epoch [5/30], Loss: 0.8701, Accuracy: 69.90%
Finetuning Epoch [10/30], Loss: 0.7947, Accuracy: 71.87%
Finetuning Epoch [15/30], Loss: 0.4730, Accuracy: 78.44%
Finetuning Epoch [20/30], Loss: 0.3043, Accuracy: 79.86%
Finetuning Epoch [25/30], Loss: 0.1721, Accuracy: 80.34%
Finetuning Epoch [30/30], Loss: 0.0955, Accuracy: 80.32%
Visualize Results: Pretraining loss, Fine-tuning loss, Validation accuracy

Inference

Use the ImageClassifier class for:

  • Single image predictions with class probabilities.
  • Visualizing results for individual test samples.
class ImageClassifier:
    def __init__(self, model_path: str, device: str = None):
        if device is None:
            self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        else:
            self.device = device
            
        # CIFAR-10 classes
        self.classes = ['airplane', 'automobile', 'bird', 'cat', 'deer',
                       'dog', 'frog', 'horse', 'ship', 'truck']
        
        # Initialize and load model
        self.model = ImageClassificationCNN().to(self.device)
        self.model.load_state_dict(torch.load(model_path, map_location=self.device))
        self.model.eval()
        
        # Define transforms
        self.transform = transforms.Compose([
            transforms.Resize((32, 32)),
            transforms.ToTensor(),
            transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
        ])

    def predict_single_image(self, image_path: str) -> Tuple[str, float]:
        """Predict class for a single image"""
        # Load and transform image
        image = Image.open(image_path).convert('RGB')
        image_tensor = self.transform(image).unsqueeze(0).to(self.device)
        
        # Get prediction
        with torch.no_grad():
            outputs = self.model.forward_classification(image_tensor)
            probabilities = torch.nn.functional.softmax(outputs, dim=1)
            pred_prob, pred_class = torch.max(probabilities, 1)
            
        return self.classes[pred_class.item()], pred_prob.item()

    def visualize_prediction(self, image_path: str):
        """Visualize image with prediction"""
        # Get prediction
        pred_class, confidence = self.predict_single_image(image_path)
        
        # Load and display image
        image = Image.open(image_path).convert('RGB')
        plt.figure(figsize=(8, 6))
        plt.imshow(image)
        plt.title(f'Prediction: {pred_class}\nConfidence: {confidence:.2%}')
        plt.axis('off')
        plt.show()

    def plot_confusion_matrix(self, confusion_mat: np.ndarray):
        """Plot confusion matrix"""
        plt.figure(figsize=(12, 8))
        sns.heatmap(confusion_mat, annot=True, fmt='d', cmap='Blues',
                   xticklabels=self.classes, yticklabels=self.classes)
        plt.title('Confusion Matrix')
        plt.xlabel('Predicted')
        plt.ylabel('True')
        plt.show()
classifier = ImageClassifier('/content/cifar10_classifier.pth')

# Test single image
pred_class, confidence = classifier.predict_single_image('/content/download (2).jpeg')
print(f"Prediction: {pred_class}, Confidence: {confidence:.2%}")

Output:

Prediction: cat, Confidence: 99.99%
# Visualize prediction
classifier.visualize_prediction('/content/download (2).jpeg')

Output:

Conclusion

Self-supervised learning has become an essential way of doing machine learning. Unlike supervised learning, which relies on labeled datasets, Self-supervised learning generates its supervisory signals, or “pseudo-labels,” from the data. This allows models to learn valuable representations without human annotations. This approach has shown promise in various domains, particularly computer vision (CV) and natural language processing (NLP).

FAQs

Q 1. What is self-supervised learning?

Self-supervised learning is a machine learning paradigm where the model learns from the data itself without requiring explicit labels.  It creates its own “supervision” by leveraging inherent structure or relationships within the data.

Q 2. What is the difference between unsupervised and self-supervised?

Unsupervised learning aims to discover hidden patterns or structures within unlabeled data, such as clustering or dimensionality reduction. While, Self-supervised learning also uses unlabeled data, but it creates a supervised learning task from the data itself. For example, predicting parts of an image given other parts, or predicting the next word in a sentence.   

Q 3. What is an example of a self-supervised learning algorithm?

Rotation Prediction: As seen in the code example, training a model to predict the rotation of an image.

Q 4. What do you mean by image classification?

Image classification is the task of assigning a class label (e.g., “cat,” “dog,” “car”) to an input image.

Q 5. What is image classification in CNN?

Convolutional Neural Networks (CNNs) are particularly well-suited for image classification. They use convolutional layers to extract hierarchical features from images, followed by fully connected layers to classify the extracted features into different classes.  

3 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *