Animal Image ClassificationΒΆ

Author: Davyd Antoniuk

Project Goal: Develop and compare machine learning and deep learning models (Random Forest, XGBoost, CNN, advanced CNN, and Transfer Learning) for classifying animal images into 10 categories.

1. IntroductionΒΆ

This project explores various architectures, from basic ML models like Random Forest and XGBoost to CNNs and pre-trained models, analyzing their performance on the Animals-10 dataset. The goal is to find the best model in terms of accuracy, efficiency, and generalization.

2. Data Preparation & ExplorationΒΆ

InΒ [2]:
import os
import shutil
import random
import time
from collections import Counter
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import cv2
import copy

import torch
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, confusion_matrix
from torchvision import datasets
import torchvision.transforms as transforms
from torchvision import datasets
from torch.utils.data import DataLoader
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from torchvision.models import resnet18
import torch.nn as nn
import torch.optim as optim
from torchvision import models
from PIL import Image

2.1 Dataset overviewΒΆ

The Animals-10 dataset contains images of 10 animal classes: dog, cat, horse, spider, butterfly, chicken, cow, sheep, squirrel, elephant. The dataset is organized into folders, but the class names are in Italian language. I will translate them to English for clarity before processing.

Now, let's load and explore the dataset!

2.2 Load and Preprocess DatasetΒΆ

Step 1: Define Paths and Translations

Set the dataset path and create a dictionary to rename folders from Italian to English.

InΒ [2]:
# Original dataset path
data_path = "data"

# Translation dictionary (Italian β†’ English)
translate = {
    "cane": "dog", "cavallo": "horse", "elefante": "elephant", "farfalla": "butterfly",
    "gallina": "chicken", "gatto": "cat", "mucca": "cow", "pecora": "sheep",
    "scoiattolo": "squirrel", "ragno": "spider"
}

Step 2: Rename Folders to English

Rename all class folders using the translation dictionary.

InΒ [3]:
for folder in os.listdir(data_path):
    if folder in translate:
        old_folder = os.path.join(data_path, folder)
        new_folder = os.path.join(data_path, translate[folder])
        os.rename(old_folder, new_folder)

print("Folders renamed successfully!")
Folders renamed successfully!

Step 3: Count Images per Class

Count and visualize how many images each class has

InΒ [13]:
class_counts = {cls: len(os.listdir(os.path.join(data_path, cls))) for cls in translate.values()}
colors = plt.cm.tab10(range(len(class_counts)))
# Plot class distribution
plt.bar(class_counts.keys(), class_counts.values(), color=colors)
plt.xticks(rotation=45)
plt.title("Class Distribution Before Balancing")
plt.show()
No description has been provided for this image

The dataset is imbalanced - dogs and spiders have the most images, while elephants have the least. However, this isn’t a problem because I will reduce each class to 200 images to ensure fair training and manage computational resources efficiently.

Step 4: Balance the Dataset

Reduce each class to 200 images to manage computational resources.

InΒ [Β ]:
target_count = 200

for cls in translate.values():
    class_path = os.path.join(data_path, cls)
    images = os.listdir(class_path)
    
    # Keep only the first 200 images
    for img in images[target_count:]:
        os.remove(os.path.join(class_path, img))

print("Dataset balanced: Each class now has 200 images.")
Dataset balanced: Each class now has 200 images.

Step 6: Verify New Distribution

Confirm that each class now contains 200 images.

InΒ [16]:
new_class_counts = {cls: len(os.listdir(os.path.join(data_path, cls))) for cls in translate.values()}

colors = plt.cm.tab10(range(len(new_class_counts)))
plt.bar(new_class_counts.keys(), new_class_counts.values(), color=colors)
plt.xticks(rotation=45)
plt.title("Class Distribution After Balancing")
plt.show()
No description has been provided for this image

Now every class has the same number of images, and I can move on to the next stage

2.3 Detailed Dataset ExplorationΒΆ

Step 1: Define the Visualization Function

Now, let's visualize random images from the dataset to better understand the data distribution and quality. Below is a function to display a grid of random images with their corresponding class labels. The function allows specifying the number of images to visualize dynamically.

InΒ [18]:
def visualize_random_images(folder, num_images=9):
    """
    Displays a grid of random images from the dataset.
    
    Args:
        folder (str): Path to the dataset folder.
        num_images (int): Number of images to visualize.
    """
    # Get all class names (folders)
    class_names = os.listdir(folder)
    
    # Collect random images with labels
    images = []
    labels = []
    
    for _ in range(num_images):
        selected_class = random.choice(class_names)  # Choose a random class
        class_path = os.path.join(folder, selected_class)
        image_name = random.choice(os.listdir(class_path))  # Choose a random image
        image_path = os.path.join(class_path, image_name)
        
        # Read and store the image
        img = cv2.imread(image_path)
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)  # Convert from BGR to RGB
        images.append(img)
        labels.append(selected_class)

    # Determine the grid size dynamically
    grid_size = int(num_images ** 0.5)  # Find the closest square root
    while grid_size * grid_size < num_images:
        grid_size += 1  # Adjust for non-perfect squares

    # Set figure size dynamically
    fig, axes = plt.subplots(grid_size, grid_size, figsize=(grid_size * 2, grid_size * 2))
    fig.suptitle("Random Sample Images", fontsize=14)

    # Flatten axes for easy iteration (handles cases where grid isn't perfect)
    axes = axes.flatten()

    for i in range(grid_size * grid_size):
        if i < num_images:
            axes[i].imshow(images[i])
            axes[i].set_title(labels[i])
        axes[i].axis('off')  # Hide axes

    plt.show()

Step 2: Visualize Random Images

InΒ [22]:
visualize_random_images("data", num_images=25)
No description has been provided for this image

The images are well-labeled and clearly represent their respective classes. The dataset quality looks good for training, so we can proceed to the next step.

2.4 Train-Validation-Test SplitΒΆ

It's crucial to split the data before augmentation and standardization to prevent data leakage and ensure a fair evaluation of the model.

I will split the dataset into Train (70%), Validation (15%), and Test (15%), ensuring each set has an equal distribution of classes.

Step 1: Define Paths and Create Split Folders

Create separate folders for train, validation, and test sets.

InΒ [24]:
# Define paths
dataset_path = "data"
split_path = "split_data"

# Create train, val, and test directories
for split in ["train", "val", "test"]:
    os.makedirs(os.path.join(split_path, split), exist_ok=True)

Step 2: Perform Train-Validation-Test Split

Split images and copy them into new structured folders.

InΒ [25]:
split_ratio = {"train": 0.7, "val": 0.15, "test": 0.15}

for cls in os.listdir(dataset_path):
    class_path = os.path.join(dataset_path, cls)
    images = os.listdir(class_path)
    
    # Shuffle and split images
    train_imgs, temp_imgs = train_test_split(images, test_size=(1 - split_ratio["train"]), random_state=42)
    val_imgs, test_imgs = train_test_split(temp_imgs, test_size=split_ratio["test"] / (split_ratio["val"] + split_ratio["test"]), random_state=42)
    
    # Copy images to respective folders
    for split, img_list in zip(["train", "val", "test"], [train_imgs, val_imgs, test_imgs]):
        split_class_path = os.path.join(split_path, split, cls)
        os.makedirs(split_class_path, exist_ok=True)
        
        for img in img_list:
            src_path = os.path.join(class_path, img)
            dst_path = os.path.join(split_class_path, img)
            shutil.copy(src_path, dst_path)

print("Dataset successfully split into Train, Validation, and Test sets!")
Dataset successfully split into Train, Validation, and Test sets!

Step 3: Visualize Class Distribution After Splitting

Plot the number of images per class for Train, Validation, and Test sets.

InΒ [30]:
# Count images in each split
split_counts = {"train": {}, "val": {}, "test": {}}

for split in ["train", "val", "test"]:
    for cls in os.listdir(os.path.join(split_path, split)):
        split_counts[split][cls] = len(os.listdir(os.path.join(split_path, split, cls)))

# Define colors for each split
colors = {"train": "blue", "val": "green", "test": "red"}

# Plot class distribution per split
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
splits = ["train", "val", "test"]

for i, split in enumerate(splits):
    axes[i].bar(split_counts[split].keys(), split_counts[split].values(), color=colors[split])
    axes[i].set_title(f"{split.capitalize()} Set Distribution")
    axes[i].set_xticks(range(len(split_counts[split].keys())))
    axes[i].set_xticklabels(split_counts[split].keys(), rotation=45)

plt.show()
No description has been provided for this image

Each set has the same number of images per class.

The dataset is successfully split, and each class has the correct number of images. Now, we can move to image preprocessing!

2.5 Image PreprocessingΒΆ

Now, I will prepare the images for model training by applying augmentation, normalization, and resizing while ensuring different models receive the correct preprocessing.

Since different models require different preprocessing, I will create two versions of the dataset:

  • Standardized dataset β†’ For Random Forest, XGBoost, CNN
  • Lightly Processed dataset β†’ For Transfer Learning models (EfficientNet, ResNet, MobileNet)

2.5.1 Define Image TransformationsΒΆ

  1. Apply Data Augmentation (Only for Training Set)
    As discussed here, augmentation must be applied before normalization to avoid distorting pixel distributions.

  2. Apply Normalization (For RF, XGBoost, CNN)
    As discussed here, normalization works better than standardization for deep learning models since it helps capture relations inside images more effectively.

Apply augmentation (for training only) and normalization where needed.

InΒ [248]:
IMAGE_SIZE = (224, 224)

# Augmentation + Normalization (For CNN, RF, XGBoost)
standardized_train_transforms = transforms.Compose([
    transforms.Resize(IMAGE_SIZE),
    transforms.RandomHorizontalFlip(p=0.5),  # Flip image with 50% chance
    transforms.RandomRotation(15),  # Rotate randomly by 15 degrees
    transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2),  # Adjust colors
    transforms.RandomResizedCrop(IMAGE_SIZE, scale=(0.8, 1.0)),  # Random zoom
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])  # Normalization
])

# Only Normalization (Validation & Test - No Augmentation)
standardized_test_transforms = transforms.Compose([
    transforms.Resize(IMAGE_SIZE),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])  # Normalization
])

#  Only Resize (For Transfer Learning Models)
transfer_learning_transforms = transforms.Compose([
    transforms.Resize(IMAGE_SIZE),
    transforms.ToTensor()  # No normalization, as pre-trained models handle it
])

2.5.2 Load the DatasetΒΆ

Load datasets with the correct preprocessing for each model type.

InΒ [249]:
# Paths to datasets
train_path = "split_data/train"
val_path = "split_data/val"
test_path = "split_data/test"

# Load datasets with respective transformations
standardized_train_dataset = datasets.ImageFolder(root=train_path, transform=standardized_train_transforms)
standardized_val_dataset = datasets.ImageFolder(root=val_path, transform=standardized_test_transforms)
standardized_test_dataset = datasets.ImageFolder(root=test_path, transform=standardized_test_transforms)

transfer_train_dataset = datasets.ImageFolder(root=train_path, transform=transfer_learning_transforms)
transfer_val_dataset = datasets.ImageFolder(root=val_path, transform=transfer_learning_transforms)
transfer_test_dataset = datasets.ImageFolder(root=test_path, transform=transfer_learning_transforms)

print("Datasets successfully loaded with augmentation and transformations!")
Datasets successfully loaded with augmentation and transformations!

2.5.3 Create Data LoadersΒΆ

Create efficient batch loaders for each dataset version.

InΒ [250]:
BATCH_SIZE = 32

# Standardized dataset loaders (For CNN, RF, XGBoost)
standardized_train_loader = DataLoader(standardized_train_dataset, batch_size=BATCH_SIZE, shuffle=True)
standardized_val_loader = DataLoader(standardized_val_dataset, batch_size=BATCH_SIZE, shuffle=False)
standardized_test_loader = DataLoader(standardized_test_dataset, batch_size=BATCH_SIZE, shuffle=False)

# Transfer Learning dataset loaders
transfer_train_loader = DataLoader(transfer_train_dataset, batch_size=BATCH_SIZE, shuffle=True)
transfer_val_loader = DataLoader(transfer_val_dataset, batch_size=BATCH_SIZE, shuffle=False)
transfer_test_loader = DataLoader(transfer_test_dataset, batch_size=BATCH_SIZE, shuffle=False)

print("DataLoaders created successfully!")
DataLoaders created successfully!

2.5.4 Verify Preprocessing (Visualization at Each Step)ΒΆ

To understand better how work image preprocessing, I will visualize a random images before and after applying augmentation and normalization.

Function to Visualize Original Images (Before Processing)

InΒ [Β ]:
def show_original_images(folder, num_images=5):
    """ Display original images from dataset """
    class_names = os.listdir(folder)
    fig, axes = plt.subplots(1, num_images, figsize=(15, 5))
    
    for i in range(num_images):
        selected_class = random.choice(class_names)
        class_path = os.path.join(folder, selected_class)
        img_name = random.choice(os.listdir(class_path))
        img_path = os.path.join(class_path, img_name)
        
        img = cv2.imread(img_path)
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)  # Convert to RGB
        
        axes[i].imshow(img)
        axes[i].set_title(selected_class)
        axes[i].axis("off")

    plt.show()

Function to Visualize Images After Augmentation & Normalization

InΒ [252]:
def show_augmented_images(dataloader, title):
    images, labels = next(iter(dataloader))
    fig, axes = plt.subplots(1, 5, figsize=(15, 5))
    
    for i in range(5):
        img = images[i].permute(1, 2, 0).numpy()  # Convert tensor to image format
        img = (img * 0.5) + 0.5  # Unnormalize for visualization
        axes[i].imshow(img)
        axes[i].set_title(f"Class: {labels[i].item()}")
        axes[i].axis("off")

    plt.suptitle(title)
    plt.show()

As, I build functions to visualize images at each step of the preprocessing pipeline, now I will visualize a random image at each step to understand how the preprocessing works.

Display original images before any transformations.

InΒ [53]:
show_original_images("split_data/train", num_images=5)
No description has been provided for this image

Show images after augmentation & normalization.

InΒ [54]:
show_augmented_images(standardized_train_loader, "Augmented & Normalized Images")
No description has been provided for this image

The augmented images show improved diversity, making the model more robust to variations in orientation, lighting, and scale. These transformations help reduce overfitting and enhance generalization, ensuring better performance on unseen data.

Next Step: Model Building!

Now, I can start training models:

  1. Random Forest, XGBoost
  2. Simple CNN Model
  3. Deeper CNN Model
  4. Transfer Learning (ResNet, VGG, EfficientNet, etc.)

3. Model BuildingΒΆ

3.1 Model Evaluation FunctionΒΆ

Before training models, I need a function to calculate and return all evaluation metrics, including Accuracy, Precision, Recall, F1-Score, and the Confusion Matrix.

Define the Evaluation Function

Computes Accuracy, Precision, Recall, F1-score, and Confusion Matrix.

InΒ [7]:
def evaluate_model(model, dataloader, dataset, device):
    """
    Evaluates the model and returns accuracy, precision, recall, F1-score, and confusion matrix.
    
    Args:
        model: Trained model to evaluate.
        dataloader: DataLoader for test/validation dataset.
        dataset: Corresponding dataset to map class indices to labels.
        device: Device (CPU/GPU) for evaluation.
        
    Returns:
        metrics_dict: Dictionary with all evaluation metrics.
        conf_matrix: Confusion matrix (optional for visualization).
    """
    model.eval()  # Set model to evaluation mode
    all_preds, all_labels = [], []

    with torch.no_grad():  # No gradients needed during evaluation
        for images, labels in dataloader:
            images, labels = images.to(device), labels.to(device)
            outputs = model(images)
            _, preds = torch.max(outputs, 1)  # Get predicted class indices
            
            all_preds.extend(preds.cpu().numpy())
            all_labels.extend(labels.cpu().numpy())

    # Convert lists to numpy arrays
    all_preds = np.array(all_preds)
    all_labels = np.array(all_labels)

    # Calculate Accuracy
    accuracy = accuracy_score(all_labels, all_preds)

    # Calculate Precision, Recall, and F1-score (macro-averaged)
    precision, recall, f1, _ = precision_recall_fscore_support(all_labels, all_preds, average="macro", zero_division=0)

    # Compute Confusion Matrix
    conf_matrix = confusion_matrix(all_labels, all_preds)

    # Store metrics in a dictionary
    metrics_dict = {
        "Accuracy": accuracy,
        "Precision": precision,
        "Recall": recall,
        "F1-Score": f1
    }

    return metrics_dict, conf_matrix

Function to Display Confusion Matrix

Generates a heatmap of the confusion matrix for better analysis.

InΒ [64]:
def plot_confusion_matrix(conf_matrix, dataset):
    """Plots the confusion matrix."""
    plt.figure(figsize=(10, 8))
    sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues", xticklabels=dataset.classes, yticklabels=dataset.classes)
    plt.xlabel("Predicted Label")
    plt.ylabel("True Label")
    plt.title("Confusion Matrix")
    plt.show()

3.2 Random Forest ModelΒΆ

Now, I will train Random Forest, which requires feature extraction since it cannot directly process images.

Step 1: Extract Features Using Pre-trained CNN (ResNet-18)

Uses ResNet-18 to extract image features before training Random Forest.

InΒ [60]:
def extract_features(dataloader, model, device):
    """
    Extracts image features using a pre-trained CNN (ResNet-18).
    
    Args:
        dataloader: DataLoader containing images.
        model: Pre-trained CNN model.
        device: CPU or GPU.
    
    Returns:
        features: Extracted feature vectors.
        labels: Corresponding labels.
    """
    model.eval()  # Set to evaluation mode
    all_features, all_labels = [], []

    with torch.no_grad():
        for images, labels in dataloader:
            images = images.to(device)
            features = model(images)  # Extract features
            all_features.append(features.cpu().numpy())
            all_labels.extend(labels.numpy())

    return np.vstack(all_features), np.array(all_labels)

# Load Pre-trained ResNet18 (Removing Fully Connected Layer)
feature_extractor = resnet18(pretrained=True)
feature_extractor = torch.nn.Sequential(*list(feature_extractor.children())[:-1])  # Remove last FC layer
feature_extractor.to("cuda" if torch.cuda.is_available() else "cpu")

# Extract Features
train_features, train_labels = extract_features(standardized_train_loader, feature_extractor, "cuda")
test_features, test_labels = extract_features(standardized_test_loader, feature_extractor, "cuda")

Step 2: Train Random Forest

Train Random Forest using extracted image features and records training time.

InΒ [61]:
# Train RF and measure time
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
start_time = time.time()
rf_model.fit(train_features.reshape(train_features.shape[0], -1), train_labels)
training_time = time.time() - start_time

print(f"Random Forest Training Time: {training_time:.2f} seconds")
Random Forest Training Time: 1.70 seconds

Step 3: Evaluate Random Forest

Evaluate RF using extracted features and stores the results.

InΒ [62]:
# Make Predictions
rf_preds = rf_model.predict(test_features.reshape(test_features.shape[0], -1))

# Compute Metrics
accuracy = accuracy_score(test_labels, rf_preds)
precision, recall, f1, _ = precision_recall_fscore_support(test_labels, rf_preds, average="macro", zero_division=0)
conf_matrix = confusion_matrix(test_labels, rf_preds)

# Store Results
rf_metrics = {
    "Accuracy": accuracy,
    "Precision": precision,
    "Recall": recall,
    "F1-Score": f1,
    "Training Time (s)": training_time
}

print("Random Forest Evaluation Metrics:")
print(rf_metrics)
Random Forest Evaluation Metrics:
{'Accuracy': 0.9354838709677419, 'Precision': 0.9381661349403284, 'Recall': 0.9354838709677418, 'F1-Score': 0.9349129843719057, 'Training Time (s)': 1.6964747905731201}

Confusion matrix for Random Forest.

InΒ [65]:
plot_confusion_matrix(conf_matrix, standardized_test_dataset)
No description has been provided for this image

Random Forest performed exceptionally well! With an accuracy of 93.5%, it demonstrates strong classification capabilities despite being a traditional machine learning model. Training was incredibly fast - only 1.7 seconds.

The confusion matrix is well-balanced and nearly diagonal, meaning the model classifies most images correctly without major biases. Only minor misclassifications exist, primarily between visually similar animals (e.g., dogs, cats, and cows).

3.3 XGBoost ModelΒΆ

XGBoost (Extreme Gradient Boosting) is a powerful tree-based model that improves predictions using gradient boosting, an iterative approach that corrects previous errors.

Train XGBoost on ResNet-18 extracted features and records training time.

Step 1: Train XGBoost on Extracted Features

InΒ [70]:
# Define XGBoost model
xgb_model = xgb.XGBClassifier(n_estimators=100, eval_metric="mlogloss", random_state=42)

# Train and measure time
start_time = time.time()
xgb_model.fit(train_features.reshape(train_features.shape[0], -1), train_labels)
training_time = time.time() - start_time

print(f"XGBoost Training Time: {training_time:.2f} seconds")
XGBoost Training Time: 4.67 seconds

Step 2: Evaluate XGBoost

Evaluate XGBoost on test data and stores results.

InΒ [71]:
# Make Predictions
xgb_preds = xgb_model.predict(test_features.reshape(test_features.shape[0], -1))

# Compute Metrics
accuracy = accuracy_score(test_labels, xgb_preds)
precision, recall, f1, _ = precision_recall_fscore_support(test_labels, xgb_preds, average="macro", zero_division=0)
conf_matrix = confusion_matrix(test_labels, xgb_preds)

# Store Results
xgb_metrics = {
    "Accuracy": accuracy,
    "Precision": precision,
    "Recall": recall,
    "F1-Score": f1,
    "Training Time (s)": training_time
}

print("XGBoost Evaluation Metrics:")
print(xgb_metrics)
XGBoost Evaluation Metrics:
{'Accuracy': 0.9193548387096774, 'Precision': 0.9234516887450702, 'Recall': 0.9193548387096774, 'F1-Score': 0.9198869715122658, 'Training Time (s)': 4.672809839248657}

XGBoost performed slightly worse than Random Forest, with an accuracy of 91.9% compared to 93.5%. The precision, recall, and F1-score are also lower, indicating more misclassifications. Training time increased significantly. Possible Reason: XGBoost relies heavily on boosted decision trees, which might not generalize as well as Random Forest on features extracted from images.

3.4 Simple CNN ModelΒΆ

Now, I will build a basic CNN model for image classification. This is the first step before making it more complex.

Step 1: Define the Simple CNN Model

InΒ [Β ]:
# Simple CNN Model
class SimpleCNN(nn.Module):
    def __init__(self, num_classes=10):
        super(SimpleCNN, self).__init__()
        self.conv1 = nn.Conv2d(3, 32, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
        self.pool = nn.MaxPool2d(kernel_size=2, stride=2)
        self.fc1 = nn.Linear(64 * 56 * 56, 512) 
        self.fc2 = nn.Linear(512, num_classes)

    def forward(self, x):
        x = self.pool(nn.ReLU()(self.conv1(x)))
        x = self.pool(nn.ReLU()(self.conv2(x)))
        x = x.view(x.size(0), -1)  
        x = nn.ReLU()(self.fc1(x))
        x = self.fc2(x)
        return x
    
# Initialize model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
cnn_model = SimpleCNN(num_classes=10).to(device)

What I Did ?

  • 2 convolutional layers for basic feature extraction.
  • Max-Pooling for downsampling.
  • Two fully connected layers for classification.

Step 2: Define Training Components

  • CrossEntropyLoss for multi-class classification.
  • Adam optimizer for adaptive learning.
  • LR Scheduler to reduce learning rate if validation loss stops improving.
InΒ [Β ]:
# Loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(cnn_model.parameters(), lr=0.001)

# Learning rate scheduler
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', patience=4, factor=0.1, min_lr=1e-7)

Step 3: Training Loop

InΒ [258]:
# Training parameters
num_epochs = 100
train_losses, val_losses, train_accs, val_accs = [], [], [], []

# Early stopping settings
best_val_loss = float("inf")
best_model_wts = copy.deepcopy(cnn_model.state_dict())
early_stop_counter = 0
patience = 5  # Stop if val loss doesn't improve for 5 epochs

start_time = time.time()

for epoch in range(num_epochs):
    cnn_model.train()
    running_loss, correct, total = 0, 0, 0

    for inputs, labels in standardized_train_loader:
        inputs, labels = inputs.to(device), labels.to(device)

        optimizer.zero_grad()
        outputs = cnn_model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        running_loss += loss.item() * inputs.size(0)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

    train_loss = running_loss / len(standardized_train_dataset)
    train_acc = correct / total
    train_losses.append(train_loss)
    train_accs.append(train_acc)

    # Validation phase
    cnn_model.eval()
    val_running_loss, val_correct, val_total = 0, 0, 0

    with torch.no_grad():
        for inputs, labels in standardized_val_loader:
            inputs, labels = inputs.to(device), labels.to(device)
            outputs = cnn_model(inputs)
            loss = criterion(outputs, labels)

            val_running_loss += loss.item() * inputs.size(0)
            _, predicted = torch.max(outputs.data, 1)
            val_total += labels.size(0)
            val_correct += (predicted == labels).sum().item()

    val_loss = val_running_loss / len(standardized_val_dataset)
    val_acc = val_correct / val_total
    val_losses.append(val_loss)
    val_accs.append(val_acc)

    # Learning rate scheduling
    scheduler.step(val_loss)

    # Early stopping check
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        best_model_wts = copy.deepcopy(cnn_model.state_dict())
        early_stop_counter = 0
    else:
        early_stop_counter += 1

    if early_stop_counter >= patience:
        print(f"Early stopping triggered at epoch {epoch+1}")
        break

    print(f'Epoch {epoch+1}/{num_epochs}')
    print(f'Train Loss: {train_loss:.4f} | Acc: {train_acc:.4f}')
    print(f'Val Loss: {val_loss:.4f} | Acc: {val_acc:.4f}')
    print('-' * 50)

# Load the best model
cnn_model.load_state_dict(best_model_wts)

# Training time
training_time = time.time() - start_time
print(f"Training completed in {training_time:.2f} seconds")
Epoch 1/100
Train Loss: 2.9343 | Acc: 0.1281
Val Loss: 2.2268 | Acc: 0.1967
--------------------------------------------------
Epoch 2/100
Train Loss: 2.1060 | Acc: 0.2482
Val Loss: 2.1139 | Acc: 0.2267
--------------------------------------------------
Epoch 3/100
Train Loss: 1.9559 | Acc: 0.3058
Val Loss: 1.9254 | Acc: 0.3067
--------------------------------------------------
Epoch 4/100
Train Loss: 1.8210 | Acc: 0.3662
Val Loss: 1.9267 | Acc: 0.2900
--------------------------------------------------
Epoch 5/100
Train Loss: 1.7703 | Acc: 0.3576
Val Loss: 1.8200 | Acc: 0.3567
--------------------------------------------------
Epoch 6/100
Train Loss: 1.6989 | Acc: 0.4165
Val Loss: 1.9632 | Acc: 0.3700
--------------------------------------------------
Epoch 7/100
Train Loss: 1.6404 | Acc: 0.4317
Val Loss: 1.8308 | Acc: 0.3667
--------------------------------------------------
Epoch 8/100
Train Loss: 1.5553 | Acc: 0.4525
Val Loss: 1.7867 | Acc: 0.3867
--------------------------------------------------
Epoch 9/100
Train Loss: 1.5927 | Acc: 0.4590
Val Loss: 1.7773 | Acc: 0.4100
--------------------------------------------------
Epoch 10/100
Train Loss: 1.5108 | Acc: 0.4849
Val Loss: 1.7864 | Acc: 0.4233
--------------------------------------------------
Epoch 11/100
Train Loss: 1.4260 | Acc: 0.5065
Val Loss: 1.8128 | Acc: 0.4033
--------------------------------------------------
Epoch 12/100
Train Loss: 1.3386 | Acc: 0.5482
Val Loss: 1.8696 | Acc: 0.3833
--------------------------------------------------
Epoch 13/100
Train Loss: 1.3221 | Acc: 0.5561
Val Loss: 1.9388 | Acc: 0.3967
--------------------------------------------------
Early stopping triggered at epoch 14
Training completed in 449.06 seconds

Learning rate stopped model training after 14 epochs, as the validation loss stopped improving. So it is important to improve CNN model architecture in next step to achieve better results.

Step 4: Training Performance Visualization

Now, let's build and use the plot_training_curves function to visualize loss and accuracy trends over epochs. Additionally, I will use that function for the future models.

Build a function

InΒ [10]:
def plot_training_curves(train_losses, val_losses, train_accs, val_accs):
    fig, axes = plt.subplots(1, 2, figsize=(12, 5))

    # Loss Curve
    axes[0].plot(train_losses, label="Train Loss")
    axes[0].plot(val_losses, label="Val Loss")
    axes[0].set_title("Loss Curve")
    axes[0].set_xlabel("Epochs")
    axes[0].set_ylabel("Loss")
    axes[0].legend()

    # Accuracy Curve
    axes[1].plot(train_accs, label="Train Acc")
    axes[1].plot(val_accs, label="Val Acc")
    axes[1].set_title("Accuracy Curve")
    axes[1].set_xlabel("Epochs")
    axes[1].set_ylabel("Accuracy")
    axes[1].legend()

    plt.show()

Run the function

Visualize how accuracy and loss change during training.

InΒ [260]:
plot_training_curves(train_losses, val_losses, train_accs, val_accs)
No description has been provided for this image

The early stopping stopped training because the validation loss stopped improving, while the training loss continued to decrease.

Step 5: Final Model Evaluation

Evaluate Simple CNN on test data and stores results.

InΒ [Β ]:
# Evaluate the trained Simple CNN model
simple_cnn_metrics, simple_cnn_conf_matrix = evaluate_model(cnn_model, standardized_test_loader, standardized_test_dataset, device)

# Add training time
simple_cnn_metrics["Training Time (s)"] = training_time

# Print formatted evaluation results
print("Simple CNN Model Evaluation Metrics:")
print(simple_cnn_metrics)
Simple CNN Model Evaluation Metrics:
{'Accuracy': 0.3225806451612903, 'Precision': 0.3394414938174116, 'Recall': 0.3225806451612903, 'F1-Score': 0.3110693701907908, 'Training Time (s)': 449.0628399848938}

Simple CNN model showed very weak results compared to RF and XFBoost models, so in next step I will build more complex CNN model to achieve better results.

3.5 Complex CNN ModelΒΆ

Step 1 Modify Data Loaders

In this step I exclusively for this model change data transformation to apply z-score standardization instead of min-max normalization.

Standardization may improve model performance when data distribution varies significantly across features, helping the network learn better feature representations and improving convergence.

InΒ [Β ]:
# Data transformations with augmentation for training
train_transform = transforms.Compose([
    transforms.RandomResizedCrop(150),  # Random crop and resize to 150x150
    transforms.RandomHorizontalFlip(),  # Random horizontal flip for augmentation
    transforms.ToTensor(),  # Convert image to tensor
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])  # Normalize (ImageNet stats)
])

# Transformations for validation and test (no augmentation)
val_test_transform = transforms.Compose([
    transforms.Resize(150),  # Resize to 150x150
    transforms.CenterCrop(150),  # Center crop to match training size
    transforms.ToTensor(),  # Convert image to tensor
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])  # Normalize (ImageNet stats)
])

# Dataset directories
split_path = "split_data"
train_dir, val_dir, test_dir = [os.path.join(split_path, x) for x in ['train', 'val', 'test']]

# Create datasets
train_dataset = datasets.ImageFolder(train_dir, transform=train_transform)
val_dataset = datasets.ImageFolder(val_dir, transform=val_test_transform)
test_dataset = datasets.ImageFolder(test_dir, transform=val_test_transform)

# Data loaders
batch_size = 32
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

Step 2: Define Training Components

InΒ [5]:
# Updated CNN Model definition
class ComplexCNN(nn.Module):
    def __init__(self, num_classes=10):
        super(ComplexCNN, self).__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 16, kernel_size=3, padding=1),
            nn.BatchNorm2d(16),
            nn.ReLU(),
            nn.Conv2d(16, 32, kernel_size=3, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),

            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),

            nn.Conv2d(128, 256, kernel_size=3, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),

            nn.AdaptiveAvgPool2d((1, 1)) 
        )
        
        self.classifier = nn.Sequential(
            nn.Linear(256, 1024),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(1024, 512),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(512, num_classes)
        )

    def forward(self, x):
        x = self.features(x)
        x = torch.flatten(x, 1)
        x = self.classifier(x)
        return x

# Initialize model, loss function, optimizer
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
complexcnn_model = ComplexCNN(num_classes=10).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(complexcnn_model.parameters(), lr=0.001)

print(complexcnn_model)
ComplexCNN(
  (features): Sequential(
    (0): Conv2d(3, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU()
    (3): Conv2d(16, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (4): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (5): ReLU()
    (6): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (7): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (8): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (9): ReLU()
    (10): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (12): ReLU()
    (13): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (14): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (15): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (16): ReLU()
    (17): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (18): AdaptiveAvgPool2d(output_size=(1, 1))
  )
  (classifier): Sequential(
    (0): Linear(in_features=256, out_features=1024, bias=True)
    (1): ReLU()
    (2): Dropout(p=0.5, inplace=False)
    (3): Linear(in_features=1024, out_features=512, bias=True)
    (4): ReLU()
    (5): Dropout(p=0.5, inplace=False)
    (6): Linear(in_features=512, out_features=10, bias=True)
  )
)

Structure of complex CNN model

Step 3: Training Loop

InΒ [Β ]:
# Training parameters
num_epochs = 100
train_losses = []
val_losses = []
train_accuracies = []
val_accuracies = []


# Initialize training time tracking
start_time = time.time()

# Training loop
for epoch in range(num_epochs):
    # Training phase
    complexcnn_model.train()
    running_loss = 0.0
    correct = 0
    total = 0
    
    for inputs, labels in train_loader:
        inputs, labels = inputs.to(device), labels.to(device)
        
        optimizer.zero_grad()
        outputs = complexcnn_model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        
        running_loss += loss.item() * inputs.size(0)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()
    
    epoch_loss = running_loss / len(train_dataset)
    epoch_acc = correct / total
    train_losses.append(epoch_loss)
    train_accuracies.append(epoch_acc)
    
    # Validation phase
    complexcnn_model.eval()
    val_running_loss = 0.0
    val_correct = 0
    val_total = 0
    
    with torch.no_grad():
        for inputs, labels in val_loader:
            inputs, labels = inputs.to(device), labels.to(device)
            outputs = complexcnn_model(inputs)
            loss = criterion(outputs, labels)
            
            val_running_loss += loss.item() * inputs.size(0)
            _, predicted = torch.max(outputs.data, 1)
            val_total += labels.size(0)
            val_correct += (predicted == labels).sum().item()
    
    val_epoch_loss = val_running_loss / len(val_dataset)
    val_epoch_acc = val_correct / val_total
    val_losses.append(val_epoch_loss)
    val_accuracies.append(val_epoch_acc)
    
    # Print epoch statistics
    print(f'Epoch {epoch+1}/{num_epochs}')
    print(f'Train Loss: {epoch_loss:.4f} | Acc: {epoch_acc:.4f}')
    print(f'Val Loss: {val_epoch_loss:.4f} | Acc: {val_epoch_acc:.4f}')
    print('-' * 50)

# Calculate total training time
training_time = time.time() - start_time
print(f"βœ… Training completed in {training_time:.2f} seconds")

I don't show the learning process because it's too long for 100 epochs

Step 4: Training Performance Visualization

Visualize how accuracy and loss change during training.

InΒ [283]:
plot_training_curves(train_losses, val_losses, train_accuracies, val_accuracies)
No description has been provided for this image

The validation loss fluctuates, suggesting some instability or overfitting tendencies. The accuracy curves show consistent improvement, with validation accuracy occasionally surpassing training accuracy, which is a good sign of generalization. But when I trained model on more number epochs it started to overfit and validation loss/accuracy stayed the same, when training loss/accuracy improved.

Step 5: Complex CNN Model Evaluation

Evaluate Complex CNN on test data and stores results.

InΒ [285]:
# Evaluate the trained Simple CNN model
complex_cnn_metrics, complex_cnn_conf_matrix = evaluate_model(complexcnn_model, test_loader, test_dataset, device)

# Add training time
complex_cnn_metrics["Training Time (s)"] = training_time

# Print formatted evaluation results
print("Complex CNN Model Evaluation Metrics:")
print(complex_cnn_metrics)
Complex CNN Model Evaluation Metrics:
{'Accuracy': 0.632258064516129, 'Precision': 0.6426096184448588, 'Recall': 0.6322580645161291, 'F1-Score': 0.6319930702454176, 'Training Time (s)': 639.4477109909058}

Complex CNN model performed significantly better than the simple CNN model, achieving 63.3% accuracy. However, this is still much lower than Random Forest and XGBoost. To improve performance, I attempted multiple approaches, including tuning batch size, number of layers, neurons, different callbacks, learning rate, and other optimization techniques. Unfortunately, none of these experiments improved the CNN's training accuracy beyond 70%, indicating that a fundamental limitation exists in training CNNs from scratch on this dataset.

Given this, the next step is to use Transfer Learning to achieve better results with CNN-based models. Below, I outline possible reasons why CNN models performed worse than traditional ML models.


❗❗❗ Why Did CNN Models Perform Worse Than Traditional ML Models? ❗❗❗

Despite being powerful for image tasks, the Complex CNN model underperformed compared to Random Forest and XGBoost on the Animals-10 dataset. Several reasons could explain this:

  1. Limited Data (Only 200 Images per Class)

    • CNNs require large datasets to learn effective features. 200 images per class may not be enough for a deep CNN to generalize well, leading to overfitting or poor feature extraction.
    • Traditional ML models like Random Forest and XGBoost work well on small datasets because they rely on pre-defined feature extraction rather than learning from scratch.
  2. Lack of Pretrained Knowledge

    • CNNs trained from scratch start with random weights, requiring extensive training data to learn meaningful patterns.
    • ML models, on the other hand, work well with smaller datasets as they leverage structured data features rather than learning hierarchical patterns from raw pixels.
  3. Insufficient Training Time & Hyperparameter Tuning

    • Training a deep CNN from scratch with limited data requires more epochs and fine-tuning, which may not have been fully optimized.
    • ML models can reach high accuracy quickly without requiring GPU-intensive training.

Why Transfer Learning is the Solution?

Instead of training CNN models from scratch, Transfer Learning allows to leverage powerful pretrained CNN architectures (e.g., ResNet, EfficientNet, MobileNet) trained on massive datasets (ImageNet). Benefits:

βœ… Pretrained Features: These models have already learned powerful, generalizable features from millions of images.
βœ… Better Accuracy: Instead of learning from scratch, the model adapts existing knowledge to classify Animals-10 more effectively.
βœ… Faster Training: Requires less data and training time, as only the top layers need to be fine-tuned.
βœ… Avoids Overfitting: Since pretrained models have seen diverse image variations, they generalize better on small datasets.

πŸ”Ή Next Step: Implement Transfer Learning to enhance CNN-based models and outperform traditional ML approaches.

3.6 Transfer Learning modelsΒΆ

3.6.1 EfficientNetB2 ModelΒΆ

Now, I will implement EfficientNet-B2, a state-of-the-art CNN architecture optimized for accuracy and efficiency.

Unlike custom CNN, EfficientNet-B2 is pre-trained on ImageNet, meaning it already understands image features like edges, textures, and shapes. We will fine-tune the last layers for our specific classification task.

Step 1: Load Pre-Trained EfficientNet-B2

Use a pre-trained EfficientNet and replaces the final classification layer to match 10 classes.

InΒ [112]:
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load EfficientNet-B2 Pretrained Model
efficientnet_b2 = models.efficientnet_b2(pretrained=True)

# Modify the final classification layer
num_classes = len(transfer_train_dataset.classes)
efficientnet_b2.classifier[1] = nn.Linear(efficientnet_b2.classifier[1].in_features, num_classes)

# Move model to GPU if available
efficientnet_b2 = efficientnet_b2.to(device)

Step 2: Define Loss and Optimizer

Use CrossEntropyLoss for multi-class classification and Adam optimizer for efficient training.

InΒ [113]:
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(efficientnet_b2.parameters(), lr=0.0001)

Step 3: Fine-Tuning the Model

Fine-tune EfficientNet-B2 on the dataset while tracking loss & accuracy.

InΒ [114]:
def train_transfer_model(model, train_loader, val_loader, criterion, optimizer, epochs=10):
    model.train()
    train_losses, val_losses, train_accs, val_accs = [], [], [], []
    start_time = time.time()

    for epoch in range(epochs):
        total_loss, correct_train, total_train = 0, 0, 0
        for images, labels in train_loader:
            images, labels = images.to(device), labels.to(device)

            optimizer.zero_grad()
            outputs = model(images)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            total_loss += loss.item()
            _, preds = torch.max(outputs, 1)
            correct_train += (preds == labels).sum().item()
            total_train += labels.size(0)

        train_loss = total_loss / len(train_loader)
        train_acc = correct_train / total_train
        train_losses.append(train_loss)
        train_accs.append(train_acc)

        # Validate model
        model.eval()
        with torch.no_grad():
            total_val_loss, correct_val, total_val = 0, 0, 0
            for images, labels in val_loader:
                images, labels = images.to(device), labels.to(device)

                outputs = model(images)
                val_loss = criterion(outputs, labels)

                total_val_loss += val_loss.item()
                _, preds = torch.max(outputs, 1)
                correct_val += (preds == labels).sum().item()
                total_val += labels.size(0)

        val_loss = total_val_loss / len(val_loader)
        val_acc = correct_val / total_val
        val_losses.append(val_loss)
        val_accs.append(val_acc)

        print(f"Epoch [{epoch+1}/{epochs}] - Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.4f} | Val Loss: {val_loss:.4f}, Val Acc: {val_acc:.4f}")

        model.train()

    training_time = time.time() - start_time
    return training_time, train_losses, val_losses, train_accs, val_accs

# Train EfficientNet-B2
training_time, train_losses, val_losses, train_accs, val_accs = train_transfer_model(efficientnet_b2, transfer_train_loader, transfer_val_loader, criterion, optimizer, epochs=10)
Epoch [1/10] - Train Loss: 1.7012, Train Acc: 0.6194 | Val Loss: 0.8963, Val Acc: 0.9233
Epoch [2/10] - Train Loss: 0.5561, Train Acc: 0.9482 | Val Loss: 0.2976, Val Acc: 0.9667
Epoch [3/10] - Train Loss: 0.2159, Train Acc: 0.9647 | Val Loss: 0.1843, Val Acc: 0.9767
Epoch [4/10] - Train Loss: 0.0956, Train Acc: 0.9906 | Val Loss: 0.1566, Val Acc: 0.9667
Epoch [5/10] - Train Loss: 0.0577, Train Acc: 0.9935 | Val Loss: 0.1441, Val Acc: 0.9700
Epoch [6/10] - Train Loss: 0.0609, Train Acc: 0.9935 | Val Loss: 0.1191, Val Acc: 0.9767
Epoch [7/10] - Train Loss: 0.0321, Train Acc: 0.9957 | Val Loss: 0.1341, Val Acc: 0.9633
Epoch [8/10] - Train Loss: 0.0253, Train Acc: 0.9971 | Val Loss: 0.1327, Val Acc: 0.9633
Epoch [9/10] - Train Loss: 0.0190, Train Acc: 0.9986 | Val Loss: 0.1152, Val Acc: 0.9733
Epoch [10/10] - Train Loss: 0.0212, Train Acc: 0.9971 | Val Loss: 0.1208, Val Acc: 0.9700

Step 4: Plot Accuracy & Loss Curves

Visualize how accuracy and loss change during training.

InΒ [Β ]:
plot_training_curves(train_losses, val_losses, train_accs, val_accs)
No description has been provided for this image

After 3 epochs, EfficientNet-B2 achieved an accuracy of 97.5% on the validation set and stoped improving validation accuracy. So for future models, I will use early stopping to prevent overfitting.

Step 5: Evaluate EfficientNet-B2 Performance

Compute accuracy, precision, recall, and F1-score for EfficientNet-B2.

InΒ [116]:
effnet_metrics, effnet_conf_matrix = evaluate_model(efficientnet_b2, transfer_test_loader, transfer_test_dataset, device)

# Add training time
effnet_metrics["Training Time (s)"] = training_time

print("EfficientNet-B2 Evaluation Metrics:")
print(effnet_metrics)
EfficientNet-B2 Evaluation Metrics:
{'Accuracy': 0.9741935483870968, 'Precision': 0.9741451149425288, 'Recall': 0.9741935483870968, 'F1-Score': 0.9739708561020036, 'Training Time (s)': 105.8400604724884}
InΒ [117]:
plot_confusion_matrix(effnet_conf_matrix, transfer_test_dataset)
No description has been provided for this image

On the test set EfficientNet-B2 achieved an accuracy of 97.5%. Precision, recall, and F1-score are also high, indicating that the model accurately classifies most images. The confusion matrix is nearly diagonal, indicating few misclassifications.

3.6.2 ResNet-50 ModelΒΆ

ResNet-50 is a popular CNN architecture known for its residual connections, which help prevent vanishing gradients and improve training.

In order not to repeat previous mistakes, I made during training EfficientNet-B2, I will improve the ResNet-50 training by:

  • Early stopping β†’ Stop training if validation loss stops improving to avoid overfitting.
  • Model checkpointing β†’ Save the best model during training for later use.

Additionally I will unfreeze some layers β†’ allow the model to fine-tune deeper layers for better feature extraction.

Step 1: Load Pre-Trained ResNet-50

Load pre-trained ResNet-50 and replaces the fully connected layer to match dataset (10 classes)

InΒ [Β ]:
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load ResNet-50 Pretrained Model
resnet50 = models.resnet50(pretrained=True)

# Unfreeze the last few layers for fine-tuning
for param in list(resnet50.parameters())[:-10]:  # Freeze all layers except the last 10
    param.requires_grad = False

# Modify the final classification layer
num_classes = len(transfer_train_dataset.classes)
resnet50.fc = nn.Linear(resnet50.fc.in_features, num_classes)

# Move model to GPU if available
resnet50 = resnet50.to(device)

Step 2: Define Loss and Optimizer

Use CrossEntropyLoss for multi-class classification and Adam optimizer for efficient training.

InΒ [Β ]:
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(resnet50.parameters(), lr=0.0001)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', patience=3, factor=0.5, verbose=True)

Step 3: Fine-Tuning the Model

Fine-tune ResNet-50 on our dataset while tracking loss & accuracy.

InΒ [Β ]:
def train_transfer_model(model, train_loader, val_loader, criterion, optimizer, scheduler, epochs=20, patience=5):
    model.train()
    train_losses, val_losses, train_accs, val_accs = [], [], [], []
    start_time = time.time()
    
    best_val_loss = float("inf")
    best_model_wts = copy.deepcopy(model.state_dict())
    early_stop_counter = 0

    for epoch in range(epochs):
        total_loss, correct_train, total_train = 0, 0, 0

        # Training phase
        model.train()
        for images, labels in train_loader:
            images, labels = images.to(device), labels.to(device)

            optimizer.zero_grad()
            outputs = model(images)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            total_loss += loss.item()
            _, preds = torch.max(outputs, 1)
            correct_train += (preds == labels).sum().item()
            total_train += labels.size(0)

        train_loss = total_loss / len(train_loader)
        train_acc = correct_train / total_train
        train_losses.append(train_loss)
        train_accs.append(train_acc)

        # Validation phase
        model.eval()
        with torch.no_grad():
            total_val_loss, correct_val, total_val = 0, 0, 0
            for images, labels in val_loader:
                images, labels = images.to(device), labels.to(device)

                outputs = model(images)
                val_loss = criterion(outputs, labels)

                total_val_loss += val_loss.item()
                _, preds = torch.max(outputs, 1)
                correct_val += (preds == labels).sum().item()
                total_val += labels.size(0)

        val_loss = total_val_loss / len(val_loader)
        val_acc = correct_val / total_val
        val_losses.append(val_loss)
        val_accs.append(val_acc)

        # Learning Rate Scheduler Update
        scheduler.step(val_loss)

        print(f"Epoch [{epoch+1}/{epochs}] - Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.4f} | Val Loss: {val_loss:.4f}, Val Acc: {val_acc:.4f}")

        # Check for best model
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            best_model_wts = copy.deepcopy(model.state_dict())
            early_stop_counter = 0
            print("πŸ”₯ New Best Model Saved!")
        else:
            early_stop_counter += 1
            print(f"⏳ Early Stop Counter: {early_stop_counter}/{patience}")

        # Early Stopping
        if early_stop_counter >= patience:
            print("⏹ Early stopping triggered! Restoring best model.")
            model.load_state_dict(best_model_wts)
            break

    training_time = time.time() - start_time
    return training_time, train_losses, val_losses, train_accs, val_accs, best_model_wts

# Train ResNet-50 
training_time, train_losses, val_losses, train_accs, val_accs, best_model_wts = train_transfer_model(
    resnet50, transfer_train_loader, transfer_val_loader, criterion, optimizer, scheduler, epochs=20, patience=5
)

# Load best model weights
resnet50.load_state_dict(best_model_wts)
Epoch [1/20] - Train Loss: 1.0279, Train Acc: 0.8151 | Val Loss: 0.2806, Val Acc: 0.9667
πŸ”₯ New Best Model Saved!
Epoch [2/20] - Train Loss: 0.2037, Train Acc: 0.9712 | Val Loss: 0.1850, Val Acc: 0.9700
πŸ”₯ New Best Model Saved!
Epoch [3/20] - Train Loss: 0.1057, Train Acc: 0.9928 | Val Loss: 0.1454, Val Acc: 0.9733
πŸ”₯ New Best Model Saved!
Epoch [4/20] - Train Loss: 0.0634, Train Acc: 0.9964 | Val Loss: 0.1363, Val Acc: 0.9667
πŸ”₯ New Best Model Saved!
Epoch [5/20] - Train Loss: 0.0457, Train Acc: 0.9957 | Val Loss: 0.1301, Val Acc: 0.9733
πŸ”₯ New Best Model Saved!
Epoch [6/20] - Train Loss: 0.0436, Train Acc: 0.9964 | Val Loss: 0.1202, Val Acc: 0.9667
πŸ”₯ New Best Model Saved!
Epoch [7/20] - Train Loss: 0.0290, Train Acc: 0.9986 | Val Loss: 0.1176, Val Acc: 0.9700
πŸ”₯ New Best Model Saved!
Epoch [8/20] - Train Loss: 0.0248, Train Acc: 0.9993 | Val Loss: 0.1138, Val Acc: 0.9700
πŸ”₯ New Best Model Saved!
Epoch [9/20] - Train Loss: 0.0189, Train Acc: 1.0000 | Val Loss: 0.1091, Val Acc: 0.9700
πŸ”₯ New Best Model Saved!
Epoch [10/20] - Train Loss: 0.0174, Train Acc: 0.9993 | Val Loss: 0.1193, Val Acc: 0.9733
⏳ Early Stop Counter: 1/5
Epoch [11/20] - Train Loss: 0.0188, Train Acc: 0.9978 | Val Loss: 0.1207, Val Acc: 0.9700
⏳ Early Stop Counter: 2/5
Epoch [12/20] - Train Loss: 0.0151, Train Acc: 0.9986 | Val Loss: 0.1037, Val Acc: 0.9700
πŸ”₯ New Best Model Saved!
Epoch [13/20] - Train Loss: 0.0111, Train Acc: 1.0000 | Val Loss: 0.1091, Val Acc: 0.9700
⏳ Early Stop Counter: 1/5
Epoch [14/20] - Train Loss: 0.0105, Train Acc: 0.9993 | Val Loss: 0.1032, Val Acc: 0.9700
πŸ”₯ New Best Model Saved!
Epoch [15/20] - Train Loss: 0.0082, Train Acc: 1.0000 | Val Loss: 0.1095, Val Acc: 0.9700
⏳ Early Stop Counter: 1/5
Epoch [16/20] - Train Loss: 0.0075, Train Acc: 1.0000 | Val Loss: 0.1083, Val Acc: 0.9667
⏳ Early Stop Counter: 2/5
Epoch [17/20] - Train Loss: 0.0076, Train Acc: 1.0000 | Val Loss: 0.1114, Val Acc: 0.9667
⏳ Early Stop Counter: 3/5
Epoch [18/20] - Train Loss: 0.0077, Train Acc: 0.9993 | Val Loss: 0.1041, Val Acc: 0.9700
⏳ Early Stop Counter: 4/5
Epoch [19/20] - Train Loss: 0.0062, Train Acc: 1.0000 | Val Loss: 0.1037, Val Acc: 0.9700
⏳ Early Stop Counter: 5/5
⏹ Early stopping triggered! Restoring best model.
Out[Β ]:
<All keys matched successfully>

Step 4: Plot Accuracy & Loss Curves

Visualize how accuracy and loss change during training.

InΒ [127]:
plot_training_curves(train_losses, val_losses, train_accs, val_accs)
No description has been provided for this image

As I set for early stopping, so large number of counts for stopping(5), model stopped on the second to last epoch, with accuracy of 97% on the validation set.

Step 5: Evaluate ResNet-50 Performance

Compute accuracy, precision, recall, and F1-score for ResNet-50.

InΒ [128]:
resnet_metrics, resnet_conf_matrix = evaluate_model(resnet50, transfer_test_loader, transfer_test_dataset, device)

# Add training time
resnet_metrics["Training Time (s)"] = training_time

print("ResNet-50 Evaluation Metrics:")
print(resnet_metrics)
ResNet-50 Evaluation Metrics:
{'Accuracy': 0.9709677419354839, 'Precision': 0.9713257575757577, 'Recall': 0.970967741935484, 'F1-Score': 0.9709032656778558, 'Training Time (s)': 136.02926087379456}

Display confusion matrix for ResNet-50

InΒ [129]:
plot_confusion_matrix(resnet_conf_matrix, transfer_test_dataset)
No description has been provided for this image

ResNet-50 showed few lower results than EfficientNet-B2, with an accuracy of 97% on the test set. Precision, recall, and F1-score are also slightly lower, indicating more misclassifications.

Confusion matrix for ResNet-50 are very similar to EfficientNet-B2, with the same positions of misclassifications. It means that there are some images that are very hard to classify and later I will analyze them. But before that I will try other models.

3.6.3 MobileNetV3-SmallΒΆ

Now, I will train MobileNetV3-Small. It is smaller and faster than ResNet or EfficientNet but still provides good accuracy.

Like before, I will:

  • Unfreeze some layers(unfreeze the last 5 layers)
  • Use early stopping(decrease patience to 3)
  • Save the best model

Step 1: Load Pre-Trained MobileNetV3-Small

Unfreezes in that case last 5 layers and replaces the classifier for 10-class dataset.

InΒ [Β ]:
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load MobileNetV3-Small Pretrained Model
mobilenet_v3 = models.mobilenet_v3_small(pretrained=True)

# Unfreeze last few layers for fine-tuning
for param in list(mobilenet_v3.parameters())[:-5]:  # Keep first layers frozen
    param.requires_grad = False

# Modify the final classification layer
num_classes = len(transfer_train_dataset.classes)
mobilenet_v3.classifier[3] = nn.Linear(mobilenet_v3.classifier[3].in_features, num_classes)

# Move model to GPU if available
mobilenet_v3 = mobilenet_v3.to(device)

Step 2: Define Loss, Optimizer & Scheduler

Use CrossEntropyLoss and Adam optimizer, with learning rate scheduling to adjust training dynamically.

InΒ [Β ]:
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(mobilenet_v3.parameters(), lr=0.0001)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', patience=3, factor=0.5, verbose=True)

Step 3: Implement Early Stopping & Checkpointing

InΒ [144]:
def train_transfer_model(model, train_loader, val_loader, criterion, optimizer, scheduler, epochs=20, patience=5):
    model.train()
    train_losses, val_losses, train_accs, val_accs = [], [], [], []
    start_time = time.time()
    
    best_val_loss = float("inf")
    best_model_wts = copy.deepcopy(model.state_dict())
    early_stop_counter = 0

    for epoch in range(epochs):
        total_loss, correct_train, total_train = 0, 0, 0

        # Training phase
        model.train()
        for images, labels in train_loader:
            images, labels = images.to(device), labels.to(device)

            optimizer.zero_grad()
            outputs = model(images)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            total_loss += loss.item()
            _, preds = torch.max(outputs, 1)
            correct_train += (preds == labels).sum().item()
            total_train += labels.size(0)

        train_loss = total_loss / len(train_loader)
        train_acc = correct_train / total_train
        train_losses.append(train_loss)
        train_accs.append(train_acc)

        # Validation phase
        model.eval()
        with torch.no_grad():
            total_val_loss, correct_val, total_val = 0, 0, 0
            for images, labels in val_loader:
                images, labels = images.to(device), labels.to(device)

                outputs = model(images)
                val_loss = criterion(outputs, labels)

                total_val_loss += val_loss.item()
                _, preds = torch.max(outputs, 1)
                correct_val += (preds == labels).sum().item()
                total_val += labels.size(0)

        val_loss = total_val_loss / len(val_loader)
        val_acc = correct_val / total_val
        val_losses.append(val_loss)
        val_accs.append(val_acc)

        # Learning Rate Scheduler Update
        scheduler.step(val_loss)

        print(f"Epoch [{epoch+1}/{epochs}] - Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.4f} | Val Loss: {val_loss:.4f}, Val Acc: {val_acc:.4f}")

        # Check for best model
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            best_model_wts = copy.deepcopy(model.state_dict())
            early_stop_counter = 0
            print("πŸ”₯ New Best Model Saved!")
        else:
            early_stop_counter += 1
            print(f"⏳ Early Stop Counter: {early_stop_counter}/{patience}")

        # Early Stopping
        if early_stop_counter >= patience:
            print("⏹ Early stopping triggered! Restoring best model.")
            model.load_state_dict(best_model_wts)
            break

    training_time = time.time() - start_time
    return training_time, train_losses, val_losses, train_accs, val_accs, best_model_wts

# Train MobileNetV3-Small with Early Stopping
training_time, train_losses, val_losses, train_accs, val_accs, best_model_wts = train_transfer_model(
    mobilenet_v3, transfer_train_loader, transfer_val_loader, criterion, optimizer, scheduler, epochs=20, patience=3
)

# Load best model weights
mobilenet_v3.load_state_dict(best_model_wts)
Epoch [1/20] - Train Loss: 1.8422, Train Acc: 0.4827 | Val Loss: 1.4235, Val Acc: 0.6900
πŸ”₯ New Best Model Saved!
Epoch [2/20] - Train Loss: 1.2387, Train Acc: 0.7669 | Val Loss: 0.9999, Val Acc: 0.8167
πŸ”₯ New Best Model Saved!
Epoch [3/20] - Train Loss: 0.8664, Train Acc: 0.8482 | Val Loss: 0.8011, Val Acc: 0.8500
πŸ”₯ New Best Model Saved!
Epoch [4/20] - Train Loss: 0.6658, Train Acc: 0.8727 | Val Loss: 0.6836, Val Acc: 0.8500
πŸ”₯ New Best Model Saved!
Epoch [5/20] - Train Loss: 0.5501, Train Acc: 0.8856 | Val Loss: 0.5913, Val Acc: 0.8567
πŸ”₯ New Best Model Saved!
Epoch [6/20] - Train Loss: 0.4547, Train Acc: 0.9036 | Val Loss: 0.5113, Val Acc: 0.8933
πŸ”₯ New Best Model Saved!
Epoch [7/20] - Train Loss: 0.4187, Train Acc: 0.9036 | Val Loss: 0.4474, Val Acc: 0.9033
πŸ”₯ New Best Model Saved!
Epoch [8/20] - Train Loss: 0.3596, Train Acc: 0.9216 | Val Loss: 0.4069, Val Acc: 0.9067
πŸ”₯ New Best Model Saved!
Epoch [9/20] - Train Loss: 0.3375, Train Acc: 0.9137 | Val Loss: 0.3767, Val Acc: 0.8933
πŸ”₯ New Best Model Saved!
Epoch [10/20] - Train Loss: 0.3110, Train Acc: 0.9216 | Val Loss: 0.3566, Val Acc: 0.8967
πŸ”₯ New Best Model Saved!
Epoch [11/20] - Train Loss: 0.2732, Train Acc: 0.9374 | Val Loss: 0.3486, Val Acc: 0.8967
πŸ”₯ New Best Model Saved!
Epoch [12/20] - Train Loss: 0.2655, Train Acc: 0.9338 | Val Loss: 0.3346, Val Acc: 0.8933
πŸ”₯ New Best Model Saved!
Epoch [13/20] - Train Loss: 0.2498, Train Acc: 0.9410 | Val Loss: 0.3347, Val Acc: 0.8867
⏳ Early Stop Counter: 1/3
Epoch [14/20] - Train Loss: 0.2414, Train Acc: 0.9403 | Val Loss: 0.3329, Val Acc: 0.8933
πŸ”₯ New Best Model Saved!
Epoch [15/20] - Train Loss: 0.2296, Train Acc: 0.9410 | Val Loss: 0.3143, Val Acc: 0.8833
πŸ”₯ New Best Model Saved!
Epoch [16/20] - Train Loss: 0.2051, Train Acc: 0.9540 | Val Loss: 0.3147, Val Acc: 0.8900
⏳ Early Stop Counter: 1/3
Epoch [17/20] - Train Loss: 0.1974, Train Acc: 0.9554 | Val Loss: 0.3102, Val Acc: 0.8967
πŸ”₯ New Best Model Saved!
Epoch [18/20] - Train Loss: 0.1849, Train Acc: 0.9518 | Val Loss: 0.3115, Val Acc: 0.8900
⏳ Early Stop Counter: 1/3
Epoch [19/20] - Train Loss: 0.1798, Train Acc: 0.9576 | Val Loss: 0.3110, Val Acc: 0.9000
⏳ Early Stop Counter: 2/3
Epoch [20/20] - Train Loss: 0.1623, Train Acc: 0.9612 | Val Loss: 0.3093, Val Acc: 0.8967
πŸ”₯ New Best Model Saved!
Out[144]:
<All keys matched successfully>

The model did not stop learning and studied all 20 epochs.

Step 4: Plot Accuracy & Loss Curves

Visualize training progress.

InΒ [145]:
plot_training_curves(train_losses, val_losses, train_accs, val_accs)
No description has been provided for this image

The training curves closely resemble those of previous models, with a slight drop in accuracy towards the end of training. The model struggled to extract key features from the images, resulting in stagnant validation accuracy throughout training.

Step 5: Evaluate MobileNetV3-Small Performance

Compute accuracy, precision, recall, and F1-score for MobileNetV3-Small.

InΒ [146]:
mobilenet_metrics, mobilenet_conf_matrix = evaluate_model(mobilenet_v3, transfer_test_loader, transfer_test_dataset, device)

# Add training time
mobilenet_metrics["Training Time (s)"] = training_time

print("MobileNetV3-Small Evaluation Metrics:")
print(mobilenet_metrics)
MobileNetV3-Small Evaluation Metrics:
{'Accuracy': 0.9129032258064517, 'Precision': 0.9175651917658559, 'Recall': 0.9129032258064516, 'F1-Score': 0.9136413690288328, 'Training Time (s)': 119.07905864715576}

Model showed very similar results to XGBoost, with an accuracy of 91.5% on the test set. Precision, recall, and F1-score are also lower, indicating more misclassifications.

Finally let's move to the next section to compare all models and analyze the results.

4. Models Evaluation & InterpretationΒΆ

4.1 Comparing Model PerformanceΒΆ

Let's compare all models results to choose the best one and base on that model make more deep analysis of the results.

InΒ [290]:
# Store all metrics
models = {
    "Random Forest": rf_metrics,
    "XGBoost": xgb_metrics,
    "Simple CNN": simple_cnn_metrics,
    "Complex CNN": complex_cnn_metrics,
    "EfficientNet": effnet_metrics,
    "ResNet": resnet_metrics,
    "MobileNet": mobilenet_metrics,
}

metrics_keys = ["Accuracy", "Precision", "Recall", "F1-Score", "Training Time (s)"]

fig, axes = plt.subplots(2, 3, figsize=(15, 10))
fig.suptitle("Comparison of Model Metrics", fontsize=16)

# Hide the sixth subplot
axes[1, 2].axis('off')

for idx, metric in enumerate(metrics_keys):
    row = idx // 3
    col = idx % 3
    ax = axes[row, col]

    # Extract values for the current metric and sort
    sorted_models = sorted(models.items(), key=lambda x: x[1][metric], reverse=True)
    model_names = [m[0] for m in sorted_models]
    values = [m[1][metric] for m in sorted_models]

    # Plot bar chart
    ax.barh(model_names, values, color=plt.cm.Paired(np.linspace(0, 1, len(models))))
    ax.set_xlabel(metric)
    ax.set_title(f"{metric} Comparison")
    ax.invert_yaxis()

plt.tight_layout(rect=[0, 0, 1, 0.96])

# Adjust positions of the second row's subplots to center them
ax3 = axes[1, 0]
ax4 = axes[1, 1]

# Get current positions
pos3 = ax3.get_position()
pos4 = ax4.get_position()

# Calculate new x positions to center both plots
total_width = pos4.x1 - pos3.x0
new_x0 = (1 - total_width) / 2

# Set new positions
ax3.set_position([new_x0, pos3.y0, pos3.width, pos3.height])
ax4.set_position([new_x0 + (pos3.x1 - pos3.x0) + (pos4.x0 - pos3.x1), 
                  pos4.y0, pos4.width, pos4.height])

plt.show()
No description has been provided for this image

The EfficientNet and ResNet models achieve the highest accuracy, precision, recall, and F1-score, making them the best-performing models. Random Forest and XGBoost also perform well but slightly lower than deep learning models. Complex CNN and Simple CNN have significantly lower performance, especially in accuracy and F1-score. However, deep learning models require much longer training times, with Complex CNN taking the longest, while Random Forest trains the fastest.

4.2 Analyzing MisclassificationsΒΆ

Since EfficientNet-B2 achieved the best results, I will analyze its misclassifications to understand where it fails and why.

What I Do:

  • Identify misclassified images from the test set.
  • Display them in a grid plot (4x2).
  • Use green text for the true label and red text for the predicted label.
  • Show model confidence in prediction (probability of predicted class).

Why EfficientNet-B2?

It showed the best accuracy, so analyzing its mistakes will give us the most useful insights about dataset complexity.

Step 1: Identify Misclassified Images

InΒ [Β ]:
def get_misclassified_images(model, dataloader, dataset, device, num_images=8):
    model.eval()
    misclassified = []

    with torch.no_grad():
        for images, labels in dataloader:
            images, labels = images.to(device), labels.to(device)
            outputs = model(images)
            probs = torch.nn.functional.softmax(outputs, dim=1) 
            confidences, preds = torch.max(probs, 1)

            for i in range(len(labels)):
                if preds[i] != labels[i]:  # If misclassified
                    misclassified.append((images[i].cpu(), labels[i].cpu(), preds[i].cpu(), confidences[i].cpu().item()))
                if len(misclassified) >= num_images:
                    return misclassified
    return misclassified

# Get 8 misclassified images
misclassified_samples = get_misclassified_images(efficientnet_b2, transfer_test_loader, transfer_test_dataset, device, num_images=8)

Step 2: Visualize Misclassifications

InΒ [Β ]:
def plot_misclassified_images(misclassified_samples, class_names):
    fig, axes = plt.subplots(2, 4, figsize=(14, 7))
    plt.subplots_adjust(top=0.85)  # Adjust top margin for titles

    for i, (image, true_label, pred_label, confidence) in enumerate(misclassified_samples):
        ax = axes[i // 4, i % 4]
        image = image.permute(1, 2, 0).numpy()
        image = (image - image.min()) / (image.max() - image.min())
        
        ax.imshow(image)
        ax.axis("off")

        true_class = class_names[true_label]
        predicted_class = class_names[pred_label]

        # Add titles with custom positioning and colors
        ax.text(0.5, 1.1, f"True: {true_class}", 
                color='green', fontsize=12, ha='center', va='bottom', 
                transform=ax.transAxes)
        
        ax.text(0.5, 1.02, f"Pred: {predicted_class} ({confidence:.2%})", 
                color='red', fontsize=12, ha='center', va='bottom', 
                transform=ax.transAxes)

    plt.tight_layout()
    plt.show()

# Get class names
class_names = transfer_test_dataset.classes

# Plot misclassified images with correct title formatting
plot_misclassified_images(misclassified_samples, class_names)
No description has been provided for this image

Some misclassified images are genuinely confusing, making it difficult even for humans to predict correctly. However, there are also cases where the true class is obvious, yet the model makes a high-confidence incorrect prediction.

Overall, most errors come from hard-to-classify images, proving that the built model is strong and struggles mainly with edge cases.

For deeper analysis, let’s use Grad-CAM to understand what the model focuses on when making predictions.

4.3 Grad-CAM: Model InterpretationΒΆ

Now, I will use Grad-CAM to understand why EfficientNet-B2 makes mistakes and what features it focuses on during predictions.

What I Do:

  1. Apply Grad-CAM to misclassified images β†’ See what the model focused on.
  2. Create a function for predicting new images β†’ Can take random images from a folder or a user-provided image path.
  3. Display predictions with confidence scores β†’ Helps evaluate model reliability.

Step 1: Implement Grad-CAM on Misclassified Images

This class allows to apply Grad-CAM to any CNN-based model, including EfficientNet-B2.

InΒ [201]:
class GradCAM:
    def __init__(self, model, target_layer):
        self.model = model
        self.target_layer = target_layer
        self.gradients = None
        self.activations = None
        
        # Register hooks
        self.forward_hook = target_layer.register_forward_hook(self.save_activations)
        self.backward_hook = target_layer.register_backward_hook(self.save_gradients)

    def save_activations(self, module, input, output):
        self.activations = output.detach()

    def save_gradients(self, module, grad_input, grad_output):
        self.gradients = grad_output[0].detach()

    def generate_heatmap(self, image, class_idx):
        # Forward pass through full model
        output = self.model(image.unsqueeze(0))
        self.model.zero_grad()
        
        # Backward pass for target class
        output[0, class_idx].backward()
        
        # Use registered activations and gradients
        gradients = self.gradients.cpu().numpy()
        activations = self.activations.cpu().numpy()
        
        # Pool gradients and weight activations
        weights = np.mean(gradients, axis=(2, 3))
        heatmap = np.zeros(activations.shape[2:], dtype=np.float32)
        
        for i, w in enumerate(weights[0]):
            heatmap += w * activations[0, i]
            
        heatmap = np.maximum(heatmap, 0)
        heatmap /= np.max(heatmap)  # Normalize
        return heatmap

    def __del__(self):
        self.forward_hook.remove()
        self.backward_hook.remove()

Step 2: Apply Grad-CAM on Misclassified Images

This function:

  • Applies Grad-CAM to misclassified images
  • Overlays heatmaps on the original image
  • Displays where the model focused during prediction
InΒ [207]:
def apply_heatmap(image, heatmap):
    heatmap = cv2.resize(heatmap, (image.shape[1], image.shape[0]))  # Resize heatmap to match image size
    heatmap = np.uint8(255 * heatmap)  
    heatmap = cv2.applyColorMap(heatmap, cv2.COLORMAP_JET) 

    # Ensure image is uint8 before blending
    image = (image * 255).astype(np.uint8)  

    # Blend original image with heatmap
    superimposed = cv2.addWeighted(image, 0.6, heatmap, 0.4, 0)
    return superimposed

def visualize_gradcam_grid(model, misclassified_samples, target_layer, class_names):
    num_images = len(misclassified_samples[:8])  # We visualize 8 images (each with its Grad-CAM)
    
    fig, axes = plt.subplots(4, 4, figsize=(14, 14))
    plt.subplots_adjust(hspace=0.5)  # Add spacing between rows

    axes = axes.flatten()  # Flatten the 2D array of axes to 1D

    for idx, (image, true_label, pred_label, confidence) in enumerate(misclassified_samples[:8]):  
        # Process image for Grad-CAM
        image_tensor = image.to(device)
        grad_cam = GradCAM(model, target_layer)
        heatmap = grad_cam.generate_heatmap(image_tensor, pred_label)

        # Convert images to numpy
        image_np = image.cpu().permute(1, 2, 0).numpy()
        image_np = (image_np - image_np.min()) / (image_np.max() - image_np.min())  # Normalize
        heatmap_img = apply_heatmap(image_np, heatmap)

        # Flattened indexing for 4x4 grid
        ax_pred = axes[idx * 2]  # First: Prediction Image
        ax_cam = axes[idx * 2 + 1]  # Second: Grad-CAM

        # Plot Prediction Image
        ax_pred.imshow(image_np)
        ax_pred.axis("off")
        
        # Add title with colors for True (Green) and Pred (Red)
        ax_pred.text(0.5, 1.1, f"True: {class_names[true_label]}", 
                     color='green', fontsize=12, ha='center', va='bottom', 
                     transform=ax_pred.transAxes)

        ax_pred.text(0.5, 1.02, f"Pred: {class_names[pred_label]} ({confidence:.2%})", 
                     color='red', fontsize=12, ha='center', va='bottom', 
                     transform=ax_pred.transAxes)

        # Plot Grad-CAM Heatmap
        ax_cam.imshow(heatmap_img)
        ax_cam.axis("off")
        ax_cam.set_title("Grad-CAM", fontsize=12)

    plt.tight_layout()
    plt.show()

Step 3: Run Grad-CAM on Misclassified Images

This step display Grad-CAM heatmaps for misclassified images.

InΒ [208]:
# βœ… Correct Target Layer Selection for EfficientNet-B2
target_layer = efficientnet_b2.features[-2]  # Last convolutional layer

# Run Grad-CAM on misclassified images in grid format
visualize_gradcam_grid(efficientnet_b2, misclassified_samples, target_layer, class_names)
No description has been provided for this image

Step 4: Function to Predict Images from Folder or User Input

InΒ [215]:
def predict_image_with_gradcam(model, path, class_names, target_layer):
    model.eval()

    # Define transformation (same as training)
    transform = transforms.Compose([
        transforms.Resize((256, 256)),
        transforms.ToTensor()
    ])

    # Select image: from folder or direct file
    if os.path.isdir(path):  
        image_files = []
        for root, _, files in os.walk(path):
            for file in files:
                if file.endswith(('.jpg', '.png', '.jpeg')):  
                    image_files.append(os.path.join(root, file))

        if len(image_files) == 0:
            print("No images found in folder.")
            return
        
        image_path = random.choice(image_files)
    else:  # If path is a direct file
        if not os.path.isfile(path):
            print("File not found.")
            return
        image_path = path

    # Load and preprocess image
    image = Image.open(image_path).convert("RGB")
    image_tensor = transform(image).unsqueeze(0).to(device)

    # Make prediction
    with torch.no_grad():
        outputs = model(image_tensor)
        probs = torch.nn.functional.softmax(outputs, dim=1)
        confidence, pred_idx = torch.max(probs, 1)

    predicted_label = class_names[pred_idx.item()]

    # Convert image for Grad-CAM processing
    image_np = np.array(image)
    image_np = (image_np - image_np.min()) / (image_np.max() - image_np.min())  

    # Apply Grad-CAM
    grad_cam = GradCAM(model, target_layer)
    heatmap = grad_cam.generate_heatmap(image_tensor.squeeze(0), pred_idx.item())
    heatmap_img = apply_heatmap(image_np, heatmap)

    # Plot prediction & Grad-CAM side by side
    fig, axes = plt.subplots(1, 2, figsize=(10, 5))

    # Original Image
    axes[0].imshow(image)
    axes[0].axis("off")
    axes[0].set_title(f"Predicted: {predicted_label} ({confidence.item():.2%})", color="red")

    # Grad-CAM Visualization
    axes[1].imshow(heatmap_img)
    axes[1].axis("off")
    axes[1].set_title("Grad-CAM")

    plt.show()
InΒ [217]:
# Correct Target Layer Selection for EfficientNet-B2
target_layer = efficientnet_b2.features[-2] 

# Predict a random image from a folder and visualize Grad-CAM
predict_image_with_gradcam(efficientnet_b2, "data", class_names, target_layer)
No description has been provided for this image

Grad-CAM works well, highlighting key features the model focuses on during classification. It correctly identifies relevant areas, like faces or distinctive textures, confirming that the model makes logical decisions.

5. Model Deployment & InferenceΒΆ

5.1 Save and Load the Best ModelΒΆ

Step 1: Save the Best Model

InΒ [Β ]:
# Define model save path
model_save_path = "best_efficientnet_b2.pth"

# Save the model
torch.save({
    'model_state_dict': efficientnet_b2.state_dict(),
    'class_names': class_names 
}, model_save_path)

print(f"Model saved successfully at {model_save_path}")
Model saved successfully at best_efficientnet_b2.pth

Step 2: Load the Saved Model

InΒ [219]:
# Reload model
def load_model(model_path, model_architecture, device):
    checkpoint = torch.load(model_path, map_location=device)

    # Load model architecture
    model = model_architecture.to(device)
    model.load_state_dict(checkpoint['model_state_dict'])
    model.eval()  # Set to evaluation mode

    # Load class names
    loaded_class_names = checkpoint.get('class_names', None)

    print("Model loaded successfully!")
    return model, loaded_class_names

# Example usage:
loaded_model, loaded_class_names = load_model(model_save_path, efficientnet_b2, device)
Model loaded successfully!

Step 3: Define the Target Layer for Grad-CAM

EfficientNet-B2’s last convolutional layer is needed for Grad-CAM

InΒ [Β ]:
# Define target layer for Grad-CAM 
target_layer = loaded_model.features[-2] 

5.2 Predicting on New ImagesΒΆ

Now, use the loaded model, class names, and target layer I will predict on new images. For that I will use the function predict_image_with_gradcam from 4.3(step 4)

InΒ [221]:
# Example usage: Predict a random image from the test folder
predict_image_with_gradcam(loaded_model, "test_images", loaded_class_names, target_layer)
No description has been provided for this image
InΒ [223]:
# Predict a specific image
predict_image_with_gradcam(loaded_model, "test_images/cat.jpg", loaded_class_names, target_layer)
No description has been provided for this image

6. ConclusionΒΆ

In this project, I explored various machine learning and deep learning models for image classification, including EfficientNet-B2, ResNet-50, and MobileNetV3-Small. I implemented data preprocessing, model training with early stopping and checkpointing, and evaluated model performance using accuracy, precision, recall, and F1-score. EfficientNet-B2 achieved the highest accuracy, and Grad-CAM was used for model interpretation. The best model was saved and loaded for future predictions. Overall, EfficientNet-B2 demonstrated strong performance, making it the preferred model for this task.