Deep Learning - AI Algo Hub

Neural Networks & Backpropagation

When to Use

When you have sufficient data (>10K samples) and non-linear relationships. The building block for all deep learning architectures.

A neural network is a series of layers where each layer performs a linear transformation followed by a non-linear activation. Backpropagation computes gradients to update weights via chain rule.

Key Components

Layers - Input, Hidden (1+), Output. Each is a linear transform: z = Wx + b
Activation Functions:
- ReLU(x) = max(0, x) - Default choice, fast, avoids vanishing gradient
- Sigmoid(x) = 1/(1+e^-x) - Output layer for binary classification
- Softmax - Output layer for multi-class (probability distribution)
- GELU - Used in transformers, smoother than ReLU
Loss Functions - Cross-entropy (classification), MSE (regression), Focal Loss (imbalanced)
Backpropagation - Chain rule applied layer by layer to compute dL/dW for each weight

Forward & Backward Pass

import math, random

def relu(x): return max(0, x)
def relu_deriv(x): return 1.0 if x > 0 else 0.0
def sigmoid(x): return 1.0 / (1.0 + math.exp(-max(-500, min(500, x))))

class NeuralNetwork:
    def __init__(self, layer_sizes):
        self.weights = []
        self.biases = []
        for i in range(len(layer_sizes) - 1):
            # Xavier initialization
            scale = math.sqrt(2.0 / layer_sizes[i])
            w = [[random.gauss(0, scale) for _ in range(layer_sizes[i+1])]
                 for _ in range(layer_sizes[i])]
            b = [0.0] * layer_sizes[i+1]
            self.weights.append(w)
            self.biases.append(b)

    def forward(self, x):
        activations = [x]
        for i, (w, b) in enumerate(zip(self.weights, self.biases)):
            z = [sum(x_j * w_j[k] for j, (x_j, w_j) in enumerate(zip(x, w))) + b[k]
                 for k in range(len(b))]
            if i < len(self.weights) - 1:
                x = [relu(z_k) for z_k in z]  # Hidden: ReLU
            else:
                x = [sigmoid(z_k) for z_k in z]  # Output: Sigmoid
            activations.append(x)
        return activations

Training Tips

Use BatchNorm between layers for faster, more stable training
Use Dropout (0.1-0.5) for regularization
Start with Adam optimizer, switch to SGD+momentum for fine-tuning
Use learning rate warmup + cosine decay schedule
Monitor train vs val loss for overfitting detection

Convolutional Neural Networks (CNN)

Agent Instruction

For image tasks: use pretrained CNN (ResNet, EfficientNet) with transfer learning. Only train from scratch with >100K labeled images. For modern tasks, consider Vision Transformer (ViT).

CNNs use learnable convolutional filters that slide across spatial dimensions to detect features like edges, textures, and objects. Key innovation: weight sharing and local connectivity.

Core Operations

Convolution - Learnable filters extract local features. 3x3 or 5x5 kernels typical.
Pooling - Downsample spatial dimensions. MaxPool (most common) or AvgPool.
Stride - Skip positions during convolution (reduces spatial size)
Padding - Add zeros around input to preserve spatial dimensions

Key Architectures (Evolution)

Architecture	Year	Key Innovation	Depth
LeNet-5	1998	First practical CNN (digit recognition)	5
AlexNet	2012	ReLU, Dropout, GPU training	8
VGG	2014	Small 3x3 filters, deeper networks	16-19
GoogLeNet/Inception	2014	Inception modules (parallel filter sizes)	22
ResNet	2015	Skip connections enable 100+ layers	50-152
EfficientNet	2019	Compound scaling (depth/width/resolution)	varies
Vision Transformer (ViT)	2020	Transformer applied to image patches	12-24
ConvNeXt	2022	Modernized CNN matching ViT performance	varies

Transfer Learning Pattern

# PyTorch transfer learning
import torchvision.models as models

# Load pretrained model
model = models.resnet50(weights='IMAGENET1K_V2')

# Freeze backbone
for param in model.parameters():
    param.requires_grad = False

# Replace classifier head
model.fc = nn.Sequential(
    nn.Linear(2048, 256),
    nn.ReLU(),
    nn.Dropout(0.3),
    nn.Linear(256, num_classes)
)

# Only train the new head (then optionally unfreeze and fine-tune all)

Applications

Image Classification - What is in this image?
Object Detection - Where are objects? (YOLO, Faster R-CNN)
Semantic Segmentation - Pixel-level classification (U-Net)
Image Generation - Style transfer, super-resolution

Vision Architecture Selection (2025)

Scenario	Recommended
Large-scale + abundant data + compute	Vision Transformers (ViT)
Small datasets / limited labels	CNN (ResNet, EfficientNet) + transfer learning
Mobile / edge deployment	EfficientNet, MobileViT
Medical imaging / specialized domains	CNNs or hybrid architectures
General purpose with pre-training	ViT with fine-tuning

Emerging trend: Hybrid architectures (CvT, CoAtNet, MobileViT) blend CNN efficiency with Transformer contextual strength — the field is converging toward these combined approaches.

RNN / LSTM / GRU

When to Use

Sequential data where order matters: time series, speech. However, Transformers have largely replaced RNNs for most NLP tasks due to better parallelization and long-range dependencies.

Recurrent Neural Networks process sequential data by maintaining a hidden state that carries information across time steps. LSTMs and GRUs solve the vanishing gradient problem.

Architecture Comparison

Type	Gates	Key Feature	Parameters
Vanilla RNN	None	Simple but vanishing gradients	Fewest
LSTM	Input, Forget, Output	Cell state preserves long-term memory	4x RNN
GRU	Reset, Update	Simplified LSTM, similar performance	3x RNN

LSTM Cell Equations

# LSTM forward pass (single time step)
def lstm_step(x_t, h_prev, c_prev, W, b):
    # Concatenate input and previous hidden state
    combined = concatenate(x_t, h_prev)

    # Gate computations
    f_t = sigmoid(W_f @ combined + b_f)   # Forget gate
    i_t = sigmoid(W_i @ combined + b_i)   # Input gate
    c_hat = tanh(W_c @ combined + b_c)    # Candidate cell state
    o_t = sigmoid(W_o @ combined + b_o)   # Output gate

    # Update cell and hidden state
    c_t = f_t * c_prev + i_t * c_hat      # New cell state
    h_t = o_t * tanh(c_t)                 # New hidden state

    return h_t, c_t

RNN vs Transformer

Aspect	RNN/LSTM	Transformer
Parallelization	Sequential (slow)	Fully parallel (fast)
Long-range deps	Struggles beyond ~100 tokens	Handles thousands of tokens
Memory	O(1) per step	O(n^2) attention matrix
Best for	Streaming, low-memory, time series	NLP, vision, most tasks

Transformers

Agent Instruction

Transformers are the dominant architecture for NLP, and increasingly for vision and multimodal tasks. Understand self-attention, positional encoding, and the encoder-decoder structure.

The Transformer architecture, introduced in "Attention Is All You Need" (2017), replaced recurrence with self-attention, enabling parallel processing of sequences and better modeling of long-range dependencies.

Self-Attention Mechanism

import math

def scaled_dot_product_attention(Q, K, V):
    """
    Q, K, V: matrices of queries, keys, values
    Returns: attention-weighted values
    """
    d_k = len(K[0])  # Key dimension
    # Compute attention scores
    scores = matmul(Q, transpose(K))  # Q @ K^T
    scores = [[s / math.sqrt(d_k) for s in row] for row in scores]  # Scale

    # Optional: apply causal mask for decoder
    # scores = apply_mask(scores, mask)

    # Softmax to get attention weights
    weights = [softmax(row) for row in scores]

    # Weighted sum of values
    output = matmul(weights, V)
    return output, weights

Transformer Components

Multi-Head Attention - Run attention in parallel with different learned projections, then concatenate
Positional Encoding - Inject position information (sinusoidal or learned embeddings)
Layer Normalization - Pre-norm (modern) or post-norm (original) placement
Feed-Forward Network - Two linear layers with activation between: FFN(x) = W2 * GELU(W1 * x + b1) + b2
Residual Connections - output = LayerNorm(x + Sublayer(x))

Transformer Variants

Model	Type	Architecture	Best For
BERT	Encoder-only	Bidirectional attention	Classification, NER, Q&A
GPT	Decoder-only	Causal (left-to-right) attention	Text generation, reasoning
T5/BART	Encoder-Decoder	Full sequence-to-sequence	Translation, summarization
ViT	Encoder-only	Image patches as tokens	Image classification
Whisper	Encoder-Decoder	Audio spectrogram input	Speech recognition

Autoencoders

Encoder-decoder architecture that learns compressed representations by training to reconstruct its input. The bottleneck layer forces the network to learn meaningful features.

Variants

Vanilla Autoencoder - Dimensionality reduction, feature learning
Variational Autoencoder (VAE) - Learns a continuous latent space for generation; adds KL divergence to loss
Denoising Autoencoder - Trained to reconstruct clean inputs from corrupted versions
Sparse Autoencoder - Adds sparsity constraint to learn more meaningful features

VAE: Variational Autoencoder

Unlike vanilla autoencoders, VAEs learn a probabilistic latent space, enabling generation of new data by sampling from the learned distribution.

import torch
import torch.nn as nn

class VAE(nn.Module):
    def __init__(self, input_dim, latent_dim):
        super().__init__()
        self.encoder = nn.Sequential(nn.Linear(input_dim, 256), nn.ReLU())
        self.mu = nn.Linear(256, latent_dim)
        self.log_var = nn.Linear(256, latent_dim)
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, 256), nn.ReLU(),
            nn.Linear(256, input_dim), nn.Sigmoid()
        )

    def reparameterize(self, mu, log_var):
        std = torch.exp(0.5 * log_var)
        eps = torch.randn_like(std)
        return mu + eps * std  # Differentiable sampling

    def forward(self, x):
        h = self.encoder(x)
        mu, log_var = self.mu(h), self.log_var(h)
        z = self.reparameterize(mu, log_var)
        return self.decoder(z), mu, log_var

# Loss = Reconstruction + KL Divergence
# KL = -0.5 * sum(1 + log_var - mu^2 - exp(log_var))

Applications

Anomaly detection (high reconstruction error = anomaly)
Image denoising and super-resolution
Feature extraction for downstream tasks
Data generation (VAE) — drug discovery, data augmentation
Latent space interpolation and exploration

Generative Adversarial Networks (GANs)

Two networks trained adversarially: a Generator creates fake data, a Discriminator distinguishes real from fake. They compete until the generator produces realistic data.

Training Dynamics

# GAN training loop (conceptual)
for epoch in range(epochs):
    # 1. Train Discriminator
    real_data = sample_real_batch()
    fake_data = generator(random_noise())

    d_loss_real = loss(discriminator(real_data), label=1)
    d_loss_fake = loss(discriminator(fake_data.detach()), label=0)
    d_loss = d_loss_real + d_loss_fake
    d_loss.backward()
    d_optimizer.step()

    # 2. Train Generator
    fake_data = generator(random_noise())
    g_loss = loss(discriminator(fake_data), label=1)  # Fool discriminator
    g_loss.backward()
    g_optimizer.step()

GAN Variants

DCGAN - Convolutional architecture, stable training guidelines
WGAN - Wasserstein distance for more stable training
StyleGAN - High-quality face generation with style control
CycleGAN - Unpaired image-to-image translation
Pix2Pix - Paired image-to-image translation

Training Challenges

Mode collapse - Generator produces limited variety
Training instability - Requires careful hyperparameter tuning
Evaluation difficulty - No single metric for generation quality (use FID, IS)

Diffusion Models (DDPM)

Agent Instruction

Diffusion models have surpassed GANs for image generation quality. Key concepts: forward noising process, reverse denoising, noise schedule, U-Net backbone. Powers Stable Diffusion, DALL-E 3, Midjourney.

Diffusion models learn to generate data by gradually denoising a sample from pure noise. The forward process adds Gaussian noise over T steps; the reverse process learns to remove it.

How It Works

Forward Process (Diffusion) - Gradually add noise: x_t = sqrt(alpha_t) * x_0 + sqrt(1-alpha_t) * epsilon
Reverse Process (Denoising) - Neural network predicts and removes noise at each step
Training - Sample random timestep t, add noise, train network to predict the added noise
Sampling - Start from pure noise, iteratively denoise through T steps

Simplified DDPM Training

# DDPM training step (simplified)
def train_step(model, x_0, noise_schedule):
    # 1. Sample random timestep
    t = random.randint(0, T-1)

    # 2. Sample noise
    epsilon = torch.randn_like(x_0)

    # 3. Create noisy image
    alpha_bar_t = noise_schedule.alpha_bar[t]
    x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1 - alpha_bar_t) * epsilon

    # 4. Predict noise
    epsilon_pred = model(x_t, t)

    # 5. Loss = MSE between actual and predicted noise
    loss = F.mse_loss(epsilon_pred, epsilon)
    return loss

Generative Model Comparison

Aspect	Diffusion	GANs	VAEs
Output quality	Very high	High (sharp)	Moderate (blurry)
Training stability	Very stable	Unstable	Very stable
Mode coverage	Full distribution	Mode collapse risk	Full distribution
Sampling speed	Slow (many steps)	Fast (single pass)	Fast (single pass)
Controllability	Excellent (text-guided)	Moderate	Good (structured latent)
Status (2025)	Dominant for image gen	Niche (upscaling, style)	Anomaly detection, science

Graph Neural Networks (GNN)

Process graph-structured data (social networks, molecules, knowledge graphs) using message passing: each node aggregates information from its neighbors to update its representation.

Key Architectures

GCN (Graph Convolutional Network) - Spectral-based, simple neighborhood aggregation
GAT (Graph Attention Network) - Attention-weighted neighbor aggregation
GraphSAGE - Samples and aggregates from neighbors, scales to large graphs

Message Passing Framework

# GNN message passing (conceptual)
def gnn_layer(node_features, adjacency, W):
    messages = {}
    for node in graph.nodes:
        # Aggregate neighbor features
        neighbor_feats = [node_features[n] for n in graph.neighbors(node)]
        aggregated = mean(neighbor_feats)  # or sum, max, attention
        # Update node representation
        messages[node] = relu(W @ concatenate(node_features[node], aggregated))
    return messages

Architecture Comparison

Feature	GCN	GraphSAGE	GAT
Learning type	Transductive	Inductive	Both
Neighbor handling	Full neighborhood	Sampled subset	Attention-weighted
Scalability	Limited (full graph)	High (sampling)	Moderate
Best for	Static graphs	Large/dynamic graphs	Heterogeneous graphs

Applications

Social network analysis, recommendation systems (PinSAGE: 3B+ nodes)
Drug discovery and molecular property prediction
Traffic prediction, knowledge graph completion
Cybersecurity: anomaly detection from network logs
Computer vision: object detection, video action recognition

Reinforcement Learning

When to Use

Decision-making problems where an agent interacts with an environment: game AI, robotics, recommendation systems, RLHF for LLM alignment.

RL trains agents to make sequential decisions by maximizing cumulative reward through trial and error. The agent learns a policy mapping states to actions.

Key Algorithms

Algorithm	Type	Key Idea
Q-Learning	Value-based, Off-policy	Learn Q(s,a) table via Bellman equation
SARSA	Value-based, On-policy	Like Q-learning but follows current policy
DQN	Value-based + NN	Q-learning with neural network function approximation
Policy Gradient	Policy-based	Directly optimize policy with gradient ascent
PPO	Policy-based	Clipped surrogate objective for stable updates
A3C/A2C	Actor-Critic	Combine value and policy learning

Q-Learning Implementation

import random

def q_learning(env, episodes=1000, lr=0.1, gamma=0.99, epsilon=0.1):
    Q = {}  # Q-table: state -> {action: value}

    for episode in range(episodes):
        state = env.reset()
        done = False

        while not done:
            # Epsilon-greedy action selection
            if random.random() < epsilon:
                action = env.random_action()
            else:
                action = max(Q.get(state, {}), key=Q[state].get, default=env.random_action())

            next_state, reward, done = env.step(action)

            # Bellman update
            old_q = Q.get(state, {}).get(action, 0)
            max_next = max(Q.get(next_state, {}).values(), default=0)
            new_q = old_q + lr * (reward + gamma * max_next - old_q)

            Q.setdefault(state, {})[action] = new_q
            state = next_state

    return Q

RLHF for LLMs

Reinforcement Learning from Human Feedback aligns language models with human preferences:

Collect human preference data (which response is better)
Train a reward model from preferences
Fine-tune LLM with PPO to maximize reward model score
Alternative: DPO (Direct Preference Optimization) skips the reward model

Deep Learning Operations

Practical techniques for training deep learning models at scale.

Essential Techniques

AdamW Optimizer - Adam with decoupled weight decay. Default for transformers.
Learning Rate Schedules - Warmup + cosine decay, or OneCycleLR
Gradient Clipping - Prevent exploding gradients: max_norm=1.0
Mixed Precision (FP16/BF16) - 2x speedup with minimal accuracy loss
Gradient Accumulation - Simulate larger batch sizes on limited GPU memory
Checkpointing - Save model state periodically for recovery
Early Stopping - Stop when validation loss stops improving

Regularization Techniques

Dropout - Randomly zero out neurons during training (0.1-0.5)
Weight Decay - L2 regularization on weights
Data Augmentation - Random crops, flips, color jitter for images
Label Smoothing - Replace hard labels with soft targets
Stochastic Depth - Randomly drop entire layers during training

Deep Learning Architectures

Neural Networks & Backpropagation

Key Components

Forward & Backward Pass

Training Tips

Convolutional Neural Networks (CNN)

Core Operations

Key Architectures (Evolution)

Transfer Learning Pattern

Applications

Vision Architecture Selection (2025)

RNN / LSTM / GRU

Architecture Comparison

LSTM Cell Equations

RNN vs Transformer

Transformers

Self-Attention Mechanism

Transformer Components

Transformer Variants

Autoencoders

Variants

VAE: Variational Autoencoder

Applications

Generative Adversarial Networks (GANs)

Training Dynamics

GAN Variants

Diffusion Models (DDPM)

How It Works

Simplified DDPM Training

Generative Model Comparison

Graph Neural Networks (GNN)

Key Architectures

Message Passing Framework

Architecture Comparison

Applications

Reinforcement Learning

Key Algorithms

Q-Learning Implementation

RLHF for LLMs

Deep Learning Operations

Essential Techniques

Regularization Techniques

Discussion