Neural Networks & Backpropagation

When to Use
When you have sufficient data (>10K samples) and non-linear relationships. The building block for all deep learning architectures.

A neural network is a series of layers where each layer performs a linear transformation followed by a non-linear activation. Backpropagation computes gradients to update weights via chain rule.

Key Components

  • Layers - Input, Hidden (1+), Output. Each is a linear transform: z = Wx + b
  • Activation Functions:
    • ReLU(x) = max(0, x) - Default choice, fast, avoids vanishing gradient
    • Sigmoid(x) = 1/(1+e^-x) - Output layer for binary classification
    • Softmax - Output layer for multi-class (probability distribution)
    • GELU - Used in transformers, smoother than ReLU
  • Loss Functions - Cross-entropy (classification), MSE (regression), Focal Loss (imbalanced)
  • Backpropagation - Chain rule applied layer by layer to compute dL/dW for each weight

Forward & Backward Pass

import math, random

def relu(x): return max(0, x)
def relu_deriv(x): return 1.0 if x > 0 else 0.0
def sigmoid(x): return 1.0 / (1.0 + math.exp(-max(-500, min(500, x))))

class NeuralNetwork:
    def __init__(self, layer_sizes):
        self.weights = []
        self.biases = []
        for i in range(len(layer_sizes) - 1):
            # Xavier initialization
            scale = math.sqrt(2.0 / layer_sizes[i])
            w = [[random.gauss(0, scale) for _ in range(layer_sizes[i+1])]
                 for _ in range(layer_sizes[i])]
            b = [0.0] * layer_sizes[i+1]
            self.weights.append(w)
            self.biases.append(b)

    def forward(self, x):
        activations = [x]
        for i, (w, b) in enumerate(zip(self.weights, self.biases)):
            z = [sum(x_j * w_j[k] for j, (x_j, w_j) in enumerate(zip(x, w))) + b[k]
                 for k in range(len(b))]
            if i < len(self.weights) - 1:
                x = [relu(z_k) for z_k in z]  # Hidden: ReLU
            else:
                x = [sigmoid(z_k) for z_k in z]  # Output: Sigmoid
            activations.append(x)
        return activations

Training Tips

  • Use BatchNorm between layers for faster, more stable training
  • Use Dropout (0.1-0.5) for regularization
  • Start with Adam optimizer, switch to SGD+momentum for fine-tuning
  • Use learning rate warmup + cosine decay schedule
  • Monitor train vs val loss for overfitting detection

Convolutional Neural Networks (CNN)

Agent Instruction
For image tasks: use pretrained CNN (ResNet, EfficientNet) with transfer learning. Only train from scratch with >100K labeled images. For modern tasks, consider Vision Transformer (ViT).

CNNs use learnable convolutional filters that slide across spatial dimensions to detect features like edges, textures, and objects. Key innovation: weight sharing and local connectivity.

Core Operations

  • Convolution - Learnable filters extract local features. 3x3 or 5x5 kernels typical.
  • Pooling - Downsample spatial dimensions. MaxPool (most common) or AvgPool.
  • Stride - Skip positions during convolution (reduces spatial size)
  • Padding - Add zeros around input to preserve spatial dimensions

Key Architectures (Evolution)

ArchitectureYearKey InnovationDepth
LeNet-51998First practical CNN (digit recognition)5
AlexNet2012ReLU, Dropout, GPU training8
VGG2014Small 3x3 filters, deeper networks16-19
GoogLeNet/Inception2014Inception modules (parallel filter sizes)22
ResNet2015Skip connections enable 100+ layers50-152
EfficientNet2019Compound scaling (depth/width/resolution)varies
Vision Transformer (ViT)2020Transformer applied to image patches12-24
ConvNeXt2022Modernized CNN matching ViT performancevaries

Transfer Learning Pattern

# PyTorch transfer learning
import torchvision.models as models

# Load pretrained model
model = models.resnet50(weights='IMAGENET1K_V2')

# Freeze backbone
for param in model.parameters():
    param.requires_grad = False

# Replace classifier head
model.fc = nn.Sequential(
    nn.Linear(2048, 256),
    nn.ReLU(),
    nn.Dropout(0.3),
    nn.Linear(256, num_classes)
)

# Only train the new head (then optionally unfreeze and fine-tune all)

Applications

  • Image Classification - What is in this image?
  • Object Detection - Where are objects? (YOLO, Faster R-CNN)
  • Semantic Segmentation - Pixel-level classification (U-Net)
  • Image Generation - Style transfer, super-resolution

Vision Architecture Selection (2025)

ScenarioRecommended
Large-scale + abundant data + computeVision Transformers (ViT)
Small datasets / limited labelsCNN (ResNet, EfficientNet) + transfer learning
Mobile / edge deploymentEfficientNet, MobileViT
Medical imaging / specialized domainsCNNs or hybrid architectures
General purpose with pre-trainingViT with fine-tuning

Emerging trend: Hybrid architectures (CvT, CoAtNet, MobileViT) blend CNN efficiency with Transformer contextual strength — the field is converging toward these combined approaches.

RNN / LSTM / GRU

When to Use
Sequential data where order matters: time series, speech. However, Transformers have largely replaced RNNs for most NLP tasks due to better parallelization and long-range dependencies.

Recurrent Neural Networks process sequential data by maintaining a hidden state that carries information across time steps. LSTMs and GRUs solve the vanishing gradient problem.

Architecture Comparison

TypeGatesKey FeatureParameters
Vanilla RNNNoneSimple but vanishing gradientsFewest
LSTMInput, Forget, OutputCell state preserves long-term memory4x RNN
GRUReset, UpdateSimplified LSTM, similar performance3x RNN

LSTM Cell Equations

# LSTM forward pass (single time step)
def lstm_step(x_t, h_prev, c_prev, W, b):
    # Concatenate input and previous hidden state
    combined = concatenate(x_t, h_prev)

    # Gate computations
    f_t = sigmoid(W_f @ combined + b_f)   # Forget gate
    i_t = sigmoid(W_i @ combined + b_i)   # Input gate
    c_hat = tanh(W_c @ combined + b_c)    # Candidate cell state
    o_t = sigmoid(W_o @ combined + b_o)   # Output gate

    # Update cell and hidden state
    c_t = f_t * c_prev + i_t * c_hat      # New cell state
    h_t = o_t * tanh(c_t)                 # New hidden state

    return h_t, c_t

RNN vs Transformer

AspectRNN/LSTMTransformer
ParallelizationSequential (slow)Fully parallel (fast)
Long-range depsStruggles beyond ~100 tokensHandles thousands of tokens
MemoryO(1) per stepO(n^2) attention matrix
Best forStreaming, low-memory, time seriesNLP, vision, most tasks

Transformers

Agent Instruction
Transformers are the dominant architecture for NLP, and increasingly for vision and multimodal tasks. Understand self-attention, positional encoding, and the encoder-decoder structure.

The Transformer architecture, introduced in "Attention Is All You Need" (2017), replaced recurrence with self-attention, enabling parallel processing of sequences and better modeling of long-range dependencies.

Self-Attention Mechanism

import math

def scaled_dot_product_attention(Q, K, V):
    """
    Q, K, V: matrices of queries, keys, values
    Returns: attention-weighted values
    """
    d_k = len(K[0])  # Key dimension
    # Compute attention scores
    scores = matmul(Q, transpose(K))  # Q @ K^T
    scores = [[s / math.sqrt(d_k) for s in row] for row in scores]  # Scale

    # Optional: apply causal mask for decoder
    # scores = apply_mask(scores, mask)

    # Softmax to get attention weights
    weights = [softmax(row) for row in scores]

    # Weighted sum of values
    output = matmul(weights, V)
    return output, weights

Transformer Components

  • Multi-Head Attention - Run attention in parallel with different learned projections, then concatenate
  • Positional Encoding - Inject position information (sinusoidal or learned embeddings)
  • Layer Normalization - Pre-norm (modern) or post-norm (original) placement
  • Feed-Forward Network - Two linear layers with activation between: FFN(x) = W2 * GELU(W1 * x + b1) + b2
  • Residual Connections - output = LayerNorm(x + Sublayer(x))

Transformer Variants

ModelTypeArchitectureBest For
BERTEncoder-onlyBidirectional attentionClassification, NER, Q&A
GPTDecoder-onlyCausal (left-to-right) attentionText generation, reasoning
T5/BARTEncoder-DecoderFull sequence-to-sequenceTranslation, summarization
ViTEncoder-onlyImage patches as tokensImage classification
WhisperEncoder-DecoderAudio spectrogram inputSpeech recognition

Autoencoders

Encoder-decoder architecture that learns compressed representations by training to reconstruct its input. The bottleneck layer forces the network to learn meaningful features.

Variants

  • Vanilla Autoencoder - Dimensionality reduction, feature learning
  • Variational Autoencoder (VAE) - Learns a continuous latent space for generation; adds KL divergence to loss
  • Denoising Autoencoder - Trained to reconstruct clean inputs from corrupted versions
  • Sparse Autoencoder - Adds sparsity constraint to learn more meaningful features

VAE: Variational Autoencoder

Unlike vanilla autoencoders, VAEs learn a probabilistic latent space, enabling generation of new data by sampling from the learned distribution.

import torch
import torch.nn as nn

class VAE(nn.Module):
    def __init__(self, input_dim, latent_dim):
        super().__init__()
        self.encoder = nn.Sequential(nn.Linear(input_dim, 256), nn.ReLU())
        self.mu = nn.Linear(256, latent_dim)
        self.log_var = nn.Linear(256, latent_dim)
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, 256), nn.ReLU(),
            nn.Linear(256, input_dim), nn.Sigmoid()
        )

    def reparameterize(self, mu, log_var):
        std = torch.exp(0.5 * log_var)
        eps = torch.randn_like(std)
        return mu + eps * std  # Differentiable sampling

    def forward(self, x):
        h = self.encoder(x)
        mu, log_var = self.mu(h), self.log_var(h)
        z = self.reparameterize(mu, log_var)
        return self.decoder(z), mu, log_var

# Loss = Reconstruction + KL Divergence
# KL = -0.5 * sum(1 + log_var - mu^2 - exp(log_var))

Applications

  • Anomaly detection (high reconstruction error = anomaly)
  • Image denoising and super-resolution
  • Feature extraction for downstream tasks
  • Data generation (VAE) — drug discovery, data augmentation
  • Latent space interpolation and exploration

Generative Adversarial Networks (GANs)

Two networks trained adversarially: a Generator creates fake data, a Discriminator distinguishes real from fake. They compete until the generator produces realistic data.

Training Dynamics

# GAN training loop (conceptual)
for epoch in range(epochs):
    # 1. Train Discriminator
    real_data = sample_real_batch()
    fake_data = generator(random_noise())

    d_loss_real = loss(discriminator(real_data), label=1)
    d_loss_fake = loss(discriminator(fake_data.detach()), label=0)
    d_loss = d_loss_real + d_loss_fake
    d_loss.backward()
    d_optimizer.step()

    # 2. Train Generator
    fake_data = generator(random_noise())
    g_loss = loss(discriminator(fake_data), label=1)  # Fool discriminator
    g_loss.backward()
    g_optimizer.step()

GAN Variants

  • DCGAN - Convolutional architecture, stable training guidelines
  • WGAN - Wasserstein distance for more stable training
  • StyleGAN - High-quality face generation with style control
  • CycleGAN - Unpaired image-to-image translation
  • Pix2Pix - Paired image-to-image translation
Training Challenges
  • Mode collapse - Generator produces limited variety
  • Training instability - Requires careful hyperparameter tuning
  • Evaluation difficulty - No single metric for generation quality (use FID, IS)

Diffusion Models (DDPM)

Agent Instruction
Diffusion models have surpassed GANs for image generation quality. Key concepts: forward noising process, reverse denoising, noise schedule, U-Net backbone. Powers Stable Diffusion, DALL-E 3, Midjourney.

Diffusion models learn to generate data by gradually denoising a sample from pure noise. The forward process adds Gaussian noise over T steps; the reverse process learns to remove it.

How It Works

  • Forward Process (Diffusion) - Gradually add noise: x_t = sqrt(alpha_t) * x_0 + sqrt(1-alpha_t) * epsilon
  • Reverse Process (Denoising) - Neural network predicts and removes noise at each step
  • Training - Sample random timestep t, add noise, train network to predict the added noise
  • Sampling - Start from pure noise, iteratively denoise through T steps

Simplified DDPM Training

# DDPM training step (simplified)
def train_step(model, x_0, noise_schedule):
    # 1. Sample random timestep
    t = random.randint(0, T-1)

    # 2. Sample noise
    epsilon = torch.randn_like(x_0)

    # 3. Create noisy image
    alpha_bar_t = noise_schedule.alpha_bar[t]
    x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1 - alpha_bar_t) * epsilon

    # 4. Predict noise
    epsilon_pred = model(x_t, t)

    # 5. Loss = MSE between actual and predicted noise
    loss = F.mse_loss(epsilon_pred, epsilon)
    return loss

Generative Model Comparison

AspectDiffusionGANsVAEs
Output qualityVery highHigh (sharp)Moderate (blurry)
Training stabilityVery stableUnstableVery stable
Mode coverageFull distributionMode collapse riskFull distribution
Sampling speedSlow (many steps)Fast (single pass)Fast (single pass)
ControllabilityExcellent (text-guided)ModerateGood (structured latent)
Status (2025)Dominant for image genNiche (upscaling, style)Anomaly detection, science

Graph Neural Networks (GNN)

Process graph-structured data (social networks, molecules, knowledge graphs) using message passing: each node aggregates information from its neighbors to update its representation.

Key Architectures

  • GCN (Graph Convolutional Network) - Spectral-based, simple neighborhood aggregation
  • GAT (Graph Attention Network) - Attention-weighted neighbor aggregation
  • GraphSAGE - Samples and aggregates from neighbors, scales to large graphs

Message Passing Framework

# GNN message passing (conceptual)
def gnn_layer(node_features, adjacency, W):
    messages = {}
    for node in graph.nodes:
        # Aggregate neighbor features
        neighbor_feats = [node_features[n] for n in graph.neighbors(node)]
        aggregated = mean(neighbor_feats)  # or sum, max, attention
        # Update node representation
        messages[node] = relu(W @ concatenate(node_features[node], aggregated))
    return messages

Architecture Comparison

FeatureGCNGraphSAGEGAT
Learning typeTransductiveInductiveBoth
Neighbor handlingFull neighborhoodSampled subsetAttention-weighted
ScalabilityLimited (full graph)High (sampling)Moderate
Best forStatic graphsLarge/dynamic graphsHeterogeneous graphs

Applications

  • Social network analysis, recommendation systems (PinSAGE: 3B+ nodes)
  • Drug discovery and molecular property prediction
  • Traffic prediction, knowledge graph completion
  • Cybersecurity: anomaly detection from network logs
  • Computer vision: object detection, video action recognition

Reinforcement Learning

When to Use
Decision-making problems where an agent interacts with an environment: game AI, robotics, recommendation systems, RLHF for LLM alignment.

RL trains agents to make sequential decisions by maximizing cumulative reward through trial and error. The agent learns a policy mapping states to actions.

Key Algorithms

AlgorithmTypeKey Idea
Q-LearningValue-based, Off-policyLearn Q(s,a) table via Bellman equation
SARSAValue-based, On-policyLike Q-learning but follows current policy
DQNValue-based + NNQ-learning with neural network function approximation
Policy GradientPolicy-basedDirectly optimize policy with gradient ascent
PPOPolicy-basedClipped surrogate objective for stable updates
A3C/A2CActor-CriticCombine value and policy learning

Q-Learning Implementation

import random

def q_learning(env, episodes=1000, lr=0.1, gamma=0.99, epsilon=0.1):
    Q = {}  # Q-table: state -> {action: value}

    for episode in range(episodes):
        state = env.reset()
        done = False

        while not done:
            # Epsilon-greedy action selection
            if random.random() < epsilon:
                action = env.random_action()
            else:
                action = max(Q.get(state, {}), key=Q[state].get, default=env.random_action())

            next_state, reward, done = env.step(action)

            # Bellman update
            old_q = Q.get(state, {}).get(action, 0)
            max_next = max(Q.get(next_state, {}).values(), default=0)
            new_q = old_q + lr * (reward + gamma * max_next - old_q)

            Q.setdefault(state, {})[action] = new_q
            state = next_state

    return Q

RLHF for LLMs

Reinforcement Learning from Human Feedback aligns language models with human preferences:

  • Collect human preference data (which response is better)
  • Train a reward model from preferences
  • Fine-tune LLM with PPO to maximize reward model score
  • Alternative: DPO (Direct Preference Optimization) skips the reward model

Deep Learning Operations

Practical techniques for training deep learning models at scale.

Essential Techniques

  • AdamW Optimizer - Adam with decoupled weight decay. Default for transformers.
  • Learning Rate Schedules - Warmup + cosine decay, or OneCycleLR
  • Gradient Clipping - Prevent exploding gradients: max_norm=1.0
  • Mixed Precision (FP16/BF16) - 2x speedup with minimal accuracy loss
  • Gradient Accumulation - Simulate larger batch sizes on limited GPU memory
  • Checkpointing - Save model state periodically for recovery
  • Early Stopping - Stop when validation loss stops improving

Regularization Techniques

  • Dropout - Randomly zero out neurons during training (0.1-0.5)
  • Weight Decay - L2 regularization on weights
  • Data Augmentation - Random crops, flips, color jitter for images
  • Label Smoothing - Replace hard labels with soft targets
  • Stochastic Depth - Randomly drop entire layers during training