Neural Networks & Backpropagation
A neural network is a series of layers where each layer performs a linear transformation followed by a non-linear activation. Backpropagation computes gradients to update weights via chain rule.
Key Components
- Layers - Input, Hidden (1+), Output. Each is a linear transform:
z = Wx + b - Activation Functions:
ReLU(x) = max(0, x)- Default choice, fast, avoids vanishing gradientSigmoid(x) = 1/(1+e^-x)- Output layer for binary classificationSoftmax- Output layer for multi-class (probability distribution)GELU- Used in transformers, smoother than ReLU
- Loss Functions - Cross-entropy (classification), MSE (regression), Focal Loss (imbalanced)
- Backpropagation - Chain rule applied layer by layer to compute dL/dW for each weight
Forward & Backward Pass
import math, random
def relu(x): return max(0, x)
def relu_deriv(x): return 1.0 if x > 0 else 0.0
def sigmoid(x): return 1.0 / (1.0 + math.exp(-max(-500, min(500, x))))
class NeuralNetwork:
def __init__(self, layer_sizes):
self.weights = []
self.biases = []
for i in range(len(layer_sizes) - 1):
# Xavier initialization
scale = math.sqrt(2.0 / layer_sizes[i])
w = [[random.gauss(0, scale) for _ in range(layer_sizes[i+1])]
for _ in range(layer_sizes[i])]
b = [0.0] * layer_sizes[i+1]
self.weights.append(w)
self.biases.append(b)
def forward(self, x):
activations = [x]
for i, (w, b) in enumerate(zip(self.weights, self.biases)):
z = [sum(x_j * w_j[k] for j, (x_j, w_j) in enumerate(zip(x, w))) + b[k]
for k in range(len(b))]
if i < len(self.weights) - 1:
x = [relu(z_k) for z_k in z] # Hidden: ReLU
else:
x = [sigmoid(z_k) for z_k in z] # Output: Sigmoid
activations.append(x)
return activations
Training Tips
- Use BatchNorm between layers for faster, more stable training
- Use Dropout (0.1-0.5) for regularization
- Start with Adam optimizer, switch to SGD+momentum for fine-tuning
- Use learning rate warmup + cosine decay schedule
- Monitor train vs val loss for overfitting detection
Convolutional Neural Networks (CNN)
CNNs use learnable convolutional filters that slide across spatial dimensions to detect features like edges, textures, and objects. Key innovation: weight sharing and local connectivity.
Core Operations
- Convolution - Learnable filters extract local features. 3x3 or 5x5 kernels typical.
- Pooling - Downsample spatial dimensions. MaxPool (most common) or AvgPool.
- Stride - Skip positions during convolution (reduces spatial size)
- Padding - Add zeros around input to preserve spatial dimensions
Key Architectures (Evolution)
| Architecture | Year | Key Innovation | Depth |
|---|---|---|---|
| LeNet-5 | 1998 | First practical CNN (digit recognition) | 5 |
| AlexNet | 2012 | ReLU, Dropout, GPU training | 8 |
| VGG | 2014 | Small 3x3 filters, deeper networks | 16-19 |
| GoogLeNet/Inception | 2014 | Inception modules (parallel filter sizes) | 22 |
| ResNet | 2015 | Skip connections enable 100+ layers | 50-152 |
| EfficientNet | 2019 | Compound scaling (depth/width/resolution) | varies |
| Vision Transformer (ViT) | 2020 | Transformer applied to image patches | 12-24 |
| ConvNeXt | 2022 | Modernized CNN matching ViT performance | varies |
Transfer Learning Pattern
# PyTorch transfer learning
import torchvision.models as models
# Load pretrained model
model = models.resnet50(weights='IMAGENET1K_V2')
# Freeze backbone
for param in model.parameters():
param.requires_grad = False
# Replace classifier head
model.fc = nn.Sequential(
nn.Linear(2048, 256),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(256, num_classes)
)
# Only train the new head (then optionally unfreeze and fine-tune all)
Applications
- Image Classification - What is in this image?
- Object Detection - Where are objects? (YOLO, Faster R-CNN)
- Semantic Segmentation - Pixel-level classification (U-Net)
- Image Generation - Style transfer, super-resolution
Vision Architecture Selection (2025)
| Scenario | Recommended |
|---|---|
| Large-scale + abundant data + compute | Vision Transformers (ViT) |
| Small datasets / limited labels | CNN (ResNet, EfficientNet) + transfer learning |
| Mobile / edge deployment | EfficientNet, MobileViT |
| Medical imaging / specialized domains | CNNs or hybrid architectures |
| General purpose with pre-training | ViT with fine-tuning |
Emerging trend: Hybrid architectures (CvT, CoAtNet, MobileViT) blend CNN efficiency with Transformer contextual strength — the field is converging toward these combined approaches.
RNN / LSTM / GRU
Recurrent Neural Networks process sequential data by maintaining a hidden state that carries information across time steps. LSTMs and GRUs solve the vanishing gradient problem.
Architecture Comparison
| Type | Gates | Key Feature | Parameters |
|---|---|---|---|
| Vanilla RNN | None | Simple but vanishing gradients | Fewest |
| LSTM | Input, Forget, Output | Cell state preserves long-term memory | 4x RNN |
| GRU | Reset, Update | Simplified LSTM, similar performance | 3x RNN |
LSTM Cell Equations
# LSTM forward pass (single time step)
def lstm_step(x_t, h_prev, c_prev, W, b):
# Concatenate input and previous hidden state
combined = concatenate(x_t, h_prev)
# Gate computations
f_t = sigmoid(W_f @ combined + b_f) # Forget gate
i_t = sigmoid(W_i @ combined + b_i) # Input gate
c_hat = tanh(W_c @ combined + b_c) # Candidate cell state
o_t = sigmoid(W_o @ combined + b_o) # Output gate
# Update cell and hidden state
c_t = f_t * c_prev + i_t * c_hat # New cell state
h_t = o_t * tanh(c_t) # New hidden state
return h_t, c_t
RNN vs Transformer
| Aspect | RNN/LSTM | Transformer |
|---|---|---|
| Parallelization | Sequential (slow) | Fully parallel (fast) |
| Long-range deps | Struggles beyond ~100 tokens | Handles thousands of tokens |
| Memory | O(1) per step | O(n^2) attention matrix |
| Best for | Streaming, low-memory, time series | NLP, vision, most tasks |
Transformers
The Transformer architecture, introduced in "Attention Is All You Need" (2017), replaced recurrence with self-attention, enabling parallel processing of sequences and better modeling of long-range dependencies.
Self-Attention Mechanism
import math
def scaled_dot_product_attention(Q, K, V):
"""
Q, K, V: matrices of queries, keys, values
Returns: attention-weighted values
"""
d_k = len(K[0]) # Key dimension
# Compute attention scores
scores = matmul(Q, transpose(K)) # Q @ K^T
scores = [[s / math.sqrt(d_k) for s in row] for row in scores] # Scale
# Optional: apply causal mask for decoder
# scores = apply_mask(scores, mask)
# Softmax to get attention weights
weights = [softmax(row) for row in scores]
# Weighted sum of values
output = matmul(weights, V)
return output, weights
Transformer Components
- Multi-Head Attention - Run attention in parallel with different learned projections, then concatenate
- Positional Encoding - Inject position information (sinusoidal or learned embeddings)
- Layer Normalization - Pre-norm (modern) or post-norm (original) placement
- Feed-Forward Network - Two linear layers with activation between:
FFN(x) = W2 * GELU(W1 * x + b1) + b2 - Residual Connections -
output = LayerNorm(x + Sublayer(x))
Transformer Variants
| Model | Type | Architecture | Best For |
|---|---|---|---|
| BERT | Encoder-only | Bidirectional attention | Classification, NER, Q&A |
| GPT | Decoder-only | Causal (left-to-right) attention | Text generation, reasoning |
| T5/BART | Encoder-Decoder | Full sequence-to-sequence | Translation, summarization |
| ViT | Encoder-only | Image patches as tokens | Image classification |
| Whisper | Encoder-Decoder | Audio spectrogram input | Speech recognition |
Autoencoders
Encoder-decoder architecture that learns compressed representations by training to reconstruct its input. The bottleneck layer forces the network to learn meaningful features.
Variants
- Vanilla Autoencoder - Dimensionality reduction, feature learning
- Variational Autoencoder (VAE) - Learns a continuous latent space for generation; adds KL divergence to loss
- Denoising Autoencoder - Trained to reconstruct clean inputs from corrupted versions
- Sparse Autoencoder - Adds sparsity constraint to learn more meaningful features
VAE: Variational Autoencoder
Unlike vanilla autoencoders, VAEs learn a probabilistic latent space, enabling generation of new data by sampling from the learned distribution.
import torch
import torch.nn as nn
class VAE(nn.Module):
def __init__(self, input_dim, latent_dim):
super().__init__()
self.encoder = nn.Sequential(nn.Linear(input_dim, 256), nn.ReLU())
self.mu = nn.Linear(256, latent_dim)
self.log_var = nn.Linear(256, latent_dim)
self.decoder = nn.Sequential(
nn.Linear(latent_dim, 256), nn.ReLU(),
nn.Linear(256, input_dim), nn.Sigmoid()
)
def reparameterize(self, mu, log_var):
std = torch.exp(0.5 * log_var)
eps = torch.randn_like(std)
return mu + eps * std # Differentiable sampling
def forward(self, x):
h = self.encoder(x)
mu, log_var = self.mu(h), self.log_var(h)
z = self.reparameterize(mu, log_var)
return self.decoder(z), mu, log_var
# Loss = Reconstruction + KL Divergence
# KL = -0.5 * sum(1 + log_var - mu^2 - exp(log_var))
Applications
- Anomaly detection (high reconstruction error = anomaly)
- Image denoising and super-resolution
- Feature extraction for downstream tasks
- Data generation (VAE) — drug discovery, data augmentation
- Latent space interpolation and exploration
Generative Adversarial Networks (GANs)
Two networks trained adversarially: a Generator creates fake data, a Discriminator distinguishes real from fake. They compete until the generator produces realistic data.
Training Dynamics
# GAN training loop (conceptual)
for epoch in range(epochs):
# 1. Train Discriminator
real_data = sample_real_batch()
fake_data = generator(random_noise())
d_loss_real = loss(discriminator(real_data), label=1)
d_loss_fake = loss(discriminator(fake_data.detach()), label=0)
d_loss = d_loss_real + d_loss_fake
d_loss.backward()
d_optimizer.step()
# 2. Train Generator
fake_data = generator(random_noise())
g_loss = loss(discriminator(fake_data), label=1) # Fool discriminator
g_loss.backward()
g_optimizer.step()
GAN Variants
- DCGAN - Convolutional architecture, stable training guidelines
- WGAN - Wasserstein distance for more stable training
- StyleGAN - High-quality face generation with style control
- CycleGAN - Unpaired image-to-image translation
- Pix2Pix - Paired image-to-image translation
- Mode collapse - Generator produces limited variety
- Training instability - Requires careful hyperparameter tuning
- Evaluation difficulty - No single metric for generation quality (use FID, IS)
Diffusion Models (DDPM)
Diffusion models learn to generate data by gradually denoising a sample from pure noise. The forward process adds Gaussian noise over T steps; the reverse process learns to remove it.
How It Works
- Forward Process (Diffusion) - Gradually add noise:
x_t = sqrt(alpha_t) * x_0 + sqrt(1-alpha_t) * epsilon - Reverse Process (Denoising) - Neural network predicts and removes noise at each step
- Training - Sample random timestep t, add noise, train network to predict the added noise
- Sampling - Start from pure noise, iteratively denoise through T steps
Simplified DDPM Training
# DDPM training step (simplified)
def train_step(model, x_0, noise_schedule):
# 1. Sample random timestep
t = random.randint(0, T-1)
# 2. Sample noise
epsilon = torch.randn_like(x_0)
# 3. Create noisy image
alpha_bar_t = noise_schedule.alpha_bar[t]
x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1 - alpha_bar_t) * epsilon
# 4. Predict noise
epsilon_pred = model(x_t, t)
# 5. Loss = MSE between actual and predicted noise
loss = F.mse_loss(epsilon_pred, epsilon)
return loss
Generative Model Comparison
| Aspect | Diffusion | GANs | VAEs |
|---|---|---|---|
| Output quality | Very high | High (sharp) | Moderate (blurry) |
| Training stability | Very stable | Unstable | Very stable |
| Mode coverage | Full distribution | Mode collapse risk | Full distribution |
| Sampling speed | Slow (many steps) | Fast (single pass) | Fast (single pass) |
| Controllability | Excellent (text-guided) | Moderate | Good (structured latent) |
| Status (2025) | Dominant for image gen | Niche (upscaling, style) | Anomaly detection, science |
Graph Neural Networks (GNN)
Process graph-structured data (social networks, molecules, knowledge graphs) using message passing: each node aggregates information from its neighbors to update its representation.
Key Architectures
- GCN (Graph Convolutional Network) - Spectral-based, simple neighborhood aggregation
- GAT (Graph Attention Network) - Attention-weighted neighbor aggregation
- GraphSAGE - Samples and aggregates from neighbors, scales to large graphs
Message Passing Framework
# GNN message passing (conceptual)
def gnn_layer(node_features, adjacency, W):
messages = {}
for node in graph.nodes:
# Aggregate neighbor features
neighbor_feats = [node_features[n] for n in graph.neighbors(node)]
aggregated = mean(neighbor_feats) # or sum, max, attention
# Update node representation
messages[node] = relu(W @ concatenate(node_features[node], aggregated))
return messages
Architecture Comparison
| Feature | GCN | GraphSAGE | GAT |
|---|---|---|---|
| Learning type | Transductive | Inductive | Both |
| Neighbor handling | Full neighborhood | Sampled subset | Attention-weighted |
| Scalability | Limited (full graph) | High (sampling) | Moderate |
| Best for | Static graphs | Large/dynamic graphs | Heterogeneous graphs |
Applications
- Social network analysis, recommendation systems (PinSAGE: 3B+ nodes)
- Drug discovery and molecular property prediction
- Traffic prediction, knowledge graph completion
- Cybersecurity: anomaly detection from network logs
- Computer vision: object detection, video action recognition
Reinforcement Learning
RL trains agents to make sequential decisions by maximizing cumulative reward through trial and error. The agent learns a policy mapping states to actions.
Key Algorithms
| Algorithm | Type | Key Idea |
|---|---|---|
| Q-Learning | Value-based, Off-policy | Learn Q(s,a) table via Bellman equation |
| SARSA | Value-based, On-policy | Like Q-learning but follows current policy |
| DQN | Value-based + NN | Q-learning with neural network function approximation |
| Policy Gradient | Policy-based | Directly optimize policy with gradient ascent |
| PPO | Policy-based | Clipped surrogate objective for stable updates |
| A3C/A2C | Actor-Critic | Combine value and policy learning |
Q-Learning Implementation
import random
def q_learning(env, episodes=1000, lr=0.1, gamma=0.99, epsilon=0.1):
Q = {} # Q-table: state -> {action: value}
for episode in range(episodes):
state = env.reset()
done = False
while not done:
# Epsilon-greedy action selection
if random.random() < epsilon:
action = env.random_action()
else:
action = max(Q.get(state, {}), key=Q[state].get, default=env.random_action())
next_state, reward, done = env.step(action)
# Bellman update
old_q = Q.get(state, {}).get(action, 0)
max_next = max(Q.get(next_state, {}).values(), default=0)
new_q = old_q + lr * (reward + gamma * max_next - old_q)
Q.setdefault(state, {})[action] = new_q
state = next_state
return Q
RLHF for LLMs
Reinforcement Learning from Human Feedback aligns language models with human preferences:
- Collect human preference data (which response is better)
- Train a reward model from preferences
- Fine-tune LLM with PPO to maximize reward model score
- Alternative: DPO (Direct Preference Optimization) skips the reward model
Deep Learning Operations
Practical techniques for training deep learning models at scale.
Essential Techniques
- AdamW Optimizer - Adam with decoupled weight decay. Default for transformers.
- Learning Rate Schedules - Warmup + cosine decay, or OneCycleLR
- Gradient Clipping - Prevent exploding gradients:
max_norm=1.0 - Mixed Precision (FP16/BF16) - 2x speedup with minimal accuracy loss
- Gradient Accumulation - Simulate larger batch sizes on limited GPU memory
- Checkpointing - Save model state periodically for recovery
- Early Stopping - Stop when validation loss stops improving
Regularization Techniques
- Dropout - Randomly zero out neurons during training (0.1-0.5)
- Weight Decay - L2 regularization on weights
- Data Augmentation - Random crops, flips, color jitter for images
- Label Smoothing - Replace hard labels with soft targets
- Stochastic Depth - Randomly drop entire layers during training