Attention Mechanisms
Attention allows a model to dynamically focus on relevant parts of the input when producing each output element. It computes a weighted sum of values, where weights are determined by query-key compatibility.
Types of Attention
- Self-Attention - Q, K, V all come from the same sequence. Each token attends to all others.
- Cross-Attention - Q from decoder, K/V from encoder. Used in translation, image captioning.
- Causal (Masked) Attention - Each position can only attend to previous positions. Used in GPT-style decoders.
- Multi-Head Attention - Run h parallel attention heads with different projections, concatenate results.
Scaled Dot-Product Attention
import math
def attention(Q, K, V, mask=None):
"""
Q: (seq_len, d_k) - Queries
K: (seq_len, d_k) - Keys
V: (seq_len, d_v) - Values
"""
d_k = len(Q[0])
# Step 1: Compute attention scores
scores = [[sum(q * k for q, k in zip(Q[i], K[j])) / math.sqrt(d_k)
for j in range(len(K))]
for i in range(len(Q))]
# Step 2: Apply causal mask (for decoder)
if mask:
for i in range(len(scores)):
for j in range(len(scores[i])):
if j > i:
scores[i][j] = float('-inf')
# Step 3: Softmax
weights = [softmax(row) for row in scores]
# Step 4: Weighted sum of values
output = [[sum(weights[i][j] * V[j][k] for j in range(len(V)))
for k in range(len(V[0]))]
for i in range(len(weights))]
return output, weights
Multi-Head Attention
def multi_head_attention(Q, K, V, num_heads, d_model):
d_k = d_model // num_heads
heads = []
for h in range(num_heads):
# Project Q, K, V with learned weight matrices
Q_h = linear(Q, W_q[h]) # (seq, d_k)
K_h = linear(K, W_k[h])
V_h = linear(V, W_v[h])
# Compute attention for this head
head_output, _ = attention(Q_h, K_h, V_h)
heads.append(head_output)
# Concatenate all heads and project
concat = concatenate(heads, dim=-1) # (seq, d_model)
output = linear(concat, W_o)
return output
Efficient Attention Variants
- Flash Attention - IO-aware exact attention, 2-4x faster via tiling
- Multi-Query Attention (MQA) - Shared K/V across heads, faster inference
- Grouped-Query Attention (GQA) - Groups of heads share K/V (used in Llama 2/3)
- Sliding Window Attention - Each token attends to local window (Mistral)
- KV Cache - Cache K/V from previous tokens for faster autoregressive generation
Tokenization & BPE
Tokenization converts raw text into tokens (subword units) that models can process. The tokenizer determines the model's vocabulary and directly impacts performance.
Tokenization Methods
| Method | Used By | Key Idea |
|---|---|---|
| BPE (Byte-Pair Encoding) | GPT, Llama | Iteratively merge most frequent character pairs |
| WordPiece | BERT | Like BPE but uses likelihood-based merging |
| SentencePiece | T5, mBART | Language-agnostic, treats input as raw bytes |
| Tiktoken | GPT-4, Claude | Fast BPE implementation with byte-level fallback |
BPE Algorithm
def byte_pair_encoding(corpus, vocab_size):
"""Build BPE vocabulary from corpus."""
# Start with character-level vocabulary
vocab = set(char for word in corpus for char in word)
tokens = [[char for char in word] for word in corpus]
while len(vocab) < vocab_size:
# Count all adjacent pairs
pair_counts = {}
for word_tokens in tokens:
for i in range(len(word_tokens) - 1):
pair = (word_tokens[i], word_tokens[i+1])
pair_counts[pair] = pair_counts.get(pair, 0) + 1
if not pair_counts:
break
# Find most frequent pair
best_pair = max(pair_counts, key=pair_counts.get)
merged = best_pair[0] + best_pair[1]
vocab.add(merged)
# Merge this pair everywhere
for j, word_tokens in enumerate(tokens):
new_tokens = []
i = 0
while i < len(word_tokens):
if (i < len(word_tokens) - 1 and
word_tokens[i] == best_pair[0] and
word_tokens[i+1] == best_pair[1]):
new_tokens.append(merged)
i += 2
else:
new_tokens.append(word_tokens[i])
i += 1
tokens[j] = new_tokens
return vocab, tokens
Fine-tuning & LoRA/PEFT
Fine-tuning Approaches
| Method | Parameters Trained | GPU Memory | Best For |
|---|---|---|---|
| Full Fine-tuning | 100% | Very High | Enough data + compute, maximum quality |
| LoRA | 0.1-1% | Low | Most practical fine-tuning tasks |
| QLoRA | 0.1-1% | Very Low | Consumer GPUs (single 24GB GPU) |
| Prefix Tuning | <0.1% | Minimal | Lightweight task adaptation |
| Prompt Tuning | <0.01% | Minimal | Simple task steering |
LoRA (Low-Rank Adaptation)
class LoRALayer:
"""
Instead of updating full weight matrix W (d x d),
learn low-rank decomposition: delta_W = A @ B
where A is (d x r) and B is (r x d), r << d
"""
def __init__(self, d_model, rank=8, alpha=16):
self.A = random_normal(d_model, rank) * (alpha / rank)
self.B = zeros(rank, d_model) # Initialize B to zero
self.scaling = alpha / rank
def forward(self, x, original_weight):
# Original output + low-rank adaptation
base_output = x @ original_weight
lora_output = x @ self.A @ self.B * self.scaling
return base_output + lora_output
Key Decisions
- Rank (r) - 4-64 typical. Higher = more capacity but more parameters. Start with 8.
- Alpha - Scaling factor. Common: alpha = 2*rank
- Target Modules - Apply to attention (q, k, v, o) layers. Optionally MLP layers too.
- Learning Rate - 1e-4 to 3e-4 typical for LoRA (higher than full fine-tuning)
RAG (Retrieval-Augmented Generation)
RAG combines information retrieval with text generation. Instead of relying solely on the model's training data, it retrieves relevant documents and includes them in the prompt context.
RAG Pipeline
# Simplified RAG pipeline
class RAGPipeline:
def __init__(self, embedding_model, vector_store, llm):
self.embedder = embedding_model
self.store = vector_store
self.llm = llm
def index(self, documents):
"""Index documents into vector store."""
for doc in documents:
chunks = self.chunk(doc, size=512, overlap=50)
for chunk in chunks:
embedding = self.embedder.encode(chunk)
self.store.add(embedding, chunk)
def query(self, question, top_k=5):
"""Answer question using retrieved context."""
# 1. Embed the question
q_embedding = self.embedder.encode(question)
# 2. Retrieve relevant chunks
results = self.store.search(q_embedding, top_k=top_k)
# 3. Build prompt with context
context = "\n\n".join([r.text for r in results])
prompt = f"""Answer based on the following context:
{context}
Question: {question}
Answer:"""
# 4. Generate answer
return self.llm.generate(prompt)
RAG Variants
- Naive RAG - Simple retrieve-then-generate pipeline
- Self-RAG - Model decides when to retrieve and self-reflects on relevance
- CRAG (Corrective RAG) - Evaluates retrieval quality, falls back to web search
- GraphRAG - Uses knowledge graphs for structured retrieval
- Agentic RAG - Agent decides retrieval strategy, query reformulation, multi-hop
Vector Search Methods
| Method | Speed | Accuracy | Memory |
|---|---|---|---|
| Flat (Exact) | Slow O(n) | Perfect | O(n*d) |
| IVF (Inverted File) | Fast | High | O(n*d) |
| HNSW | Very Fast | Very High | O(n*d*M) |
| Product Quantization | Fast | Good | Low O(n*m) |
Knowledge Distillation
Transfer knowledge from a large "teacher" model to a smaller "student" model. The student learns from the teacher's soft probability outputs (dark knowledge) rather than hard labels.
How It Works
def distillation_loss(student_logits, teacher_logits, labels, T=4.0, alpha=0.7):
"""
T: Temperature (higher = softer distributions)
alpha: Weight between soft and hard loss
"""
# Soft loss: KL divergence between soft distributions
soft_student = softmax(student_logits / T)
soft_teacher = softmax(teacher_logits / T)
soft_loss = kl_divergence(soft_student, soft_teacher) * (T * T)
# Hard loss: Standard cross-entropy with true labels
hard_loss = cross_entropy(student_logits, labels)
return alpha * soft_loss + (1 - alpha) * hard_loss
Applications
- Deploy large model quality on mobile/edge devices
- Reduce inference cost (smaller model = cheaper)
- Model compression pipelines
Contrastive Learning (SimCLR)
Self-supervised learning that creates representations by pulling similar (positive) pairs together and pushing dissimilar (negative) pairs apart in embedding space.
SimCLR Framework
- Create two augmented views of each image
- Encode both views with shared encoder
- Maximize agreement between positive pairs (same image) via NT-Xent loss
- Negative pairs are all other images in the batch
def nt_xent_loss(z_i, z_j, temperature=0.5):
"""Normalized Temperature-scaled Cross Entropy Loss."""
batch_size = len(z_i)
# Cosine similarity between all pairs
z = concatenate(z_i, z_j) # 2N embeddings
sim = cosine_similarity_matrix(z) # (2N, 2N)
# Positive pairs: (i, i+N) and (i+N, i)
# Negative pairs: everything else
loss = 0
for i in range(batch_size):
pos_sim = sim[i][i + batch_size] / temperature
neg_sims = [sim[i][j] / temperature
for j in range(2 * batch_size) if j != i]
loss -= pos_sim - log(sum(exp(s) for s in neg_sims))
return loss / (2 * batch_size)
MLOps & Production ML
Practices for deploying, monitoring, and maintaining ML models in production.
Key Components
- Feature Store - Centralized feature management ensuring train-serve consistency
- Model Registry - Version control for models with metadata and lineage
- Model Serving - REST APIs, batch inference, streaming inference
- Monitoring - Track prediction drift, data drift, model performance
- A/B Testing - Compare models in production with statistical rigor
- CI/CD for ML - Automated training, validation, and deployment pipelines
Data & Model Drift Detection
def population_stability_index(expected, actual, bins=10):
"""PSI measures distribution shift between train and serving data."""
psi = 0
for i in range(bins):
e_pct = expected[i] / sum(expected)
a_pct = actual[i] / sum(actual)
if e_pct > 0 and a_pct > 0:
psi += (a_pct - e_pct) * math.log(a_pct / e_pct)
# PSI < 0.1: no shift, 0.1-0.2: moderate, > 0.2: significant
return psi
Deployment Strategies
- Canary Release - Route small % of traffic to new model, monitor metrics
- Blue-Green - Two identical environments, switch traffic instantly
- Shadow Mode - New model runs alongside old, compare outputs without serving
- Feature Flags - Enable/disable model features without deployment
Model Evaluation & Metrics
Classification Metrics
| Metric | Formula | When to Use |
|---|---|---|
| Accuracy | (TP+TN)/(TP+TN+FP+FN) | Balanced classes only |
| Precision | TP/(TP+FP) | Cost of false positives is high (spam) |
| Recall | TP/(TP+FN) | Cost of false negatives is high (disease) |
| F1 Score | 2*P*R/(P+R) | Balance precision and recall |
| AUC-ROC | Area under ROC curve | Compare models, threshold-independent |
| Log Loss | -avg(y*log(p)) | When probability calibration matters |
Regression Metrics
| Metric | Properties |
|---|---|
| MAE | Robust to outliers, interpretable in original units |
| RMSE | Penalizes large errors more, same units as target |
| R-squared | Proportion of variance explained (0-1) |
| MAPE | Percentage error, scale-independent |
Retrieval Metrics
| Metric | What It Measures |
|---|---|
| MRR (Mean Reciprocal Rank) | Rank of first relevant result |
| NDCG | Quality of ranking with graded relevance |
| Recall@K | Fraction of relevant items in top-K |
| Hit@K | Whether any relevant item appears in top-K |
Prompt Engineering
The practice of designing and optimizing input prompts to get desired outputs from LLMs without modifying model weights.
Key Techniques
| Technique | Description | When to Use |
|---|---|---|
| Zero-shot | Direct instruction with no examples | Simple, well-defined tasks |
| Few-shot | Provide input-output examples in the prompt | Tasks with subtle formatting or conventions |
| Chain-of-thought (CoT) | Ask the model to reason step by step | Complex reasoning, math, logic |
| Tree-of-thought | Explore multiple reasoning paths | Problems with multiple valid approaches |
| ReAct | Interleave reasoning and tool-use actions | Agentic workflows, multi-step tasks |
| Self-consistency | Generate multiple CoT paths, take majority vote | Improving reliability on hard problems |
| Reflection | Ask the model to critique/improve its own output | Quality refinement |
| Prompt chaining | Break complex tasks into sequential prompts | Multi-step workflows |
Best Practices
# 1. Be specific - state task, context, format, tone
prompt = """You are a senior data scientist. Given the following dataset
description, recommend the top 3 ML algorithms with reasons.
Dataset: {description}
Constraints: {constraints}
Output format: JSON array with fields: algorithm, reason, complexity"""
# 2. Few-shot example pattern
prompt = """Classify the sentiment of each review.
Review: "Great product, fast shipping!" -> positive
Review: "Terrible quality, broke in a day" -> negative
Review: "{user_input}" ->"""
# 3. Chain-of-thought
prompt = """Solve step by step:
Q: If a store has 240 items and sells 30% on Monday,
then 50% of the remainder on Tuesday, how many are left?
A: Let me think step by step...
Monday: 240 * 0.30 = 72 sold, 240 - 72 = 168 remain
Tuesday: 168 * 0.50 = 84 sold, 168 - 84 = 84 remain
Answer: 84 items"""
Common Mistakes
- Over-engineering prompts (longer is not always better)
- Assuming the model will infer intent — be explicit
- Using every technique at once instead of selecting what fits
- Ignoring output formatting constraints (JSON, markdown, etc.)
RAG: Factual accuracy, dynamic knowledge, auditability. Needs retrieval quality.
Fine-tuning: Consistent style/format, lower inference cost at scale. Expensive to train.
2025 Trends
- Context Engineering — broader discipline encompassing prompt engineering
- Auto-prompting — automated prompt optimization using model feedback
- Multi-agent Chaining — coordinating prompts across specialized agents
- Prompt Versioning — tracking prompt versions and performance at scale
Quantization
Reducing the numerical precision of model weights and activations (e.g., FP32 to INT8 or INT4) to decrease model size, memory usage, and inference time.
Quantization Types
| Method | Description | Accuracy Loss | Use Case |
|---|---|---|---|
| PTQ (Post-Training) | Applied after training, no retraining needed | Moderate | Quick deployment |
| QAT (Quantization-Aware Training) | Simulates quantization during training | Minimal | Best accuracy |
| GPTQ | Weight-only quantization for LLMs (3-4 bit) | Low | LLM deployment |
| AWQ | Protects salient weights during quantization | Low | LLM deployment |
| GGUF/GGML | Format for llama.cpp CPU inference | Varies | Local/CPU inference |
Implementation Example
# Post-training quantization with PyTorch
import torch
# Dynamic quantization (weights only, activations quantized at runtime)
quantized_model = torch.quantization.quantize_dynamic(
model,
{torch.nn.Linear}, # quantize Linear layers
dtype=torch.qint8
)
# Static quantization (both weights and activations)
model.qconfig = torch.quantization.get_default_qconfig('x86')
torch.quantization.prepare(model, inplace=True)
# Run calibration data through model...
torch.quantization.convert(model, inplace=True)
# Using bitsandbytes for LLM quantization (4-bit)
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3-8B",
quantization_config=bnb_config
)
Precision Comparison
| Precision | Bits | Model Size (7B params) | Typical Use |
|---|---|---|---|
| FP32 | 32 | ~28 GB | Training (full precision) |
| FP16 / BF16 | 16 | ~14 GB | Training (mixed precision) |
| INT8 | 8 | ~7 GB | Inference (good quality) |
| INT4 / NF4 | 4 | ~3.5 GB | Inference (consumer GPU) |