Attention Mechanisms

Agent Instruction
Attention is the core mechanism behind transformers. Understand scaled dot-product attention, multi-head attention, causal masking for decoders, and cross-attention for encoder-decoder models.

Attention allows a model to dynamically focus on relevant parts of the input when producing each output element. It computes a weighted sum of values, where weights are determined by query-key compatibility.

Types of Attention

  • Self-Attention - Q, K, V all come from the same sequence. Each token attends to all others.
  • Cross-Attention - Q from decoder, K/V from encoder. Used in translation, image captioning.
  • Causal (Masked) Attention - Each position can only attend to previous positions. Used in GPT-style decoders.
  • Multi-Head Attention - Run h parallel attention heads with different projections, concatenate results.

Scaled Dot-Product Attention

import math

def attention(Q, K, V, mask=None):
    """
    Q: (seq_len, d_k) - Queries
    K: (seq_len, d_k) - Keys
    V: (seq_len, d_v) - Values
    """
    d_k = len(Q[0])

    # Step 1: Compute attention scores
    scores = [[sum(q * k for q, k in zip(Q[i], K[j])) / math.sqrt(d_k)
                for j in range(len(K))]
               for i in range(len(Q))]

    # Step 2: Apply causal mask (for decoder)
    if mask:
        for i in range(len(scores)):
            for j in range(len(scores[i])):
                if j > i:
                    scores[i][j] = float('-inf')

    # Step 3: Softmax
    weights = [softmax(row) for row in scores]

    # Step 4: Weighted sum of values
    output = [[sum(weights[i][j] * V[j][k] for j in range(len(V)))
                for k in range(len(V[0]))]
               for i in range(len(weights))]

    return output, weights

Multi-Head Attention

def multi_head_attention(Q, K, V, num_heads, d_model):
    d_k = d_model // num_heads
    heads = []

    for h in range(num_heads):
        # Project Q, K, V with learned weight matrices
        Q_h = linear(Q, W_q[h])  # (seq, d_k)
        K_h = linear(K, W_k[h])
        V_h = linear(V, W_v[h])

        # Compute attention for this head
        head_output, _ = attention(Q_h, K_h, V_h)
        heads.append(head_output)

    # Concatenate all heads and project
    concat = concatenate(heads, dim=-1)  # (seq, d_model)
    output = linear(concat, W_o)
    return output

Efficient Attention Variants

  • Flash Attention - IO-aware exact attention, 2-4x faster via tiling
  • Multi-Query Attention (MQA) - Shared K/V across heads, faster inference
  • Grouped-Query Attention (GQA) - Groups of heads share K/V (used in Llama 2/3)
  • Sliding Window Attention - Each token attends to local window (Mistral)
  • KV Cache - Cache K/V from previous tokens for faster autoregressive generation

Tokenization & BPE

Tokenization converts raw text into tokens (subword units) that models can process. The tokenizer determines the model's vocabulary and directly impacts performance.

Tokenization Methods

MethodUsed ByKey Idea
BPE (Byte-Pair Encoding)GPT, LlamaIteratively merge most frequent character pairs
WordPieceBERTLike BPE but uses likelihood-based merging
SentencePieceT5, mBARTLanguage-agnostic, treats input as raw bytes
TiktokenGPT-4, ClaudeFast BPE implementation with byte-level fallback

BPE Algorithm

def byte_pair_encoding(corpus, vocab_size):
    """Build BPE vocabulary from corpus."""
    # Start with character-level vocabulary
    vocab = set(char for word in corpus for char in word)
    tokens = [[char for char in word] for word in corpus]

    while len(vocab) < vocab_size:
        # Count all adjacent pairs
        pair_counts = {}
        for word_tokens in tokens:
            for i in range(len(word_tokens) - 1):
                pair = (word_tokens[i], word_tokens[i+1])
                pair_counts[pair] = pair_counts.get(pair, 0) + 1

        if not pair_counts:
            break

        # Find most frequent pair
        best_pair = max(pair_counts, key=pair_counts.get)
        merged = best_pair[0] + best_pair[1]
        vocab.add(merged)

        # Merge this pair everywhere
        for j, word_tokens in enumerate(tokens):
            new_tokens = []
            i = 0
            while i < len(word_tokens):
                if (i < len(word_tokens) - 1 and
                    word_tokens[i] == best_pair[0] and
                    word_tokens[i+1] == best_pair[1]):
                    new_tokens.append(merged)
                    i += 2
                else:
                    new_tokens.append(word_tokens[i])
                    i += 1
            tokens[j] = new_tokens

    return vocab, tokens

Fine-tuning & LoRA/PEFT

Agent Instruction
For adapting LLMs: use LoRA (rank 8-64) on attention layers. QLoRA adds 4-bit quantization for consumer GPUs. Full fine-tuning only with large GPU clusters.

Fine-tuning Approaches

MethodParameters TrainedGPU MemoryBest For
Full Fine-tuning100%Very HighEnough data + compute, maximum quality
LoRA0.1-1%LowMost practical fine-tuning tasks
QLoRA0.1-1%Very LowConsumer GPUs (single 24GB GPU)
Prefix Tuning<0.1%MinimalLightweight task adaptation
Prompt Tuning<0.01%MinimalSimple task steering

LoRA (Low-Rank Adaptation)

class LoRALayer:
    """
    Instead of updating full weight matrix W (d x d),
    learn low-rank decomposition: delta_W = A @ B
    where A is (d x r) and B is (r x d), r << d
    """
    def __init__(self, d_model, rank=8, alpha=16):
        self.A = random_normal(d_model, rank) * (alpha / rank)
        self.B = zeros(rank, d_model)  # Initialize B to zero
        self.scaling = alpha / rank

    def forward(self, x, original_weight):
        # Original output + low-rank adaptation
        base_output = x @ original_weight
        lora_output = x @ self.A @ self.B * self.scaling
        return base_output + lora_output

Key Decisions

  • Rank (r) - 4-64 typical. Higher = more capacity but more parameters. Start with 8.
  • Alpha - Scaling factor. Common: alpha = 2*rank
  • Target Modules - Apply to attention (q, k, v, o) layers. Optionally MLP layers too.
  • Learning Rate - 1e-4 to 3e-4 typical for LoRA (higher than full fine-tuning)

RAG (Retrieval-Augmented Generation)

Agent Instruction
RAG is the standard pattern for grounding LLM responses in external knowledge. Key components: embedding model, vector store, retrieval, prompt construction. Use when accuracy and freshness matter more than speed.

RAG combines information retrieval with text generation. Instead of relying solely on the model's training data, it retrieves relevant documents and includes them in the prompt context.

RAG Pipeline

# Simplified RAG pipeline
class RAGPipeline:
    def __init__(self, embedding_model, vector_store, llm):
        self.embedder = embedding_model
        self.store = vector_store
        self.llm = llm

    def index(self, documents):
        """Index documents into vector store."""
        for doc in documents:
            chunks = self.chunk(doc, size=512, overlap=50)
            for chunk in chunks:
                embedding = self.embedder.encode(chunk)
                self.store.add(embedding, chunk)

    def query(self, question, top_k=5):
        """Answer question using retrieved context."""
        # 1. Embed the question
        q_embedding = self.embedder.encode(question)

        # 2. Retrieve relevant chunks
        results = self.store.search(q_embedding, top_k=top_k)

        # 3. Build prompt with context
        context = "\n\n".join([r.text for r in results])
        prompt = f"""Answer based on the following context:

{context}

Question: {question}
Answer:"""

        # 4. Generate answer
        return self.llm.generate(prompt)

RAG Variants

  • Naive RAG - Simple retrieve-then-generate pipeline
  • Self-RAG - Model decides when to retrieve and self-reflects on relevance
  • CRAG (Corrective RAG) - Evaluates retrieval quality, falls back to web search
  • GraphRAG - Uses knowledge graphs for structured retrieval
  • Agentic RAG - Agent decides retrieval strategy, query reformulation, multi-hop

Vector Search Methods

MethodSpeedAccuracyMemory
Flat (Exact)Slow O(n)PerfectO(n*d)
IVF (Inverted File)FastHighO(n*d)
HNSWVery FastVery HighO(n*d*M)
Product QuantizationFastGoodLow O(n*m)

Knowledge Distillation

Transfer knowledge from a large "teacher" model to a smaller "student" model. The student learns from the teacher's soft probability outputs (dark knowledge) rather than hard labels.

How It Works

def distillation_loss(student_logits, teacher_logits, labels, T=4.0, alpha=0.7):
    """
    T: Temperature (higher = softer distributions)
    alpha: Weight between soft and hard loss
    """
    # Soft loss: KL divergence between soft distributions
    soft_student = softmax(student_logits / T)
    soft_teacher = softmax(teacher_logits / T)
    soft_loss = kl_divergence(soft_student, soft_teacher) * (T * T)

    # Hard loss: Standard cross-entropy with true labels
    hard_loss = cross_entropy(student_logits, labels)

    return alpha * soft_loss + (1 - alpha) * hard_loss

Applications

  • Deploy large model quality on mobile/edge devices
  • Reduce inference cost (smaller model = cheaper)
  • Model compression pipelines

Contrastive Learning (SimCLR)

Self-supervised learning that creates representations by pulling similar (positive) pairs together and pushing dissimilar (negative) pairs apart in embedding space.

SimCLR Framework

  • Create two augmented views of each image
  • Encode both views with shared encoder
  • Maximize agreement between positive pairs (same image) via NT-Xent loss
  • Negative pairs are all other images in the batch
def nt_xent_loss(z_i, z_j, temperature=0.5):
    """Normalized Temperature-scaled Cross Entropy Loss."""
    batch_size = len(z_i)

    # Cosine similarity between all pairs
    z = concatenate(z_i, z_j)  # 2N embeddings
    sim = cosine_similarity_matrix(z)  # (2N, 2N)

    # Positive pairs: (i, i+N) and (i+N, i)
    # Negative pairs: everything else
    loss = 0
    for i in range(batch_size):
        pos_sim = sim[i][i + batch_size] / temperature
        neg_sims = [sim[i][j] / temperature
                    for j in range(2 * batch_size) if j != i]
        loss -= pos_sim - log(sum(exp(s) for s in neg_sims))

    return loss / (2 * batch_size)

MLOps & Production ML

Practices for deploying, monitoring, and maintaining ML models in production.

Key Components

  • Feature Store - Centralized feature management ensuring train-serve consistency
  • Model Registry - Version control for models with metadata and lineage
  • Model Serving - REST APIs, batch inference, streaming inference
  • Monitoring - Track prediction drift, data drift, model performance
  • A/B Testing - Compare models in production with statistical rigor
  • CI/CD for ML - Automated training, validation, and deployment pipelines

Data & Model Drift Detection

def population_stability_index(expected, actual, bins=10):
    """PSI measures distribution shift between train and serving data."""
    psi = 0
    for i in range(bins):
        e_pct = expected[i] / sum(expected)
        a_pct = actual[i] / sum(actual)
        if e_pct > 0 and a_pct > 0:
            psi += (a_pct - e_pct) * math.log(a_pct / e_pct)

    # PSI < 0.1: no shift, 0.1-0.2: moderate, > 0.2: significant
    return psi

Deployment Strategies

  • Canary Release - Route small % of traffic to new model, monitor metrics
  • Blue-Green - Two identical environments, switch traffic instantly
  • Shadow Mode - New model runs alongside old, compare outputs without serving
  • Feature Flags - Enable/disable model features without deployment

Model Evaluation & Metrics

Classification Metrics

MetricFormulaWhen to Use
Accuracy(TP+TN)/(TP+TN+FP+FN)Balanced classes only
PrecisionTP/(TP+FP)Cost of false positives is high (spam)
RecallTP/(TP+FN)Cost of false negatives is high (disease)
F1 Score2*P*R/(P+R)Balance precision and recall
AUC-ROCArea under ROC curveCompare models, threshold-independent
Log Loss-avg(y*log(p))When probability calibration matters

Regression Metrics

MetricProperties
MAERobust to outliers, interpretable in original units
RMSEPenalizes large errors more, same units as target
R-squaredProportion of variance explained (0-1)
MAPEPercentage error, scale-independent

Retrieval Metrics

MetricWhat It Measures
MRR (Mean Reciprocal Rank)Rank of first relevant result
NDCGQuality of ranking with graded relevance
Recall@KFraction of relevant items in top-K
Hit@KWhether any relevant item appears in top-K

Prompt Engineering

The practice of designing and optimizing input prompts to get desired outputs from LLMs without modifying model weights.

Key Techniques

TechniqueDescriptionWhen to Use
Zero-shotDirect instruction with no examplesSimple, well-defined tasks
Few-shotProvide input-output examples in the promptTasks with subtle formatting or conventions
Chain-of-thought (CoT)Ask the model to reason step by stepComplex reasoning, math, logic
Tree-of-thoughtExplore multiple reasoning pathsProblems with multiple valid approaches
ReActInterleave reasoning and tool-use actionsAgentic workflows, multi-step tasks
Self-consistencyGenerate multiple CoT paths, take majority voteImproving reliability on hard problems
ReflectionAsk the model to critique/improve its own outputQuality refinement
Prompt chainingBreak complex tasks into sequential promptsMulti-step workflows

Best Practices

# 1. Be specific - state task, context, format, tone
prompt = """You are a senior data scientist. Given the following dataset
description, recommend the top 3 ML algorithms with reasons.

Dataset: {description}
Constraints: {constraints}

Output format: JSON array with fields: algorithm, reason, complexity"""

# 2. Few-shot example pattern
prompt = """Classify the sentiment of each review.

Review: "Great product, fast shipping!" -> positive
Review: "Terrible quality, broke in a day" -> negative
Review: "{user_input}" ->"""

# 3. Chain-of-thought
prompt = """Solve step by step:
Q: If a store has 240 items and sells 30% on Monday,
   then 50% of the remainder on Tuesday, how many are left?
A: Let me think step by step...
   Monday: 240 * 0.30 = 72 sold, 240 - 72 = 168 remain
   Tuesday: 168 * 0.50 = 84 sold, 168 - 84 = 84 remain
   Answer: 84 items"""

Common Mistakes

  • Over-engineering prompts (longer is not always better)
  • Assuming the model will infer intent — be explicit
  • Using every technique at once instead of selecting what fits
  • Ignoring output formatting constraints (JSON, markdown, etc.)
RAG vs Fine-tuning vs Prompt Engineering
Prompt Engineering: Fast iteration, changing policies. Limited by context window.
RAG: Factual accuracy, dynamic knowledge, auditability. Needs retrieval quality.
Fine-tuning: Consistent style/format, lower inference cost at scale. Expensive to train.

2025 Trends

  • Context Engineering — broader discipline encompassing prompt engineering
  • Auto-prompting — automated prompt optimization using model feedback
  • Multi-agent Chaining — coordinating prompts across specialized agents
  • Prompt Versioning — tracking prompt versions and performance at scale

Quantization

Reducing the numerical precision of model weights and activations (e.g., FP32 to INT8 or INT4) to decrease model size, memory usage, and inference time.

Impact
FP32 → INT8 = 4x smaller model. FP32 → INT4 = 8x smaller. Unlocks LLM inference on consumer GPUs and edge devices.

Quantization Types

MethodDescriptionAccuracy LossUse Case
PTQ (Post-Training)Applied after training, no retraining neededModerateQuick deployment
QAT (Quantization-Aware Training)Simulates quantization during trainingMinimalBest accuracy
GPTQWeight-only quantization for LLMs (3-4 bit)LowLLM deployment
AWQProtects salient weights during quantizationLowLLM deployment
GGUF/GGMLFormat for llama.cpp CPU inferenceVariesLocal/CPU inference

Implementation Example

# Post-training quantization with PyTorch
import torch

# Dynamic quantization (weights only, activations quantized at runtime)
quantized_model = torch.quantization.quantize_dynamic(
    model,
    {torch.nn.Linear},  # quantize Linear layers
    dtype=torch.qint8
)

# Static quantization (both weights and activations)
model.qconfig = torch.quantization.get_default_qconfig('x86')
torch.quantization.prepare(model, inplace=True)
# Run calibration data through model...
torch.quantization.convert(model, inplace=True)

# Using bitsandbytes for LLM quantization (4-bit)
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3-8B",
    quantization_config=bnb_config
)

Precision Comparison

PrecisionBitsModel Size (7B params)Typical Use
FP3232~28 GBTraining (full precision)
FP16 / BF1616~14 GBTraining (mixed precision)
INT88~7 GBInference (good quality)
INT4 / NF44~3.5 GBInference (consumer GPU)