LLM & Modern AI

Attention Mechanisms

Agent Instruction

Attention is the core mechanism behind transformers. Understand scaled dot-product attention, multi-head attention, causal masking for decoders, and cross-attention for encoder-decoder models.

Attention allows a model to dynamically focus on relevant parts of the input when producing each output element. It computes a weighted sum of values, where weights are determined by query-key compatibility.

Types of Attention

Self-Attention - Q, K, V all come from the same sequence. Each token attends to all others.
Cross-Attention - Q from decoder, K/V from encoder. Used in translation, image captioning.
Causal (Masked) Attention - Each position can only attend to previous positions. Used in GPT-style decoders.
Multi-Head Attention - Run h parallel attention heads with different projections, concatenate results.

Scaled Dot-Product Attention

import math

def attention(Q, K, V, mask=None):
    """
    Q: (seq_len, d_k) - Queries
    K: (seq_len, d_k) - Keys
    V: (seq_len, d_v) - Values
    """
    d_k = len(Q[0])

    # Step 1: Compute attention scores
    scores = [[sum(q * k for q, k in zip(Q[i], K[j])) / math.sqrt(d_k)
                for j in range(len(K))]
               for i in range(len(Q))]

    # Step 2: Apply causal mask (for decoder)
    if mask:
        for i in range(len(scores)):
            for j in range(len(scores[i])):
                if j > i:
                    scores[i][j] = float('-inf')

    # Step 3: Softmax
    weights = [softmax(row) for row in scores]

    # Step 4: Weighted sum of values
    output = [[sum(weights[i][j] * V[j][k] for j in range(len(V)))
                for k in range(len(V[0]))]
               for i in range(len(weights))]

    return output, weights

Multi-Head Attention

def multi_head_attention(Q, K, V, num_heads, d_model):
    d_k = d_model // num_heads
    heads = []

    for h in range(num_heads):
        # Project Q, K, V with learned weight matrices
        Q_h = linear(Q, W_q[h])  # (seq, d_k)
        K_h = linear(K, W_k[h])
        V_h = linear(V, W_v[h])

        # Compute attention for this head
        head_output, _ = attention(Q_h, K_h, V_h)
        heads.append(head_output)

    # Concatenate all heads and project
    concat = concatenate(heads, dim=-1)  # (seq, d_model)
    output = linear(concat, W_o)
    return output

Efficient Attention Variants

Flash Attention - IO-aware exact attention, 2-4x faster via tiling
Multi-Query Attention (MQA) - Shared K/V across heads, faster inference
Grouped-Query Attention (GQA) - Groups of heads share K/V (used in Llama 2/3)
Sliding Window Attention - Each token attends to local window (Mistral)
KV Cache - Cache K/V from previous tokens for faster autoregressive generation

Tokenization & BPE

Tokenization converts raw text into tokens (subword units) that models can process. The tokenizer determines the model's vocabulary and directly impacts performance.

Tokenization Methods

Method	Used By	Key Idea
BPE (Byte-Pair Encoding)	GPT, Llama	Iteratively merge most frequent character pairs
WordPiece	BERT	Like BPE but uses likelihood-based merging
SentencePiece	T5, mBART	Language-agnostic, treats input as raw bytes
Tiktoken	GPT-4, Claude	Fast BPE implementation with byte-level fallback

BPE Algorithm

def byte_pair_encoding(corpus, vocab_size):
    """Build BPE vocabulary from corpus."""
    # Start with character-level vocabulary
    vocab = set(char for word in corpus for char in word)
    tokens = [[char for char in word] for word in corpus]

    while len(vocab) < vocab_size:
        # Count all adjacent pairs
        pair_counts = {}
        for word_tokens in tokens:
            for i in range(len(word_tokens) - 1):
                pair = (word_tokens[i], word_tokens[i+1])
                pair_counts[pair] = pair_counts.get(pair, 0) + 1

        if not pair_counts:
            break

        # Find most frequent pair
        best_pair = max(pair_counts, key=pair_counts.get)
        merged = best_pair[0] + best_pair[1]
        vocab.add(merged)

        # Merge this pair everywhere
        for j, word_tokens in enumerate(tokens):
            new_tokens = []
            i = 0
            while i < len(word_tokens):
                if (i < len(word_tokens) - 1 and
                    word_tokens[i] == best_pair[0] and
                    word_tokens[i+1] == best_pair[1]):
                    new_tokens.append(merged)
                    i += 2
                else:
                    new_tokens.append(word_tokens[i])
                    i += 1
            tokens[j] = new_tokens

    return vocab, tokens

Fine-tuning & LoRA/PEFT

Agent Instruction

For adapting LLMs: use LoRA (rank 8-64) on attention layers. QLoRA adds 4-bit quantization for consumer GPUs. Full fine-tuning only with large GPU clusters.

Fine-tuning Approaches

Method	Parameters Trained	GPU Memory	Best For
Full Fine-tuning	100%	Very High	Enough data + compute, maximum quality
LoRA	0.1-1%	Low	Most practical fine-tuning tasks
QLoRA	0.1-1%	Very Low	Consumer GPUs (single 24GB GPU)
Prefix Tuning	<0.1%	Minimal	Lightweight task adaptation
Prompt Tuning	<0.01%	Minimal	Simple task steering

LoRA (Low-Rank Adaptation)

class LoRALayer:
    """
    Instead of updating full weight matrix W (d x d),
    learn low-rank decomposition: delta_W = A @ B
    where A is (d x r) and B is (r x d), r << d
    """
    def __init__(self, d_model, rank=8, alpha=16):
        self.A = random_normal(d_model, rank) * (alpha / rank)
        self.B = zeros(rank, d_model)  # Initialize B to zero
        self.scaling = alpha / rank

    def forward(self, x, original_weight):
        # Original output + low-rank adaptation
        base_output = x @ original_weight
        lora_output = x @ self.A @ self.B * self.scaling
        return base_output + lora_output

Key Decisions

Rank (r) - 4-64 typical. Higher = more capacity but more parameters. Start with 8.
Alpha - Scaling factor. Common: alpha = 2*rank
Target Modules - Apply to attention (q, k, v, o) layers. Optionally MLP layers too.
Learning Rate - 1e-4 to 3e-4 typical for LoRA (higher than full fine-tuning)

RAG (Retrieval-Augmented Generation)

Agent Instruction

RAG is the standard pattern for grounding LLM responses in external knowledge. Key components: embedding model, vector store, retrieval, prompt construction. Use when accuracy and freshness matter more than speed.

RAG combines information retrieval with text generation. Instead of relying solely on the model's training data, it retrieves relevant documents and includes them in the prompt context.

RAG Pipeline

# Simplified RAG pipeline
class RAGPipeline:
    def __init__(self, embedding_model, vector_store, llm):
        self.embedder = embedding_model
        self.store = vector_store
        self.llm = llm

    def index(self, documents):
        """Index documents into vector store."""
        for doc in documents:
            chunks = self.chunk(doc, size=512, overlap=50)
            for chunk in chunks:
                embedding = self.embedder.encode(chunk)
                self.store.add(embedding, chunk)

    def query(self, question, top_k=5):
        """Answer question using retrieved context."""
        # 1. Embed the question
        q_embedding = self.embedder.encode(question)

        # 2. Retrieve relevant chunks
        results = self.store.search(q_embedding, top_k=top_k)

        # 3. Build prompt with context
        context = "\n\n".join([r.text for r in results])
        prompt = f"""Answer based on the following context:

{context}

Question: {question}
Answer:"""

        # 4. Generate answer
        return self.llm.generate(prompt)

RAG Variants

Naive RAG - Simple retrieve-then-generate pipeline
Self-RAG - Model decides when to retrieve and self-reflects on relevance
CRAG (Corrective RAG) - Evaluates retrieval quality, falls back to web search
GraphRAG - Uses knowledge graphs for structured retrieval
Agentic RAG - Agent decides retrieval strategy, query reformulation, multi-hop

Vector Search Methods

Method	Speed	Accuracy	Memory
Flat (Exact)	Slow O(n)	Perfect	O(n*d)
IVF (Inverted File)	Fast	High	O(n*d)
HNSW	Very Fast	Very High	O(ndM)
Product Quantization	Fast	Good	Low O(n*m)

Knowledge Distillation

Transfer knowledge from a large "teacher" model to a smaller "student" model. The student learns from the teacher's soft probability outputs (dark knowledge) rather than hard labels.

How It Works

def distillation_loss(student_logits, teacher_logits, labels, T=4.0, alpha=0.7):
    """
    T: Temperature (higher = softer distributions)
    alpha: Weight between soft and hard loss
    """
    # Soft loss: KL divergence between soft distributions
    soft_student = softmax(student_logits / T)
    soft_teacher = softmax(teacher_logits / T)
    soft_loss = kl_divergence(soft_student, soft_teacher) * (T * T)

    # Hard loss: Standard cross-entropy with true labels
    hard_loss = cross_entropy(student_logits, labels)

    return alpha * soft_loss + (1 - alpha) * hard_loss

Applications

Deploy large model quality on mobile/edge devices
Reduce inference cost (smaller model = cheaper)
Model compression pipelines

Contrastive Learning (SimCLR)

Self-supervised learning that creates representations by pulling similar (positive) pairs together and pushing dissimilar (negative) pairs apart in embedding space.

SimCLR Framework

Create two augmented views of each image
Encode both views with shared encoder
Maximize agreement between positive pairs (same image) via NT-Xent loss
Negative pairs are all other images in the batch

def nt_xent_loss(z_i, z_j, temperature=0.5):
    """Normalized Temperature-scaled Cross Entropy Loss."""
    batch_size = len(z_i)

    # Cosine similarity between all pairs
    z = concatenate(z_i, z_j)  # 2N embeddings
    sim = cosine_similarity_matrix(z)  # (2N, 2N)

    # Positive pairs: (i, i+N) and (i+N, i)
    # Negative pairs: everything else
    loss = 0
    for i in range(batch_size):
        pos_sim = sim[i][i + batch_size] / temperature
        neg_sims = [sim[i][j] / temperature
                    for j in range(2 * batch_size) if j != i]
        loss -= pos_sim - log(sum(exp(s) for s in neg_sims))

    return loss / (2 * batch_size)

MLOps & Production ML

Practices for deploying, monitoring, and maintaining ML models in production.

Key Components

Feature Store - Centralized feature management ensuring train-serve consistency
Model Registry - Version control for models with metadata and lineage
Model Serving - REST APIs, batch inference, streaming inference
Monitoring - Track prediction drift, data drift, model performance
A/B Testing - Compare models in production with statistical rigor
CI/CD for ML - Automated training, validation, and deployment pipelines

Data & Model Drift Detection

def population_stability_index(expected, actual, bins=10):
    """PSI measures distribution shift between train and serving data."""
    psi = 0
    for i in range(bins):
        e_pct = expected[i] / sum(expected)
        a_pct = actual[i] / sum(actual)
        if e_pct > 0 and a_pct > 0:
            psi += (a_pct - e_pct) * math.log(a_pct / e_pct)

    # PSI < 0.1: no shift, 0.1-0.2: moderate, > 0.2: significant
    return psi

Deployment Strategies

Canary Release - Route small % of traffic to new model, monitor metrics
Blue-Green - Two identical environments, switch traffic instantly
Shadow Mode - New model runs alongside old, compare outputs without serving
Feature Flags - Enable/disable model features without deployment

Model Evaluation & Metrics

Classification Metrics

Metric	Formula	When to Use
Accuracy	(TP+TN)/(TP+TN+FP+FN)	Balanced classes only
Precision	TP/(TP+FP)	Cost of false positives is high (spam)
Recall	TP/(TP+FN)	Cost of false negatives is high (disease)
F1 Score	2PR/(P+R)	Balance precision and recall
AUC-ROC	Area under ROC curve	Compare models, threshold-independent
Log Loss	-avg(y*log(p))	When probability calibration matters

Regression Metrics

Metric	Properties
MAE	Robust to outliers, interpretable in original units
RMSE	Penalizes large errors more, same units as target
R-squared	Proportion of variance explained (0-1)
MAPE	Percentage error, scale-independent

Retrieval Metrics

Metric	What It Measures
MRR (Mean Reciprocal Rank)	Rank of first relevant result
NDCG	Quality of ranking with graded relevance
Recall@K	Fraction of relevant items in top-K
Hit@K	Whether any relevant item appears in top-K

Prompt Engineering

The practice of designing and optimizing input prompts to get desired outputs from LLMs without modifying model weights.

Key Techniques

Technique	Description	When to Use
Zero-shot	Direct instruction with no examples	Simple, well-defined tasks
Few-shot	Provide input-output examples in the prompt	Tasks with subtle formatting or conventions
Chain-of-thought (CoT)	Ask the model to reason step by step	Complex reasoning, math, logic
Tree-of-thought	Explore multiple reasoning paths	Problems with multiple valid approaches
ReAct	Interleave reasoning and tool-use actions	Agentic workflows, multi-step tasks
Self-consistency	Generate multiple CoT paths, take majority vote	Improving reliability on hard problems
Reflection	Ask the model to critique/improve its own output	Quality refinement
Prompt chaining	Break complex tasks into sequential prompts	Multi-step workflows

Best Practices

# 1. Be specific - state task, context, format, tone
prompt = """You are a senior data scientist. Given the following dataset
description, recommend the top 3 ML algorithms with reasons.

Dataset: {description}
Constraints: {constraints}

Output format: JSON array with fields: algorithm, reason, complexity"""

# 2. Few-shot example pattern
prompt = """Classify the sentiment of each review.

Review: "Great product, fast shipping!" -> positive
Review: "Terrible quality, broke in a day" -> negative
Review: "{user_input}" ->"""

# 3. Chain-of-thought
prompt = """Solve step by step:
Q: If a store has 240 items and sells 30% on Monday,
   then 50% of the remainder on Tuesday, how many are left?
A: Let me think step by step...
   Monday: 240 * 0.30 = 72 sold, 240 - 72 = 168 remain
   Tuesday: 168 * 0.50 = 84 sold, 168 - 84 = 84 remain
   Answer: 84 items"""

Common Mistakes

Over-engineering prompts (longer is not always better)
Assuming the model will infer intent — be explicit
Using every technique at once instead of selecting what fits
Ignoring output formatting constraints (JSON, markdown, etc.)

RAG vs Fine-tuning vs Prompt Engineering

Prompt Engineering: Fast iteration, changing policies. Limited by context window.
RAG: Factual accuracy, dynamic knowledge, auditability. Needs retrieval quality.
Fine-tuning: Consistent style/format, lower inference cost at scale. Expensive to train.

2025 Trends

Context Engineering — broader discipline encompassing prompt engineering
Auto-prompting — automated prompt optimization using model feedback
Multi-agent Chaining — coordinating prompts across specialized agents
Prompt Versioning — tracking prompt versions and performance at scale

Quantization

Reducing the numerical precision of model weights and activations (e.g., FP32 to INT8 or INT4) to decrease model size, memory usage, and inference time.

Impact

FP32 → INT8 = 4x smaller model. FP32 → INT4 = 8x smaller. Unlocks LLM inference on consumer GPUs and edge devices.

Quantization Types

Method	Description	Accuracy Loss	Use Case
PTQ (Post-Training)	Applied after training, no retraining needed	Moderate	Quick deployment
QAT (Quantization-Aware Training)	Simulates quantization during training	Minimal	Best accuracy
GPTQ	Weight-only quantization for LLMs (3-4 bit)	Low	LLM deployment
AWQ	Protects salient weights during quantization	Low	LLM deployment
GGUF/GGML	Format for llama.cpp CPU inference	Varies	Local/CPU inference

Implementation Example

# Post-training quantization with PyTorch
import torch

# Dynamic quantization (weights only, activations quantized at runtime)
quantized_model = torch.quantization.quantize_dynamic(
    model,
    {torch.nn.Linear},  # quantize Linear layers
    dtype=torch.qint8
)

# Static quantization (both weights and activations)
model.qconfig = torch.quantization.get_default_qconfig('x86')
torch.quantization.prepare(model, inplace=True)
# Run calibration data through model...
torch.quantization.convert(model, inplace=True)

# Using bitsandbytes for LLM quantization (4-bit)
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3-8B",
    quantization_config=bnb_config
)

Precision Comparison

Precision	Bits	Model Size (7B params)	Typical Use
FP32	32	~28 GB	Training (full precision)
FP16 / BF16	16	~14 GB	Training (mixed precision)
INT8	8	~7 GB	Inference (good quality)
INT4 / NF4	4	~3.5 GB	Inference (consumer GPU)

Attention Mechanisms

Types of Attention

Scaled Dot-Product Attention

Multi-Head Attention

Efficient Attention Variants

Tokenization & BPE

Tokenization Methods

BPE Algorithm

Fine-tuning & LoRA/PEFT

Fine-tuning Approaches

LoRA (Low-Rank Adaptation)

Key Decisions

RAG (Retrieval-Augmented Generation)

RAG Pipeline

RAG Variants

Vector Search Methods

Knowledge Distillation

How It Works

Applications

Contrastive Learning (SimCLR)

SimCLR Framework

MLOps & Production ML

Key Components

Data & Model Drift Detection

Deployment Strategies

Model Evaluation & Metrics

Classification Metrics

Regression Metrics

Retrieval Metrics

Prompt Engineering

Key Techniques

Best Practices

Common Mistakes

2025 Trends

Quantization

Quantization Types

Implementation Example

Precision Comparison

Discussion