Linear Regression

When to Use
Predicting continuous numeric values when you expect a roughly linear relationship between features and target. Start here for any regression problem as a baseline.

Linear regression fits a line (or hyperplane) to minimize the sum of squared residuals between predicted and actual values. It's the foundation of predictive modeling.

Key Concepts

  • Ordinary Least Squares (OLS) - Closed-form solution: w = (X^T X)^(-1) X^T y
  • Ridge Regression (L2) - Adds L2 penalty to prevent overfitting: loss + lambda * ||w||^2
  • Lasso Regression (L1) - Adds L1 penalty for feature selection: loss + lambda * ||w||
  • Elastic Net - Combines L1 and L2 regularization

Pure Python Implementation

def linear_regression(X, y, lr=0.01, epochs=1000):
    """Gradient descent linear regression from scratch."""
    n, m = len(X), len(X[0])
    weights = [0.0] * m
    bias = 0.0

    for _ in range(epochs):
        for i in range(n):
            pred = sum(weights[j] * X[i][j] for j in range(m)) + bias
            error = pred - y[i]
            for j in range(m):
                weights[j] -= lr * error * X[i][j] / n
            bias -= lr * error / n

    return weights, bias

Complexity & Trade-offs

AspectDetails
Time ComplexityO(n*m) per epoch for SGD, O(m^3) for OLS closed-form
Space ComplexityO(m) for weights
ProsInterpretable, fast, good baseline, well-understood theory
ConsAssumes linearity, sensitive to outliers, can't capture non-linear patterns
Use WhenLinear relationship expected, interpretability needed, small-medium data
Don't Use WhenComplex non-linear relationships, lots of categorical features

Logistic Regression

When to Use
Binary or multi-class classification when you need probability outputs and interpretable feature importance. Great baseline for classification.

Despite the name, logistic regression is a classification algorithm. It applies the sigmoid function to a linear combination of features to output probabilities.

Key Concepts

  • Sigmoid Function - sigma(z) = 1 / (1 + e^(-z)) maps any value to [0, 1]
  • Log Loss - Cross-entropy loss: -[y*log(p) + (1-y)*log(1-p)]
  • Decision Boundary - Linear boundary in feature space at p = 0.5
  • Regularization - L1 for sparse features, L2 for multicollinearity
  • Multi-class - One-vs-Rest (OvR) or Softmax for multiple classes

Pure Python Implementation

import math

def sigmoid(z):
    return 1.0 / (1.0 + math.exp(-max(-500, min(500, z))))

def logistic_regression(X, y, lr=0.1, epochs=500):
    n, m = len(X), len(X[0])
    weights = [0.0] * m
    bias = 0.0

    for _ in range(epochs):
        for i in range(n):
            z = sum(weights[j] * X[i][j] for j in range(m)) + bias
            pred = sigmoid(z)
            error = pred - y[i]
            for j in range(m):
                weights[j] -= lr * error * X[i][j] / n
            bias -= lr * error / n

    return weights, bias

Complexity & Trade-offs

AspectDetails
TimeO(n*m) per epoch
ProsProbabilistic output, interpretable coefficients, fast training, good baseline
ConsAssumes linear decision boundary, can't capture complex interactions
Use WhenBinary classification, need probabilities, interpretability required
Don't Use WhenNon-linear decision boundaries, image/text data

Stochastic Gradient Descent (SGD)

When to Use
SGD is the optimization algorithm behind most ML models. Understand it to understand how models learn.

SGD updates model parameters using the gradient of the loss function computed on a single sample (or mini-batch) at a time, rather than the full dataset.

Variants

  • Batch GD - Uses entire dataset per update (stable but slow)
  • Mini-batch SGD - Uses small batches (best of both worlds)
  • SGD with Momentum - Accelerates convergence with velocity term
  • Adam - Adaptive learning rates per parameter (most popular for deep learning)
  • AdamW - Adam with decoupled weight decay (recommended for transformers)

Learning Rate Schedule

# Cosine annealing with warmup
def cosine_lr(step, warmup_steps, total_steps, max_lr, min_lr=0):
    if step < warmup_steps:
        return max_lr * step / warmup_steps
    progress = (step - warmup_steps) / (total_steps - warmup_steps)
    return min_lr + 0.5 * (max_lr - min_lr) * (1 + math.cos(math.pi * progress))

Perceptron

The simplest neural network - a single-layer linear classifier that learns a decision boundary using the perceptron learning rule. Converges only for linearly separable data.

Algorithm

def perceptron(X, y, lr=0.01, epochs=100):
    weights = [0.0] * len(X[0])
    bias = 0.0
    for _ in range(epochs):
        for i in range(len(X)):
            pred = 1 if sum(w * x for w, x in zip(weights, X[i])) + bias > 0 else 0
            error = y[i] - pred
            for j in range(len(weights)):
                weights[j] += lr * error * X[i][j]
            bias += lr * error
    return weights, bias

Decision Trees (CART)

When to Use
When you need a highly interpretable model. Best for understanding feature importance and explaining predictions to non-technical stakeholders.

CART (Classification and Regression Trees) recursively splits data on feature thresholds to create a tree of decisions. Each leaf node represents a prediction.

Splitting Criteria

  • Gini Impurity - 1 - sum(p_i^2) - measures probability of misclassification
  • Entropy / Information Gain - -sum(p_i * log(p_i)) - measures disorder
  • MSE - Mean squared error for regression trees

Pure Python Implementation

def gini_impurity(groups, classes):
    total = sum(len(g) for g in groups)
    gini = 0.0
    for group in groups:
        size = len(group)
        if size == 0:
            continue
        score = 0.0
        for cls in classes:
            p = sum(1 for row in group if row[-1] == cls) / size
            score += p * p
        gini += (1.0 - score) * (size / total)
    return gini

def best_split(dataset):
    classes = list(set(row[-1] for row in dataset))
    best_feature, best_value, best_score, best_groups = None, None, float('inf'), None
    for col in range(len(dataset[0]) - 1):
        for row in dataset:
            left = [r for r in dataset if r[col] < row[col]]
            right = [r for r in dataset if r[col] >= row[col]]
            gini = gini_impurity([left, right], classes)
            if gini < best_score:
                best_feature, best_value, best_score = col, row[col], gini
                best_groups = (left, right)
    return {'feature': best_feature, 'value': best_value, 'groups': best_groups}

Pruning Strategies

  • Pre-pruning - Set max_depth, min_samples_split, min_samples_leaf
  • Post-pruning - Grow full tree then prune based on validation error
  • Cost-complexity pruning - Balance tree size vs accuracy with alpha parameter
AspectDetails
ProsHighly interpretable, handles mixed feature types, no scaling needed
ConsProne to overfitting, unstable (small data changes = different tree), high variance
FixUse ensemble methods (Random Forest, Gradient Boosting) to reduce variance

Naive Bayes

When to Use
Text classification (spam, sentiment), when features are roughly independent, or when you need a very fast baseline.

Applies Bayes' theorem with the "naive" assumption of feature independence. Surprisingly effective for text classification despite the simplifying assumption.

Variants

  • Gaussian NB - Continuous features assumed normally distributed
  • Multinomial NB - Count features (word counts in text)
  • Bernoulli NB - Binary features (word presence/absence)
AspectDetails
ProsExtremely fast, works with small data, good for text, probabilistic
ConsStrong independence assumption, can't learn feature interactions

K-Nearest Neighbors (KNN)

When to Use
Small datasets where you need a simple, non-parametric classifier. Good for recommendation systems and anomaly detection.

KNN classifies new points by majority vote of the K nearest training examples. It's a "lazy learner" - no training phase, all computation happens at prediction time.

Key Decisions

  • K value - Small K = noisy, large K = smooth but biased. Use cross-validation.
  • Distance metric - Euclidean (default), Manhattan, Minkowski, Cosine
  • Feature scaling - Essential! Features must be on same scale.
  • Weighted voting - Weight closer neighbors more heavily
AspectDetails
TimeO(1) training, O(n*m) prediction (slow for large datasets)
ProsSimple, no training, works for any decision boundary shape
ConsSlow prediction, curse of dimensionality, needs feature scaling

Support Vector Machine (SVM)

When to Use
Binary classification with clear margin of separation. Works well in high-dimensional spaces and when n_features > n_samples.

SVM finds the hyperplane that maximizes the margin between two classes. The "support vectors" are the closest points to the boundary that define it.

Key Concepts

  • Maximum Margin - Find the decision boundary with largest gap between classes
  • Kernel Trick - Map to higher dimensions without computing the transformation: RBF, Polynomial, Sigmoid
  • Soft Margin - Allow some misclassifications with penalty C
  • Hinge Loss - max(0, 1 - y * f(x))
AspectDetails
ProsEffective in high dimensions, memory efficient (only stores support vectors), kernel trick
ConsSlow for large datasets O(n^2-n^3), sensitive to feature scaling, no probabilities by default
Don't UseLarge datasets (>100K samples), lots of noise, multi-class (use OvR wrapper)

Random Forest

When to Use
General-purpose classification and regression. Best "set it and forget it" algorithm - robust with minimal tuning.

Random Forest builds many decision trees on random subsets of data (bagging) and random subsets of features, then aggregates predictions by voting/averaging.

Why It Works

  • Bootstrap Aggregation (Bagging) - Each tree trained on random sample with replacement
  • Random Feature Selection - Each split considers only sqrt(m) features (classification) or m/3 (regression)
  • Variance Reduction - Averaging many uncorrelated trees reduces overfitting
  • Out-of-Bag (OOB) Score - Free validation using samples not in each tree's bootstrap

Key Hyperparameters

  • n_estimators - Number of trees (more = better but diminishing returns, typically 100-500)
  • max_depth - Limit tree depth (None = fully grown, default)
  • max_features - Features per split (sqrt for classification, n/3 for regression)
  • min_samples_leaf - Minimum samples in leaf nodes
AspectDetails
ProsRobust, handles missing data, feature importance, minimal tuning, parallelizable
ConsLess interpretable than single tree, slower than linear models, large memory
vs XGBoostRF is simpler and more robust; XGBoost is usually more accurate with tuning

AdaBoost

Adaptive Boosting sequentially trains weak learners (usually decision stumps), giving more weight to misclassified samples. The final model is a weighted combination of all learners.

How It Works

  • Initialize equal weights for all samples
  • Train weak learner, calculate weighted error
  • Compute learner weight: alpha = 0.5 * ln((1-error)/error)
  • Update sample weights: increase for misclassified, decrease for correct
  • Repeat for T rounds
AspectDetails
ProsSimple, focuses on hard examples, less prone to overfitting than expected
ConsSensitive to noisy data and outliers, sequential (can't parallelize)

XGBoost / LightGBM / CatBoost

Agent Instruction
For tabular data: always try XGBoost/LightGBM first. They consistently outperform other algorithms on structured data. Use CatBoost for high-cardinality categorical features.

Gradient Boosting builds trees sequentially where each new tree corrects errors of the ensemble so far. XGBoost adds regularization, second-order optimization, and engineering optimizations.

XGBoost Key Innovations

  • Second-order Taylor expansion - Uses both gradient and Hessian for better split gains
  • Regularization - L1/L2 on leaf weights prevents overfitting
  • Sparsity-aware - Handles missing values natively
  • Column subsampling - Random feature selection like Random Forest

XGBoost vs LightGBM vs CatBoost

FeatureXGBoostLightGBMCatBoost
Split strategyLevel-wiseLeaf-wise (faster)Symmetric (balanced)
CategoricalsNeed encodingNative supportBest native support
SpeedGoodFastestModerate
Overfitting riskMediumHigher (leaf-wise)Lower
Best forGeneral purposeLarge datasetsCategorical-heavy data

Key Hyperparameters

# XGBoost typical starting point
params = {
    'max_depth': 6,           # Tree depth (3-10)
    'learning_rate': 0.1,     # Step size (0.01-0.3)
    'n_estimators': 500,      # Number of trees
    'subsample': 0.8,         # Row sampling ratio
    'colsample_bytree': 0.8,  # Column sampling ratio
    'reg_alpha': 0,           # L1 regularization
    'reg_lambda': 1,          # L2 regularization
    'min_child_weight': 1,    # Minimum leaf weight
}

K-Means Clustering

When to Use
Customer segmentation, data compression, image color quantization, or any task where you need to group similar items. Use when clusters are roughly spherical.

K-Means partitions n observations into K clusters by iteratively assigning points to the nearest centroid and updating centroids to cluster means.

Algorithm

import random, math

def kmeans(data, k, max_iter=100):
    # K-Means++ initialization
    centroids = [random.choice(data)]
    for _ in range(1, k):
        dists = [min(sum((a-b)**2 for a,b in zip(p,c)) for c in centroids) for p in data]
        total = sum(dists)
        probs = [d/total for d in dists]
        # Weighted random selection
        r = random.random()
        cumsum = 0
        for i, p in enumerate(probs):
            cumsum += p
            if cumsum >= r:
                centroids.append(data[i])
                break

    for _ in range(max_iter):
        # Assign clusters
        clusters = [[] for _ in range(k)]
        for point in data:
            distances = [sum((a-b)**2 for a,b in zip(point,c)) for c in centroids]
            clusters[distances.index(min(distances))].append(point)

        # Update centroids
        new_centroids = []
        for cluster in clusters:
            if cluster:
                new_centroids.append([sum(dim)/len(cluster) for dim in zip(*cluster)])
            else:
                new_centroids.append(random.choice(data))

        if new_centroids == centroids:
            break
        centroids = new_centroids

    return centroids, clusters

Choosing K

  • Elbow Method - Plot inertia vs K, look for the "elbow"
  • Silhouette Score - Measures cluster cohesion vs separation (-1 to 1, higher is better)
  • Gap Statistic - Compares clustering to random uniform distribution
AspectDetails
TimeO(n * K * m * iterations) - fast in practice
ProsSimple, fast, scales well, easy to interpret
ConsMust specify K, assumes spherical clusters, sensitive to initialization and outliers
AlternativeDBSCAN for arbitrary shapes, GMM for probabilistic

DBSCAN

Density-Based Spatial Clustering of Applications with Noise. Finds clusters of arbitrary shape by grouping dense regions separated by sparse areas. Automatically identifies noise/outliers.

Key Parameters

  • eps (epsilon) - Radius of neighborhood
  • min_samples - Minimum points in neighborhood to form a core point
AspectDetails
ProsNo need to specify K, finds arbitrary shapes, identifies outliers
ConsSensitive to eps/min_samples, struggles with varying density clusters

Gaussian Mixture Model (GMM)

Probabilistic model that assumes data is generated from a mixture of K Gaussian distributions. Uses EM algorithm to find parameters. Provides soft (probabilistic) cluster assignments.

EM Algorithm Steps

  • E-step - Compute responsibility (probability each point belongs to each Gaussian)
  • M-step - Update means, covariances, and mixing coefficients using responsibilities
  • Repeat until log-likelihood converges
AspectDetails
vs K-MeansGMM gives probabilistic assignments, handles elliptical clusters, more flexible but slower
Use WhenClusters are elliptical, you need probabilities, data is Gaussian-like

Principal Component Analysis (PCA)

When to Use
Dimensionality reduction before training (speeds up models, reduces overfitting), visualization of high-dimensional data, noise reduction.

PCA finds orthogonal directions (principal components) of maximum variance. Projects data onto these directions to reduce dimensions while preserving the most information.

Algorithm

  • Center the data (subtract mean)
  • Compute covariance matrix
  • Compute eigenvalues and eigenvectors
  • Sort by eigenvalue (largest = most variance)
  • Project data onto top-K eigenvectors

How Many Components?

  • Plot cumulative explained variance - pick K that explains 95% of variance
  • Use scree plot - look for the elbow in eigenvalue magnitudes
AspectDetails
ProsReduces overfitting, speeds up training, removes multicollinearity
ConsLoses interpretability, assumes linear relationships, sensitive to scaling
Alternativet-SNE/UMAP for visualization (non-linear), LDA for supervised reduction

t-SNE

t-Distributed Stochastic Neighbor Embedding — a non-linear dimensionality reduction technique specialized for visualizing high-dimensional data in 2D or 3D.

When to Use
Visualizing clusters in high-dimensional data (embeddings, image features, genomics). Best for exploratory analysis on small-medium datasets (<10K points). NOT for preprocessing — use PCA or UMAP instead.

How It Works

Computes pairwise similarities in high-dimensional space using Gaussian distributions, then creates a matching distribution in low-dimensional space using Student's t-distribution (heavier tails prevent crowding). Minimizes KL divergence between the two distributions via gradient descent.

Pure Python Core

import numpy as np

def tsne_similarity(X, perplexity=30.0):
    """Compute pairwise affinities with perplexity-based bandwidth."""
    n = X.shape[0]
    distances = np.sum((X[:, None] - X[None, :]) ** 2, axis=-1)
    P = np.zeros((n, n))
    for i in range(n):
        # Binary search for sigma that achieves target perplexity
        lo, hi = 1e-10, 1e4
        for _ in range(50):
            sigma = (lo + hi) / 2
            pi = np.exp(-distances[i] / (2 * sigma ** 2))
            pi[i] = 0
            pi /= pi.sum() + 1e-8
            entropy = -np.sum(pi * np.log2(pi + 1e-8))
            if entropy > np.log2(perplexity):
                hi = sigma
            else:
                lo = sigma
        P[i] = pi
    return (P + P.T) / (2 * n)

# Low-dimensional: Student-t kernel
# q_ij = (1 + ||y_i - y_j||^2)^(-1) / sum
# Gradient: 4 * sum_j (p_ij - q_ij)(y_i - y_j)(1 + ||y_i - y_j||^2)^(-1)
AspectDetails
ComplexityO(n²) time and space — quadratic in data points
Key Paramsperplexity (5-50), learning_rate (10-1000), n_iter (1000+)
ProsExcellent local structure preservation, reveals clusters beautifully
ConsSlow on large data, non-deterministic, inter-cluster distances not meaningful
ApplicationsSingle-cell RNA-seq, document embeddings, image dataset exploration

UMAP

Uniform Manifold Approximation and Projection — preserves both local and global data structure, offering a faster and more scalable alternative to t-SNE.

When to Use
Large-scale visualization, preprocessing before clustering (works great with HDBSCAN), when t-SNE is too slow. Also usable for general dimensionality reduction (unlike t-SNE).

How It Works

Based on Riemannian geometry and algebraic topology. Constructs a fuzzy simplicial complex (weighted graph) in high-dimensional space, then optimizes a low-dimensional layout that preserves the topological structure using cross-entropy loss.

Key Advantages over t-SNE

  • Speed — ~O(n^1.14) empirically vs O(n²) for t-SNE
  • Global structure — preserves relative cluster positions, not just local neighborhoods
  • Transform support — can embed new unseen data points
  • General purpose — usable as preprocessing for ML models, not just visualization
AspectDetails
Complexity~O(n^1.14) time, O(n×k) space where k = n_neighbors
Key Paramsn_neighbors (5-50), min_dist (0.0-0.99), n_components, metric
ProsFast, preserves global structure, supports transform, scalable
ConsStochastic, hyperparameter-sensitive, harder to interpret theoretically
ApplicationsSingle-cell genomics, image embeddings, clustering pipelines, feature exploration

Dimensionality Reduction Comparison

MethodTypeSpeedGlobal StructureTransform New DataBest For
PCALinearVery FastYesYesFeature reduction, preprocessing
t-SNENon-linearSlowNoNoSmall dataset visualization
UMAPNon-linearFastPartialYesLarge-scale visualization + preprocessing

Algorithm Comparison Chart

Quick reference for choosing the right algorithm based on your requirements.

AlgorithmTypeData SizeInterpretable?SpeedBest For
Linear RegressionRegressionAnyHighVery FastLinear relationships
Logistic RegressionClassificationAnyHighVery FastBinary classification baseline
Decision TreeBothSmall-MedVery HighFastExplainability
Random ForestBothAnyMediumMediumGeneral purpose, minimal tuning
XGBoostBothAnyLow-MedMediumTabular data, competitions
SVMClassificationSmall-MedLowSlowHigh-dimensional, clear margins
KNNClassificationSmallMediumSlow (pred)Simple problems, recommendations
Naive BayesClassificationAnyMediumVery FastText classification, spam
K-MeansClusteringAnyHighFastCustomer segmentation
DBSCANClusteringSmall-MedMediumMediumArbitrary shapes, outliers
PCADim. ReductionAnyLowFastFeature reduction, visualization
t-SNEDim. ReductionSmallLowSlowCluster visualization
UMAPDim. ReductionAnyLowFastVisualization + preprocessing