Classic ML Algorithms

Linear Regression

When to Use

Predicting continuous numeric values when you expect a roughly linear relationship between features and target. Start here for any regression problem as a baseline.

Linear regression fits a line (or hyperplane) to minimize the sum of squared residuals between predicted and actual values. It's the foundation of predictive modeling.

Key Concepts

Ordinary Least Squares (OLS) - Closed-form solution: w = (X^T X)^(-1) X^T y
Ridge Regression (L2) - Adds L2 penalty to prevent overfitting: loss + lambda * ||w||^2
Lasso Regression (L1) - Adds L1 penalty for feature selection: loss + lambda * ||w||
Elastic Net - Combines L1 and L2 regularization

Pure Python Implementation

def linear_regression(X, y, lr=0.01, epochs=1000):
    """Gradient descent linear regression from scratch."""
    n, m = len(X), len(X[0])
    weights = [0.0] * m
    bias = 0.0

    for _ in range(epochs):
        for i in range(n):
            pred = sum(weights[j] * X[i][j] for j in range(m)) + bias
            error = pred - y[i]
            for j in range(m):
                weights[j] -= lr * error * X[i][j] / n
            bias -= lr * error / n

    return weights, bias

Complexity & Trade-offs

Aspect	Details
Time Complexity	O(n*m) per epoch for SGD, O(m^3) for OLS closed-form
Space Complexity	O(m) for weights
Pros	Interpretable, fast, good baseline, well-understood theory
Cons	Assumes linearity, sensitive to outliers, can't capture non-linear patterns
Use When	Linear relationship expected, interpretability needed, small-medium data
Don't Use When	Complex non-linear relationships, lots of categorical features

Logistic Regression

When to Use

Binary or multi-class classification when you need probability outputs and interpretable feature importance. Great baseline for classification.

Despite the name, logistic regression is a classification algorithm. It applies the sigmoid function to a linear combination of features to output probabilities.

Key Concepts

Sigmoid Function - sigma(z) = 1 / (1 + e^(-z)) maps any value to [0, 1]
Log Loss - Cross-entropy loss: -[y*log(p) + (1-y)*log(1-p)]
Decision Boundary - Linear boundary in feature space at p = 0.5
Regularization - L1 for sparse features, L2 for multicollinearity
Multi-class - One-vs-Rest (OvR) or Softmax for multiple classes

Pure Python Implementation

import math

def sigmoid(z):
    return 1.0 / (1.0 + math.exp(-max(-500, min(500, z))))

def logistic_regression(X, y, lr=0.1, epochs=500):
    n, m = len(X), len(X[0])
    weights = [0.0] * m
    bias = 0.0

    for _ in range(epochs):
        for i in range(n):
            z = sum(weights[j] * X[i][j] for j in range(m)) + bias
            pred = sigmoid(z)
            error = pred - y[i]
            for j in range(m):
                weights[j] -= lr * error * X[i][j] / n
            bias -= lr * error / n

    return weights, bias

Complexity & Trade-offs

Aspect	Details
Time	O(n*m) per epoch
Pros	Probabilistic output, interpretable coefficients, fast training, good baseline
Cons	Assumes linear decision boundary, can't capture complex interactions
Use When	Binary classification, need probabilities, interpretability required
Don't Use When	Non-linear decision boundaries, image/text data

Stochastic Gradient Descent (SGD)

When to Use

SGD is the optimization algorithm behind most ML models. Understand it to understand how models learn.

SGD updates model parameters using the gradient of the loss function computed on a single sample (or mini-batch) at a time, rather than the full dataset.

Variants

Batch GD - Uses entire dataset per update (stable but slow)
Mini-batch SGD - Uses small batches (best of both worlds)
SGD with Momentum - Accelerates convergence with velocity term
Adam - Adaptive learning rates per parameter (most popular for deep learning)
AdamW - Adam with decoupled weight decay (recommended for transformers)

Learning Rate Schedule

# Cosine annealing with warmup
def cosine_lr(step, warmup_steps, total_steps, max_lr, min_lr=0):
    if step < warmup_steps:
        return max_lr * step / warmup_steps
    progress = (step - warmup_steps) / (total_steps - warmup_steps)
    return min_lr + 0.5 * (max_lr - min_lr) * (1 + math.cos(math.pi * progress))

Perceptron

The simplest neural network - a single-layer linear classifier that learns a decision boundary using the perceptron learning rule. Converges only for linearly separable data.

Algorithm

def perceptron(X, y, lr=0.01, epochs=100):
    weights = [0.0] * len(X[0])
    bias = 0.0
    for _ in range(epochs):
        for i in range(len(X)):
            pred = 1 if sum(w * x for w, x in zip(weights, X[i])) + bias > 0 else 0
            error = y[i] - pred
            for j in range(len(weights)):
                weights[j] += lr * error * X[i][j]
            bias += lr * error
    return weights, bias

Decision Trees (CART)

When to Use

When you need a highly interpretable model. Best for understanding feature importance and explaining predictions to non-technical stakeholders.

CART (Classification and Regression Trees) recursively splits data on feature thresholds to create a tree of decisions. Each leaf node represents a prediction.

Splitting Criteria

Gini Impurity - 1 - sum(p_i^2) - measures probability of misclassification
Entropy / Information Gain - -sum(p_i * log(p_i)) - measures disorder
MSE - Mean squared error for regression trees

Pure Python Implementation

def gini_impurity(groups, classes):
    total = sum(len(g) for g in groups)
    gini = 0.0
    for group in groups:
        size = len(group)
        if size == 0:
            continue
        score = 0.0
        for cls in classes:
            p = sum(1 for row in group if row[-1] == cls) / size
            score += p * p
        gini += (1.0 - score) * (size / total)
    return gini

def best_split(dataset):
    classes = list(set(row[-1] for row in dataset))
    best_feature, best_value, best_score, best_groups = None, None, float('inf'), None
    for col in range(len(dataset[0]) - 1):
        for row in dataset:
            left = [r for r in dataset if r[col] < row[col]]
            right = [r for r in dataset if r[col] >= row[col]]
            gini = gini_impurity([left, right], classes)
            if gini < best_score:
                best_feature, best_value, best_score = col, row[col], gini
                best_groups = (left, right)
    return {'feature': best_feature, 'value': best_value, 'groups': best_groups}

Pruning Strategies

Pre-pruning - Set max_depth, min_samples_split, min_samples_leaf
Post-pruning - Grow full tree then prune based on validation error
Cost-complexity pruning - Balance tree size vs accuracy with alpha parameter

Aspect	Details
Pros	Highly interpretable, handles mixed feature types, no scaling needed
Cons	Prone to overfitting, unstable (small data changes = different tree), high variance
Fix	Use ensemble methods (Random Forest, Gradient Boosting) to reduce variance

Naive Bayes

When to Use

Text classification (spam, sentiment), when features are roughly independent, or when you need a very fast baseline.

Applies Bayes' theorem with the "naive" assumption of feature independence. Surprisingly effective for text classification despite the simplifying assumption.

Variants

Gaussian NB - Continuous features assumed normally distributed
Multinomial NB - Count features (word counts in text)
Bernoulli NB - Binary features (word presence/absence)

Aspect	Details
Pros	Extremely fast, works with small data, good for text, probabilistic
Cons	Strong independence assumption, can't learn feature interactions

K-Nearest Neighbors (KNN)

When to Use

Small datasets where you need a simple, non-parametric classifier. Good for recommendation systems and anomaly detection.

KNN classifies new points by majority vote of the K nearest training examples. It's a "lazy learner" - no training phase, all computation happens at prediction time.

Key Decisions

K value - Small K = noisy, large K = smooth but biased. Use cross-validation.
Distance metric - Euclidean (default), Manhattan, Minkowski, Cosine
Feature scaling - Essential! Features must be on same scale.
Weighted voting - Weight closer neighbors more heavily

Aspect	Details
Time	O(1) training, O(n*m) prediction (slow for large datasets)
Pros	Simple, no training, works for any decision boundary shape
Cons	Slow prediction, curse of dimensionality, needs feature scaling

Support Vector Machine (SVM)

When to Use

Binary classification with clear margin of separation. Works well in high-dimensional spaces and when n_features > n_samples.

SVM finds the hyperplane that maximizes the margin between two classes. The "support vectors" are the closest points to the boundary that define it.

Key Concepts

Maximum Margin - Find the decision boundary with largest gap between classes
Kernel Trick - Map to higher dimensions without computing the transformation: RBF, Polynomial, Sigmoid
Soft Margin - Allow some misclassifications with penalty C
Hinge Loss - max(0, 1 - y * f(x))

Aspect	Details
Pros	Effective in high dimensions, memory efficient (only stores support vectors), kernel trick
Cons	Slow for large datasets O(n^2-n^3), sensitive to feature scaling, no probabilities by default
Don't Use	Large datasets (>100K samples), lots of noise, multi-class (use OvR wrapper)

Random Forest

When to Use

General-purpose classification and regression. Best "set it and forget it" algorithm - robust with minimal tuning.

Random Forest builds many decision trees on random subsets of data (bagging) and random subsets of features, then aggregates predictions by voting/averaging.

Why It Works

Bootstrap Aggregation (Bagging) - Each tree trained on random sample with replacement
Random Feature Selection - Each split considers only sqrt(m) features (classification) or m/3 (regression)
Variance Reduction - Averaging many uncorrelated trees reduces overfitting
Out-of-Bag (OOB) Score - Free validation using samples not in each tree's bootstrap

Key Hyperparameters

n_estimators - Number of trees (more = better but diminishing returns, typically 100-500)
max_depth - Limit tree depth (None = fully grown, default)
max_features - Features per split (sqrt for classification, n/3 for regression)
min_samples_leaf - Minimum samples in leaf nodes

Aspect	Details
Pros	Robust, handles missing data, feature importance, minimal tuning, parallelizable
Cons	Less interpretable than single tree, slower than linear models, large memory
vs XGBoost	RF is simpler and more robust; XGBoost is usually more accurate with tuning

AdaBoost

Adaptive Boosting sequentially trains weak learners (usually decision stumps), giving more weight to misclassified samples. The final model is a weighted combination of all learners.

How It Works

Initialize equal weights for all samples
Train weak learner, calculate weighted error
Compute learner weight: alpha = 0.5 * ln((1-error)/error)
Update sample weights: increase for misclassified, decrease for correct
Repeat for T rounds

Aspect	Details
Pros	Simple, focuses on hard examples, less prone to overfitting than expected
Cons	Sensitive to noisy data and outliers, sequential (can't parallelize)

XGBoost / LightGBM / CatBoost

Agent Instruction

For tabular data: always try XGBoost/LightGBM first. They consistently outperform other algorithms on structured data. Use CatBoost for high-cardinality categorical features.

Gradient Boosting builds trees sequentially where each new tree corrects errors of the ensemble so far. XGBoost adds regularization, second-order optimization, and engineering optimizations.

XGBoost Key Innovations

Second-order Taylor expansion - Uses both gradient and Hessian for better split gains
Regularization - L1/L2 on leaf weights prevents overfitting
Sparsity-aware - Handles missing values natively
Column subsampling - Random feature selection like Random Forest

XGBoost vs LightGBM vs CatBoost

Feature	XGBoost	LightGBM	CatBoost
Split strategy	Level-wise	Leaf-wise (faster)	Symmetric (balanced)
Categoricals	Need encoding	Native support	Best native support
Speed	Good	Fastest	Moderate
Overfitting risk	Medium	Higher (leaf-wise)	Lower
Best for	General purpose	Large datasets	Categorical-heavy data

Key Hyperparameters

# XGBoost typical starting point
params = {
    'max_depth': 6,           # Tree depth (3-10)
    'learning_rate': 0.1,     # Step size (0.01-0.3)
    'n_estimators': 500,      # Number of trees
    'subsample': 0.8,         # Row sampling ratio
    'colsample_bytree': 0.8,  # Column sampling ratio
    'reg_alpha': 0,           # L1 regularization
    'reg_lambda': 1,          # L2 regularization
    'min_child_weight': 1,    # Minimum leaf weight
}

K-Means Clustering

When to Use

Customer segmentation, data compression, image color quantization, or any task where you need to group similar items. Use when clusters are roughly spherical.

K-Means partitions n observations into K clusters by iteratively assigning points to the nearest centroid and updating centroids to cluster means.

Algorithm

import random, math

def kmeans(data, k, max_iter=100):
    # K-Means++ initialization
    centroids = [random.choice(data)]
    for _ in range(1, k):
        dists = [min(sum((a-b)**2 for a,b in zip(p,c)) for c in centroids) for p in data]
        total = sum(dists)
        probs = [d/total for d in dists]
        # Weighted random selection
        r = random.random()
        cumsum = 0
        for i, p in enumerate(probs):
            cumsum += p
            if cumsum >= r:
                centroids.append(data[i])
                break

    for _ in range(max_iter):
        # Assign clusters
        clusters = [[] for _ in range(k)]
        for point in data:
            distances = [sum((a-b)**2 for a,b in zip(point,c)) for c in centroids]
            clusters[distances.index(min(distances))].append(point)

        # Update centroids
        new_centroids = []
        for cluster in clusters:
            if cluster:
                new_centroids.append([sum(dim)/len(cluster) for dim in zip(*cluster)])
            else:
                new_centroids.append(random.choice(data))

        if new_centroids == centroids:
            break
        centroids = new_centroids

    return centroids, clusters

Choosing K

Elbow Method - Plot inertia vs K, look for the "elbow"
Silhouette Score - Measures cluster cohesion vs separation (-1 to 1, higher is better)
Gap Statistic - Compares clustering to random uniform distribution

Aspect	Details
Time	O(n * K * m * iterations) - fast in practice
Pros	Simple, fast, scales well, easy to interpret
Cons	Must specify K, assumes spherical clusters, sensitive to initialization and outliers
Alternative	DBSCAN for arbitrary shapes, GMM for probabilistic

DBSCAN

Density-Based Spatial Clustering of Applications with Noise. Finds clusters of arbitrary shape by grouping dense regions separated by sparse areas. Automatically identifies noise/outliers.

Key Parameters

eps (epsilon) - Radius of neighborhood
min_samples - Minimum points in neighborhood to form a core point

Aspect	Details
Pros	No need to specify K, finds arbitrary shapes, identifies outliers
Cons	Sensitive to eps/min_samples, struggles with varying density clusters

Gaussian Mixture Model (GMM)

Probabilistic model that assumes data is generated from a mixture of K Gaussian distributions. Uses EM algorithm to find parameters. Provides soft (probabilistic) cluster assignments.

EM Algorithm Steps

E-step - Compute responsibility (probability each point belongs to each Gaussian)
M-step - Update means, covariances, and mixing coefficients using responsibilities
Repeat until log-likelihood converges

Aspect	Details
vs K-Means	GMM gives probabilistic assignments, handles elliptical clusters, more flexible but slower
Use When	Clusters are elliptical, you need probabilities, data is Gaussian-like

Principal Component Analysis (PCA)

When to Use

Dimensionality reduction before training (speeds up models, reduces overfitting), visualization of high-dimensional data, noise reduction.

PCA finds orthogonal directions (principal components) of maximum variance. Projects data onto these directions to reduce dimensions while preserving the most information.

Algorithm

Center the data (subtract mean)
Compute covariance matrix
Compute eigenvalues and eigenvectors
Sort by eigenvalue (largest = most variance)
Project data onto top-K eigenvectors

How Many Components?

Plot cumulative explained variance - pick K that explains 95% of variance
Use scree plot - look for the elbow in eigenvalue magnitudes

Aspect	Details
Pros	Reduces overfitting, speeds up training, removes multicollinearity
Cons	Loses interpretability, assumes linear relationships, sensitive to scaling
Alternative	t-SNE/UMAP for visualization (non-linear), LDA for supervised reduction

t-SNE

t-Distributed Stochastic Neighbor Embedding — a non-linear dimensionality reduction technique specialized for visualizing high-dimensional data in 2D or 3D.

When to Use

Visualizing clusters in high-dimensional data (embeddings, image features, genomics). Best for exploratory analysis on small-medium datasets (<10K points). NOT for preprocessing — use PCA or UMAP instead.

How It Works

Computes pairwise similarities in high-dimensional space using Gaussian distributions, then creates a matching distribution in low-dimensional space using Student's t-distribution (heavier tails prevent crowding). Minimizes KL divergence between the two distributions via gradient descent.

Pure Python Core

import numpy as np

def tsne_similarity(X, perplexity=30.0):
    """Compute pairwise affinities with perplexity-based bandwidth."""
    n = X.shape[0]
    distances = np.sum((X[:, None] - X[None, :]) ** 2, axis=-1)
    P = np.zeros((n, n))
    for i in range(n):
        # Binary search for sigma that achieves target perplexity
        lo, hi = 1e-10, 1e4
        for _ in range(50):
            sigma = (lo + hi) / 2
            pi = np.exp(-distances[i] / (2 * sigma ** 2))
            pi[i] = 0
            pi /= pi.sum() + 1e-8
            entropy = -np.sum(pi * np.log2(pi + 1e-8))
            if entropy > np.log2(perplexity):
                hi = sigma
            else:
                lo = sigma
        P[i] = pi
    return (P + P.T) / (2 * n)

# Low-dimensional: Student-t kernel
# q_ij = (1 + ||y_i - y_j||^2)^(-1) / sum
# Gradient: 4 * sum_j (p_ij - q_ij)(y_i - y_j)(1 + ||y_i - y_j||^2)^(-1)

Aspect	Details
Complexity	O(n²) time and space — quadratic in data points
Key Params	perplexity (5-50), learning_rate (10-1000), n_iter (1000+)
Pros	Excellent local structure preservation, reveals clusters beautifully
Cons	Slow on large data, non-deterministic, inter-cluster distances not meaningful
Applications	Single-cell RNA-seq, document embeddings, image dataset exploration

UMAP

Uniform Manifold Approximation and Projection — preserves both local and global data structure, offering a faster and more scalable alternative to t-SNE.

When to Use

Large-scale visualization, preprocessing before clustering (works great with HDBSCAN), when t-SNE is too slow. Also usable for general dimensionality reduction (unlike t-SNE).

How It Works

Based on Riemannian geometry and algebraic topology. Constructs a fuzzy simplicial complex (weighted graph) in high-dimensional space, then optimizes a low-dimensional layout that preserves the topological structure using cross-entropy loss.

Key Advantages over t-SNE

Speed — ~O(n^1.14) empirically vs O(n²) for t-SNE
Global structure — preserves relative cluster positions, not just local neighborhoods
Transform support — can embed new unseen data points
General purpose — usable as preprocessing for ML models, not just visualization

Aspect	Details
Complexity	~O(n^1.14) time, O(n×k) space where k = n_neighbors
Key Params	n_neighbors (5-50), min_dist (0.0-0.99), n_components, metric
Pros	Fast, preserves global structure, supports transform, scalable
Cons	Stochastic, hyperparameter-sensitive, harder to interpret theoretically
Applications	Single-cell genomics, image embeddings, clustering pipelines, feature exploration

Dimensionality Reduction Comparison

Method	Type	Speed	Global Structure	Transform New Data	Best For
PCA	Linear	Very Fast	Yes	Yes	Feature reduction, preprocessing
t-SNE	Non-linear	Slow	No	No	Small dataset visualization
UMAP	Non-linear	Fast	Partial	Yes	Large-scale visualization + preprocessing

Algorithm Comparison Chart

Quick reference for choosing the right algorithm based on your requirements.

Algorithm	Type	Data Size	Interpretable?	Speed	Best For
Linear Regression	Regression	Any	High	Very Fast	Linear relationships
Logistic Regression	Classification	Any	High	Very Fast	Binary classification baseline
Decision Tree	Both	Small-Med	Very High	Fast	Explainability
Random Forest	Both	Any	Medium	Medium	General purpose, minimal tuning
XGBoost	Both	Any	Low-Med	Medium	Tabular data, competitions
SVM	Classification	Small-Med	Low	Slow	High-dimensional, clear margins
KNN	Classification	Small	Medium	Slow (pred)	Simple problems, recommendations
Naive Bayes	Classification	Any	Medium	Very Fast	Text classification, spam
K-Means	Clustering	Any	High	Fast	Customer segmentation
DBSCAN	Clustering	Small-Med	Medium	Medium	Arbitrary shapes, outliers
PCA	Dim. Reduction	Any	Low	Fast	Feature reduction, visualization
t-SNE	Dim. Reduction	Small	Low	Slow	Cluster visualization
UMAP	Dim. Reduction	Any	Low	Fast	Visualization + preprocessing

Classic ML Algorithms

Linear Regression

Key Concepts

Pure Python Implementation

Complexity & Trade-offs

Logistic Regression

Key Concepts

Pure Python Implementation

Complexity & Trade-offs

Stochastic Gradient Descent (SGD)

Variants

Learning Rate Schedule

Perceptron

Algorithm

Decision Trees (CART)

Splitting Criteria

Pure Python Implementation

Pruning Strategies

Naive Bayes

Variants

K-Nearest Neighbors (KNN)

Key Decisions

Support Vector Machine (SVM)

Key Concepts

Random Forest

Why It Works

Key Hyperparameters

AdaBoost

How It Works

XGBoost / LightGBM / CatBoost

XGBoost Key Innovations

XGBoost vs LightGBM vs CatBoost

Key Hyperparameters

K-Means Clustering

Algorithm

Choosing K

DBSCAN

Key Parameters

Gaussian Mixture Model (GMM)

EM Algorithm Steps

Principal Component Analysis (PCA)

Algorithm

How Many Components?

t-SNE

How It Works

Pure Python Core

UMAP

How It Works

Key Advantages over t-SNE

Dimensionality Reduction Comparison

Algorithm Comparison Chart

Discussion