Linear Regression
Linear regression fits a line (or hyperplane) to minimize the sum of squared residuals between predicted and actual values. It's the foundation of predictive modeling.
Key Concepts
- Ordinary Least Squares (OLS) - Closed-form solution:
w = (X^T X)^(-1) X^T y - Ridge Regression (L2) - Adds L2 penalty to prevent overfitting:
loss + lambda * ||w||^2 - Lasso Regression (L1) - Adds L1 penalty for feature selection:
loss + lambda * ||w|| - Elastic Net - Combines L1 and L2 regularization
Pure Python Implementation
def linear_regression(X, y, lr=0.01, epochs=1000):
"""Gradient descent linear regression from scratch."""
n, m = len(X), len(X[0])
weights = [0.0] * m
bias = 0.0
for _ in range(epochs):
for i in range(n):
pred = sum(weights[j] * X[i][j] for j in range(m)) + bias
error = pred - y[i]
for j in range(m):
weights[j] -= lr * error * X[i][j] / n
bias -= lr * error / n
return weights, bias
Complexity & Trade-offs
| Aspect | Details |
|---|---|
| Time Complexity | O(n*m) per epoch for SGD, O(m^3) for OLS closed-form |
| Space Complexity | O(m) for weights |
| Pros | Interpretable, fast, good baseline, well-understood theory |
| Cons | Assumes linearity, sensitive to outliers, can't capture non-linear patterns |
| Use When | Linear relationship expected, interpretability needed, small-medium data |
| Don't Use When | Complex non-linear relationships, lots of categorical features |
Logistic Regression
Despite the name, logistic regression is a classification algorithm. It applies the sigmoid function to a linear combination of features to output probabilities.
Key Concepts
- Sigmoid Function -
sigma(z) = 1 / (1 + e^(-z))maps any value to [0, 1] - Log Loss - Cross-entropy loss:
-[y*log(p) + (1-y)*log(1-p)] - Decision Boundary - Linear boundary in feature space at p = 0.5
- Regularization - L1 for sparse features, L2 for multicollinearity
- Multi-class - One-vs-Rest (OvR) or Softmax for multiple classes
Pure Python Implementation
import math
def sigmoid(z):
return 1.0 / (1.0 + math.exp(-max(-500, min(500, z))))
def logistic_regression(X, y, lr=0.1, epochs=500):
n, m = len(X), len(X[0])
weights = [0.0] * m
bias = 0.0
for _ in range(epochs):
for i in range(n):
z = sum(weights[j] * X[i][j] for j in range(m)) + bias
pred = sigmoid(z)
error = pred - y[i]
for j in range(m):
weights[j] -= lr * error * X[i][j] / n
bias -= lr * error / n
return weights, bias
Complexity & Trade-offs
| Aspect | Details |
|---|---|
| Time | O(n*m) per epoch |
| Pros | Probabilistic output, interpretable coefficients, fast training, good baseline |
| Cons | Assumes linear decision boundary, can't capture complex interactions |
| Use When | Binary classification, need probabilities, interpretability required |
| Don't Use When | Non-linear decision boundaries, image/text data |
Stochastic Gradient Descent (SGD)
SGD updates model parameters using the gradient of the loss function computed on a single sample (or mini-batch) at a time, rather than the full dataset.
Variants
- Batch GD - Uses entire dataset per update (stable but slow)
- Mini-batch SGD - Uses small batches (best of both worlds)
- SGD with Momentum - Accelerates convergence with velocity term
- Adam - Adaptive learning rates per parameter (most popular for deep learning)
- AdamW - Adam with decoupled weight decay (recommended for transformers)
Learning Rate Schedule
# Cosine annealing with warmup
def cosine_lr(step, warmup_steps, total_steps, max_lr, min_lr=0):
if step < warmup_steps:
return max_lr * step / warmup_steps
progress = (step - warmup_steps) / (total_steps - warmup_steps)
return min_lr + 0.5 * (max_lr - min_lr) * (1 + math.cos(math.pi * progress))
Perceptron
The simplest neural network - a single-layer linear classifier that learns a decision boundary using the perceptron learning rule. Converges only for linearly separable data.
Algorithm
def perceptron(X, y, lr=0.01, epochs=100):
weights = [0.0] * len(X[0])
bias = 0.0
for _ in range(epochs):
for i in range(len(X)):
pred = 1 if sum(w * x for w, x in zip(weights, X[i])) + bias > 0 else 0
error = y[i] - pred
for j in range(len(weights)):
weights[j] += lr * error * X[i][j]
bias += lr * error
return weights, bias
Decision Trees (CART)
CART (Classification and Regression Trees) recursively splits data on feature thresholds to create a tree of decisions. Each leaf node represents a prediction.
Splitting Criteria
- Gini Impurity -
1 - sum(p_i^2)- measures probability of misclassification - Entropy / Information Gain -
-sum(p_i * log(p_i))- measures disorder - MSE - Mean squared error for regression trees
Pure Python Implementation
def gini_impurity(groups, classes):
total = sum(len(g) for g in groups)
gini = 0.0
for group in groups:
size = len(group)
if size == 0:
continue
score = 0.0
for cls in classes:
p = sum(1 for row in group if row[-1] == cls) / size
score += p * p
gini += (1.0 - score) * (size / total)
return gini
def best_split(dataset):
classes = list(set(row[-1] for row in dataset))
best_feature, best_value, best_score, best_groups = None, None, float('inf'), None
for col in range(len(dataset[0]) - 1):
for row in dataset:
left = [r for r in dataset if r[col] < row[col]]
right = [r for r in dataset if r[col] >= row[col]]
gini = gini_impurity([left, right], classes)
if gini < best_score:
best_feature, best_value, best_score = col, row[col], gini
best_groups = (left, right)
return {'feature': best_feature, 'value': best_value, 'groups': best_groups}
Pruning Strategies
- Pre-pruning - Set max_depth, min_samples_split, min_samples_leaf
- Post-pruning - Grow full tree then prune based on validation error
- Cost-complexity pruning - Balance tree size vs accuracy with alpha parameter
| Aspect | Details |
|---|---|
| Pros | Highly interpretable, handles mixed feature types, no scaling needed |
| Cons | Prone to overfitting, unstable (small data changes = different tree), high variance |
| Fix | Use ensemble methods (Random Forest, Gradient Boosting) to reduce variance |
Naive Bayes
Applies Bayes' theorem with the "naive" assumption of feature independence. Surprisingly effective for text classification despite the simplifying assumption.
Variants
- Gaussian NB - Continuous features assumed normally distributed
- Multinomial NB - Count features (word counts in text)
- Bernoulli NB - Binary features (word presence/absence)
| Aspect | Details |
|---|---|
| Pros | Extremely fast, works with small data, good for text, probabilistic |
| Cons | Strong independence assumption, can't learn feature interactions |
K-Nearest Neighbors (KNN)
KNN classifies new points by majority vote of the K nearest training examples. It's a "lazy learner" - no training phase, all computation happens at prediction time.
Key Decisions
- K value - Small K = noisy, large K = smooth but biased. Use cross-validation.
- Distance metric - Euclidean (default), Manhattan, Minkowski, Cosine
- Feature scaling - Essential! Features must be on same scale.
- Weighted voting - Weight closer neighbors more heavily
| Aspect | Details |
|---|---|
| Time | O(1) training, O(n*m) prediction (slow for large datasets) |
| Pros | Simple, no training, works for any decision boundary shape |
| Cons | Slow prediction, curse of dimensionality, needs feature scaling |
Support Vector Machine (SVM)
SVM finds the hyperplane that maximizes the margin between two classes. The "support vectors" are the closest points to the boundary that define it.
Key Concepts
- Maximum Margin - Find the decision boundary with largest gap between classes
- Kernel Trick - Map to higher dimensions without computing the transformation: RBF, Polynomial, Sigmoid
- Soft Margin - Allow some misclassifications with penalty C
- Hinge Loss -
max(0, 1 - y * f(x))
| Aspect | Details |
|---|---|
| Pros | Effective in high dimensions, memory efficient (only stores support vectors), kernel trick |
| Cons | Slow for large datasets O(n^2-n^3), sensitive to feature scaling, no probabilities by default |
| Don't Use | Large datasets (>100K samples), lots of noise, multi-class (use OvR wrapper) |
Random Forest
Random Forest builds many decision trees on random subsets of data (bagging) and random subsets of features, then aggregates predictions by voting/averaging.
Why It Works
- Bootstrap Aggregation (Bagging) - Each tree trained on random sample with replacement
- Random Feature Selection - Each split considers only sqrt(m) features (classification) or m/3 (regression)
- Variance Reduction - Averaging many uncorrelated trees reduces overfitting
- Out-of-Bag (OOB) Score - Free validation using samples not in each tree's bootstrap
Key Hyperparameters
n_estimators- Number of trees (more = better but diminishing returns, typically 100-500)max_depth- Limit tree depth (None = fully grown, default)max_features- Features per split (sqrt for classification, n/3 for regression)min_samples_leaf- Minimum samples in leaf nodes
| Aspect | Details |
|---|---|
| Pros | Robust, handles missing data, feature importance, minimal tuning, parallelizable |
| Cons | Less interpretable than single tree, slower than linear models, large memory |
| vs XGBoost | RF is simpler and more robust; XGBoost is usually more accurate with tuning |
AdaBoost
Adaptive Boosting sequentially trains weak learners (usually decision stumps), giving more weight to misclassified samples. The final model is a weighted combination of all learners.
How It Works
- Initialize equal weights for all samples
- Train weak learner, calculate weighted error
- Compute learner weight:
alpha = 0.5 * ln((1-error)/error) - Update sample weights: increase for misclassified, decrease for correct
- Repeat for T rounds
| Aspect | Details |
|---|---|
| Pros | Simple, focuses on hard examples, less prone to overfitting than expected |
| Cons | Sensitive to noisy data and outliers, sequential (can't parallelize) |
XGBoost / LightGBM / CatBoost
Gradient Boosting builds trees sequentially where each new tree corrects errors of the ensemble so far. XGBoost adds regularization, second-order optimization, and engineering optimizations.
XGBoost Key Innovations
- Second-order Taylor expansion - Uses both gradient and Hessian for better split gains
- Regularization - L1/L2 on leaf weights prevents overfitting
- Sparsity-aware - Handles missing values natively
- Column subsampling - Random feature selection like Random Forest
XGBoost vs LightGBM vs CatBoost
| Feature | XGBoost | LightGBM | CatBoost |
|---|---|---|---|
| Split strategy | Level-wise | Leaf-wise (faster) | Symmetric (balanced) |
| Categoricals | Need encoding | Native support | Best native support |
| Speed | Good | Fastest | Moderate |
| Overfitting risk | Medium | Higher (leaf-wise) | Lower |
| Best for | General purpose | Large datasets | Categorical-heavy data |
Key Hyperparameters
# XGBoost typical starting point
params = {
'max_depth': 6, # Tree depth (3-10)
'learning_rate': 0.1, # Step size (0.01-0.3)
'n_estimators': 500, # Number of trees
'subsample': 0.8, # Row sampling ratio
'colsample_bytree': 0.8, # Column sampling ratio
'reg_alpha': 0, # L1 regularization
'reg_lambda': 1, # L2 regularization
'min_child_weight': 1, # Minimum leaf weight
}
K-Means Clustering
K-Means partitions n observations into K clusters by iteratively assigning points to the nearest centroid and updating centroids to cluster means.
Algorithm
import random, math
def kmeans(data, k, max_iter=100):
# K-Means++ initialization
centroids = [random.choice(data)]
for _ in range(1, k):
dists = [min(sum((a-b)**2 for a,b in zip(p,c)) for c in centroids) for p in data]
total = sum(dists)
probs = [d/total for d in dists]
# Weighted random selection
r = random.random()
cumsum = 0
for i, p in enumerate(probs):
cumsum += p
if cumsum >= r:
centroids.append(data[i])
break
for _ in range(max_iter):
# Assign clusters
clusters = [[] for _ in range(k)]
for point in data:
distances = [sum((a-b)**2 for a,b in zip(point,c)) for c in centroids]
clusters[distances.index(min(distances))].append(point)
# Update centroids
new_centroids = []
for cluster in clusters:
if cluster:
new_centroids.append([sum(dim)/len(cluster) for dim in zip(*cluster)])
else:
new_centroids.append(random.choice(data))
if new_centroids == centroids:
break
centroids = new_centroids
return centroids, clusters
Choosing K
- Elbow Method - Plot inertia vs K, look for the "elbow"
- Silhouette Score - Measures cluster cohesion vs separation (-1 to 1, higher is better)
- Gap Statistic - Compares clustering to random uniform distribution
| Aspect | Details |
|---|---|
| Time | O(n * K * m * iterations) - fast in practice |
| Pros | Simple, fast, scales well, easy to interpret |
| Cons | Must specify K, assumes spherical clusters, sensitive to initialization and outliers |
| Alternative | DBSCAN for arbitrary shapes, GMM for probabilistic |
DBSCAN
Density-Based Spatial Clustering of Applications with Noise. Finds clusters of arbitrary shape by grouping dense regions separated by sparse areas. Automatically identifies noise/outliers.
Key Parameters
- eps (epsilon) - Radius of neighborhood
- min_samples - Minimum points in neighborhood to form a core point
| Aspect | Details |
|---|---|
| Pros | No need to specify K, finds arbitrary shapes, identifies outliers |
| Cons | Sensitive to eps/min_samples, struggles with varying density clusters |
Gaussian Mixture Model (GMM)
Probabilistic model that assumes data is generated from a mixture of K Gaussian distributions. Uses EM algorithm to find parameters. Provides soft (probabilistic) cluster assignments.
EM Algorithm Steps
- E-step - Compute responsibility (probability each point belongs to each Gaussian)
- M-step - Update means, covariances, and mixing coefficients using responsibilities
- Repeat until log-likelihood converges
| Aspect | Details |
|---|---|
| vs K-Means | GMM gives probabilistic assignments, handles elliptical clusters, more flexible but slower |
| Use When | Clusters are elliptical, you need probabilities, data is Gaussian-like |
Principal Component Analysis (PCA)
PCA finds orthogonal directions (principal components) of maximum variance. Projects data onto these directions to reduce dimensions while preserving the most information.
Algorithm
- Center the data (subtract mean)
- Compute covariance matrix
- Compute eigenvalues and eigenvectors
- Sort by eigenvalue (largest = most variance)
- Project data onto top-K eigenvectors
How Many Components?
- Plot cumulative explained variance - pick K that explains 95% of variance
- Use scree plot - look for the elbow in eigenvalue magnitudes
| Aspect | Details |
|---|---|
| Pros | Reduces overfitting, speeds up training, removes multicollinearity |
| Cons | Loses interpretability, assumes linear relationships, sensitive to scaling |
| Alternative | t-SNE/UMAP for visualization (non-linear), LDA for supervised reduction |
t-SNE
t-Distributed Stochastic Neighbor Embedding — a non-linear dimensionality reduction technique specialized for visualizing high-dimensional data in 2D or 3D.
How It Works
Computes pairwise similarities in high-dimensional space using Gaussian distributions, then creates a matching distribution in low-dimensional space using Student's t-distribution (heavier tails prevent crowding). Minimizes KL divergence between the two distributions via gradient descent.
Pure Python Core
import numpy as np
def tsne_similarity(X, perplexity=30.0):
"""Compute pairwise affinities with perplexity-based bandwidth."""
n = X.shape[0]
distances = np.sum((X[:, None] - X[None, :]) ** 2, axis=-1)
P = np.zeros((n, n))
for i in range(n):
# Binary search for sigma that achieves target perplexity
lo, hi = 1e-10, 1e4
for _ in range(50):
sigma = (lo + hi) / 2
pi = np.exp(-distances[i] / (2 * sigma ** 2))
pi[i] = 0
pi /= pi.sum() + 1e-8
entropy = -np.sum(pi * np.log2(pi + 1e-8))
if entropy > np.log2(perplexity):
hi = sigma
else:
lo = sigma
P[i] = pi
return (P + P.T) / (2 * n)
# Low-dimensional: Student-t kernel
# q_ij = (1 + ||y_i - y_j||^2)^(-1) / sum
# Gradient: 4 * sum_j (p_ij - q_ij)(y_i - y_j)(1 + ||y_i - y_j||^2)^(-1)
| Aspect | Details |
|---|---|
| Complexity | O(n²) time and space — quadratic in data points |
| Key Params | perplexity (5-50), learning_rate (10-1000), n_iter (1000+) |
| Pros | Excellent local structure preservation, reveals clusters beautifully |
| Cons | Slow on large data, non-deterministic, inter-cluster distances not meaningful |
| Applications | Single-cell RNA-seq, document embeddings, image dataset exploration |
UMAP
Uniform Manifold Approximation and Projection — preserves both local and global data structure, offering a faster and more scalable alternative to t-SNE.
How It Works
Based on Riemannian geometry and algebraic topology. Constructs a fuzzy simplicial complex (weighted graph) in high-dimensional space, then optimizes a low-dimensional layout that preserves the topological structure using cross-entropy loss.
Key Advantages over t-SNE
- Speed — ~O(n^1.14) empirically vs O(n²) for t-SNE
- Global structure — preserves relative cluster positions, not just local neighborhoods
- Transform support — can embed new unseen data points
- General purpose — usable as preprocessing for ML models, not just visualization
| Aspect | Details |
|---|---|
| Complexity | ~O(n^1.14) time, O(n×k) space where k = n_neighbors |
| Key Params | n_neighbors (5-50), min_dist (0.0-0.99), n_components, metric |
| Pros | Fast, preserves global structure, supports transform, scalable |
| Cons | Stochastic, hyperparameter-sensitive, harder to interpret theoretically |
| Applications | Single-cell genomics, image embeddings, clustering pipelines, feature exploration |
Dimensionality Reduction Comparison
| Method | Type | Speed | Global Structure | Transform New Data | Best For |
|---|---|---|---|---|---|
| PCA | Linear | Very Fast | Yes | Yes | Feature reduction, preprocessing |
| t-SNE | Non-linear | Slow | No | No | Small dataset visualization |
| UMAP | Non-linear | Fast | Partial | Yes | Large-scale visualization + preprocessing |
Algorithm Comparison Chart
Quick reference for choosing the right algorithm based on your requirements.
| Algorithm | Type | Data Size | Interpretable? | Speed | Best For |
|---|---|---|---|---|---|
| Linear Regression | Regression | Any | High | Very Fast | Linear relationships |
| Logistic Regression | Classification | Any | High | Very Fast | Binary classification baseline |
| Decision Tree | Both | Small-Med | Very High | Fast | Explainability |
| Random Forest | Both | Any | Medium | Medium | General purpose, minimal tuning |
| XGBoost | Both | Any | Low-Med | Medium | Tabular data, competitions |
| SVM | Classification | Small-Med | Low | Slow | High-dimensional, clear margins |
| KNN | Classification | Small | Medium | Slow (pred) | Simple problems, recommendations |
| Naive Bayes | Classification | Any | Medium | Very Fast | Text classification, spam |
| K-Means | Clustering | Any | High | Fast | Customer segmentation |
| DBSCAN | Clustering | Small-Med | Medium | Medium | Arbitrary shapes, outliers |
| PCA | Dim. Reduction | Any | Low | Fast | Feature reduction, visualization |
| t-SNE | Dim. Reduction | Small | Low | Slow | Cluster visualization |
| UMAP | Dim. Reduction | Any | Low | Fast | Visualization + preprocessing |