Algorithm Picker

Answer a few questions about your problem and data, and we'll recommend the best algorithms to try.

Step 1: What is your goal?

🎯 Classification Predict a category/label

📈 Regression Predict a numeric value

📌 Clustering Group similar items

🎨 Generation Create new content

🔎 Anomaly Detection Find outliers/unusual data

⭐ Recommendation Suggest items to users

🔬 Dimensionality Reduction Reduce features/visualize

🤖 Decision Making RL / sequential decisions

Step 2: What type of data do you have?

📊 Tabular / Structured Spreadsheets, databases, CSV

📝 Text / NLP Documents, chat, reviews

📷 Images / Video Photos, medical scans

📉 Time Series Stock prices, sensor data

🛎 Graph / Network Social networks, molecules

🎤 Audio / Speech Voice, music, sounds

Step 3: How much data do you have?

🔬 Small < 1,000 samples

📊 Medium 1K - 100K samples

🚀 Large 100K+ samples

Step 4: What's most important?

🎯 Best Accuracy Performance above all

💡 Interpretability Need to explain decisions

⚡ Speed / Simplicity Quick results, easy to implement

Quick Selection by Data Type

Data Type	Problem	Start With	Level Up To
Tabular	Classification	Logistic Regression	XGBoost / LightGBM
Tabular	Regression	Linear Regression	XGBoost / LightGBM
Text	Classification	TF-IDF + LogReg	BERT / Fine-tuned LLM
Text	Generation	Pre-trained LLM	Fine-tuned LLM (LoRA)
Text	Q&A / Search	BM25	RAG with embeddings
Images	Classification	Pre-trained CNN (ResNet)	Vision Transformer (ViT)
Images	Object Detection	YOLO	DETR (Transformer-based)
Images	Generation	GAN	Diffusion Models
Time Series	Forecasting	ARIMA / Prophet	LSTM / Temporal Fusion Transformer
Graph	Node classification	GCN	GAT / GraphSAGE

By Business Constraint

Constraint	Recommended Approach
Must explain predictions (healthcare, finance)	Linear/Logistic Regression, Decision Trees, SHAP
Real-time predictions (< 10ms)	Linear models, Naive Bayes, quantized models
Limited labeled data (< 100 samples)	Transfer learning, few-shot, pre-trained models
No labeled data at all	Clustering, dimensionality reduction, self-supervised
Frequently changing knowledge	RAG (easy to update document store)
Consistent style/format	Fine-tuning (SFT with LoRA)
Minimum engineering effort	CatBoost (tabular), pre-trained models, AutoML
Edge deployment (mobile, IoT)	Quantized models, Knowledge Distillation

Common Mistakes

Mistake	Why It's Wrong	Better Approach
Deep learning for small tabular data	Will overfit; gradient boosting is better	XGBoost / LightGBM / CatBoost
KNN on large datasets	O(n) prediction time is too slow	Tree-based models or ANN index
t-SNE for dimensionality reduction	Designed for visualization only	PCA or UMAP for preprocessing
SVM on millions of rows	O(n²-n³) training doesn't scale	LightGBM or neural networks
RNNs for NLP in 2025	Transformers are superior	BERT, GPT, or other Transformers
Fine-tuning when RAG would work	Expensive, knowledge becomes stale	RAG for dynamic factual knowledge
GANs for image generation	Diffusion models produce better results	Stable Diffusion

The Progressive Complexity Ladder

Rule of Thumb

Don't move to the next step unless the current one is clearly insufficient. Each step adds complexity, cost, and maintenance. Many production systems run perfectly well on Step 2 or 3.

Baseline: Simple model (Linear/Logistic Regression, Naive Bayes)
Standard ML: Tree-based ensemble (Random Forest, XGBoost)
Optimized ML: Hyperparameter-tuned gradient boosting + feature engineering
Deep Learning: Neural networks (only if Step 3 is insufficient)
State-of-the-art: Pre-trained models, fine-tuning, ensembles