Quick Selection by Data Type

Data TypeProblemStart WithLevel Up To
TabularClassificationLogistic RegressionXGBoost / LightGBM
TabularRegressionLinear RegressionXGBoost / LightGBM
TextClassificationTF-IDF + LogRegBERT / Fine-tuned LLM
TextGenerationPre-trained LLMFine-tuned LLM (LoRA)
TextQ&A / SearchBM25RAG with embeddings
ImagesClassificationPre-trained CNN (ResNet)Vision Transformer (ViT)
ImagesObject DetectionYOLODETR (Transformer-based)
ImagesGenerationGANDiffusion Models
Time SeriesForecastingARIMA / ProphetLSTM / Temporal Fusion Transformer
GraphNode classificationGCNGAT / GraphSAGE

By Business Constraint

ConstraintRecommended Approach
Must explain predictions (healthcare, finance)Linear/Logistic Regression, Decision Trees, SHAP
Real-time predictions (< 10ms)Linear models, Naive Bayes, quantized models
Limited labeled data (< 100 samples)Transfer learning, few-shot, pre-trained models
No labeled data at allClustering, dimensionality reduction, self-supervised
Frequently changing knowledgeRAG (easy to update document store)
Consistent style/formatFine-tuning (SFT with LoRA)
Minimum engineering effortCatBoost (tabular), pre-trained models, AutoML
Edge deployment (mobile, IoT)Quantized models, Knowledge Distillation

Common Mistakes

MistakeWhy It's WrongBetter Approach
Deep learning for small tabular dataWill overfit; gradient boosting is betterXGBoost / LightGBM / CatBoost
KNN on large datasetsO(n) prediction time is too slowTree-based models or ANN index
t-SNE for dimensionality reductionDesigned for visualization onlyPCA or UMAP for preprocessing
SVM on millions of rowsO(n²-n³) training doesn't scaleLightGBM or neural networks
RNNs for NLP in 2025Transformers are superiorBERT, GPT, or other Transformers
Fine-tuning when RAG would workExpensive, knowledge becomes staleRAG for dynamic factual knowledge
GANs for image generationDiffusion models produce better resultsStable Diffusion

The Progressive Complexity Ladder

Rule of Thumb
Don't move to the next step unless the current one is clearly insufficient. Each step adds complexity, cost, and maintenance. Many production systems run perfectly well on Step 2 or 3.
  1. Baseline: Simple model (Linear/Logistic Regression, Naive Bayes)
  2. Standard ML: Tree-based ensemble (Random Forest, XGBoost)
  3. Optimized ML: Hyperparameter-tuned gradient boosting + feature engineering
  4. Deep Learning: Neural networks (only if Step 3 is insufficient)
  5. State-of-the-art: Pre-trained models, fine-tuning, ensembles