The challenge of modern AI often starts with the immense computational cost of Training. This section explores breakthroughs in model efficiency that focus on optimization strategies applied before or during the training process, specifically targeting the reduction of time and hardware resources required to achieve a high-performing model.
We explore methods categorized by their phase of implementation (Applied During Training) and their primary goal (reducing Training Time). Discover techniques that fundamentally change how parameters are updated and stored, making the development of massive models practical and accessible. See the list of methods below,
| Category (Implementation) | Method | Primary Benefit | Rationale / Key Trade-off |
|---|---|---|---|
| Applied During Training | Efficient Optimizers | Training Time | Optimizers (e.g., Sophia, AdamW) converge faster or use less memory, directly cutting training resources. |
| Applied During Training | Distributed Training | Training Time | Techniques (FSDP, ZeRO) distribute memory/compute across devices to allow for faster, larger training runs. |
| Applied During Training | PEFT (LoRA, etc.) | Training Time | Drastically reduces the number of parameters that need to be trained/updated during the fine-tuning process. |
| Applied During Training | Efficient Architectures (MoE, SSMs) | Both | Architectures like MoE and Mamba are inherently more efficient, improving throughput in both training and inference. |
| Applied During Training | Knowledge Distillation (KD) | Inference Time | The goal is to generate a smaller, faster model (the student) that is cheap to run for prediction. |
| Applied During Training | Quantization-Aware Training (QAT) | Inference Time | The training is modified only to ensure the resulting low-precision model performs well during inference. |
| Applied During Training | Gradient-based Neural Architecture Search (NAS) | Inference Time (Net) | Trade-off: NAS significantly increases the total training time (search cost) to find an architecture that maximizes inference speed. |
| Applied During Training | Mixed Precision (MP) | Training Time | Uses lower precision (e.g., FP16/BF16) for non-critical calculations during the training loop to reduce memory and accelerate throughput. |
