The challenge of modern AI lies in scaling models without skyrocketing operational costs (Inference). This section explores breakthroughs in model efficiency that focus on optimization after (or during) the training process, specifically targeting deployment.

We categorize methods by their phase of implementation (e.g., Applied During Training vs. Applied Post-Training) and their primary goal (reducing Training Time vs. Inference Time). Discover techniques that decouple performance from prohibitive resource consumption, making large, high-performing AI models practical for everything from mobile devices to large-scale data center serving. See the list of methods below,

Category (Implementation)MethodPrimary BenefitRationale / Key Trade-off
Applied Post-Training / InferencePost-Training Quantization (PTQ)Inference TimeWeights are reduced in precision after training to immediately reduce model size and speed up prediction.
Applied Post-Training / InferencePruningInference TimeRemoves redundant structure after the full model is trained to achieve a smaller, faster deployment model.
Applied Post-Training / InferenceLow-Rank Factorization (LRF)Inference TimeDecomposes weight matrices post-training to reduce parameters and FLOPs for deployment.
Applied Post-Training / InferenceModel Compilers (TVM, XLA)Inference TimeSoftware-level optimization of the computational graph tailored for specific deployment hardware.
Applied Post-Training / InferenceNeural Architecture Search (NAS)Inference Time (Net)Trade-off: NAS significantly increases the total training time (search cost) to find an architecture that maximizes inference speed.