Methods for Neural Networks Post-training Optimization

The challenge of modern AI lies in scaling models without skyrocketing operational costs (Inference). This section explores breakthroughs in model efficiency that focus on optimization after (or during) the training process, specifically targeting deployment.

We categorize methods by their phase of implementation (e.g., Applied During Training vs. Applied Post-Training) and their primary goal (reducing Training Time vs. Inference Time). Discover techniques that decouple performance from prohibitive resource consumption, making large, high-performing AI models practical for everything from mobile devices to large-scale data center serving. See the list of methods below,

Category (Implementation)	Method	Primary Benefit	Rationale / Key Trade-off
Applied Post-Training / Inference	Post-Training Quantization (PTQ)	Inference Time	Weights are reduced in precision after training to immediately reduce model size and speed up prediction.
Applied Post-Training / Inference	Pruning	Inference Time	Removes redundant structure after the full model is trained to achieve a smaller, faster deployment model.
Applied Post-Training / Inference	Low-Rank Factorization (LRF)	Inference Time	Decomposes weight matrices post-training to reduce parameters and FLOPs for deployment.
Applied Post-Training / Inference	Model Compilers (TVM, XLA)	Inference Time	Software-level optimization of the computational graph tailored for specific deployment hardware.
Applied Post-Training / Inference	Neural Architecture Search (NAS)	Inference Time (Net)	Trade-off: NAS significantly increases the total training time (search cost) to find an architecture that maximizes inference speed.