LLM | Saeed Mehrang

GCP Architecture for Generative AI Applications: A Practical Guide - Part 1

A comprehensive guide to designing production-ready generative AI applications on Google Cloud Platform, covering architectural patterns, service selection, autoscaling strategies, and cost optimization.

The What,When, and Why of Mixture-of-Experts (MoE)

Unpacking Mixture-of-Experts (MoE) in LLMs: A Foundational Dive

This blog post demystifies Mixture-of-Experts (MoE) layers, a key innovation for scaling Large Language Models efficiently. We’ll trace its origins, delve into the mathematical underpinnings, and build a foundational MoE block in PyTorch, mirroring the architecture from its initial conception.

Multi-head Latent Attention (MLA): Making Transformers More Efficient

This blog post explains Multi-head Latent Attention (MLA) and provides minimal working code in pytorch.

KV-Caching in LLMs: The Optimization That Makes Inference Practical

Learn how KV-caching makes ChatGPT respond in seconds instead of minutes. This comprehensive guide explains the quadratic complexity problem in transformers, how caching Keys and Values solves it with 10-100x speedups, and the memory trade-offs - complete with full PyTorch implementations, benchmarks, and interactive visualizations.

Comparison of activation functions showing SwiGLU's smooth gradients and gating mechanism

SwiGLU: The Activation Function Powering Modern LLMs

Discover why SwiGLU has replaced ReLU and GELU in modern transformers. This post explains the mathematical foundations, the evolution from sigmoid gates to Swish gates, and why this innovation delivers 5-8% performance gains - complete with Python implementations and interactive visualizations.

Understanding Grouped-Query Attention: A Practical Guide with PyTorch Implementation

This blog post explains Grouped-Query Attention in simple terms, showing how it differs from vanila multihead attention for modern language models.

Understanding Rotary Position Embeddings (RoPE): A Visual Guide

This blog post explains RoPE in simple terms, showing how it differs from sinusoidal embeddings and why it’s become the standard for modern language models.