Diagram showing efficient transformer architectures

Taming the Transformer: How Perceiver IO and PaCa-ViT Conquer Quadratic Complexity

A deep dive into two novel architectures, Perceiver IO and PaCa-ViT, that break the O(N^2) barrier in Transformers, enabling them to process massive inputs efficiently.

October 2025 · Saeed Mehrang

Swin Transformer: Shifting Windows to Build Hierarchical Vision Models

This post provides a minimal PyTorch implementation of Swin Transformer for a simple image classification.

October 2025 · Saeed Mehrang

Computational Drug Discovery Part 5 (Part 3/3): Generative Models for De Novo Drug Design - Transformers

From prediction to creation (Part 3/3): : how AI generates novel drug molecules optimized for multiple objectives using autoregressive transformer architectures.

October 2025 · Saeed Mehrang

Computational Drug Discovery Part 3 (Subpart 1/3): AlphaFold Overview

How DeepMind’s AlphaFold2 solved the 50-year grand challenge in biology – the protein folding problem – using transformers, evolutionary information, and geometric reasoning and what it means for drug discovery - Subpart 1/3

October 2025 · Saeed Mehrang
Multi-head Latent Attention (MLA)

Multi-head Latent Attention (MLA): Making Transformers More Efficient

This blog post explains Multi-head Latent Attention (MLA) and provides minimal working code in pytorch.

October 2025 · Saeed Mehrang
KV Caching Simply

KV-Caching in LLMs: The Optimization That Makes Inference Practical

Learn how KV-caching makes ChatGPT respond in seconds instead of minutes. This comprehensive guide explains the quadratic complexity problem in transformers, how caching Keys and Values solves it with 10-100x speedups, and the memory trade-offs - complete with full PyTorch implementations, benchmarks, and interactive visualizations.

October 2025 · Saeed Mehrang
Comparison of activation functions showing SwiGLU's smooth gradients and gating mechanism

SwiGLU: The Activation Function Powering Modern LLMs

Discover why SwiGLU has replaced ReLU and GELU in modern transformers. This post explains the mathematical foundations, the evolution from sigmoid gates to Swish gates, and why this innovation delivers 5-8% performance gains - complete with Python implementations and interactive visualizations.

October 2025 · Saeed Mehrang
RoPE vs Sinusoidal Embeddings

Understanding Rotary Position Embeddings (RoPE): A Visual Guide

This blog post explains RoPE in simple terms, showing how it differs from sinusoidal embeddings and why it’s become the standard for modern language models.

October 2025 · Saeed Mehrang