Building a Masked Autoencoder (MAE) from Scratch in PyTorch

Learn how masked autoencoder (MAE) works and implemented in PyTorch

November 2025 · Saeed Mehrang
ViTDet

ViTDet: Plain Vision Transformer Backbones for Object Detection

ViTDet demonstrates that plain, non-hierarchical Vision Transformers can compete with hierarchical backbones for object detection through simple adaptations.

October 2025 · Saeed Mehrang
ViT

The Image is a Sequence: Dissecting the Vision Transformer (ViT)

An in-depth look at ‘An Image is Worth 16x16 Words,’ the paper that introduced the pure Vision Transformer, its architecture, novelty, limitations, and how modern models like Swin Transformer evolved from it.

October 2025 · Saeed Mehrang