Mech Interp Puzzles

Monthly mechanistic interpretability challenges on small toy models.

Each month we release a new puzzle: a small neural network trained on a toy algorithmic task. Your goal is to reverse-engineer the algorithm the model has learned.

Inspired by Callum McDougall's ARENA Monthly Algorithmic Challenges. Each puzzle ships with a starter Colab notebook and pre-trained weights on HuggingFace. Code and training scripts on GitHub.

Puzzles

Given a sequence of 10 tokens drawn from a vocabulary of 10 symbols, predict the number of distinct symbols.

Given a list of numbers, predict the maximum. Two variants: a 1-layer transformer on single-digit inputs, and a 2-layer transformer on two-digit inputs with digit-level tokenization.