Monthly mechanistic interpretability challenges on small toy models. Learn by doing.
Each month we release a new puzzle: a small neural network trained on a toy algorithmic task. The model is as simple as possible while achieving perfect accuracy. Your goal is to reverse-engineer the algorithm the model has learned.
Inspired by Callum McDougall's ARENA Monthly Algorithmic Challenges, these puzzles are an opportunity to get hands-on experience with mechanistic interpretability. Practice the methods you've learned — whether from the ARENA tutorials or elsewhere — on models small enough to understand.
These puzzles are also a chance to get familiar with nnsight for model internals access, and to be part of a growing interpretability community. They're meant to be educational and fun.
Each puzzle comes with a starter Colab notebook, pre-trained model weights on HuggingFace, and everything you need to get going. Poke around the weights, visualize attention patterns, formulate hypotheses, and write up what you find. When you're ready, submit your write-up via the provided form.
All code and training scripts are open source on GitHub.
Given a list of numbers, predict the maximum. Two variants: a 1-layer transformer on single-digit inputs, and a 2-layer transformer on two-digit inputs with digit-level tokenization.