EA - A Barebones Guide to Mechanistic Interpretability Prerequisites by Neel Nanda

The Nonlinear Library: EA Forum - A podcast by The Nonlinear Fund

Podcast artwork

Categories:

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A Barebones Guide to Mechanistic Interpretability Prerequisites, published by Neel Nanda on November 29, 2022 on The Effective Altruism Forum.Co-authored by Neel Nanda and Jess SmithCrossposted on the suggestion of Vasco GriloWhy does this exist?People often get intimidated when trying to get into AI or AI Alignment research. People often think that the gulf between where they are and where they need to be is huge. This presents practical concerns for people trying to change fields: we all have limited time and energy. And for the most part, people wildly overestimate the actual core skills required.This guide is our take on the essential skills required to understand, write code and ideally contribute useful research to mechanistic interpretability. We hope that it’s useful and unintimidating. :)Core Skills:Maths:Linear Algebra: 3Blue1Brown or Linear Algebra Done RightCore goals - to deeply & intuitively understand these concepts:BasisChange of basisThat a vector space is a geometric object that doesn’t necessarily have a canonical basisThat a matrix is a linear map between two vector spaces (or from a vector space to itself)Bonus things that it’s useful to understand:What’s singular value decomposition? Why is it useful?What are orthogonal/orthonormal matrices, and how is changing to an orthonormal basis importantly different from just any change of basis?What are eigenvalues and eigenvectors, and what do these tell you about a linear map?Probability basicsBasics of distributions: expected value, standard deviation, normal distributionsLog likelihoodMaximum value estimatorsRandom variablesCentral limit theoremCalculus basicsGradientsThe chain ruleThe intuition for what backprop is - in particular, grokking the idea that backprop is just the chain rule on multivariate functionsCoding:Python BasicsThe “how to learn coding” market is pretty saturated - there’s a lot of good stuff out there! And not really a clear best one.Zac Hatfield-Dodds recommends Al Sweigart's Automate the Boring Stuff and then Beyond the Basic Stuff (both readable for free on inventwithpython.com, or purchasable in books); he's also written some books of exercises. If you prefer a more traditional textbook, Think Python 2e is excellent and also available freely online.NumPy BasicsTry to do the first ~third of these:. Bonus points for doing them in pytorch on tensors :)ML:Rough grounding in ML.fast.ai is a good intro, but a fair bit more effort than is necessary. For an 80/20, focus on Andrej Karpathy’s new video explaining neural nets:PyTorch basicsDon’t go overboard here. You’ll pick up what you need over time - learning to google things when you get confused or stuck is most of the real skill in programming.One goal: build linear regression that runs in Google Colab on a GPU.Transformers - probably the biggest way mechanistic interpretability differs from normal ML is that it’s really important to deeply understand the architectures of the models you use, all of the moving parts inside of them, and how they fit together. In this case, the main architecture that matters is a transformer! (This is useful in normal ML too, but you can often get away with treating the model as a black box)Check out the illustrated transformerNote that you can pretty much ignore the stuff on encoder vs decoder transformers - we mostly care about autoregressive decoder-only transformers like GPT-2, which means that each token can only see tokens before it, and they learn to predict the next tokenGood (but hard) exercise: Code your own tiny GPT-2 and train it. If you can do this, I’d say that you basically fully understand the transformer architecture.Example of basic training boilerplate and train scriptThe EasyTransformer codebase is probably good to riff off of hereAn ...