Mechanistic interpretability, often shortened to mech interp, mechinterp, or MI, is a part of a larger area called explainable artificial intelligence. It focuses on understanding how neural networks work by examining the processes involved in their calculations. This method is similar to how people study binary computer programs to figure out what they do.
History
The term "mechanistic interpretability" was created by Chris Olah to describe his work in circuit analysis, which is different from common methods used in interpretable AI. Circuit analysis aimed to fully understand the parts and circuits inside models, while the larger field focused more on methods that use gradients, such as saliency maps.
Before circuit analysis, research in this area used different methods like feature visualization, reducing data complexity, and attribution along with human-computer interaction techniques to study models such as the vision model Inception v1.
Key concepts
Mechanistic interpretability seeks to understand how machine learning models process information through their internal parts, such as structures or algorithms. This approach differs from earlier methods, which mainly focused on explaining how inputs relate to outputs.
A hypothesis suggests that important ideas in neural networks are shown as straight lines in the way the networks activate. Studies on word embeddings and large language models support this idea, but it is not always true.
Methods
Mechanistic interpretability uses methods to find cause and effect in how parts of a model affect its outputs. These methods often include structured tools from the study of cause and effect.
In the area of AI safety, mechanistic interpretability helps researchers understand and check the behavior of complex AI systems. It also helps identify possible risks, such as when AI systems do not act as intended.
A sparse autoencoder (SAE) is a model trained to separate the activity of neurons in a network into simple, clear parts. The features the model learns often represent basic ideas that humans can understand. This technique was used by Anthropic to study how large language models work.
A circuit in a neural network is made up of chains of cause and effect in how features are activated. By figuring out which circuits lead to certain results and by turning circuits on or off, researchers can study how a neural network, like a large language model, produces specific outputs from given inputs.