12th March 2025
Our recent paper Shifting Perspectives: Steering Vector Ensembles for Robust Bias Mitigation in LLMs builds on the new field of activation engineering (also known as representation engineering). So far, work has looked at controlling concepts like truth and honesty, as well as model refusal.
A common way to find neurons that correspond to a concept (e.g. gender) is to take the difference in activations from a contrastive pair of inputs, from which you get a steering vector. These steering vectors are actually a collection of vectors at each layer of a model. The vector is added (or subtracted) to the activations at each layer, or at the layers of your choosing. We also apply a scalar to these vectors, which is where the 'dial' element comes into things: we can set these bias vectors to increase or decrease bias in outputs, and by varying degrees. And the results were pretty wild (but also crazy racist in some scenarios and were not included in the paper). There's a nice diagram of how this works on the first page of this paper.
Current work has carefully selected either a single contrastive pair, or a set of contrastive pairs and calculated the mean difference in activations at each layer (or a PCA at each layer in our case). One of the first interesting things we do is dynamically create 50 different contrastive datasets for each of 9 axes in the BBQ dataset, and improve the generated dataset using Bayesian optimization, until we get a contrastive dataset that is optimized to reduce bias (e.g. age bias or racial bias). This cute diagram visualises it for you:
The second (and cooler) discovery was that if you add all these vectors together, in a lot of cases, this performs better than any individual steering vector - we call this a Steering Vector Ensemble. In the table below, we show the difference in BBQ and MMLU scores for various baselines: average ISV (average score of the individual steering vectors), merged datasets (just concatenating all the contrastive datasets used to compute the ISVs), and SVE (a steering vector ensemble).
I won't give everything away here, but our experiments across Mistral, Llama, and Qwen show promising improvements in fairness without compromising language understanding, though I believe there is a lot more work to do in this area. For more details, and future work recommendations, check out our code on our GitHub repository and of course, read the paper.