Aligned but Blind:
Alignment Increases Implicit Bias by Reducing Awareness of Race

1University of Chicago, 2Rutgers University, 3Allen Institute for AI, 4University of Washington
ACL 2025 (Main)

Language model alignment unintentionally amplifies implicit racial biases by reducing their sensitivity to race concepts - akin to race blindness in humans.

Alignment reduces explicit but increases implicit bias
Interpretations of LM associations with black/white. Alignment makes the model treat them more like pure colors rather than racial groups in ambiguous settings - ignoring race-related associations can exacerbate bias.

Overview

Although value-aligned language models (LMs) appear unbiased in explicit bias evaluations, they often exhibit stereotypes in implicit word association tasks, raising concerns about their fair usage. We investigate the mechanisms behind this discrepancy and find that alignment surprisingly amplifies implicit bias in model outputs. Specifically, we show that aligned LMs, unlike their unaligned counterparts, overlook racial concepts in early internal representations when the context is ambiguous. Not representing race likely fails to activate safety guardrails, leading to unintended biases. Inspired by this insight, we propose a new bias mitigation strategy that works by incentivizing the representation of racial concepts in the early model layers. In contrast to conventional mitigation methods of machine unlearning, our interventions find that steering the model to be more aware of racial concepts effectively mitigates bias. Similar to race blindness in humans, ignoring racial nuances can inadvertently perpetuate subtle biases in LMs.

Behavioral Evidence

We curated pairs of implicit and explicit racial bias and used them to evaluate Llama 3 models.

  1. Explicit: Likert scale, asking whether the model agrees with a given association such as black is related to negative, white is related to positive.
  2. Implicit: Word association, let the model freely pair black/white with positive/negative words.
Alignment consistently increases black implicit bias while reducing explicit bias.
Alignment reduces explicit bias but amplifies implicit racial bias.

💡 Key Insight: While alignment reduces explicit bias to near zero (8.1%), it increases implicit bias to 91.4%.

Mechanistic Insights

Interpreting Race versus Color:
Activation Patching

To explain this behavior, we use activation patching to test whether LMs represent black/white as race or color in ambiguous settings.

Activation patching results
Activation patching reveals aligned models treat black/white in ambiguous settings more as pure colors, not racial groups.
  • Aligned models are 52.2% less likely to represent race internally.

Visualizing Latent Bias:
SelfIE Analysis

To explore whether stronger associations beyond the race/color binary might be present, we applied SelfIE - an open-ended natural language interpretation method that translates internal embeddings into text.

SelfIE Interpretations
Examples of SelfIE interpretations for black/white
  • SelfIE interpretations of black/white fell into three categories: color, race, or nonsensical outputs (e.g., repeating the instruction).
  • Consistent with activation patching results, the aligned model produced 74.4% fewer race-related interpretations than the base model on implicit prompts.

💡 Key Insight: Aligned LMs failed to robustly represent race concepts in the face of ambiguity, exhibiting race blindness. Not representing race likely fails to activate safety guardrails, leading to unintended biases.

Causal Intervention:
Strengthening Race Associations

Unlike traditional debiasing that removes racial concepts, we add racial awareness to reduce bias - similar to how acknowledging race reduces human colorblind racism. The effectiveness of this approach also provides causal support for our interpretability findings.

Embedding Intervention via Steering

  • Injecting race‐aware activations (from disambiguated prompts like ā€œRace: black and whiteā€) reduces implicit bias by 26–42%, especially when applied to early layers.
Causal results
Strengthening race associations reduces implicit and explicit bias in LMs.

Weight Intervention via LoRA Fine-tuning

  • We fine-tune models using LoRA on prompts where black/white are used ambiguously but labeled with racial meaning.
  • Targeted early-layer fine-tuning cuts implicit bias from 97.3% to 42.4% and also reduces explicit bias.

💡 Key Insight: Enhancing the model's awareness of racial concepts effectively reduces implicit bias.

Implications

Our findings suggest a broader class of alignment failure: debiasing strategies suppress sensitive concepts, they can unintentionally reduce a model's ability to detect bias, which inadvertently exacerbates bias. Rather than removing sensitive concepts, reinforcing awareness of these concepts in language models can mitigate biases. Future research could extend similar methodologies to other bias domains and investigate the origins of harmful associations in pretraining, enhancing deeper alignment strategies across broader contexts.

BibTeX

@inproceedings{sun2025aligned,
  title={Aligned but Blind: Alignment Increases Implicit Bias by Reducing Awareness of Race},
  author={Sun, Lihao and Mao, Chengzhi and Hofmann, Valentin and Bai, Xuechunzi},
  booktitle={Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics},
  year={2025},
  url={https://arxiv.org/abs/2506.00253}
}