Model Editing as a Robust and Denoised variant of DPO: A Case Study on Toxicity

Rheeya Uppaal1, Apratim Dey2, Yiting He3, Yiqiao Zhong1, Junjie Hu1
1University of Wisconsin-Madison 2Stanford University 3University of Science and Technology of China

ICLR 2025
Note: You may also find another version of this paper titled DeTox: Toxic Subspace Projection for Model Editing. This is the same paper—just an earlier iteration. Apologies for any confusion!

A Tale of Two Communities

Large language models have become extraordinarily capable, yet building systems that behave safely and remain aligned with human intent is still an unsolved challenge. Even models that have undergone several rounds of reinforcement learning and fine-tuning can often be jailbroken with surprisingly simple prompts—by reframing an unsafe request as a “thought experiment”, disguising intent through encoding, or role-playing (“pretend you’re a malicious AI executing this task”). These failures reveal more than weaknesses in filtering: they expose a deeper fragility in how alignment is represented within the model, where unsafe patterns remain latent and easily reactivated.

Community A: Alignment through Tuning

The dominant approach has been to treat alignment as a training problem. Methods such as Reinforcement Learning from Human Feedback (RLHF), Direct Preference Optimisation (DPO), and Reinforcement Learning from AI Feedback (RLAIF) adjust model parameters so that outputs better match human preferences. These techniques have clearly improved controllability, but they are data- and compute-intensive and often opaque. Even heavily aligned models can regress when the distribution shifts or when an adversarial prompt exploits overlooked patterns—suggesting that training may polish surface behaviour more than it reshapes the underlying geometry of unsafe representations.

Community B: Alignment through Editing

A smaller but steadily growing line of work takes a different view. If problematic behaviour is stored somewhere within the model’s representations, perhaps we can intervene directly and "delete" the behaviour from the model. Model editing treats alignment as an intervention problem: identify and modify the directions in weight or activation space that give rise to harmful outputs. Editing is fast, requires little data, and—unlike gradient-based tuning—makes the change explicit. Because the modified directions can be inspected, it offers a transparency that conventional fine-tuning rarely provides.

Bridging the Communities

Despite its appeal, editing remains a niche practice, partly because its broader side effects are not yet well understood. Tuning methods are evaluated through large-scale preference benchmarks, while editing work is usually framed as an interpretability problem. The two strands have therefore evolved in parallel, with little shared vocabulary. The result is a fragmented landscape in which tuning papers compare only with tuning baselines and editing papers with editing baselines, even though both aim to steer the model’s internal space towards desirable behaviours.

This divide motivates our study. If both training and editing attempt to reshape a model’s latent geometry to encourage some directions and suppress others, then it should be possible to describe them within a common framework. Can an edit be understood as the limit of a training update? And if so, what might that reveal about when alignment generalises—and when it quietly fails?

Where Toxicity Lives: Mapping Subspaces in Model Representations

If a model can be persuaded to generate harmful text, then at some level it must represent the concept of harm internally. The question, of course, is how and where. A long-standing intuition in the interpretability community—often referred to as the linear representation hypothesis—suggests that many abstract concepts in large neural models correspond to roughly linear directions in activation space. Features such as sentiment, politeness, or toxicity appear to occupy low-dimensional subspaces that can be identified by contrasting the data that activate them. Although this view is an idealisation, it provides a useful framework for reasoning about how behaviour is encoded and how it might be modified.

Examples of toxic and non-toxic text.
Figure 1: Examples of paired data—toxic and non-toxic continuations to the same prompt. Tokens have been censored for readability.

Our work builds on this idea to ask whether the toxic tendencies of a model can be isolated geometrically. The procedure is straightforward: we begin with a small set of paired examples—one toxic, one non-toxic, both with the same prompt (Figure 1). Then, we record the model’s internal embeddings for each. Let these be denoted by \( X^{+}_{\ell} \) and \( X^{-}_{\ell} \in \mathbb{R}^{N \times D} \) respectively, where \( \ell \) indexes the layer, \( N \) is the number of datapoints and \( D \) is the model's hidden dimension. (We take embeddings from the last token position.) The difference between the two embeddings provides a rough approximation of the toxic direction at that layer:

\( T^{0}_{\ell} := X^{+}_{\ell} - X^{-}_{\ell} \)

By collecting such differences across multiple examples, we obtain a matrix of contrastive representations that capture how the model’s activations shift when moving from toxic to non-toxic text. Applying singular value decomposition (SVD) to this matrix allows us to identify the principal axes of variation that consistently separate the two classes:

\( U \Sigma V^{\top} = T^{0}_{\ell} \)

The top orthogonal directions from this operation—our toxic subspace—capture the most toxic behaviours, while the later directions tend to reflect frequent or contextual features that are not inherently harmful. But how do we know this for sure? To interpret what each direction represents, we apply a simple diagnostic tool known as logit lens. This involves feeding the vector through the model’s output layer to see which tokens it would most strongly predict if treated as a hidden state. Table 1 shows the results—the top few singular vectors (or directions) encode toxicity (since their top ranked words are toxic), while later directions encode non-toxic concepts.

Interpreting the top singular vectors of the difference of preference data embeddings.
Table 1: Interpreting the top singular vectors of the difference of preference data embeddings. Tokens have been censored for readability.

This observation makes the alignment problem more tangible. If undesirable behaviours occupy identifiable regions in representation space, alignment need not rely solely on iteratively adjusting output distributions. Instead, we can intervene directly on the model’s internal geometry—removing or re-orienting the very directions that give rise to those behaviours. The real challenge lies in precision: how to act on a subspace without disrupting the surrounding manifold of useful capabilities. It is this question that motivates our projection-based method, ProFS, to which we now turn.

ProFS: Projection Filter for Subspaces

We now know where toxicity lives inside a model’s representations. The next challenge is to remove it without damaging the model’s broader capabilities. Our key intuition is that effective editing requires the purest possible toxic subspace—a space that captures toxicity and only toxicity. Any overlap with directions encoding syntax or semantics risks corrupting the model’s general fluency or reasoning ability.

Step 1: Removing the mean direction

Table 1 reveals an interesting pattern: the mean vector across non-toxic examples tends to encode common tokens—stopwords, punctuation, and grammatical markers—that are fundamental to coherent text generation. If left uncorrected, these high-frequency directions can leak into our toxic subspace estimate, causing the edit to damage the model's sentence forming ability. To prevent this, we first centre the contrastive matrix by removing the mean direction. Formally, if \( T^{0}_{\ell} \) is the initial matrix of toxic–non-toxic differences and \( \mu \) is its mean, we compute:

\( T_{\ell} = T^{0}_{\ell}\,(I - P_{\mu}), \qquad P_{\mu} := \frac{\mu\,\mu^{\top}}{\|\mu\|^{2}} \)

This centring step isolates the components that vary specifically with toxicity, ensuring that subsequent edits leave linguistic structure intact.

Step 2: Extracting the toxic subspace

We then decompose the centred matrix using singular value decomposition (SVD), as before, and retain only the top-r singular vectors identified as toxic through logit-lens inspection (see Table 1 and Figure 2). These vectors define the model’s true toxic subspace:

\( U \Sigma V^{\top} = T_{\ell}, \qquad P_{\text{toxic}} = \sum_{i=1}^{r} v_{i} v_{i}^{\top} \)

This subspace represents the geometric directions most strongly associated with harmful or offensive language, purified of frequency and topic effects. The remaining orthogonal directions capture normal linguistic and semantic behaviour and should therefore be preserved.

Step 3: Projecting out toxicity

With the toxic subspace in hand, we edit the model directly. For each MLP block, we modify only the second (value) matrix—the layer shown in prior interpretability work to store factual and conceptual knowledge. The projection is a single, analytic operation:

\( W_{\ell}^{\text{edited}} = (I - P_{\text{toxic}})\, W_{\ell}^{\text{original}} \)

This subtracts the model’s projection onto the toxic subspace, effectively filtering out harmful representations while leaving all other knowledge untouched. Because it is linear and applied once, the procedure is fast, transparent, and free from the instabilities of gradient-based tuning.

Schematic illustration of projection filtering in ProFS.
Figure 2: Schematic of ProFS. Toxic directions (in red) are projected out of the model’s MLP-value matrices, leaving other representational directions intact.

How does this perform?

Empirically, ProFS performs on par with DPO—the widely used optimisation-based alignment method. Figure 3 shows that both approaches achieve comparable reductions in toxicity while preserving general capability across diverse benchmarks such as BoolQ, RTE, and HellaSwag. The projection does not degrade fluency or factual accuracy, confirming that removing the purified toxic subspace leaves other functions unaffected.

Comparison of toxicity reduction and capability preservation across models.
Figure 3: ProFS reduces toxicity to the same degree as DPO while maintaining task performance. Further, ProFS achieves DPO-level toxicity reduction with roughly four times fewer preference pairs.

Even more interestingly, Figure 4 shows that layer-wise probability contributions change in nearly identical patterns for ProFS and DPO, suggesting that the two methods influence similar internal mechanisms. Yet ProFS reaches this point with a fraction of the data: as the figure illustrates, it requires roughly one-quarter of the preference pairs used by DPO to achieve the same toxicity reduction (in the paper, we show it needs as little as 50 datapoints). This remarkable sample efficiency follows from its geometric construction—ProFS removes the subspace directly instead of learning to approximate it through noisy preference gradients (more on this below).

Sample efficiency comparison between DPO and ProFS.
Figure 4: Contribution of individual layers towards token probabilities. Both DPO and ProFS have similar contributions, incrementally suppressing the contribution of toxic tokens.

In summary, ProFS mirrors DPO in its effects but achieves them through a single projection rather than thousands of optimisation steps. The two appear empirically equivalent in their outcomes, yet differ profoundly in efficiency and transparency. This naturally raises a deeper question: are they theoretically connected? Can ProFS be derived as a principled, closed-form limit of DPO?

When Geometry Meets Optimisation: A Factor Analysis View of ProFS and DPO

Empirically, ProFS behaves remarkably like DPO. Both reduce toxicity while preserving linguistic fluency, and both shift token probabilities in nearly identical ways. Yet the two methods could not be more different in spirit: DPO is an optimisation procedure, while ProFS is a single geometric projection. To understand why they converge to such similar outcomes, we take a step back and construct a simple theoretical model.

A minimal model for layer activations

We begin with a factor-analytic view of what happens inside a language model’s hidden layers. Recall our earlier setting: we have toxic and non-toxic sentence embeddings, \( X^{+} \) and \( X^{-} \in \mathbb{R}^{N \times D} \). Each pair of activations \( x_i^{+} \) and \( x_i^{-} \) can be seen as a combination of several independent components: a general linguistic mean, a toxic component, a contextual component, and some residual noise. Formally, we can write:

\( x_i^{+} = a^{+}\mu + Bf_i + \tilde{B}\tilde{f}_i + u_i^{+}, \quad x_i^{-} = a^{-}\mu + \tilde{B}\tilde{f}_i + u_i^{-} \)

Here, \( \mu \) denotes the mean direction of the language—capturing stopwords, punctuation, and other high-frequency tokens essential for syntax. \( Bf_i \) is the toxic component, \( \tilde{B}\tilde{f}_i \) represents topic-dependent variation (e.g. subject matter), and \( u_i \) is a noise the model cannot account for.

From differences to the centred toxic subspace

ProFS constructs an approximate toxic subspace by taking the difference between toxic and non-toxic activations and then centring it, removing the mean direction \( \mu \). Applying this to the decomposition above, we obtain:

\( (I - P_{\mu})(x_i^{+} - x_i^{-}) = (I - P_{\mu})Bf_i + (I - P_{\mu})(u_i^{+} - u_i^{-}) \)

This centred difference consists of two components:

For convenience, let us define:

\( g_i := (I - P_{\mu})(u_i^{+} - u_i^{-}), \qquad B^{*} := (I - P_{\mu})B \)

Here, \( g_i \) represents the centred noise for the i-th sample, and \( B^{*} \) denotes the centred toxic subspace—the component of \( B \) that remains after removing the linguistic mean direction. Substituting these back gives:

\( (I - P_{\mu})(x_i^{+} - x_i^{-}) = B^{*}f_i + g_i \)

Stacking these differences across all \( N \) examples produces a matrix of contrastive representations, \( T_{\ell} \), such that:

\( T_{\ell} = [\, B^{*}f_1 + g_1,\; B^{*}f_2 + g_2,\; \ldots,\; B^{*}f_N + g_N \,]^{\top} \)

Equivalently, this can be written compactly as:

\( T_{\ell} = F(B^{*})^{\top} + G \tag{6} \)

where \( F \) collects the latent factors \( f_i \) as rows, and \( G \) collects the noise terms \( g_i \). The first term, \( F(B^{*})^{\top} \), represents the structured, low-rank signal that encodes toxicity; the second term, \( G \), represents unstructured noise. This decomposition is important, and we'll come back to it soon.

DPO through the same lens

We can now re-express DPO within this same framework. Assume a simple logistic model \( \pi_W(y|x_i) \) whose conditional probability of the next token is:

\( \pi_W(y|x_i) = Z_W^{-1} \exp(w_y^{\top}Wx_i) \)

Under this formulation, the gradient of the DPO loss at initialisation can be written as:

\( \nabla_W \mathcal{L}_{\text{DPO}} \propto \sum_i \big(w_{y_i}(x_i^{+})^{\top} - w_{y_i}(x_i^{-})^{\top}\big) \)

Notice the structural similarity: this gradient depends on differences of activations—just like the construction of \( T_{\ell} \) in ProFS. Hence, we can apply the same factor-analysis model to DPO as well. Both methods are, at their core, recovering the same toxic signal \( F(B^{*})^{\top} \) while attenuating the noise term \( G \).

Two voices, One song

(I expect no one to get the reference to this title. If you do, we need to be friends!)

When ProFS applies singular value decomposition (SVD) to \( T_{\ell} \), it effectively separates these two components, recovering \( B^{*} \) as the dominant low-rank subspace. Because SVD performs an implicit denoising, ProFS can identify the correct subspace cleanly with only a few hundred paired examples.

DPO, on the other hand, performs the same operation implicitly through optimisation. Its stochastic gradient updates also move the model along the direction of \( x_i^{+} - x_i^{-} \), but the noise terms only cancel out after averaging over many training iterations. In this sense, DPO performs statistical denoising where ProFS achieves the same result geometrically and in closed form.

This parallel reveals a deeper connection: both methods act on the same underlying structure in activation space. DPO smooths the toxic signal by averaging across a large dataset, while ProFS extracts it directly by computing a clean projection in one step. The difference lies not in what they change, but in how they approach it.

Under this model, we can view ProFS as a denoised, closed-form variant of a single DPO step. Where DPO gradually converges to the correct subspace through iterative updates, ProFS reaches it analytically by projecting out the noisy directions. This perspective bridges the gap between training-based alignment and editing-based alignment, suggesting that both are, fundamentally, different routes to reshaping the same internal geometry.

Empirical Evidence

ProFS and DPO are heading in the same direction — ProFS just gets there in one clean move instead of thousands of gradient steps. To check if that actually happens in practice, we looked at the DPO gradients themselves. Specifically, we took the first-step DPO gradients \( G \) with respect to each layer’s MLP-value matrix and asked: how much of these gradients fall inside the toxic subspace that ProFS identifies?

To measure this overlap, we used a simple metric: the ratio \( \| P_{\text{toxic}} G \|_{F} / \| G \|_{F} \), which tells us how much of DPO’s update lives in the same space that ProFS would project out. We repeated this with different numbers of training pairs—8, 32, and 128—and compared it against a random baseline where the same projection is applied to a random matrix.

Correlation between DPO gradients and the toxic subspace
Figure 5: Correlation between DPO gradients and the toxic subspace identified by ProFS. Larger sample sizes increase alignment, especially in higher layers.

The pattern is striking. As shown in Figure 5, DPO’s gradients line up with the ProFS subspace far more than chance. The correlation gets stronger in later layers — exactly where edits tend to be most effective — and it grows with sample size. In other words, DPO slowly “learns” the same subspace that ProFS finds instantly. With enough data, both approaches end up moving weights in almost the same direction; ProFS just skips the noisy averaging and computes that direction directly. That’s what makes it a kind of denoised shortcut to DPO.

Wider Applicability of ProFS

Preferences Beyond Toxicity

So far, we’ve focused on toxicity because it’s a clean and measurable case — the signal is easy to spot and the model’s responses can be evaluated automatically. But real alignment problems rarely look that neat. Concepts like harmfulness or ethical sensitivity are far more subtle, often defined by context rather than keywords. So, can the same geometric intuition that worked for toxicity extend to these higher-level preferences?

It turns out, yes. When we apply ProFS to the HH-Golden safety dataset — which targets a broader notion of harmlessness — the same subspace structure appears. The difference is simply that the rank of the subspace (denoted by r) is larger, reflecting that these behaviours are more complex and multidimensional than lexical toxicity. Table 2 shows that after aligning both models to the HH-Golden dataset, ProFS achieves a higher win rate over DPO according to GPT-4 evaluations. In other words, even for nuanced safety preferences, a simple geometric projection performs on par with or better than full-scale preference optimisation.

ProFS vs DPO win rates on HH-Golden dataset
Table 2: Win rate of ProFS over DPO (evaluated by GPT-4) on the HH-Golden safety dataset. The higher subspace rank reflects the complexity of the underlying behaviour.

Robustness to Noisy Labels

Real datasets are messy. Human annotations are inconsistent, and preference data often carries noise — especially when collected at scale. With toxicity, for instance, mislabelled data can make a model more toxic if it learns from the wrong examples. So how does ProFS handle label noise compared to a training-based method like DPO?

To test this, we randomly flipped the labels of a fraction of the dataset and measured how both methods performed. As shown in Figure 6, DPO’s effectiveness drops steadily as noise increases, whereas ProFS remains almost completely stable even when half the dataset is incorrectly labelled. This robustness comes from the mathematics of SVD: the singular vectors of \( T_{\ell} \) are equivalent to the eigenvectors of the Gram matrix \( T_{\ell}^{\top} T_{\ell} \), and flipping the sign of any row in \( T_{\ell} \) doesn’t affect \( T_{\ell}^{\top} T_{\ell} \) at all. In simpler terms, ProFS looks at the overall geometry of the data, not individual labels — so noisy examples don’t distort its direction.

Label noise robustness comparison between ProFS and DPO
Figure 6: ProFS remains robust to labelling noise, while DPO performance degrades as more examples are mislabelled.

Wrapping Up

You made it to the end! Well done — go reward yourself with a snack.

If your mind wandered while reading this (mine did too, and I wrote the whole thing), or if some of it felt hard to digest — that’s completely normal. This is a different way of thinking about model alignment than the usual machine learning story most of us are used to. So, here’s a quick recap.

Summary

We introduced ProFS — a projection-based editing method that reduces unwanted behaviours in language models by identifying and removing the subspaces that encode them. Instead of relying on large-scale preference data or long optimisation runs, ProFS works in a single step: it finds the directions in the model’s activations that correspond to toxic behaviour, defines a low-dimensional “toxic subspace”, and filters this out directly from the weights. Despite its simplicity, ProFS performs on par with methods like DPO, achieving similar alignment outcomes with just a fraction of the data and compute.

Key Takeaways

Limitations

Of course, ProFS isn’t perfect. It depends heavily on how we choose the singular vectors that define the subspace. Select too few, and the edit won’t fully remove toxicity; select too many, and the model may lose useful capabilities or subtle stylistic traits — a common pitfall in unsupervised subspace editing. Also, our method currently edits only the MLP-value layers. While these layers encode much of the model’s “knowledge”, attention heads are equally important for how that knowledge is applied. Extending projection-based editing to attention mechanisms is an exciting next step.

Open Questions

At the end of the day, our hope with ProFS is simple: alignment doesn’t have to be a mysterious black box. Sometimes, it’s just a matter of understanding the shape of your model’s space — and knowing exactly where to project.

Let’s end on a lighter note: here are a few examples of ProFS-edited models politely refusing harmful requests (and doing it with a bit of personality). Notice how the model can still discuss these topics safely when asked in the right context — proof that editing doesn’t mean forgetting.

Examples of ProFS models refusing harmful prompts politely
Figure 7: ProFS-edited models sometimes like adding personality to their refusal of harmful prompts. The first response is from an un-aligned model which leaks unsae information. The second response is after ProFS editing.
Examples of ProFS models discussing harmful topics safely in context
Figure 8: Editing with ProFS doesn’t make the model forget a concept — it just steers it toward safe, contextual use.

Thanks for reading!