New AI Steering Method Exposes Flaws and Potential Improvements

Researchers find a way to manipulate specific concepts in large language models, leading to more reliable and efficient training but also uncovering vulnerabilities.

Published on Feb. 27, 2026

A team of researchers has developed a new method to steer the output of large language models (LLMs) by manipulating specific concepts within these models. The approach could lead to more reliable, efficient, and less computationally expensive training of LLMs, but it also exposes potential vulnerabilities, such as the ability to jailbreak an LLM or boost political bias and conspiracy theories.

Why it matters

This research is important because it provides a way to open up the black box of LLMs, allowing researchers to better understand how these models arrive at their outputs and identify potential issues. The ability to steer LLMs could lead to significant improvements in performance and safety, but the same technique could also be used maliciously to bypass an LLM's safeguards or spread misinformation.

The details

The researchers, led by Mikhail Belkin at the University of California San Diego and Adit Radhakrishnan at the Massachusetts Institute of Technology, were able to locate specific concepts within LLMs and mathematically increase or decrease their importance in the model's output. This builds on their previous work on Recursive Feature Machines, which identified patterns in the mathematical operations inside LLMs that encode specific concepts. The researchers tested their steering approach on large open-source LLMs like Llama and Deepseek, identifying and influencing 512 concepts across five classes, including fears, moods, and locations. They found that steering could improve performance on narrow tasks, but it could also be used to jailbreak an LLM or boost political bias and conspiracy theories.

  • The research findings were published in the Feb. 19, 2026 issue of the journal Science.
  • The work builds on a 2024 Science paper led by Belkin and Radhakrishnan, in which they described predictive algorithms known as Recursive Feature Machines.

The players

Mikhail Belkin

A professor in the Halıcıoğlu Data Science Institute, which is part of the School of Computing, Information and Data Sciences at the University of California San Diego.

Adit Radhakrishnan

A researcher at the Massachusetts Institute of Technology and the Broad Institute.

Llama

One of the largest open-source large language models used in the experiments.

Deepseek

Another large open-source large language model used in the experiments.

Got photos? Submit your photos here. ›

What they’re saying

“We found that we could mathematically modify these patterns with math that is surprisingly simple.”

— Mikhail Belkin, Professor

What’s next

Researchers plan to continue improving the steering method to adapt to specific inputs and applications, as well as explore the technique's potential to lead to fundamental performance and safety improvements in large language models.

The takeaway

This research highlights both the potential benefits and risks of being able to steer the output of large language models. While the technique could lead to more reliable and efficient training, it also exposes vulnerabilities that could be exploited to bypass an LLM's safeguards or spread misinformation. Continued research and development in this area will be crucial to ensuring the safe and responsible use of these powerful AI systems.