- Today
- Holidays
- Birthdays
- Reminders
- Cities
- Atlanta
- Austin
- Baltimore
- Berwyn
- Beverly Hills
- Birmingham
- Boston
- Brooklyn
- Buffalo
- Charlotte
- Chicago
- Cincinnati
- Cleveland
- Columbus
- Dallas
- Denver
- Detroit
- Fort Worth
- Houston
- Indianapolis
- Knoxville
- Las Vegas
- Los Angeles
- Louisville
- Madison
- Memphis
- Miami
- Milwaukee
- Minneapolis
- Nashville
- New Orleans
- New York
- Omaha
- Orlando
- Philadelphia
- Phoenix
- Pittsburgh
- Portland
- Raleigh
- Richmond
- Rutherford
- Sacramento
- Salt Lake City
- San Antonio
- San Diego
- San Francisco
- San Jose
- Seattle
- Tampa
- Tucson
- Washington
New AI Steering Method Exposes Flaws and Potential Improvements
Researchers find a way to manipulate specific concepts in large language models, leading to more reliable and efficient training but also uncovering vulnerabilities.
Published on Feb. 27, 2026
Got story updates? Submit your updates here. ›
A team of researchers has developed a new method to steer the output of large language models (LLMs) by manipulating specific concepts within these models. The approach could lead to more reliable, efficient, and less computationally expensive training of LLMs, but it also exposes potential vulnerabilities, such as the ability to jailbreak an LLM or boost political bias and conspiracy theories.
Why it matters
This research is important because it provides a way to open up the black box of LLMs, allowing researchers to better understand how these models arrive at their outputs and identify potential issues. The ability to steer LLMs could lead to significant improvements in performance and safety, but the same technique could also be used maliciously to bypass an LLM's safeguards or spread misinformation.
The details
The researchers, led by Mikhail Belkin at the University of California San Diego and Adit Radhakrishnan at the Massachusetts Institute of Technology, were able to locate specific concepts within LLMs and mathematically increase or decrease their importance in the model's output. This builds on their previous work on Recursive Feature Machines, which identified patterns in the mathematical operations inside LLMs that encode specific concepts. The researchers tested their steering approach on large open-source LLMs like Llama and Deepseek, identifying and influencing 512 concepts across five classes, including fears, moods, and locations. They found that steering could improve performance on narrow tasks, but it could also be used to jailbreak an LLM or boost political bias and conspiracy theories.
- The research findings were published in the Feb. 19, 2026 issue of the journal Science.
- The work builds on a 2024 Science paper led by Belkin and Radhakrishnan, in which they described predictive algorithms known as Recursive Feature Machines.
The players
Mikhail Belkin
A professor in the Halıcıoğlu Data Science Institute, which is part of the School of Computing, Information and Data Sciences at the University of California San Diego.
Adit Radhakrishnan
A researcher at the Massachusetts Institute of Technology and the Broad Institute.
Llama
One of the largest open-source large language models used in the experiments.
Deepseek
Another large open-source large language model used in the experiments.
What they’re saying
“We found that we could mathematically modify these patterns with math that is surprisingly simple.”
— Mikhail Belkin, Professor
What’s next
Researchers plan to continue improving the steering method to adapt to specific inputs and applications, as well as explore the technique's potential to lead to fundamental performance and safety improvements in large language models.
The takeaway
This research highlights both the potential benefits and risks of being able to steer the output of large language models. While the technique could lead to more reliable and efficient training, it also exposes vulnerabilities that could be exploited to bypass an LLM's safeguards or spread misinformation. Continued research and development in this area will be crucial to ensuring the safe and responsible use of these powerful AI systems.
San Diego top stories
San Diego events
Feb. 28, 2026
Beetlejuice (Touring)Feb. 28, 2026
Artemas - LOVERCORE TourFeb. 28, 2026
Brahms Festival: Symphonies 1 & 2




