An attack class that passes every current LLM filter

*A Novel Attack Method for Large Language Models*

A recent paper published on Shaping Rooms has revealed a new attack method for Large Language Models (LLMs) that evades current security filters. The research, titled "postural manipulation," demonstrates that ordinary language can be used to subtly influence an AI's decision-making process without any detectable signature of an attack.

The Methodology

The author of the paper has developed a technique to inject "postures" into a conversation, which can alter the AI's reasoning and decision-making process. These postures are not malicious in nature and do not contain any obvious attack signatures. They are simply ordinary language that is buried within prior context, making them difficult to detect.

Reproducibility and Impact

The author has demonstrated the effectiveness of this attack method by running matched controls and documenting binary decision reversals across four frontier models. The results show that the same question can yield two different answers depending on the context in which it is asked. This effect is particularly concerning when it comes to agentic systems, where a posture installed early in one agent can survive summarization and arrive at a downstream agent as independent expert judgment.

Limitations and Coordination

It's worth noting that the author's methodology is observational, and no internal access to the models was required. The paper's limitations are stated plainly, and the author is not claiming to have all the answers. The research was coordinated with several major companies in the AI industry, including Anthropic, OpenAI, Google, xAI, CERT/CC, and OWASP, before publication.

Testing and Discussion

The demos for this research are available at https://shapingrooms.com/demos, and they work against any frontier model without requiring any setup. The author is happy to discuss the findings and answer any questions.

This research highlights the need for further investigation into the security of LLMs and the potential for subtle influence. As AI becomes increasingly integrated into our lives, it's essential to understand the vulnerabilities of these systems and develop effective countermeasures to mitigate the risks.

An attack class that passes every current LLM filter - no payload, no injection signature, no log trace

The Methodology

Reproducibility and Impact

Limitations and Coordination

Testing and Discussion

Ricardo