*Building a Multi-Agent System for Geopolitical Crisis Outcomes*
In a recent experiment, I created a pipeline where five AI models independently assess the probability of over 30 geopolitical crisis scenarios twice daily. Each model operates in isolation, without access to the others' outputs, and their reasoning is synthesized into final projections by an orchestrator. This setup has provided valuable insights into the behavior of these AI models.
Model Behavior and Disagreement
One of the most striking observations from the first 15 days of continuous operation is the frequency of disagreement between the models. In some cases, the differences in probability assessments were significant, exceeding 25 points. For example, Claude and GPT-4o might assign a 60% probability to a particular scenario, while Gemini and Grok estimate it to be around 35%. This disagreement is not surprising, given the different approaches and training data used by each model.
Model Biases and Anchoring
Another interesting phenomenon observed during this experiment is the anchoring effect, where models tend to rely on their previous outputs rather than updating their probabilities based on new information. To mitigate this, I made the models "blind" to their previous outputs, forcing them to reassess the scenarios without any influence from their past predictions. This adjustment led to a more dynamic and responsive output.
Model Shorthands and Hallucinations
The use of named rules in prompts became a common shorthand for the models, which they would cite instead of providing a thorough explanation of their reasoning. This behavior is likely due to the models' attempts to optimize their output for the prompt, rather than providing a genuine understanding of the scenario. Additionally, despite using Google Search grounding to prevent source hallucination, the models still fabricated content, such as a fictional $138 oil price while correctly citing Bloomberg as the source.
Insights and Observations
The experiment has shed light on several key aspects of AI model behavior:
* The importance of designing a system that encourages diverse perspectives and prevents anchoring effects.
* The need for careful prompt engineering to prevent models from relying on shorthands and instead providing nuanced explanations.
* The ongoing challenge of preventing content hallucinations, even with grounding mechanisms in place.
By continuing to explore and refine this multi-agent system, we can gain a deeper understanding of AI model behavior and develop more robust and accurate prediction tools for complex geopolitical scenarios.
[devblog at /blog covers the prompt engineering insights and mistakes I've encountered along the way in detail](http://link)