How Murmur Works

Murmur is a synthetic forecasting swarm. It runs dozens of AI personas with diverse expertise through a structured analytical process, clusters their predictions into natural scenarios, and surfaces the key disagreements that matter.

Why forecasting is hard

Most predictions fail for predictable reasons. People anchor on a single narrative. They confuse confidence with accuracy. They ignore base rates. They don't update when evidence changes. And when asked "what's the probability?", they round to 0%, 50%, or 100%.

Philip Tetlock's Good Judgment Project spent decades studying what separates accurate forecasters from everyone else. The answer wasn't domain expertise or intelligence — it was a specific set of cognitive habits: thinking in probabilities, breaking questions into components, balancing inside and outside views, and actively looking for reasons you might be wrong.^[1]

The problem is that these habits are hard to maintain. Even trained superforecasters regress when they're tired, rushed, or emotionally invested. Murmur automates the discipline.

How Murmur works

Your question → Clarification → Web search → Multi-model swarm → Clustering → Debate → Scenarios → Assumptions

1. Clarification

Vague questions produce vague forecasts. Before anything runs, Murmur asks you 2–4 clarifying questions based on Tetlock's commandments: What's the timeframe? What would count as resolution? What's your prior estimate? What's the strongest force for and against?

Your answers get synthesized into a precise, falsifiable question with explicit resolution criteria — the kind of question that can actually be scored right or wrong later.

2. Web search grounding

Before any forecasting begins, Murmur searches the web for recent context on your question. This grounds the forecast in current reality — not just the models' training data, which may be months or years stale.

The search results are injected into every persona's prompt as background context, with explicit anti-anchoring instructions: the models are told to use the results as one input among many, not as the primary evidence. This prevents the common failure mode where a model over-indexes on a single recent news article instead of doing structured analytical reasoning.

Why this matters: In early benchmarks, Murmur's two worst failures were both factual errors — the models confidently believed things that hadn't actually happened. Web search grounding cut the Brier score on those questions dramatically by correcting wrong priors before the analysis began.

3. The multi-model swarm

Murmur sends your question to multiple AI personas, each with distinct expertise, analytical frameworks, known cognitive biases, and blind spots. A CEO thinks about market timing. An actuary thinks about tail risk. A red teamer thinks about how assumptions break under pressure. An artist thinks about cultural adoption. A philosopher questions the hidden assumptions everyone else takes for granted.

The system intelligently selects the most relevant personas based on the question's domain — always including an adversarial challenger and a humanistic perspective. Each persona runs multiple times with varied parameters — different temperatures, evidence emphases, and temporal anchors — producing dozens of independent forecasts.

Critically, these runs are split across multiple AI models — currently Claude and DeepSeek. This provides genuine model diversity: different training data, different priors, different blind spots. When Claude and DeepSeek disagree, that disagreement is real signal about uncertainty, not just temperature variation within the same model.

Every forecast follows a structured reasoning chain:

Base rate anchoring. Identify the reference class. What's the historical rate for events like this? Start from the outside view.
Decomposition. Break the question into 2–4 independent sub-questions. Estimate each one separately.
Inside view adjustment. What case-specific factors push the probability up or down from the base rate?
Counterargument. State the strongest case against your position. What evidence would change your mind?
Final estimate. Synthesize into a single probability.

Why structured single-pass, not multi-round? Research on LLM self-revision shows a well-documented failure mode: when you ask a model to reconsider its own estimate, it regresses toward 50%. It hedges instead of genuinely self-correcting.^[3] You'd spend 2–3x the API calls to make estimates less sharp. The swarm's power comes from aggregating many sharp, diverse opinions — not from making each opinion individually more cautious. Structured single-pass prompts that force decomposition and base rate anchoring consistently outperform simpler approaches.^[5]

4. Clustering

Dozens of probability estimates don't speak for themselves. Murmur clusters them into 2–3 natural scenarios using a combination of DBSCAN (density-based clustering that finds natural groupings) and k-means with silhouette score optimization. The cap at 3 scenarios is deliberate — more than 3 produces blurry, overlapping futures that don't help decision-making.

The result isn't a single number. It's a map of possible futures: "35% of forecasters think gradual augmentation, 28% think rapid displacement, 22% think hybrid equilibrium." Each cluster represents a coherent, distinct story about what could happen.

5. Aggregation: two numbers, not one

Murmur shows two aggregate probabilities, not one, because there's an honest uncertainty about how to combine the estimates.

Panel mean is the simple average across all forecasters. This is the right number if you believe the personas share systematic biases from the same base model — which they do. They all read the same training data. When they agree, it might reflect genuine evidence or a shared blind spot. The mean treats their agreement cautiously.

Extremized aggregate uses Tetlock's formula from the Good Judgment Project: geometric mean of odds, then push away from 50% by a factor of d=2.5.^[2] The intuition: if independent forecasters mostly agree, the true probability is probably more extreme than the average. This was validated on genuinely independent human superforecasters in the IARPA tournament.

The honest caveat: d=2.5 was calibrated on humans with different life experiences, information sources, and reasoning styles. LLM personas sharing a base model are less independent than that. The truth likely falls between the mean and the extremized number. Murmur shows both so you can reason about the range rather than anchoring on a false precision.

6. Cross-persona debate

After clustering, Murmur identifies the two scenarios with the highest disagreement and picks a "champion" persona from each — the persona whose viewpoint dominates that cluster. Then it runs a structured debate: each champion sees the other's strongest argument and must rebut it.

The debate doesn't revise the numbers. Its purpose is to surface the core analytical tension — the structural disagreement that explains why the forecasters diverge. This is often the most useful output: not "38% probability" but "the real question is whether regulatory friction or market pressure wins."

Why debate only post-clustering? Multi-agent forecasting research^[4] found that debate adds value only when there's genuine structural disagreement — different analytical frameworks or information, not just different random draws. If two personas disagree because one got bullish evidence and the other got bearish evidence, debate just averages them out. The clustering already handles that. Debate matters when the Red Teamer says "the technology doesn't actually work" and the VC says "the market doesn't care if it works yet."

7. Scenario narratives

Each cluster gets a narrative: a vivid 2–3 sentence description of what this future looks like, the key assumption it depends on, and the condition that would break it. This turns statistical clusters into stories you can reason about.

Every scenario is also expandable — you can drill into the reasoning of individual forecasters to see their base rate estimate, inside view adjustment, sub-question decomposition, and what specific evidence would change their mind. This transparency lets you evaluate why the number is what it is, not just what the number is.

8. Assumption extraction

The final step is often the most valuable. Murmur examines all the scenarios and extracts the load-bearing assumptions — the specific, falsifiable claims about the world that must be true for each scenario to play out.

For each assumption, Murmur identifies:

What category it belongs to (market, technology, regulatory, talent, behavioral, economic, geopolitical)
When it could be tested against real-world evidence
What current evidence supports or undermines it
How confident we should be that it holds today

Critically, Murmur also identifies shared assumptions — assumptions that appear across multiple scenarios. These are the highest-leverage monitoring targets, because if a shared assumption breaks, it doesn't just shift one scenario — it reshuffles the entire forecast.

The linchpin assumption is the single assumption whose reversal would cause the largest redistribution of probability across all scenarios. This is the thing to watch. If you're going to monitor one signal to know whether the forecast is still valid, it's this one.

Why this matters: A forecast that says "35% probability" is useful for about a week. A forecast that says "35% probability, and here are the 3 assumptions it depends on, here's when you can check each one, and here's the one that matters most" — that's useful for months. The assumptions are the monitoring system for the forecast itself.

Knowing when to be skeptical

Murmur's most important output isn't the probability — it's the metadata that tells you when to trust it and when to bring your own judgment.

Consensus warning

When 80% or more of individual forecast runs agree on the same side, Murmur flags it prominently. High consensus from personas sharing a base model can mean genuine evidence strength — or it can mean a shared blind spot in the training data. The warning surfaces the shared assumption driving the consensus so you can evaluate it yourself.

In benchmarking, the consensus warning correctly fired on both of Murmur's worst prediction failures. It also fired on several correct predictions. The warning doesn't mean the forecast is wrong — it means this is where your domain knowledge matters most.

Dissenting views

When the swarm reaches consensus but a minority of personas disagree, those dissenting voices are surfaced explicitly. Murmur identifies which personas dissent, how far they diverge from the consensus, and their reasoning.

Dissent is classified by strength: weak (one persona, could be noise), moderate (two distinct personas with coherent counter-reasoning), or strong (three or more personas, likely seeing something the majority misses). Strong dissent against consensus is the most valuable signal the tool produces.

Cost

Each forecast costs approximately $0.45, broken down as:

Claude Sonnet (~18 swarm runs + clarification + debate + narrator + assumptions): ~$0.40
DeepSeek (~17 swarm runs): ~$0.04
Tavily search: free tier (1,000 queries/month)

At 10 forecasts per day, that's roughly $4.50/day or $135/month. The biggest cost driver is Claude's swarm runs. DeepSeek provides genuine model diversity at negligible marginal cost.

The personas

Murmur ships with a diverse roster of personas spanning cybersecurity, technology, business, policy, finance, and humanities. Each has:

Expertise — what they know and how they think about problems
Analytical frameworks — the specific models they apply (Porter's five forces, MITRE ATT&CK, Kondratiev waves, etc.)
Cognitive biases — explicitly stated so the persona can compensate
Blind spots — what they systematically miss
Historical analogies — the precedents they draw from

The diversity is the point. A CEO and an actuary will forecast the same question through completely different lenses. That's not noise — it's signal. The scenarios that emerge from clustering many different perspectives are richer than any single expert's prediction.

What Murmur is not

Murmur is not an oracle. It's a structured thinking tool. The output is not "the answer" — it's a map of plausible futures weighted by probability, with the key assumptions and breaking conditions made explicit.

The value isn't the point estimate. It's the decomposition: what are the real sub-questions? Where do smart people disagree, and why? What specific evidence would change the picture?

Use it to think better, not to think less.

References

^[1] Tetlock, P.E. & Gardner, D. (2015). Superforecasting: The Art and Science of Prediction. Crown Publishers. The foundational work on what makes forecasters accurate. Wikipedia · Good Judgment Project

^[2] Baron, J. et al. (2014). Two Reasons to Make Aggregated Probability Forecasts More Extreme. Decision Analysis, 11(2), 133–145. The empirical basis for extremized aggregation with d=2.5. doi:10.1287/deca.2014.0293

^[3] Halawi, D. et al. (2024). Approaching Human-Level Forecasting with Language Models. arXiv preprint. Demonstrates structured prompting improves LLM forecasting accuracy by up to 41% over baseline. arXiv:2402.18563

^[4] Schoenegger, P. et al. (2024). AI Superforecasting: Can AI Beat Human Forecasters? Multi-agent experiments showing independent analysis followed by selective debate outperforms consensus-seeking approaches. arXiv:2409.08322

^[5] Zou, A. et al. (2024). Forecasting with Large Language Models. arXiv preprint. Structured single-pass prompts with decomposition and base rate anchoring outperform multi-round self-revision. arXiv:2402.01426