murmur

how it works

Murmur is a synthetic forecasting swarm. It runs dozens of AI personas with diverse expertise through a structured analytical process, clusters their predictions into natural scenarios, and surfaces the key disagreements that matter.

Why forecasting is hard

Most predictions fail for predictable reasons. People anchor on a single narrative. They confuse confidence with accuracy. They ignore base rates. They don't update when evidence changes. And when asked "what's the probability?", they round to 0%, 50%, or 100%.

Philip Tetlock's Good Judgment Project spent decades studying what separates accurate forecasters from everyone else. The answer wasn't domain expertise or intelligence — it was a specific set of cognitive habits: thinking in probabilities, breaking questions into components, balancing inside and outside views, and actively looking for reasons you might be wrong.[1]

The problem is that these habits are hard to maintain. Even trained superforecasters regress when they're tired, rushed, or emotionally invested. Murmur automates the discipline.

How Murmur works

Your question Clarification Swarm Clustering Debate Scenarios Assumptions

1. Clarification

Vague questions produce vague forecasts. Before anything runs, Murmur asks you 2–4 clarifying questions based on Tetlock's commandments: What's the timeframe? What would count as resolution? What's your prior estimate? What's the strongest force for and against?

Your answers get synthesized into a precise, falsifiable question with explicit resolution criteria — the kind of question that can actually be scored right or wrong later.

2. The swarm

Murmur sends your question to multiple AI personas, each with distinct expertise, analytical frameworks, known cognitive biases, and blind spots. A CEO thinks about market timing. An actuary thinks about tail risk. A red teamer thinks about how assumptions break under pressure. An artist thinks about cultural adoption. A philosopher questions the hidden assumptions everyone else takes for granted.

The system intelligently selects the most relevant personas based on the question's domain — always including an adversarial challenger and a humanistic perspective. Each persona runs multiple times with varied parameters — different temperatures, evidence emphases, and temporal anchors — producing dozens of independent forecasts.

Every forecast follows a structured reasoning chain:

  1. Base rate anchoring. Identify the reference class. What's the historical rate for events like this? Start from the outside view.
  2. Decomposition. Break the question into 2–4 independent sub-questions. Estimate each one separately.
  3. Inside view adjustment. What case-specific factors push the probability up or down from the base rate?
  4. Counterargument. State the strongest case against your position. What evidence would change your mind?
  5. Final estimate. Synthesize into a single probability.
Why structured single-pass, not multi-round? Research on LLM self-revision shows a well-documented failure mode: when you ask a model to reconsider its own estimate, it regresses toward 50%. It hedges instead of genuinely self-correcting.[3] You'd spend 2–3x the API calls to make estimates less sharp. The swarm's power comes from aggregating many sharp, diverse opinions — not from making each opinion individually more cautious. Structured single-pass prompts that force decomposition and base rate anchoring consistently outperform simpler approaches.[5]

3. Clustering

Dozens of probability estimates don't speak for themselves. Murmur clusters them into 2–3 natural scenarios using a combination of DBSCAN (density-based clustering that finds natural groupings) and k-means with silhouette score optimization. The cap at 3 scenarios is deliberate — more than 3 produces blurry, overlapping futures that don't help decision-making.

The result isn't a single number. It's a map of possible futures: "35% of forecasters think gradual augmentation, 28% think rapid displacement, 22% think hybrid equilibrium." Each cluster represents a coherent, distinct story about what could happen.

4. Aggregation: two numbers, not one

Murmur shows two aggregate probabilities, not one, because there's an honest uncertainty about how to combine the estimates.

Panel mean is the simple average across all forecasters. This is the right number if you believe the personas share systematic biases from the same base model — which they do. They all read the same training data. When they agree, it might reflect genuine evidence or a shared blind spot. The mean treats their agreement cautiously.

Extremized aggregate uses Tetlock's formula from the Good Judgment Project: geometric mean of odds, then push away from 50% by a factor of d=2.5.[2] The intuition: if independent forecasters mostly agree, the true probability is probably more extreme than the average. This was validated on genuinely independent human superforecasters in the IARPA tournament.

The honest caveat: d=2.5 was calibrated on humans with different life experiences, information sources, and reasoning styles. LLM personas sharing a base model are less independent than that. The truth likely falls between the mean and the extremized number. Murmur shows both so you can reason about the range rather than anchoring on a false precision.

5. Cross-persona debate

After clustering, Murmur identifies the two scenarios with the highest disagreement and picks a "champion" persona from each — the persona whose viewpoint dominates that cluster. Then it runs a structured debate: each champion sees the other's strongest argument and must rebut it.

The debate doesn't revise the numbers. Its purpose is to surface the core analytical tension — the structural disagreement that explains why the forecasters diverge. This is often the most useful output: not "38% probability" but "the real question is whether regulatory friction or market pressure wins."

Why debate only post-clustering? Multi-agent forecasting research[4] found that debate adds value only when there's genuine structural disagreement — different analytical frameworks or information, not just different random draws. If two personas disagree because one got bullish evidence and the other got bearish evidence, debate just averages them out. The clustering already handles that. Debate matters when the Red Teamer says "the technology doesn't actually work" and the VC says "the market doesn't care if it works yet."

6. Scenario narratives

Each cluster gets a narrative: a vivid 2–3 sentence description of what this future looks like, the key assumption it depends on, and the condition that would break it. This turns statistical clusters into stories you can reason about.

Every scenario is also expandable — you can drill into the reasoning of individual forecasters to see their base rate estimate, inside view adjustment, sub-question decomposition, and what specific evidence would change their mind. This transparency lets you evaluate why the number is what it is, not just what the number is.

7. Assumption extraction

The final step is often the most valuable. Murmur examines all the scenarios and extracts the load-bearing assumptions — the specific, falsifiable claims about the world that must be true for each scenario to play out.

For each assumption, Murmur identifies:

Critically, Murmur also identifies shared assumptions — assumptions that appear across multiple scenarios. These are the highest-leverage monitoring targets, because if a shared assumption breaks, it doesn't just shift one scenario — it reshuffles the entire forecast.

The linchpin assumption is the single assumption whose reversal would cause the largest redistribution of probability across all scenarios. This is the thing to watch. If you're going to monitor one signal to know whether the forecast is still valid, it's this one.

Why this matters: A forecast that says "35% probability" is useful for about a week. A forecast that says "35% probability, and here are the 3 assumptions it depends on, here's when you can check each one, and here's the one that matters most" — that's useful for months. The assumptions are the monitoring system for the forecast itself.

The personas

Murmur ships with a diverse roster of personas spanning cybersecurity, technology, business, policy, finance, and humanities. Each has:

The diversity is the point. A CEO and an actuary will forecast the same question through completely different lenses. That's not noise — it's signal. The scenarios that emerge from clustering many different perspectives are richer than any single expert's prediction.

What Murmur is not

Murmur is not an oracle. It's a structured thinking tool. The output is not "the answer" — it's a map of plausible futures weighted by probability, with the key assumptions and breaking conditions made explicit.

The value isn't the point estimate. It's the decomposition: what are the real sub-questions? Where do smart people disagree, and why? What specific evidence would change the picture?

Use it to think better, not to think less.

References

[1] Tetlock, P.E. & Gardner, D. (2015). Superforecasting: The Art and Science of Prediction. Crown Publishers. The foundational work on what makes forecasters accurate. Wikipedia · Good Judgment Project

[2] Baron, J. et al. (2014). Two Reasons to Make Aggregated Probability Forecasts More Extreme. Decision Analysis, 11(2), 133–145. The empirical basis for extremized aggregation with d=2.5. doi:10.1287/deca.2014.0293

[3] Halawi, D. et al. (2024). Approaching Human-Level Forecasting with Language Models. arXiv preprint. Demonstrates structured prompting improves LLM forecasting accuracy by up to 41% over baseline. arXiv:2402.18563

[4] Schoenegger, P. et al. (2024). AI Superforecasting: Can AI Beat Human Forecasters? Multi-agent experiments showing independent analysis followed by selective debate outperforms consensus-seeking approaches. arXiv:2409.08322

[5] Zou, A. et al. (2024). Forecasting with Large Language Models. arXiv preprint. Structured single-pass prompts with decomposition and base rate anchoring outperform multi-round self-revision. arXiv:2402.01426