Murmur is a synthetic forecasting swarm. It runs dozens of AI personas with diverse expertise through a structured analytical process, clusters their predictions into natural scenarios, and surfaces the key disagreements that matter.
Most predictions fail for predictable reasons. People anchor on a single narrative. They confuse confidence with accuracy. They ignore base rates. They don't update when evidence changes. And when asked "what's the probability?", they round to 0%, 50%, or 100%.
Philip Tetlock's Good Judgment Project spent decades studying what separates accurate forecasters from everyone else. The answer wasn't domain expertise or intelligence — it was a specific set of cognitive habits: thinking in probabilities, breaking questions into components, balancing inside and outside views, and actively looking for reasons you might be wrong.[1]
The problem is that these habits are hard to maintain. Even trained superforecasters regress when they're tired, rushed, or emotionally invested. Murmur automates the discipline.
Vague questions produce vague forecasts. Before anything runs, Murmur asks you 2–4 clarifying questions based on Tetlock's commandments: What's the timeframe? What would count as resolution? What's your prior estimate? What's the strongest force for and against?
Your answers get synthesized into a precise, falsifiable question with explicit resolution criteria — the kind of question that can actually be scored right or wrong later.
Before any forecasting begins, Murmur searches the web for recent context on your question. This grounds the forecast in current reality — not just the models' training data, which may be months or years stale.
The search results are injected into every persona's prompt as background context, with explicit anti-anchoring instructions: the models are told to use the results as one input among many, not as the primary evidence. This prevents the common failure mode where a model over-indexes on a single recent news article instead of doing structured analytical reasoning.
Murmur sends your question to multiple AI personas, each with distinct expertise, analytical frameworks, known cognitive biases, and blind spots. A CEO thinks about market timing. An actuary thinks about tail risk. A red teamer thinks about how assumptions break under pressure. An artist thinks about cultural adoption. A philosopher questions the hidden assumptions everyone else takes for granted.
The system intelligently selects the most relevant personas based on the question's domain — always including an adversarial challenger and a humanistic perspective. Each persona runs multiple times with varied parameters — different temperatures, evidence emphases, and temporal anchors — producing dozens of independent forecasts.
Critically, these runs are split across multiple AI models — currently Claude and DeepSeek. This provides genuine model diversity: different training data, different priors, different blind spots. When Claude and DeepSeek disagree, that disagreement is real signal about uncertainty, not just temperature variation within the same model.
Every forecast follows a structured reasoning chain:
Dozens of probability estimates don't speak for themselves. Murmur clusters them into 2–3 natural scenarios using a combination of DBSCAN (density-based clustering that finds natural groupings) and k-means with silhouette score optimization. The cap at 3 scenarios is deliberate — more than 3 produces blurry, overlapping futures that don't help decision-making.
The result isn't a single number. It's a map of possible futures: "35% of forecasters think gradual augmentation, 28% think rapid displacement, 22% think hybrid equilibrium." Each cluster represents a coherent, distinct story about what could happen.
Murmur shows two aggregate probabilities, not one, because there's an honest uncertainty about how to combine the estimates.
Panel mean is the simple average across all forecasters. This is the right number if you believe the personas share systematic biases from the same base model — which they do. They all read the same training data. When they agree, it might reflect genuine evidence or a shared blind spot. The mean treats their agreement cautiously.
Extremized aggregate uses Tetlock's formula from the Good Judgment Project: geometric mean of odds, then push away from 50% by a factor of d=2.5.[2] The intuition: if independent forecasters mostly agree, the true probability is probably more extreme than the average. This was validated on genuinely independent human superforecasters in the IARPA tournament.
After clustering, Murmur identifies the two scenarios with the highest disagreement and picks a "champion" persona from each — the persona whose viewpoint dominates that cluster. Then it runs a structured debate: each champion sees the other's strongest argument and must rebut it.
The debate doesn't revise the numbers. Its purpose is to surface the core analytical tension — the structural disagreement that explains why the forecasters diverge. This is often the most useful output: not "38% probability" but "the real question is whether regulatory friction or market pressure wins."
Each cluster gets a narrative: a vivid 2–3 sentence description of what this future looks like, the key assumption it depends on, and the condition that would break it. This turns statistical clusters into stories you can reason about.
Every scenario is also expandable — you can drill into the reasoning of individual forecasters to see their base rate estimate, inside view adjustment, sub-question decomposition, and what specific evidence would change their mind. This transparency lets you evaluate why the number is what it is, not just what the number is.
The final step is often the most valuable. Murmur examines all the scenarios and extracts the load-bearing assumptions — the specific, falsifiable claims about the world that must be true for each scenario to play out.
For each assumption, Murmur identifies:
Critically, Murmur also identifies shared assumptions — assumptions that appear across multiple scenarios. These are the highest-leverage monitoring targets, because if a shared assumption breaks, it doesn't just shift one scenario — it reshuffles the entire forecast.
The linchpin assumption is the single assumption whose reversal would cause the largest redistribution of probability across all scenarios. This is the thing to watch. If you're going to monitor one signal to know whether the forecast is still valid, it's this one.
Murmur's most important output isn't the probability — it's the metadata that tells you when to trust it and when to bring your own judgment.
When 80% or more of individual forecast runs agree on the same side, Murmur flags it prominently. High consensus from personas sharing a base model can mean genuine evidence strength — or it can mean a shared blind spot in the training data. The warning surfaces the shared assumption driving the consensus so you can evaluate it yourself.
In benchmarking, the consensus warning correctly fired on both of Murmur's worst prediction failures. It also fired on several correct predictions. The warning doesn't mean the forecast is wrong — it means this is where your domain knowledge matters most.
When the swarm reaches consensus but a minority of personas disagree, those dissenting voices are surfaced explicitly. Murmur identifies which personas dissent, how far they diverge from the consensus, and their reasoning.
Dissent is classified by strength: weak (one persona, could be noise), moderate (two distinct personas with coherent counter-reasoning), or strong (three or more personas, likely seeing something the majority misses). Strong dissent against consensus is the most valuable signal the tool produces.
Each forecast costs approximately $0.45, broken down as:
At 10 forecasts per day, that's roughly $4.50/day or $135/month. The biggest cost driver is Claude's swarm runs. DeepSeek provides genuine model diversity at negligible marginal cost.
Murmur ships with a diverse roster of personas spanning cybersecurity, technology, business, policy, finance, and humanities. Each has:
The diversity is the point. A CEO and an actuary will forecast the same question through completely different lenses. That's not noise — it's signal. The scenarios that emerge from clustering many different perspectives are richer than any single expert's prediction.
Murmur is not an oracle. It's a structured thinking tool. The output is not "the answer" — it's a map of plausible futures weighted by probability, with the key assumptions and breaking conditions made explicit.
The value isn't the point estimate. It's the decomposition: what are the real sub-questions? Where do smart people disagree, and why? What specific evidence would change the picture?
Use it to think better, not to think less.
[1] Tetlock, P.E. & Gardner, D. (2015). Superforecasting: The Art and Science of Prediction. Crown Publishers. The foundational work on what makes forecasters accurate. Wikipedia · Good Judgment Project
[2] Baron, J. et al. (2014). Two Reasons to Make Aggregated Probability Forecasts More Extreme. Decision Analysis, 11(2), 133–145. The empirical basis for extremized aggregation with d=2.5. doi:10.1287/deca.2014.0293
[3] Halawi, D. et al. (2024). Approaching Human-Level Forecasting with Language Models. arXiv preprint. Demonstrates structured prompting improves LLM forecasting accuracy by up to 41% over baseline. arXiv:2402.18563
[4] Schoenegger, P. et al. (2024). AI Superforecasting: Can AI Beat Human Forecasters? Multi-agent experiments showing independent analysis followed by selective debate outperforms consensus-seeking approaches. arXiv:2409.08322
[5] Zou, A. et al. (2024). Forecasting with Large Language Models. arXiv preprint. Structured single-pass prompts with decomposition and base rate anchoring outperform multi-round self-revision. arXiv:2402.01426