How the Optimizely Stats Engine Works

Loading...·10 min read

Optimizely's Stats Engine is the statistical framework behind every winner, loser, and "inconclusive" verdict on the Experiment Results page. It exists to solve a specific, expensive problem: practitioners watch experiments in real time and make decisions the moment a result "looks significant," and traditional statistics punishes that behavior with a flood of false positives. Understanding how Stats Engine works is the difference between trusting a result and acting on noise.

What Stats Engine Is and the Problem It Solves

Stats Engine is Optimizely's proprietary statistical methodology for evaluating A/B tests. It is a frequentist engine built on sequential testing combined with false discovery rate (FDR) control. Those two mechanisms are the whole story, and each addresses a distinct failure mode of classical A/B testing.

The first failure mode is the peeking problem. Traditional fixed-horizon statistics (the t-test being the canonical example) are only valid if you commit to a sample size in advance, wait until you reach it, and look exactly once. Every time you check results early and react, you give random noise another chance to cross the significance threshold. A test that peeks repeatedly can report a false "winner" far more often than its stated error rate suggests. But waiting passively for a pre-computed sample size is exactly what real experimentation teams do not do.

The second failure mode is the multiple comparisons problem. Real experiments rarely track one metric on one variation. Add more variations and more metrics and the chance of at least one false positive climbs quickly, even though each individual test holds its error rate. Worse, the rate that actually matters to a decision-maker, the proportion of false positives among the results you act on, is higher still.

Stats Engine is engineered so that both problems are handled automatically. As Optimizely's documentation states, results are always valid: you can monitor a test continuously and stop as soon as you have a clear winner, without invalidating it, because the engine controls the false discovery rate throughout.

How Sequential Testing Solves the Peeking Problem

Sequential testing is the mechanism that makes continuous monitoring safe. Rather than computing a single p-value at one fixed endpoint, Stats Engine evaluates the experiment as evidence accumulates over time and produces inferences that remain valid no matter when you look.

Optimizely's own framing uses a baking analogy that is worth keeping in mind:

Fixed Horizon : Set a timer before baking. You may only open the oven
                when the timer ends. Open it early and the result is unreliable.
Sequential    : Put the cake in without committing to a time. Open the oven
                whenever you like to check; looking never ruins the result.
                When it looks done, it is done.

Mechanically, Stats Engine does not compute one confidence interval. It computes a series of 100 successive confidence intervals across the experiment's lifetime, each with its own significance value. The numbers you see on the Results page are deliberately conservative summaries of that series:

  • The statistical significance shown is the smallest significance value observed across those sequential intervals, not the average and not the latest.

  • The confidence interval shown is the running intersection of all prior intervals: it tracks the smallest upper limit and the largest lower limit seen during the run.

Because of this, the displayed significance and interval may not exactly match the currently observed conversion rates. That is intentional. It is what makes the result robust to having been observed many times.

Why Significance Climbs (and Occasionally Drops)

In a stable environment, significance should rise in a stepwise, generally increasing fashion as evidence accumulates. Two forms of evidence move it: larger differences between conversion rates, and differences that persist across more visitors. Early on, when the sample is small, large swings are treated conservatively, so you often see a flat line that later rises sharply once real evidence accrues.

Significance can also fall, though Optimizely's analysis indicates this happens in only about 4% of experiments. Small dips of a few percentage points come from time bucketing: Optimizely divides the experiment's runtime into 100 equal buckets that grow as the test runs, reshuffling visitors among them and recomputing as it goes, which produces minor fluctuations. Larger drops, potentially all the way to 0%, come from a stats reset, a protective mechanism that triggers when the engine detects that the underlying environment has changed (the assumption that data is identically distributed has been violated). A reset is the engine refusing to stand behind a conclusion the new evidence no longer supports.

False Discovery Rate Control vs Traditional Significance

The second pillar is what makes Stats Engine trustworthy when you run many metrics and variations. The naive error metric is the false positive rate: out of all the comparisons where there is truly no effect, what fraction are wrongly flagged? Optimizely controls something more decision-relevant, the false discovery rate: out of the results you would actually act on (the declared winners and losers), what fraction are wrong?

The distinction matters enormously. Optimizely's worked example: an experiment with ten comparison opportunities reports two winners, one of which is a false winner. Measured as a false positive rate, that is 1 in 10, about 10%, which sounds acceptable. But you do not implement the eight inconclusive results; you implement the two winners. Among those, your error rate is 1 in 2, or 50%. The false discovery rate captures the risk that actually reaches your roadmap.

False positive rate = false positives / all null comparisons   = 1/10 = 10%
False discovery rate = false positives / declared discoveries   = 1/2  = 50%

To control FDR across many hypotheses, Stats Engine uses a tiered version of the Benjamini-Hochberg procedure. The tiering reflects that not all metrics deserve equal weight:

  • Primary metric (rank 1) — evaluated independently of all others, so it reaches significance as fast as possible and is unaffected by how many other metrics you track.

  • Secondary metrics (ranks 2-5) — their significance threshold is adjusted for the number of metrics and variations. Adding more secondary metrics can slow each of them to significance, but never slows the primary metric.

  • Monitoring metrics (rank 6+) — each given a fractional weight of 1/n, so they have minimal impact on secondary metrics and none on the primary metric.

The practical payoff: Optimizely keeps the false discovery rate low (approximately 10%) while still letting your most important metric reach significance quickly. This is also why Stats Engine uses two-tailed tests, which are required for FDR control.

One caveat the documentation is explicit about: FDR control is not maintained when you segment results. The deeper you slice, the higher your chance of finding a spurious "significant" segment. Use segments for exploration, not for decisions, and only inspect the most meaningful ones.

How to Read Significance and Confidence Intervals

Statistical significance answers a precise question: how unusual would these results be if the variation and baseline truly performed identically? At 90% significance, you are accepting roughly a 10% false-positive risk on that call. The confidence interval is the estimated range that likely contains the true effect (the true uplift), and Optimizely sets its confidence level to match your project's significance threshold (90% by default).

The single most useful rule for reading the Results page:

A variation reaches significance exactly when its confidence interval stops crossing zero.

  • Confidence interval entirely above 0% means a winning variation.

  • Confidence interval includes 0% means inconclusive (you cannot yet rule out "no effect").

  • Confidence interval entirely below 0% means a losing variation.

Before declaring anything, Stats Engine enforces minimum data thresholds. For binary metrics, it requires at least 100 visitors or sessions and at least 25 conversions in both the baseline and the variation. For numeric metrics such as revenue, it requires at least 100 visitors or sessions but no fixed conversion count. Until those are met, the page reports that more visitors are needed and estimates the wait.

A useful judgment heuristic from the documentation: if the observed mean (the tick mark) sits near the edge of the confidence interval, the engine may be accumulating evidence against its own conclusion, so consider waiting. If the observed mean sits near the center, you can be more confident the call will hold.

For revenue-per-visitor goals, be aware that revenue distributions are heavily skewed, which reduces statistical power. Stats Engine applies skew correction to recover some of that power and to shape the confidence interval correctly, but skewed metrics still generally need more data than binary ones.

Stats Accelerator: A Separate Feature, Not the Engine

Stats Accelerator is frequently conflated with Stats Engine. They are different things. Stats Engine is the statistical methodology that evaluates results. Stats Accelerator is a traffic-allocation feature that sits on top of it and uses a multi-armed-bandit-family algorithm (a variation on the Upper Confidence Bound strategy) to shorten the time to statistical significance.

Stats Accelerator monitors a running experiment and routes more traffic toward the variation showing the most significant difference from the baseline, regardless of whether that difference is positive or negative, because its goal is to minimize time, not regret. Once a variation reaches significance, it is removed from consideration and traffic is redistributed to the rest. It requires at least three variations (a baseline plus two). It still produces statistical significance, because the underlying engine is still doing the inference.

This is distinct from a true multi-armed bandit (MAB) optimization (formerly "Accelerate Impact"), which minimizes regret by funneling traffic to whichever variation currently performs best on the primary metric. MABs are for temporary, value-maximizing scenarios such as a Black Friday promotion, and crucially MAB optimizations do not generate statistical significance at all.

Feature

Goal

Produces significance?

Use when

Stats Accelerator

Minimize time to significance

Yes

You want a reliable winner faster

Multi-armed bandit

Maximize reward / minimize regret

No

Short-lived campaigns; no permanent decision needed

Because Stats Accelerator changes traffic allocation mid-flight, it risks a sampling bias called Simpson's Paradox when conversion rates vary over time (for example, weekday-vs-weekend seasonality). Optimizely addresses this with the Epoch Stats Engine, which produces a stratified, weighted improvement estimate, comparing baseline and variation within each interval between allocation changes, then combining those intervals by visitor count. This is also why, with Stats Accelerator enabled, the Results page may report both absolute (in percentage points) and relative improvement. For Feature Experimentation, use a user profile service (sticky bucketing) so frequent reallocation does not expose one visitor to multiple variations.

Common Misconceptions

Is the Optimizely Stats Engine Bayesian?

No. This is the most common misconception, and it is worth correcting precisely. Stats Engine is a frequentist sequential method. Optimizely does offer a separate, explicitly Bayesian A/B testing mode (which expresses results as direct probabilities like "90% chance B beats A"), and a separate Frequentist Fixed Horizon mode. But the classic Stats Engine, the one that powers sequential testing, is frequentist. It reports statistical significance and frequentist confidence intervals, not posterior probabilities. The reason a search for "Optimizely Bayesian" surfaces Stats Engine at all is that both Bayesian and sequential methods let you peek and stop early. That shared behavior does not make them the same methodology.

Stats Engine results disagree with my t-test, so they must be wrong

They can legitimately disagree, and Stats Engine is the more trustworthy of the two when you have been monitoring continuously. A t-test uses only the currently observed mean and difference, so if evidence looked strong early and weakened later, a t-test can still report the stale, strong result. Stats Engine's intersected intervals are more conservative, less likely to declare a false winner, and less likely to reverse a call later.

I can keep slicing segments until something is significant

You can, but you will be manufacturing false discoveries. FDR control does not extend across segments. Repeated segment-hunting inflates false positives exactly like peeking does.

A stats reset means the tool is broken

The opposite. A reset means the engine detected that the environment changed and is protecting you from standing behind a conclusion the new data no longer supports.

Practical Guidance for Trusting Your Results

  • Rank your metrics deliberately. Put the metric that defines success as the primary metric, ideally measured close to the change in the funnel. It gets independent, fastest-to-significance treatment; everything else is secondary or monitoring.

  • Let the test run to its planned duration even though you can peek. Sequential validity means peeking will not break your stats, but a result that has barely cleared the threshold on thin data is fragile. Treat experimentation as a standardized process, not a dashboard you babysit.

  • Read the confidence interval, not just the significance number. Width tells you precision; position relative to zero tells you direction; the tick mark's position warns you whether a call is at risk of reversing.

  • Use segments to explore, never to decide. If a segment looks interesting, treat it as a hypothesis for a new, properly powered experiment.

  • Match the method to the intent. Use a standard A/B test (sequential Stats Engine) when you need a trustworthy decision; add Stats Accelerator to reach that decision faster; use a multi-armed bandit only for temporary value-maximization where you do not need a statistically defensible winner.

Related guides