Hello fellow experimenters! 👋
If you have been working with Optimizely Data for a while, you have probably noticed some puzzling behaviors on the results page that left you scratching your head. Today we are diving deep into the edge cases and gotchas that every practitioner should understand to avoid common interpretation mistakes.
Whether you are dealing with late conversions after pausing an experiment, forced variations affecting your statistics, or complex audience targeting scenarios, this guide helps you navigate the tricky waters of experiment result interpretation with confidence.
Paused Experiments: What Happens to Late Conversions?
One of the most confusing scenarios occurs when you pause an experiment but continue to see conversions trickling in. Many practitioners assume that pausing an experiment immediately stops all conversion attribution, but that is not how Optimizely works.
Conversion Attribution After Pause
Here is the key insight: conversions are still counted even after you pause an experiment. When you call the track event method, Optimizely records a conversion and attributes it to the variations that the user has seen regardless of the experiment’s current running status.
This happens because conversion attribution is tied to the user’s exposure to variations, not the experiment’s current state. If a user was bucketed into a variation while the experiment was running, any subsequent conversions from that user will be attributed to their assigned variation until the attribution window expires.
// This conversion will still be attributed even if experiment is paused
optimizelyClient.trackEvent('purchase', 'user123', { location: 'NY' }, {
revenue: 99.99
});
Actionable advice: Always check your experiment’s attribution window settings before pausing. If you need to completely stop attribution, consider resetting the experiment results or excluding the post-pause period from your analysis timeframe.
API Response Timing vs UI Display
Another timing gotcha relates to when results actually update. The Results API documentation reveals that the end time will be rounded to the largest time modulo 5 minutes smaller or equal to end_time and that the end_time in the response may be earlier than requested if fresher results are not available yet.
This means there is up to a 5-minute delay between when conversions are sent and when they appear in results. The API response includes an is_stale
flag that indicates whether you are looking at the most current data:
{
"confidence_threshold": 0.9,
"end_time": "2024-01-01T12:00:00Z",
"experiment_id": 12345678,
"is_stale": true,
"metrics": [...]
}
Actionable advice: When analyzing recent changes, always check the is_stale
flag and end_time in API responses. If you are seeing unexpected numbers, wait 5-10 minutes and check again before drawing conclusions.
Force Variations: Overriding the Bucket System
Forced decisions can completely change how you should interpret your results, yet many practitioners do not fully understand their implications.
Bucketing Hierarchy and Conflicts
Optimizely follows a strict hierarchy when determining user bucketing. According to the bucketing documentation, the evaluation order is:
- Forced variations (setForcedVariation or Forced Decision methods)
- User allowlisting (specific user overrides)
- User profile service (persistent bucketing)
- Audience targeting (conditional logic)
- Exclusion groups (mutual exclusion)
- Traffic allocation (percentage splits)
The critical rule: If there is a conflict over how a user should be bucketed, then the first user-bucketing method to be evaluated overrides any conflicting method.
This means forced variations trump everything else, including your carefully crafted audience conditions.
// This user will be forced into 'treatment' regardless of audience targeting
optimizelyUserContext.setForcedDecision({
flagKey: 'my_experiment',
ruleKey: 'my_experiment'
}, {
variationKey: 'treatment'
});
Force Variation Impact on Statistics
Here is a critical gotcha: forced variations can skew your statistical calculations. When users are artificially assigned to variations (rather than randomly bucketed), your experiment is no longer a true randomized controlled trial.
This impacts:
- Sample size calculations – Forced users may not represent your target population
- Confidence intervals – Non-random assignment violates statistical assumptions
- Significance testing – P-values may be misleading
Actionable advice: Document all forced variations and consider excluding forced users from statistical analysis. Use forced variations sparingly and primarily for QA purposes, not production traffic.
Multi-Attribute Targeting: When Complexity Breaks Results
Complex audience conditions with multiple attributes can create unexpected edge cases, especially when attributes have similar names across projects or when using list attributes.
Attribute Key Conflicts Across Projects
A common gotcha occurs when you have shared user IDs across multiple Optimizely projects with conflicting attribute definitions. If you are using the same user_id
value but different attribute keys or value formats, you might see users incorrectly segmented in your results.
For example:
- Project A uses
subscription_tier
with values [“free”, “premium”, “enterprise”] - Project B uses
subscription_tier
with values [1, 2, 3]
If the same user appears in both projects, their attribute values might not match your expected targeting logic.
Actionable advice: Maintain a centralized attribute schema document and use consistent naming conventions across all projects. Consider namespacing attributes by project (e.g., projectA_subscription_tier
).
List Attributes and Audience Overlap
List attributes create another potential pitfall. When targeting users who are part of an audience you have already defined somewhere outside of Optimizely, you might encounter users who appear in multiple lists or whose list membership has changed since upload.
The results page may show unexpected bucket assignments if:
- List data is stale (user was removed from external list but not from Optimizely)
- Users appear in conflicting lists with different targeting rules
- List upload timing does not align with experiment start
Actionable advice: Regularly audit and refresh your list attributes. Set up monitoring to track when list sizes change dramatically, which could indicate data sync issues.
Environment Scoping: QA Data Contamination
One of the biggest improvements in Feature Experimentation over legacy systems is environment-scoped results, but this change creates confusion for teams migrating from older setups.
Legacy vs Feature Experimentation Results
In Full Stack Experimentation (Legacy), experiment results can only be viewed at the experiment level and experiments run across all environments and share one results page. This meant that when you run quality assurance (QA) on an experiment, those events get mixed with the live production results.
Feature Experimentation solves this by scoping experiment rules by the environment, so results are also scoped by the environment. This eliminates a common pain point wherein events created in QA are mixed with live production results.
However, if you are looking at legacy experiment results, you might still be seeing contaminated data from multiple environments mixed together.
When to Reset vs Migrate Results
When encountering contaminated results, you have two options:
Reset Results:
- Choose this when QA data significantly skews your metrics
- Acceptable when you are early in the experiment lifecycle
- Required when test data creates impossible conversion values
Migrate/Filter Results:
- Use when you have significant production data you do not want to lose
- Filter by date range to exclude QA periods
- Exclude known test user IDs from analysis
Actionable advice: Before launching experiments in production, establish clear QA protocols that use separate environments. For legacy experiments with mixed data, document the contamination period and consider time-based filtering in your analysis.
Statistical Significance Gotchas
The results page statistical indicators often get misinterpreted, leading to premature or incorrect experiment conclusions.
Confidence Threshold vs Winning Direction
A critical distinction exists between confidence threshold and winning direction. The API response separates these concepts:
"metrics": [{
"conclusion": {
"loser": "string",
"winner": "string"
},
"winning_direction": "increasing",
// other metric data...
}]
The confidence_threshold
(typically 0.9 for 90% confidence) indicates the statistical certainty threshold, while winning_direction
shows whether the metric is expected to increase or decrease for a positive outcome.
A common mistake is declaring a “winner” when:
- The variation shows improvement in the wrong direction (e.g., decrease when increase is desired)
- Confidence threshold is met but practical significance is minimal
- Multiple comparisons have not been accounted for
Reach Calculations and Traffic Allocation
The reach
object in results provides crucial context that many practitioners overlook:
"reach": {
"baseline_count": 0,
"baseline_reach": 0.0,
"total_count": 0,
"treatment_count": 0,
"treatment_reach": 0.0,
"variations": {}
}
baseline_reach
vs treatment_reach
shows the actual traffic split, which might differ from your intended allocation due to:
- Audience targeting reducing eligible users
- User profile service sticky bucketing
- Mid-experiment traffic allocation changes
If your reach numbers are dramatically different from expected allocations, investigate potential targeting issues before trusting the statistical results.
Actionable advice: Always check reach calculations before interpreting significance. Look for patterns like consistently low treatment reach, which might indicate targeting problems or implementation issues.
Next Steps
Understanding these results page gotchas helps you avoid common pitfalls and make more confident decisions. Remember to always:
- Check the
is_stale
flag for recent data - Document forced variations and their impact on statistics
- Verify environment scoping matches your analysis intent
- Distinguish between statistical and practical significance
- Monitor reach calculations for targeting issues
Have you encountered any of these gotchas in your own experiments? What other results page edge cases have caught you off guard? I would love to hear about your experiences and any additional gotchas you have discovered!
References: