Testing AI Models with Feature Flags: LLM Prompt Optimization

Modern AI applications face a unique challenge: how do you A/B test something as dynamic and complex as a large language model? Whether you're choosing between Claude Opus 4.8, GPT-5.5, and Gemini 3.1 Pro, optimizing prompt engineering, or balancing cost and quality across model tiers, traditional experimentation approaches fall short. Feature flags provide an elegant solution for testing AI configurations without deploying new code.

AI teams struggle with critical decisions: Which frontier model performs better for customer support—Claude Opus 4.8's coding expertise or GPT-5.5's reasoning prowess? Should you pay $30 per million tokens for Opus 4.8 or use Claude Haiku 4.5 at $6 for simpler queries? Can prompt caching reduce costs by 90%? Without proper experimentation, these decisions are made on gut feeling rather than data, potentially costing thousands in wasted API spend.

This guide demonstrates how to use Optimizely Feature Experimentation to A/B test the latest AI models, prompts, and cost optimization strategies. You'll learn how to set up feature flag variations that route traffic to different AI configurations, leverage cost-saving features like prompt caching and batch APIs, measure quality metrics, and make data-driven decisions that can reduce AI costs by 50-67% while maintaining quality.

The 2026 AI Model Landscape

The AI model landscape has matured significantly with three dominant providers releasing flagship models in late 2025. Each model excels at different tasks, making intelligent model selection critical for both quality and cost optimization.

OpenAI GPT-5.5

GPT-5.5 is OpenAI's most capable model series for professional knowledge work. It comes in three variants: Instant (optimized for speed), Thinking (extended reasoning), and Pro (maximum capability).

Key capabilities include a 400,000-token context window, 128,000-token output capacity, 65% fewer hallucinations compared to GPT-4, and 100% accuracy on the AIME 2025 mathematics olympiad. The model is priced at $5 per million input tokens and $30 per million output tokens, with a 90% discount on cached inputs.

GPT-5.5 is best suited for complex reasoning tasks, mathematics, and general-purpose applications where cost efficiency matters.

Anthropic Claude Opus 4.8 and Sonnet 4.6

Claude Opus 4.8 is described as "the best model in the world for coding, agents, and computer use." It achieves 80.9% on SWE-bench Verified (real GitHub issues), making it the leader for autonomous coding tasks. The model delivers flagship performance at 67% lower cost than the previous Opus 4.

Pricing for Opus 4.8 is $5 per million input tokens and $25 per million output tokens. With prompt caching, cache reads cost only $0.50 per million tokens, representing a 90% savings.

Claude Sonnet 4.6 offers the best balance of intelligence, speed, and cost for most use cases at $3 per million input tokens and $15 per million output tokens. Anthropic recommends starting with Sonnet 4.6 for general-purpose applications.

Claude Haiku 4.5 provides a budget option at $1 per million input tokens and $5 per million output tokens, ideal for high-volume, simple queries.

Google Gemini 3.1 Pro

Gemini 3.1 Pro features the longest context window among the three providers at 1 million tokens, native multimodal understanding for images, video, and audio, and wins user preference rankings for helpfulness.

Pricing for contexts under 200,000 tokens is $2 per million input tokens and $12 per million output tokens. For larger contexts, pricing increases to $4 input and $18 output.

Gemini 3.1 Pro excels at multimodal tasks, longest context requirements, and daily assistance scenarios.

Model Selection Decision Matrix

The following table summarizes when to use each model:

Model	Cost per 1M+1M Tokens	Best For	Key Benchmark
Claude Opus 4.8	$30.00	Coding, agents, autonomous tasks	SWE-bench: 80.9%
GPT-5.5	$35.00	Reasoning, mathematics	AIME 2025: 100%
Gemini 3.1 Pro	$14.00	Multimodal, long context	1M token context
Claude Sonnet 4.6	$18.00	General-purpose, balanced	Production default
Claude Haiku 4.5	$6.00	High-volume, simple queries	Fast, cost-efficient

Intelligent routing can save 40-67% by using the right model tier for each query type.

Why Feature Flags for AI Testing

Feature flags transform how teams deploy and test AI models in production. Rather than deploying new code for every model change, feature flags allow instant configuration changes, gradual rollouts, and rapid rollback if issues arise.

Benefits of the Feature Flag Approach

Testing multiple frontier models simultaneously becomes straightforward. You can run Claude Opus 4.8, GPT-5.5, and Gemini 3.1 Pro in parallel, routing traffic based on your experimental design.

Experimenting with cost optimization features like prompt caching and batch APIs requires no code changes. Simply update the flag configuration to enable or disable these features for specific user segments.

Instant rollback provides a safety net. If a model underperforms or costs spike unexpectedly, you can switch back to your baseline within seconds, not hours.

Gradual traffic ramps minimize risk. Start with 1% of traffic, validate metrics, then progressively increase to 5%, 25%, 50%, and finally 100%.

Segmentation by query complexity or user tier enables intelligent routing. Send simple queries to Haiku 4.5 and complex queries to Opus 4.8, optimizing both quality and cost.

What You Can Test

Model providers and versions offer the most impactful testing opportunities. Compare Claude Opus 4.8 versus GPT-5.5 versus Gemini 3.1 Pro to determine which performs best for your specific use case.

Model tiers within a provider family also warrant testing. The quality difference between Opus 4.8 ($30), Sonnet 4.6 ($18), and Haiku 4.5 ($6) may or may not justify the cost difference for your workload.

Prompt templates significantly impact output quality. Test system prompts, few-shot examples, and structured output formats.

Model parameters like temperature, top_p, and max_tokens affect response creativity and consistency.

Cost optimization strategies including standard API calls, prompt caching, and batch API usage can reduce costs by 50-90% for eligible workloads.

Intelligent routing rules that classify queries by complexity and route to appropriate model tiers can yield 40-67% cost savings.

Feature Flags as AI Configuration Routers

The core pattern for feature flag-based AI testing routes each request through a decision point that determines the AI configuration:

User Request → Feature Flag Decision → Model Selection → Cost Optimization → LLM API Call → Metrics Tracking

flowchart TB
    A[User Request] --> B[Optimizely Feature Flag]
    B -->|50% Control| C[Claude Opus 4.8]
    B -->|25% Var1| D[GPT-5.5]
    B -->|25% Var2| E[Gemini 3.1 Pro]
    C --> F[Apply Caching]
    D --> F
    E --> F
    F --> G[Track Metrics]
    G --> H[Optimizely Results]

When a user triggers an AI interaction, the feature flag's decide() method returns a variation key such as "opus", "gpt55", or "gemini". The application maps this variation to a complete AI configuration object containing the provider, model, prompt, and parameters. Cost optimization logic checks for cache hits or batch eligibility. Finally, the appropriate LLM API is called and comprehensive metrics are tracked back to Optimizely.

Implementation Guide

This section provides complete code examples for implementing AI model testing with Optimizely feature flags.

Setting Up the Feature Flag

Create a new feature flag in Optimizely named ai-model-selection-2026 with the following variations:

The control variation uses Claude Opus 4.8 with prompt caching as your baseline. Add challenger variations for GPT-5.5, Gemini 3.1 Pro, and an intelligent routing option that dynamically selects models based on query complexity.

Set initial traffic allocation to 40% control, 20% GPT-5.5, 20% Gemini 3.1 Pro, and 20% intelligent routing. This distribution provides sufficient data for each variation while maintaining a substantial control group.

Define a custom event named ai_response_generated with event properties for latency_ms, cost_usd, accuracy_score, cache_hit, and tokens_used.

Configuration Map

Define your AI configurations as a map that the feature flag variations reference:

const AI_CONFIGS = {
  'control': {
    provider: 'anthropic',
    model: 'claude-opus-4-8',
    temperature: 0.7,
    max_tokens: 2048,
    useCache: true,
    systemPrompt: 'You are an expert customer support agent...',
    pricing: { input: 5.00, output: 25.00, cache_read: 0.50 }
  },
  'gpt55': {
    provider: 'openai',
    model: 'gpt-5.5',
    temperature: 0.7,
    max_tokens: 2048,
    systemPrompt: 'You are an expert customer support agent...',
    pricing: { input: 5.00, output: 30.00, cache: 0.50 }
  },
  'gemini': {
    provider: 'google',
    model: 'gemini-3.1-pro',
    temperature: 0.7,
    max_tokens: 2048,
    systemPrompt: 'You are an expert customer support agent...',
    pricing: { input: 2.00, output: 12.00 }
  },
  'haiku': {
    provider: 'anthropic',
    model: 'claude-haiku-4-5',
    temperature: 0.5,
    max_tokens: 1024,
    useCache: true,
    systemPrompt: 'You are a concise customer support agent...',
    pricing: { input: 1.00, output: 5.00, cache_read: 0.10 }
  }
};

Main Request Handler

The main function handles the feature flag decision and routes to the appropriate model:

import Anthropic from '@anthropic-ai/sdk';
import OpenAI from 'openai';
import { GoogleGenerativeAI } from '@google/generative-ai';
import { createInstance } from '@optimizely/optimizely-sdk';

const optimizelyClient = createInstance({
  sdkKey: process.env.OPTIMIZELY_SDK_KEY
});

const anthropic = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY });
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
const googleAI = new GoogleGenerativeAI(process.env.GOOGLE_API_KEY);

async function getAIResponse(userId, userMessage, userAttributes = {}) {
  const startTime = Date.now();

  const user = {
    id: userId,
    attributes: {
      user_tier: userAttributes.tier || 'free',
      query_complexity: classifyQueryComplexity(userMessage)
    }
  };

  const decision = optimizelyClient.decide(user, 'ai-model-selection-2026');
  const variation = decision.variationKey;
  const config = AI_CONFIGS[variation];

  let response, tokensUsed, cost, cacheHit = false;

  try {
    if (config.provider === 'anthropic') {
      const result = await callClaude(config, userMessage);
      response = result.response;
      tokensUsed = result.tokensUsed;
      cost = result.cost;
      cacheHit = result.cacheHit;
    } else if (config.provider === 'openai') {
      const result = await callGPT(config, userMessage);
      response = result.response;
      tokensUsed = result.tokensUsed;
      cost = result.cost;
      cacheHit = result.cacheHit;
    } else if (config.provider === 'google') {
      const result = await callGemini(config, userMessage);
      response = result.response;
      tokensUsed = result.tokensUsed;
      cost = result.cost;
    }

    const latencyMs = Date.now() - startTime;

    optimizelyClient.track('ai_response_generated', userId, {
      $opt_event_properties: {
        latency_ms: latencyMs,
        cost_usd: cost,
        tokens_used: tokensUsed,
        cache_hit: cacheHit,
        model_provider: config.provider,
        model_name: config.model,
        variation: variation
      }
    });

    return { response, metadata: { variation, latencyMs, cost, tokensUsed, cacheHit } };

  } catch (error) {
    optimizelyClient.track('ai_error', userId, {
      $opt_event_properties: {
        error_type: error.name,
        model_provider: config.provider,
        variation: variation
      }
    });
    throw error;
  }
}

Claude API Call with Prompt Caching

The Claude implementation enables prompt caching for significant cost savings:

async function callClaude(config, userMessage) {
  const requestParams = {
    model: config.model,
    max_tokens: config.max_tokens,
    temperature: config.temperature,
    messages: [{ role: 'user', content: userMessage }]
  };

  if (config.useCache) {
    requestParams.system = [
      {
        type: 'text',
        text: config.systemPrompt,
        cache_control: { type: 'ephemeral' }
      }
    ];
  } else {
    requestParams.system = config.systemPrompt;
  }

  const result = await anthropic.messages.create(requestParams);

  const inputTokens = result.usage.input_tokens;
  const cachedTokens = result.usage.cache_read_input_tokens || 0;
  const outputTokens = result.usage.output_tokens;
  const cacheHit = cachedTokens > 0;

  const inputCost = (inputTokens - cachedTokens) * (config.pricing.input / 1_000_000);
  const cacheCost = cachedTokens * (config.pricing.cache_read / 1_000_000);
  const outputCost = outputTokens * (config.pricing.output / 1_000_000);
  const totalCost = inputCost + cacheCost + outputCost;

  return {
    response: result.content[0].text,
    tokensUsed: inputTokens + outputTokens,
    cost: totalCost,
    cacheHit: cacheHit
  };
}

GPT-5.5 API Call

The GPT-5.5 implementation with automatic caching detection:

async function callGPT(config, userMessage) {
  const result = await openai.chat.completions.create({
    model: config.model,
    max_tokens: config.max_tokens,
    temperature: config.temperature,
    messages: [
      { role: 'system', content: config.systemPrompt },
      { role: 'user', content: userMessage }
    ]
  });

  const inputTokens = result.usage.prompt_tokens;
  const outputTokens = result.usage.completion_tokens;
  const cachedTokens = result.usage.prompt_tokens_details?.cached_tokens || 0;
  const cacheHit = cachedTokens > 0;

  const uncachedInput = inputTokens - cachedTokens;
  const inputCost = uncachedInput * (config.pricing.input / 1_000_000);
  const cacheCost = cachedTokens * (config.pricing.cache / 1_000_000);
  const outputCost = outputTokens * (config.pricing.output / 1_000_000);
  const totalCost = inputCost + cacheCost + outputCost;

  return {
    response: result.choices[0].message.content,
    tokensUsed: inputTokens + outputTokens,
    cost: totalCost,
    cacheHit: cacheHit
  };
}

Query Complexity Classification

Intelligent routing requires classifying queries by complexity:

function classifyQueryComplexity(userMessage) {
  if (userMessage.length < 50) return 'simple';
  if (userMessage.length > 200) return 'complex';

  const simpleKeywords = [
    'track order', 'shipping status', 'return policy',
    'hours', 'location', 'price'
  ];
  const complexKeywords = [
    'refund', 'damaged', 'defective', 'unauthorized',
    'fraud', 'complaint', 'manager'
  ];

  const lower = userMessage.toLowerCase();

  if (complexKeywords.some(kw => lower.includes(kw))) return 'complex';
  if (simpleKeywords.some(kw => lower.includes(kw))) return 'simple';

  return 'medium';
}

Testing Strategies

Several testing strategies address different optimization goals.

Strategy 1: Model Provider Comparison

Test the hypothesis that Claude Opus 4.8 matches or beats GPT-5.5 on accuracy at a lower blended cost ($30 versus $35 per 1M+1M tokens).

Configure a 50/50 split between Claude Opus 4.8 with prompt caching and GPT-5.5 with prompt caching. Run the test for two weeks with all customer support queries.

Measure accuracy via thumbs up/down rate as the primary metric. Track cost per query, latency P95, and cache hit rate as secondary metrics.

If Claude accuracy matches or exceeds GPT-5.5, keep Claude as the default—it delivers equal or better quality at a lower blended cost ($30 versus $35 per 1M+1M tokens).

Strategy 2: Cost Optimization with Prompt Caching

Test whether enabling prompt caching reduces costs by 40-50% without degrading quality.

Split traffic 50/50 between Claude Opus 4.8 standard and Claude Opus 4.8 with prompt caching enabled.

The primary metric is cost per query, with an expected 40-50% reduction. Accuracy must be maintained at baseline levels to confirm caching does not impact quality.

Expected savings with a 50% cache hit rate: cost drops from $0.015 to $0.008 per query. For 100,000 monthly queries, this represents $700 in monthly savings.

Strategy 3: Intelligent Model Routing

Test the hypothesis that routing simple queries to Haiku 4.5 reduces costs by 70% while maintaining acceptable quality.

In the control group, use Claude Opus 4.8 for all queries. In the variation group, route simple queries (estimated 70% of traffic) to Haiku 4.5 and complex queries (30%) to Opus 4.8.

Expected cost comparison per 100,000 queries:

Approach	Simple Queries (70K)	Complex Queries (30K)	Total Cost
All Opus 4.8	$1,050	$450	$1,500
Intelligent Routing	$210 (Haiku)	$450 (Opus)	$660

The intelligent routing approach yields 56% cost savings. Validate that Haiku accuracy on simple queries reaches at least 90% of Opus accuracy.

Strategy 4: Batch API for Non-Urgent Queries

For workloads that can tolerate 1-24 hour latency, such as email support responses, test batch API pricing.

Split email support queries 50/50 between real-time GPT-5.5 at $35.00 per million input and output tokens and batch API GPT-5.5 at $17.50 per million tokens.

The 50% cost savings from batch processing applies only to workloads where delayed responses are acceptable.

Measuring Success

Define success metrics before launching any experiment.

Example OKR: Cost Optimization

Objective: Reduce AI support costs while maintaining quality.

Key Result 1: Reduce cost per query by 40%, from $0.015 to $0.009.

Key Result 2: Maintain accuracy score at or above 85% (thumbs up rate).

Key Result 3: Keep P95 latency under 3 seconds.

Results Analysis Framework

After gathering sufficient data, analyze results across multiple dimensions.

The winner on accuracy may differ from the winner on cost. A model that costs twice as much but delivers only marginally better accuracy may not be the right choice for cost-sensitive applications.

Consider segment performance. A variation may win overall but underperform for specific user segments or query types.

Calculate quality per dollar: accuracy divided by cost. This composite metric helps identify the best value proposition.

Decision Matrix

Scenario	Recommended Action
Variation wins on cost AND accuracy	Roll out to 100%
Variation wins on cost, loses accuracy by less than 5%	Business decision: evaluate cost vs quality tradeoff
Variation wins on cost, loses accuracy by more than 10%	Reject variation
No statistically significant difference	Extend test or decide based on secondary metrics

Real-World Example: E-commerce Customer Support

ShopCo, an e-commerce company processing 1 million support queries per month, faced unsustainable AI costs of $15,000 monthly using Claude Opus 4.8 for all queries.

Baseline Measurement

During the first week, they measured performance with 100% Claude Opus 4.8 without caching: 87% accuracy, 2.8 second P95 latency, and $0.015 per query.

Phase 1: Enable Prompt Caching

Testing prompt caching with a 50/50 split revealed a 48% cache hit rate, reducing cost to $0.008 per query (47% savings) with no change in accuracy.

Phase 2: Intelligent Routing

The final phase implemented intelligent routing based on query complexity:

Query Type	Volume	Model	Cost per Query	Accuracy	Monthly Cost
Simple	650K	Haiku 4.5	$0.003	85%	$1,950
Medium	250K	Sonnet 4.6	$0.009	88%	$2,250
Complex	100K	Opus 4.8	$0.015	92%	$1,500
Total	1M	Mixed	$0.0057	87%	$5,700

Results Summary

Baseline cost with Opus 4.8 and no caching: $15,000 per month.

Final cost with intelligent routing and caching: $5,700 per month.

Total savings: 62%, representing $9,300 per month or $111,600 annually.

Accuracy remained at 87% with no degradation from the optimization.

flowchart TD
    A[User Support Query] --> B{Classify Complexity}
    B -->|Simple 65%| C[Haiku 4.5 $0.003/query]
    B -->|Medium 25%| D[Sonnet 4.6 $0.009/query]
    B -->|Complex 10%| E[Opus 4.8 $0.015/query]
    C --> F[Apply Caching]
    D --> F
    E --> F
    F --> G[Track Metrics]
    G --> H{Monthly Review}
    H -->|Quality Maintained| I[Continue Strategy]
    H -->|Issues Detected| J[Adjust or Rollback]

Best Practices

Leverage Cost Optimization Features

Cache system prompts and static context that remain consistent across queries. Monitor cache hit rate with a target above 40% for meaningful savings. Avoid caching user-specific data that provides no benefit.

Use batch API for email support, analytics, content generation, and overnight jobs. Avoid batch API for real-time chat where latency matters.

Gradual Rollout Schedule

Week 1: 1% traffic to catch catastrophic failures early.

Week 2: 5% traffic to validate metrics and tune thresholds.

Week 3: 25% traffic to gather statistical significance.

Week 4: 50% traffic to accelerate learning.

Week 5: 90% if winning, otherwise rollback.

Automated Rollback Triggers

Define thresholds that trigger automatic rollback:

const ROLLBACK_THRESHOLDS = {
  error_rate: 0.05,
  latency_increase: 1.5,
  cost_spike: 2.0,
  accuracy_drop: 0.10
};

If error rate exceeds 5%, latency increases by 50%, cost doubles, or accuracy drops by 10 percentage points, trigger an automatic rollback to the control variation.

Model Selection Guidelines

For coding and agent tasks, use Claude Opus 4.8.

For math and reasoning tasks, use GPT-5.5.

For multimodal tasks with images or video, use Gemini 3.1 Pro.

For simple, high-volume queries, use Claude Haiku 4.5.

For general-purpose production applications, start with Claude Sonnet 4.6.

Common Pitfalls

Ignoring Thinking Token Costs

GPT-5.5's Thinking variant generates internal reasoning tokens billed at the output rate of $30 per million tokens. A query with 100 visible output tokens may generate 2,000 hidden reasoning tokens, making the actual cost 21 times higher than expected.

Solution: For cost-sensitive tasks, use the Instant variant. When using Thinking, estimate a 10-20x token multiplier.

Not Tracking Cache Hit Rate

Teams often assume 90% caching savings, but actual cache hit rates may be much lower. Expected cost of $0.005 per query with 90% cache hits becomes $0.013 per query with only 15% cache hits.

Solution: Always track cache_hit as a metric. Investigate low hit rates by examining whether system prompts change too frequently or user queries are too unique for caching to benefit.

Testing Too Many Variables Simultaneously

Changing model, prompt, temperature, and caching in a single test makes it impossible to determine which variable caused observed differences.

Solution: Isolate variables. First test models with identical prompts and settings. Then test prompts with the winning model. Then test caching with the winning configuration.

Conclusion

The 2026 AI landscape offers powerful options for every use case. Claude Opus 4.8 leads for coding, GPT-5.5 excels at reasoning, and Gemini 3.1 Pro dominates multimodal applications.

Feature flags enable safe experimentation without code deployments. Cost optimization through prompt caching and intelligent routing can reduce AI costs by 50-67% while maintaining quality. Data-driven decisions based on proper A/B testing consistently outperform gut-feeling model selection.

For teams deploying AI in production, the recommended path forward is:

Set up your first feature flag comparing two or three flagship models.
Enable prompt caching on system prompts for immediate cost reduction.
Run a two-week A/B test on production traffic.
Implement intelligent routing based on query complexity.
Monitor and iterate, re-testing quarterly as models improve.

With proper experimentation infrastructure, you can confidently deploy the right AI model for each task, optimize costs without sacrificing quality, and adapt quickly as the AI landscape continues to evolve.