A/B Testing Emails at Scale: Statistical Rigor for Engineers | EmailForDevs

The Problem with Most Email A/B Tests

Here is a scenario that plays out at thousands of companies every week: a marketer sends variant A to 500 people and variant B to 500 people. Variant A gets a 22% open rate, variant B gets a 24%. The team declares variant B the winner and rolls it out. The problem? With those sample sizes, the difference is almost certainly due to random chance. You need roughly 4,000 recipients per variant to detect a 2-percentage-point difference in open rates with 95% confidence and 80% power.

Engineers building email systems need to apply the same statistical rigor they would use in any other experiment. This means calculating required sample sizes before the test, defining success metrics upfront, running tests to completion rather than peeking at results, and applying proper significance testing before declaring a winner.

Calculating Sample Sizes

The minimum sample size for an email A/B test depends on three factors: your baseline conversion rate, the minimum detectable effect (MDE) you care about, and your desired statistical power. The standard formula uses a significance level of 0.05 (95% confidence) and power of 0.80.

// Sample size calculator for email A/B tests
function calculateSampleSize(
  baselineRate: number,  // e.g., 0.22 for 22% open rate
  mde: number,           // e.g., 0.02 for 2 percentage points
  alpha: number = 0.05,  // significance level
  power: number = 0.80   // statistical power
): number {
  const zAlpha = 1.96;   // z-score for 95% confidence
  const zBeta = 0.84;    // z-score for 80% power
  const p1 = baselineRate;
  const p2 = baselineRate + mde;
  const pBar = (p1 + p2) / 2;

  const numerator = (zAlpha * Math.sqrt(2 * pBar * (1 - pBar)) +
    zBeta * Math.sqrt(p1 * (1 - p1) + p2 * (1 - p2))) ** 2;
  const denominator = (p2 - p1) ** 2;

  return Math.ceil(numerator / denominator);
}

// Example: detecting a 2pp lift from 22% baseline
console.log(calculateSampleSize(0.22, 0.02));
// Output: ~3,842 per variant

This means if your email list has fewer than 8,000 recipients, you cannot reliably detect a 2-percentage-point difference in open rates. You either need to accept a larger MDE (say, 5 percentage points, which requires around 620 per variant) or accumulate results across multiple sends.

Adjusting for Click-Through Rates

Click-through rates are typically much lower than open rates (2-5% vs 20-30%), which means you need significantly larger sample sizes to detect meaningful differences. A test trying to detect a 1-percentage-point improvement in click-through rate from a 3% baseline requires approximately 5,200 recipients per variant. Most teams simply do not have enough volume for click-through A/B tests on individual sends.

Significance Testing Done Right

After collecting your data, apply a two-proportion z-test to determine whether the observed difference is statistically significant. Do not use a simple comparison of percentages. The test accounts for the variance inherent in proportions and tells you the probability that the observed difference occurred by chance.

// Two-proportion z-test for email A/B results
function abTestSignificance(
  visitorsA: number, conversionsA: number,
  visitorsB: number, conversionsB: number
): { zScore: number; pValue: number; significant: boolean } {
  const pA = conversionsA / visitorsA;
  const pB = conversionsB / visitorsB;
  const pPool = (conversionsA + conversionsB) / (visitorsA + visitorsB);
  const se = Math.sqrt(pPool * (1 - pPool) * (1/visitorsA + 1/visitorsB));
  const z = (pB - pA) / se;
  // Approximate two-tailed p-value
  const p = 2 * (1 - normalCDF(Math.abs(z)));
  return { zScore: z, pValue: p, significant: p < 0.05 };
}

// Example usage:
const result = abTestSignificance(4000, 880, 4000, 960);
// pA = 22%, pB = 24%
console.log(result);
// { zScore: 2.13, pValue: 0.033, significant: true }

A critical mistake is "peeking" at results before the test reaches its required sample size. Every time you check results and consider stopping early, you inflate your false positive rate. If you check five times during a test, your effective significance level jumps from 5% to roughly 14%. Either commit to a fixed sample size or use sequential testing methods.

Multi-Armed Bandits for Email Optimization

Traditional A/B testing has a fundamental limitation: it sacrifices conversions during the test period by sending traffic to the losing variant. Multi-armed bandit algorithms address this by dynamically allocating more sends to the better-performing variant while still exploring alternatives.

The Thompson Sampling approach works well for email. For each variant, maintain a Beta distribution representing your belief about its true conversion rate. Before each send, sample from each distribution and send the variant with the highest sample. Over time, the algorithm naturally converges on the best variant while minimizing the cost of exploration.

// Thompson Sampling for email variant selection
class ThompsonBandit {
  private variants: Map<string, { successes: number; failures: number }>;

  constructor(variantIds: string[]) {
    this.variants = new Map();
    variantIds.forEach(id => {
      this.variants.set(id, { successes: 1, failures: 1 }); // Beta(1,1) prior
    });
  }

  selectVariant(): string {
    let bestSample = -1;
    let bestVariant = "";
    for (const [id, stats] of this.variants) {
      const sample = betaSample(stats.successes, stats.failures);
      if (sample > bestSample) {
        bestSample = sample;
        bestVariant = id;
      }
    }
    return bestVariant;
  }

  recordResult(variantId: string, success: boolean): void {
    const stats = this.variants.get(variantId)!;
    if (success) stats.successes++;
    else stats.failures++;
  }
}

Brew\'s AI-Driven Testing

Platforms like brew.new implement multi-armed bandit testing natively. When you configure an A/B test in Brew, you can choose between traditional fixed-horizon testing and adaptive bandit mode. In bandit mode, Brew automatically shifts send volume toward the better-performing variant after an initial exploration phase, typically reducing opportunity cost by 30-40% compared to traditional 50/50 splits.

SendGrid offers basic A/B testing with manual winner selection, which works for teams with straightforward testing needs but lacks the adaptive optimization of bandit approaches. For high-volume senders processing millions of emails per month, the efficiency gains from bandit algorithms translate directly to revenue.

Sequential Testing for Continuous Experimentation

If you cannot commit to a fixed sample size upfront, sequential testing methods let you check results at predetermined intervals without inflating your false positive rate. The most common approach uses alpha-spending functions (like the O\'Brien-Fleming method) that allocate your significance budget across multiple interim analyses.

For example, with three interim looks and a final analysis, O\'Brien-Fleming would use significance thresholds of 0.0001, 0.004, 0.019, and 0.043 at each stage. This means you can stop the test early if the results are overwhelmingly clear, but the bar for early stopping is very high. This approach is particularly useful for high-volume email programs where a single campaign might take days to fully deliver.

Putting It All Together

Before launching any email A/B test, follow this checklist: (1) Define your primary metric (open rate, click-through rate, or conversion rate). (2) Calculate the required sample size for your minimum detectable effect. (3) Verify that you have enough volume to reach that sample size within a reasonable timeframe. (4) Decide between fixed-horizon testing, sequential testing, or multi-armed bandits based on your volume and goals. (5) Commit to the analysis plan before seeing any results.

For most developer-focused email programs, a pragmatic approach is to use traditional A/B tests for subject line optimization (where open rate differences tend to be large and detectable) and bandit algorithms for content optimization (where differences are smaller and the cost of suboptimal variants is higher). Whichever method you choose, resist the urge to declare winners based on gut feeling. Let the math decide.