What Is In-App A/B Testing?
In-app A/B testing is the practice of showing different versions of a feature, screen, or flow to different user groups and measuring which version performs better. Unlike store listing experiments (which test screenshots and descriptions), in-app tests evaluate the actual product experience.
For example, you might test whether a green or orange "Subscribe" button converts better, whether showing prices with or without a free trial mention affects revenue, or whether a simplified onboarding flow improves 7-day retention.
The key distinction from guesswork: A/B testing gives you statistically valid evidence for product decisions.
Why Mobile A/B Testing Is Different
Mobile experiments have unique constraints compared to web:
- No instant deploys. Unless you use server-driven UI or feature flags, changes require a new app version and store review.
- Smaller sample sizes. Most apps have fewer daily users than websites, so reaching statistical significance takes longer.
- Persistent state. Users update at different speeds, so you may have multiple app versions running experiments simultaneously.
- Platform differences. iOS and Android users often behave differently. Always segment results by platform.
Core Concepts
Feature Flags
Feature flags (also called feature toggles) are the foundation of mobile A/B testing. They let you enable or disable features remotely without shipping a new build.
A simple feature flag implementation:
- Your app checks a remote configuration on launch
- The config assigns the user to group A or group B
- The app renders the appropriate variant
- Analytics events track the user's behavior
Statistical Significance
Statistical significance tells you whether the difference between variants is real or just random noise. The industry standard threshold is 95% confidence (p-value < 0.05).
To reach significance, you need sufficient sample size. A rough calculator:
| Baseline conversion | Minimum detectable effect | Required sample per variant |
|---|---|---|
| 5% | 20% relative lift (5% to 6%) | ~14,000 |
| 10% | 10% relative lift (10% to 11%) | ~14,500 |
| 20% | 5% relative lift (20% to 21%) | ~30,000 |
If your app has 1,000 DAU, reaching 14,000 samples per variant takes about 28 days. Plan accordingly.
Guardrail Metrics
A guardrail metric is something you do not want to damage while optimizing your primary metric. For example, if you are testing a more aggressive upsell screen that increases conversion, your guardrail might be uninstall rate or session length. If the variant wins on conversion but loses on retention, that is a false victory.
Running an Experiment: Step by Step
1. Define a Clear Hypothesis
Bad: "Let's test the new onboarding." Good: "Reducing onboarding from 5 screens to 3 will increase completion rate by 15% without reducing 7-day retention."
2. Choose Your Primary Metric
Pick exactly one primary metric. Common choices:
- Conversion rate (trial starts, purchases, sign-ups)
- Retention (day 1, day 7, day 30)
- Engagement (sessions per week, feature adoption)
- Revenue (ARPU, LTV)
3. Calculate Required Sample Size
Use a sample size calculator (Evan Miller's is the standard). Input your baseline rate, minimum detectable effect, and desired confidence level. If the required sample size exceeds what you can collect in 4 weeks, either increase the minimum detectable effect or test on a higher-traffic screen.
4. Implement with Feature Flags
Use your chosen tool to create the experiment, define variants, and set allocation percentages. Most teams start with a 50/50 split. For risky changes, use 90/10 (control/variant) to limit exposure.
5. Monitor in Real Time
Check daily for:
- Sample ratio mismatch (are both groups actually 50/50?)
- Crash rate differences between variants
- Guardrail metric degradation
6. Analyze and Decide
Wait until you reach your pre-calculated sample size. Do not peek at results early and stop the test when it "looks significant" - this inflates false positive rates (the peeking problem).
Tools for Mobile A/B Testing
Firebase Remote Config + A/B Testing
Free, integrated with Firebase Analytics. Good for simple experiments. Limited statistical rigor compared to dedicated platforms.
Statsig
Full experimentation platform with automatic significance calculations, warehouse-native mode, and session replay. Free tier supports up to 1M metered events.
LaunchDarkly
Enterprise feature flag platform with built-in experimentation. Strong SDK support for iOS, Android, React Native, and Flutter.
Amplitude Experiment
Tight integration with Amplitude Analytics. Good for teams already using Amplitude for product analytics.
Optimizely
Industry veteran with robust statistical engine. Full Stack product supports mobile SDKs.
Common Mistakes
Testing too many things at once. If you change the button color, text, and position simultaneously, you cannot attribute the result to any single change. Change one variable at a time.
Stopping tests early. Reaching 80% confidence on day 3 does not mean the result is real. Wait for your full sample size.
Ignoring platform segmentation. A feature that wins on iOS may lose on Android. Always check platform-level results.
Not accounting for novelty effect. Users sometimes engage more with something simply because it is new. Run experiments for at least 2 weeks to let the novelty wear off.
Testing trivial changes on small audiences. If you have 500 DAU, do not test whether "Sign Up" vs "Get Started" performs better. You will never reach significance. Focus on big, structural changes.