A/B Testing with SHAP: From Black Box to Glass Box
Uncover the True Impact of Your Web Changes Using Explainable AI
In the competitive landscape of e-commerce, understanding the "why" behind user behavior is crucial for success. While traditional A/B testing shows us what works, it often leaves us wondering why. A product page redesign might show a 15% increase in conversions, but which specific changes drove this improvement? Did some changes actually hurt conversion rates? How did different elements work together to influence purchasing decisions?
This is where SHAP enters the scene. Instead of just telling us that our changes worked, SHAP (SHapley Additive exPlanations) acts like a detective, investigating each change's contribution to success. It breaks down that 15% improvement into precise measurements: the new button location added 42%, while having too many images actually reduced success by 15%. Now that's the kind of insight we can actually use.
What You'll Learn
How to go beyond simple "it worked/didn't work" test results
Understanding which changes actually drove conversion rate improvements
Detecting when changes hurt rather than helped
Using SHAP to measure the impact of each change
How to identify and leverage feature interactions
Implementing changes based on data-driven insights
The Challenge: Beyond Simple Conversion Metrics
The Testing Dilemma
In our recent test of a web page, we faced a common dilemma. Like many teams, we wanted to improve fast, so we tested multiple changes at once:
Button Placement: We moved the main action button from the bottom to the top of the page, making it immediately visible
Price Display Style: We experimented with a larger, more prominent price display, including clearer discount information
Mobile-Friendly Improvements: We redesigned the layout to work better on phones, with easier navigation and better touch targets
Image Layout: We adjusted how product images were displayed, testing different sizes and arrangements
Checkout Process: We streamlined the steps needed to complete an action, removing unnecessary fields
Before we dive in, let's clarify some key concepts:
A/B Testing is like giving users two different versions of your website and seeing which one works better. Imagine having two ice cream shops with different layouts - one with the menu at the entrance, another with it above the counter. Which gets more sales? That's A/B testing.
SHAP (SHapley Additive exPlanations) is a tool that helps us understand why something worked. Think of it as a detective that can tell you not just that your new shop layout increased sales, but exactly how much each change (menu position, lighting, seating) contributed to that success.
Hey so now you know the game, are you ready? Let’s dive into the web page scenario
How SHAP Helps Understand Results
SHAP (SHapley Additive exPlanations) helps untangle these results. It measures how each change contributes to success, both alone and in combination with other changes. It breaks down complex changes into understandable pieces. Think of it as taking apart a complex machine to see exactly how each gear and lever contributes to the whole. Let’s get our hands dirty by developing a code to create a sample dataset for the case study.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import shap
# Generate test data
def generate_test_data(n_samples=50000):
"""Simulate web page test data with known effects"""
data = pd.DataFrame({
'time_on_page': np.random.normal(60, 20, n_samples),
'button_location': np.random.choice(['top', 'middle', 'bottom'], n_samples),
'price_style': np.random.choice(['large', 'medium', 'small'], n_samples),
'image_count': np.random.randint(1, 10, n_samples),
'mobile_score': np.random.uniform(0.5, 1.0, n_samples)
})
# Define known effects
success_prob = (
0.2 + # baseline
0.42 * (data['button_location'] == 'top') +
0.35 * (data['price_style'] == 'large') +
-0.15 * (data['image_count'] > 5) +
0.18 * ((data['button_location'] == 'top') &
(data['price_style'] == 'large')) +
0.28 * (data['mobile_score'] > 0.8)
)
data['success'] = np.random.binomial(1, np.clip(success_prob, 0, 1))
return dataUnderstanding Our Test Data Generation
To demonstrate how SHAP analyzes test results, we first need test data. Just like a well-designed scientific experiment, we need to create data that represents real user behavior while controlling for specific variables we want to study. Let's break down how we create this data and the assumptions we make.
Creating Test Variables
First, we simulate different aspects of our web page. Think of this as setting up a controlled experiment where we can measure exactly how each element affects user behavior. Each variable simulates a real metric:
time_on_page: Average time spent is 60 seconds, varying by 20 seconds (normal distribution)Similar to how real users might spend anywhere from 40 to 80 seconds on a page
We use a normal distribution because that's how real user behavior typically varies
button_location: Button can be at top, middle, or bottom with equal chanceLike testing different positions for the "Buy Now" button
Each position has an equal probability to simulate unbiased testing
price_style: Price display can be large, medium, or small with equal chanceRepresents different ways of showing prices to users
Could be font size, color contrast, or prominence of display
image_count: Pages show between 1 to 9 imagesSimulates different amounts of visual content
Range chosen based on typical product page layouts
mobile_score: Mobile optimization score ranges from 50% to 100%Represents how well the page works on mobile devices
Higher scores mean better mobile experience
Calculating Success Probability
Now comes the interesting part. We calculate how likely each user is to succeed (like making a purchase) based on these factors. Let's look at a practical example:
Base case: 20% success chance
With top button: +42% (total: 62%)
With large price: +35% (total: 97%)
With 6 images: -15% (total: 82%)
With both top button and large price: +18% (total: 100%)
With good mobile score: +28% (total: 100%, capped)
After calculating these probabilities, we need to convert them into realistic yes/no outcomes that mirror real-world user behavior.
Converting to Actual Success
Finally, we turn these probabilities into actual yes/no outcomes.
np.clip: Ensures probabilities stay between 0 and 1np.random.binomial: Converts probability into 0 (fail) or 1 (success)
For example, if success_prob is 0.75:
75% chance to get 1 (success)
25% chance to get 0 (fail)
This creates realistic test data where we know exactly which factors influence success and by how much, letting us validate SHAP's findings against true effects.
Our Test Setup
To ensure thorough analysis, we tracked multiple metrics beyond just overall success:
Time spent on page
Indicates user engagement level
Helps understand browsing patterns
Mobile optimization score
Measures how well the page performs on mobile
Ranges from 50% to 100% optimization
User interactions
Button placement effects
Price display visibility impact
Image presentation
Number of images shown
Impact on user engagement
Using a dataset of 50,000 simulated sessions (with 1,000 used for detailed SHAP analysis), we can understand how each element contributes to overall success.
# Prepare data for analysis
data = generate_test_data()
X = pd.get_dummies(data.drop('success', axis=1))
y = data['success']
# Store column names OUTSIDE the function
feature_names = X.columns
# Create explainer
def predict(X):
if isinstance(X, np.ndarray):
X = pd.DataFrame(X, columns=feature_names) # Use stored feature_names
return (0.42 * (X['button_location_top'].astype(float)) +
0.35 * (X['price_style_large'].astype(float)) +
-0.15 * (X['image_count'].astype(float) > 5) +
0.28 * (X['mobile_score'].astype(float) > 0.8))
background = shap.sample(X, 100)
explainer = shap.KernelExplainer(predict, background)
shap_values = explainer.shap_values(X[:1000])# 1. SHAP Summary Plot
plt.figure(figsize=(12, 6))
shap.summary_plot(shap_values, X[:1000], show=False)
plt.title('Impact of Each Change')
plt.tight_layout()
plt.show()The presented plot is a SHAP summary plot which is a powerful visualization that shows how each feature impacts our test results.
The plot tells us three key things such as Feature Importance. Features are ordered by impact (most important at the top) and the spread of SHAP values shows the magnitude of impact, in this plot wider spreads indicate stronger effects.
It also demonstrates the direction of Impact. When the direction points to the right it shows a positive impact on success and pointing to the left means a negative impact on success. The position shows how much impact (further from the center equals stronger impact)
And last but not least it shows the Value Relationships. In this plot, red dots show high feature values and blue dots show low feature values.
Looking at our data, we see:
The button location at the top shows red dots on the right (positive impact)
High image counts show blue dots on the left (negative impact)
Mobile scores show a gradient from blue to red (linear relationship)
This helps understand:
Which changes matter most
How feature values affect outcomes
Where to focus optimization efforts
# 2. True vs SHAP Effects Comparison
# Step 1: Define true effects
true_effects = {
'Button Location': 0.42,
'Price Style': 0.35,
'Mobile Score': 0.28,
'Image Count': -0.15
}
# Step 2: Calculate SHAP effects
shap_effects = {
'Button Location': float(np.abs(shap_values).mean(0)[X.columns.str.contains('button_location')].sum()),
'Price Style': float(np.abs(shap_values).mean(0)[X.columns.str.contains('price_style')].sum()),
'Mobile Score': float(np.abs(shap_values).mean(0)[X.columns == 'mobile_score']),
'Image Count': float(np.abs(shap_values).mean(0)[X.columns == 'image_count'])
}
# Step 3: Create comparison plot
plt.figure(figsize=(12, 6))
x = np.arange(len(true_effects))
width = 0.35 # Width of bars
fig, ax = plt.subplots()
rects1 = ax.bar(x - width/2, list(true_effects.values()), width, label='True Effect', color='skyblue')
rects2 = ax.bar(x + width/2, list(shap_effects.values()), width, label='SHAP Effect', color='lightgreen')
# Add labels and title
ax.set_ylabel('Effect Size')
ax.set_xlabel('Changes')
ax.set_title('True vs SHAP-discovered Effects')
ax.set_xticks(x)
ax.set_xticklabels(list(true_effects.keys()), rotation=45)
ax.legend()
# Add value labels on bars
def autolabel(rects):
for rect in rects:
height = rect.get_height()
ax.annotate(f'{height:.2f}',
xy=(rect.get_x() + rect.get_width()/2, height),
xytext=(0, 3), # 3 points vertical offset
textcoords="offset points",
ha='center', va='bottom')
autolabel(rects1)
autolabel(rects2)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()This code creates a bar chart comparing what we know to be true (our designed effects) against what SHAP discovered. Let's break it down:
Step 1: Define True Effects
At this step, we see the effects we built into our test data.
Step 2: Calculate SHAP Effects
Here we:
Take the mean of absolute SHAP values
Sum effects for categorical variables (like button_location)
Get single values for numeric variables
Step 3: Create Comparison Plot
This creates a DataFrame with:
Rows for each change
Columns for true and SHAP-discovered effects
Step 4: Visualization
The resulting plot shows two bars for each change, the blue ones show true effects (what we designed) and the green bars show SHAP effects (what was discovered). also, the exact values labeled on each bar
This visualization helps validate SHAP's effectiveness by comparing its discoveries against known truth, building confidence in its use for real-world analysis where true effects are unknown.
# Interaction Analysis
plt.clf()
plt.close('all')
# Create correlation matrix of SHAP values
shap_corr = pd.DataFrame(shap_values, columns=X.columns).corr()
# Fill NA values with 0
shap_corr = shap_corr.fillna(0)
# Create a single figure
fig, ax = plt.subplots(figsize=(12, 8))
# Create heatmap with all values shown
sns.heatmap(shap_corr,
xticklabels=X.columns,
yticklabels=X.columns,
cmap='viridis',
annot=True,
fmt='.2f',
center=0,
square=True,
mask=None, # Don't mask any values
cbar_kws={'label': 'SHAP Value Correlation'})
plt.title('Change Interaction Analysis')
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()As you can see in the heatmap, most changes work independently and there is only a few strong interactions between different features. Design changes (button, price) have minimal interference and the mobile optimization and image count have minimal interaction.
This suggests our changes can be implemented relatively independently without worrying about negative interactions.
# 4. Distribution of SHAP Values
# First, identify most important features by mean absolute SHAP value
mean_shap = np.abs(shap_values).mean(0)
top_features_idx = np.argsort(mean_shap)[-4:] # Get indices of top 4 features
top_features = X.columns[top_features_idx]
plt.figure(figsize=(12, 6))
for i, (idx, col) in enumerate(zip(top_features_idx, top_features)):
plt.subplot(2, 2, i+1)
sns.kdeplot(shap_values[:, idx], fill=True)
plt.title(f'SHAP Distribution\n{col}')
plt.xlabel('SHAP Value')
plt.ylabel('Density')
plt.tight_layout()
plt.show()his plot shows the distribution of SHAP values for the four most impactful features in our analysis. Let's break down each subplot:
Image Count Distribution
Shows two distinct peaks centered around -0.08 and 0.08 Bimodal distribution suggests image count has two common effects:
Negative peak (-0.08): Likely when there are too many images (>5) Positive peak (0.08): When image count is optimal
The symmetrical peaks indicate balanced positive/negative impacts
Mobile Score Distribution
Also bimodal, with peaks around -0.1 and 0.2 Larger positive peak (0.2): Shows strong positive impact of good mobile optimization Smaller negative peak (-0.1): Indicates potential downsides of poor mobile scores Wider spread suggests more variable impact than image count
Price Style (Large) Distribution
Similar bimodal pattern with peaks at -0.1 and 0.25 Strong positive effect (0.25) when price is prominently displayed Negative effect (-0.1) when not using large price style Distribution suggests price style has the most consistent positive impact
Button Location (Top) Distribution
Widest range of SHAP values (-0.2 to 0.35) Strongest positive peak around 0.3 Notable negative impact around -0.15 Most polarized effect among all features, suggesting it's the most influential change
Key Insights
All important features show bimodal distributions
Button location and price style have the largest potential positive impacts
Mobile score shows more balanced positive/negative effects
Image count has the most symmetrical impact distribution
This visualization helps understand not just the average impact of each feature, but how those impacts vary across different scenarios.
# 5. Cumulative Effects Plot
# Sort features by importance for cumulative plot
sorted_idx = np.argsort(mean_shap)
cumulative_effects = np.cumsum(mean_shap[sorted_idx])
plt.figure(figsize=(15, 10))
plt.plot(range(1, len(cumulative_effects) + 1), cumulative_effects,
marker='o', linewidth=2, markersize=8, color='#1f77b4')
for i, effect in enumerate(cumulative_effects):
if effect < 0.1:
y_offset = -40 if i % 2 == 0 else -20
x_offset = -20 if i % 2 == 0 else 20
else:
y_offset = 20
x_offset = 50
plt.annotate(f'{X.columns[sorted_idx[i]]}\n{effect:.2f}',
(i+1, effect),
xytext=(x_offset, y_offset),
textcoords='offset points',
ha='left' if x_offset > 0 else 'right',
va='center',
bbox=dict(boxstyle='round,pad=0.5',
fc='white',
ec='gray',
alpha=0.8))
plt.title('Cumulative Impact of Changes', pad=20, size=14)
plt.xlabel('Number of Changes', size=12)
plt.ylabel('Cumulative Effect', size=12)
plt.grid(True, alpha=0.3)
plt.margins(x=0.1)
plt.ylim(-0.1, 0.6)
plt.tight_layout()
plt.show()This plot visualizes how the total impact builds up as we add features in order of their importance. Each point on the line represents the cumulative effect after adding another feature.
The line moves upward in steps, with each step showing:
The total impact so far
Which feature was just added
How much that feature contributed
The plot begins with a flat line at zero effect through the first five changes: time_on_page, button_location_bottom, button_location_middle, price_style_medium, and price_style_small. This indicates these features had negligible impact on our results when implemented sequentially. Image_count introduces the first noticeable increase, showing a cumulative effect of 0.07. While modest, this marks where measurable impacts begin to appear in our implementation sequence. The final three changes demonstrate significant effects:
Mobile_score: Increases cumulative impact to 0.21
Price_style_large: Further rises to 0.36
Button_location_top: Reaches final impact of 0.55
These steep increases in the line's trajectory indicate these three changes were responsible for most of the overall improvement. The button_location_top change shows the largest individual contribution, evidenced by the steepest slope in the plot.
Analysis and Findings: Understanding the Impact of Each Change
Our SHAP analysis revealed clear insights about how each change affected our test results. Let's break down what we found:
Button Location Impact (42% Effect) The position of the button emerged as our strongest influencer. Moving it to the top of the page showed a consistent 42% improvement. SHAP's distribution plot for this feature shows two distinct clusters - a strong positive effect when placed at the top and a negative effect when placed lower, validating our initial design hypothesis.
Price Display Effectiveness (35% Effect) Making prices more prominent was our second most impactful change. SHAP analysis shows this had a 35% positive effect, with the distribution plot revealing a clear pattern: large price displays consistently improved results while smaller displays sometimes hindered performance.
Mobile Optimization Results (28% Effect) Mobile optimization proved significant with a 28% improvement when scoring above 0.8 on our mobile metrics. The SHAP distribution for mobile scores shows an interesting bimodal pattern - strong positive effects for well-optimized pages and moderate negative effects for poor mobile experiences.
Image Count Findings (-15% Effect) Perhaps our most surprising finding came from image count analysis. Pages with more than 5 images showed a 15% decrease in effectiveness. The SHAP distribution here shows two clear peaks, suggesting a clear threshold where additional images begin to hurt rather than help. Interaction Effects
The interaction heatmap revealed minimal interference between our changes. The strongest interaction appeared between button placement and price display, though even this was relatively modest. This suggests our changes largely worked independently, allowing for flexible implementation approaches. These findings are particularly reliable because SHAP's discovered effects closely match our known true effects, validating the analysis methodology and providing confidence in our conclusions.
From Analysis to Action: Implementing Test Insights
Our SHAP analysis provided clear direction for practical improvements that can be broken down into specific, measurable actions:
Primary Changes
Based on the 42% improvement from button placement and 35% from price visibility:
Standardize button placement
Move all primary action buttons above the fold Maintain consistent positioning across all pages Remove any competing calls to action near the main button
Enhance price visibility
Increase price font size and contrast Position pricing near the action button Display any discounts or savings prominently
Performance Optimizations
Following the negative 15% impact of excess images:
Image strategy
Limit pages to a maximum of 5 key images Implement lazy loading for additional images Optimize image compression and formats
Mobile experience (28% improvement potential)
Prioritize mobile page speed Ensure touch targets meet size guidelines Simplify navigation for mobile users
Implementation Strategy
To maximize impact while minimizing risk:
Start with the highest-impact changes (button and price)
Follow with mobile optimizations
Implement image limits on new pages first
Roll out changes gradually to measure real-world impact
Conclusion
Each change should be monitored with clear metrics to validate the improvements match our analysis predictions.
Our journey through SHAP analysis reveals more than just technical metrics. It's a testament to the power of understanding, not just what works, but why it works. In the digital landscape, where user experience is king, these insights are more than data points; they're windows into user behavior.
The real magic isn't in blindly implementing changes, but in understanding the nuanced interactions that drive user decisions. Each pixel, each button placement, each design choice tells a story. SHAP helps us read that story with unprecedented clarity.
As we continue to explore the intersection of AI, web design, and user experience, remember:
data isn't just about numbers. It's about people. It's about understanding the human behind the click, the motivation behind the interaction.
Stay curious. Keep exploring. And never stop asking why.
References
Lundberg, S. M., & Lee, S. I. (2017). "A unified approach to interpreting model predictions." Advances in Neural Information Processing Systems, 30.
Molnar, C. (2020). "Interpretable Machine Learning: A Guide for Making Black Box Models Explainable." https://christophm.github.io/interpretable-ml-book/
Lipovetsky, S., & Conklin, M. (2001). "Analysis of regression in game theory approach." Applied Stochastic Models in Business and Industry, 17(4), 319-330.
SHAP (SHapley Additive exPlanations) GitHub Repository: https://github.com/slundberg/shap
Interpretable Machine Learning with SHAP: https://towardsdatascience.com/interpretable-machine-learning-with-shap-61e7c1f53f9d








