Free Trial Screener Experiment

Description

At the time of this experiment, Udacity courses currently have two options on the home page: "start free trial", and "access course materials". If the student clicks "start free trial", they will be asked to enter their credit card information, and then they will be enrolled in a free trial for the paid version of the course. After 14 days, they will automatically be charged unless they cancel first. If the student clicks "access course materials", they will be able to view the videos and take the quizzes for free, but they will not receive coaching support or a verified certificate, and they will not submit their final project for feedback.

In the experiment, Udacity tested a change where if the student clicked "start free trial", they were asked how much time they had available to devote to the course. If the student indicated 5 or more hours per week, they would be taken through the checkout process as usual. If they indicated fewer than 5 hours per week, a message would appear indicating that Udacity courses usually require a greater time commitment for successful completion, and suggesting that the student might like to access the course materials for free. At this point, the student would have the option to continue enrolling in the free trial, or access the course materials for free instead.

This screenshot shows what the experiment looks like:

The hypothesis was that this might set clearer expectations for students upfront, thus reducing the number of frustrated students who left the free trial because they didn't have enough time—without significantly reducing the number of students to continue past the free trial and eventually complete the course. If this hypothesis held true, Udacity could improve the overall student experience and improve coaches' capacity to support students who are likely to complete the course.

The unit of diversion is a cookie, although if the student enrolls in the free trial, they are tracked by user-id from that point forward. The same user-id cannot enroll in the free trial twice. For users that do not enroll, their user-id is not tracked in the experiment, even if they were signed in when they visited the course overview page.

In [1]:
import pandas as pd
import numpy as np
import scipy.stats as st
from util import *

Design

Metric Choice

There are two types of metrics that we have to choose.
One is invariant metrics, that is, metrics which the results are expected to be equal in the control and experiment group (or, the difference is not statistically significant).
Another one is evaluation metrics, that is, the metrics that are used to evaluate the experiment result, which should have statistically significant difference between control and experiment group.

Invariant Metrics: Number of Cookies, Number of Clicks, Click-through-probability
Evaluation Metrics: Gross Conversion, Retention, Net Conversion

Invariant Metrics

  • Number of Cookies: Number of unique cookies to view the course overview page.
    The experiment will be after the user clicked "Start free trial" button, number of page views shouldn't have difference between control and experiment group.
  • Number of Clicks: Number of unique cookies to click the "Start free trial" button (which happens before the free trial screener is trigger).
    Again, the experiment will be after the user clicked, so it shouldn't have a difference. Thus an appropriate invariant metric.
  • Click-through-probability: Number of unique cookies to click the "Start free trial" button divided by number of unique cookies to view the course overview page.
    We're expecting both numerator and denominator to be invariant, so the probability should also be invariant.
In [2]:
invariants = ["num_cookies", "num_clicks", "click_thru_prob"]

Evaluation Metrics

  • Gross Conversion: Number of user-ids to complete checkout and enroll in the free trial divided by number of unique cookies to click the "Start free trial" button.
    For this metric, we're expecting that there should be a negative change. That is, the gross conversion of the experiment group should be less than the control group, because the goal of the change is to prevent users from enroll into the course without measuring his/her time.
  • Retention: Number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by number of user-ids to complete checkout.
    There should be an increasing change for retention, we want the enrollments to be consistent, not just simply cancel it after 14 days free trial.
  • Net Conversion: number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by the number of unique cookies to click the "Start free trial" button.
    We're expecting to see a non-decreasing change (which is, remain or increase) on net conversion. That is, it's not going to reduce the number of students who stay beyond the free trial and make at least one payment.
In [3]:
eval_metrics = ["gross_conversion", "retention", "net_conversion"]

Unused Metrics

  • Number of user-ids: Number of users who enroll in the free trial.
    This isn't suitable for an invariant metric, because it supposed to change. It could be an evaluation metric though, but I would rather use a normalised fraction such as probabilities or rates for evaluation metrics. Thus, we should use Gross Conversion instead of Number of user-ids.

Measuring Variability

Here we're measuring variability of metrics analytically based on some predefined baseline values, which can be found in this spreadsheet

In [4]:
# Baseline Conversion Rates
COOKIES_PER_DAY = 40000
CLICKS_PER_DAY  = 3200
ENROLLS_PER_DAY = 660
CLICK_THRU_PROB = 0.08
P_ENROLL_CLICK  = 0.20625
P_PAY_ENROLL    = 0.53
P_PAY_CLICK     = 0.1093125

# BCR, short for Baseline Conversion Rate
metric_bcr = pd.Series([P_ENROLL_CLICK, P_PAY_ENROLL, P_PAY_CLICK], eval_metrics)

unit_analysis = pd.Series([CLICK_THRU_PROB,
                           P_ENROLL_CLICK * CLICK_THRU_PROB,
                           CLICK_THRU_PROB],
                         eval_metrics)
In [5]:
SAMPLE_SIZE = 5000

metrics_sd = se_binom(metric_bcr, SAMPLE_SIZE * unit_analysis)
pd.DataFrame({"Standard Deviation": metrics_sd})
Out[5]:
Standard Deviation
gross_conversion 0.020231
retention 0.054949
net_conversion 0.015602

The analytic estimates are likely to be comparable to the empirical estimates if unit of analysis is equal to unit of diversion, which applies to gross conversion and net conversion. For retention, we might want to collect the empirical estimate if we have time.

Sizing

Number of Samples vs. Power

I won't be using Bonferroni correction for my analysis.
For the number of page views needed, we have to calculate the requirements for each evaluation metric, then use the maximum as our requirement for page views.
Here I'm using $\alpha = 0.05, \beta = 0.2$ to calculate the sample size.

In [6]:
# Practical significance levels
d_min = pd.Series([0.01, 0.01, 0.0075], eval_metrics)

pageviews_needed = 2 * required_samples(metric_bcr, d_min, alpha=0.05, beta=0.2) / unit_analysis
pd.DataFrame({"Pageviews Required": pageviews_needed.sort_values()})
Out[6]:
Pageviews Required
gross_conversion 645875.000000
net_conversion 685325.000000
retention 4741212.121212

That says, we need at least 4741213 page views to have enough power for each metric

Duration vs. Exposure

This experiment isn't quite risky since we're not expecting to see a descreasing change on number of enrollments. We can divert more than half of traffic to the experiment group, here I choose 80%. Given this, we can now estimate the duration of the experiment.

In [7]:
TRAFFIC_FRACTION = 0.8

duration = pageviews_needed / (COOKIES_PER_DAY * TRAFFIC_FRACTION)
pd.DataFrame({"Duration in Days": duration})
Out[7]:
Duration in Days
gross_conversion 20.183594
retention 148.162879
net_conversion 21.416406

As a result, if we need enough power for every metric, we'll need 149 days to run the experiment, which is a really long. In fact, even if we divert 100% of the traffic to the experiment group, we still need 119 days. We might want to eliminate retention from our evaluation metrics, so that the experiment won't take too long.

In [8]:
eval_metrics = ["gross_conversion", "net_conversion"]
for series in [metric_bcr, unit_analysis, metrics_sd, d_min, pageviews_needed, duration]:
    series.drop("retention", inplace=True)

Now that we only use gross conversion and net conversion as our evaluation metrics, the minimum page views required drops down to 685325, and the duration also comes to 22, which is much more appropriate.

Analysis

The experiment data can be found in this spreadsheet

In [9]:
# conts and exps are the summary of cont and exp
cont, conts, exp, exps = get_data()

Sanity Checks

For each invariant metric, we're going to calculate the 95% confidence interval for the value we expect to observe, then check if the observed value is actually between the lower and upper bound.

In [10]:
sanity_checks(conts, exps, invariants,
              metric_types=["sum", "sum", "prob"],
              units=["Pageviews", "Clicks", ("Clicks", "Pageviews")])
Out[10]:
Expected Lower Bound Upper Bound Observed Pass
num_cookies 0.5 0.49882 0.50118 0.50064 True
num_clicks 0.5 0.495885 0.504115 0.500467 True
click_thru_prob 0 -0.00129566 0.00129566 5.66271e-05 True

Result Analysis

In [11]:
cont, conts, exp, exps = cleanup_data(cont, exp)

Effect Size Tests

For each metric, I'm going to calculate the 95% confidence interval around the difference between experiment and control groups. Then check if 0 and the practical significance level, $d_{min}$ are between the interval. If they don't, we'll be confident that there's a difference.

In [12]:
effect_size_tests(conts, exps, eval_metrics,
                  unit_X=["Enrollments", "Payments"],
                  unit_N=["Clicks",      "Clicks"],
                  d_min=d_min)
Out[12]:
Lower Bound Upper Bound Statistical Significance Practical Significance
gross_conversion -0.029123 -0.011987 True True
net_conversion -0.011605 0.001857 False False

Sign Tests

In [13]:
sign_tests(cont, exp, eval_metrics,
           unit_X=["Enrollments", "Payments"],
           unit_N=["Clicks",      "Clicks"])
Out[13]:
p-value Statistical Significance
gross_conversion 0.00259948 True
net_conversion 0.677639 False

Summary

In the experiment analysis, I didn't use Bonferroni correction, because it's not so useful for our case. Bonferroni correction is for reducing false positive results (Type I errors) when the criteria for launching the experiment is to have any of the metrics shown up positive, in that situation, if we have many evaluation metrics, there will be a higher probability that the change have significant result just by chance, thus deploy Bonferroni correction to reduce the chance of happening this. However, in our case, we want all the metrics to have significant result, so Bonferroni correction isn't so useful here.
The sanity checks were all passed, all the observed statistics are inside the 95% confidence intervals for each invariant metrics' expected values.
For the experiment results, gross conversion resulted to be both statistically and practically significant. Net conversion in the other hand, shown up neither statistically nor practically significant.

Recommendation

Our experiment result shows that there's a significant change on gross conversion, which means that there are significantly less students enroll into free trial, and that's what we want to see. For net conversion, there's no significant difference, which is also what we want to see, it's not significantly reducing the students who continue pass the free trial; However, the confidence interval actually includes the negetive of the practical significance level, so it's possible that this number went down by an amount that would matter to the business, which is extremely important. Based on these, I would recommend to not launch the change, or run an another experiment.

Follow-up Experiment

For myself, I think I sometimes over or under estimate the time I'm going to need for a course. It's probably a good idea to have a warm-up course or project that is simple enough to complete in 5 hours and also covers prerequisite knowledges, so if the student can complete it within a week, then he'll have a more practical sense of the time commitment.

Hypothesis

The hypothesis for this experiment is that students will have more practical sense of the time commitment, thus reducing early cancellations due to the lack of time.

Unit of Diversion

The unit of diversion will be the user-id, because the experiment is after the student hits the "Start free trial" button.

Invariant Metrics

The invariant metrics for this experiment will be Number of Cookies, Number of User-ids, Number of Clicks and Click-through-probability. The three metrics we've already used in the previous experiment are having the same reasons as in the previous experiment. For the Number of User-ids, because this experiment begins after users enrolled into free trial, so the number of users who enrolled should be invariant.

Evaluation Metrics

I'll use one of the metrics we had in the previous experiment, Retention, as the evaluation metric in this new experiment since we want to keep students beyond free trial. Retention measures how likely the students are going to stay beyond free trial.