At the time of this experiment, Udacity courses currently have two options on the home page: "start free trial", and "access course materials". If the student clicks "start free trial", they will be asked to enter their credit card information, and then they will be enrolled in a free trial for the paid version of the course. After 14 days, they will automatically be charged unless they cancel first. If the student clicks "access course materials", they will be able to view the videos and take the quizzes for free, but they will not receive coaching support or a verified certificate, and they will not submit their final project for feedback.
In the experiment, Udacity tested a change where if the student clicked "start free trial", they were asked how much time they had available to devote to the course. If the student indicated 5 or more hours per week, they would be taken through the checkout process as usual. If they indicated fewer than 5 hours per week, a message would appear indicating that Udacity courses usually require a greater time commitment for successful completion, and suggesting that the student might like to access the course materials for free. At this point, the student would have the option to continue enrolling in the free trial, or access the course materials for free instead.
This screenshot shows what the experiment looks like:
The hypothesis was that this might set clearer expectations for students upfront, thus reducing the number of frustrated students who left the free trial because they didn't have enough time—without significantly reducing the number of students to continue past the free trial and eventually complete the course. If this hypothesis held true, Udacity could improve the overall student experience and improve coaches' capacity to support students who are likely to complete the course.
The unit of diversion is a cookie, although if the student enrolls in the free trial, they are tracked by user-id from that point forward. The same user-id cannot enroll in the free trial twice. For users that do not enroll, their user-id is not tracked in the experiment, even if they were signed in when they visited the course overview page.
import pandas as pd
import numpy as np
import scipy.stats as st
from util import *
There are two types of metrics that we have to choose.
One is invariant metrics, that is, metrics which the results are expected
to be equal in the control and experiment group (or, the difference
is not statistically significant).
Another one is evaluation metrics, that is, the metrics that
are used to evaluate the experiment result, which should have
statistically significant difference between control and experiment group.
Invariant Metrics: Number of Cookies, Number of Clicks, Click-through-probability
Evaluation Metrics: Gross Conversion, Retention, Net Conversion
invariants = ["num_cookies", "num_clicks", "click_thru_prob"]
eval_metrics = ["gross_conversion", "retention", "net_conversion"]
Here we're measuring variability of metrics analytically based on some predefined baseline values, which can be found in this spreadsheet
# Baseline Conversion Rates
COOKIES_PER_DAY = 40000
CLICKS_PER_DAY = 3200
ENROLLS_PER_DAY = 660
CLICK_THRU_PROB = 0.08
P_ENROLL_CLICK = 0.20625
P_PAY_ENROLL = 0.53
P_PAY_CLICK = 0.1093125
# BCR, short for Baseline Conversion Rate
metric_bcr = pd.Series([P_ENROLL_CLICK, P_PAY_ENROLL, P_PAY_CLICK], eval_metrics)
unit_analysis = pd.Series([CLICK_THRU_PROB,
P_ENROLL_CLICK * CLICK_THRU_PROB,
CLICK_THRU_PROB],
eval_metrics)
SAMPLE_SIZE = 5000
metrics_sd = se_binom(metric_bcr, SAMPLE_SIZE * unit_analysis)
pd.DataFrame({"Standard Deviation": metrics_sd})
The analytic estimates are likely to be comparable to the empirical estimates if unit of analysis is equal to unit of diversion, which applies to gross conversion and net conversion. For retention, we might want to collect the empirical estimate if we have time.
I won't be using Bonferroni correction for my analysis.
For the number of page views needed, we have to calculate the requirements
for each evaluation metric, then use the maximum as our requirement for
page views.
Here I'm using $\alpha = 0.05, \beta = 0.2$ to calculate the sample size.
# Practical significance levels
d_min = pd.Series([0.01, 0.01, 0.0075], eval_metrics)
pageviews_needed = 2 * required_samples(metric_bcr, d_min, alpha=0.05, beta=0.2) / unit_analysis
pd.DataFrame({"Pageviews Required": pageviews_needed.sort_values()})
That says, we need at least 4741213 page views to have enough power for each metric
This experiment isn't quite risky since we're not expecting to see a descreasing change on number of enrollments. We can divert more than half of traffic to the experiment group, here I choose 80%. Given this, we can now estimate the duration of the experiment.
TRAFFIC_FRACTION = 0.8
duration = pageviews_needed / (COOKIES_PER_DAY * TRAFFIC_FRACTION)
pd.DataFrame({"Duration in Days": duration})
As a result, if we need enough power for every metric, we'll need 149 days to run the experiment, which is a really long. In fact, even if we divert 100% of the traffic to the experiment group, we still need 119 days. We might want to eliminate retention from our evaluation metrics, so that the experiment won't take too long.
eval_metrics = ["gross_conversion", "net_conversion"]
for series in [metric_bcr, unit_analysis, metrics_sd, d_min, pageviews_needed, duration]:
series.drop("retention", inplace=True)
Now that we only use gross conversion and net conversion as our evaluation metrics, the minimum page views required drops down to 685325, and the duration also comes to 22, which is much more appropriate.
The experiment data can be found in this spreadsheet
# conts and exps are the summary of cont and exp
cont, conts, exp, exps = get_data()
For each invariant metric, we're going to calculate the 95% confidence interval for the value we expect to observe, then check if the observed value is actually between the lower and upper bound.
sanity_checks(conts, exps, invariants,
metric_types=["sum", "sum", "prob"],
units=["Pageviews", "Clicks", ("Clicks", "Pageviews")])
cont, conts, exp, exps = cleanup_data(cont, exp)
For each metric, I'm going to calculate the 95% confidence interval around the difference between experiment and control groups. Then check if 0 and the practical significance level, $d_{min}$ are between the interval. If they don't, we'll be confident that there's a difference.
effect_size_tests(conts, exps, eval_metrics,
unit_X=["Enrollments", "Payments"],
unit_N=["Clicks", "Clicks"],
d_min=d_min)
sign_tests(cont, exp, eval_metrics,
unit_X=["Enrollments", "Payments"],
unit_N=["Clicks", "Clicks"])
In the experiment analysis, I didn't use Bonferroni correction, because it's
not so useful for our case. Bonferroni correction is for reducing false
positive results (Type I errors) when the criteria for launching the experiment
is to have any of the metrics shown up positive, in that situation, if we have
many evaluation metrics, there will be a higher probability that the change have
significant result just by chance, thus deploy Bonferroni correction to reduce
the chance of happening this. However, in our case, we want all the metrics to
have significant result, so Bonferroni correction isn't so useful here.
The sanity checks were all passed, all the observed statistics are inside the
95% confidence intervals for each invariant metrics' expected values.
For the experiment results, gross conversion resulted to be both statistically
and practically significant. Net conversion in the other hand, shown up neither
statistically nor practically significant.
Our experiment result shows that there's a significant change on gross conversion, which means that there are significantly less students enroll into free trial, and that's what we want to see. For net conversion, there's no significant difference, which is also what we want to see, it's not significantly reducing the students who continue pass the free trial; However, the confidence interval actually includes the negetive of the practical significance level, so it's possible that this number went down by an amount that would matter to the business, which is extremely important. Based on these, I would recommend to not launch the change, or run an another experiment.
For myself, I think I sometimes over or under estimate the time I'm going to need for a course. It's probably a good idea to have a warm-up course or project that is simple enough to complete in 5 hours and also covers prerequisite knowledges, so if the student can complete it within a week, then he'll have a more practical sense of the time commitment.
The hypothesis for this experiment is that students will have more practical sense of the time commitment, thus reducing early cancellations due to the lack of time.
The unit of diversion will be the user-id, because the experiment is after the student hits the "Start free trial" button.
The invariant metrics for this experiment will be Number of Cookies, Number of User-ids, Number of Clicks and Click-through-probability. The three metrics we've already used in the previous experiment are having the same reasons as in the previous experiment. For the Number of User-ids, because this experiment begins after users enrolled into free trial, so the number of users who enrolled should be invariant.
I'll use one of the metrics we had in the previous experiment, Retention, as the evaluation metric in this new experiment since we want to keep students beyond free trial. Retention measures how likely the students are going to stay beyond free trial.