Competition - website redesign

Which version of the website should you use?

📖 Background

You work for an early-stage startup in Germany. Your team has been working on a redesign of the landing page. The team believes a new design will increase the number of people who click through and join your site.

They have been testing the changes for a few weeks and now they want to measure the impact of the change and need you to determine if the increase can be due to random chance or if it is statistically significant.

💾 The data

The team assembled the following file:

Redesign test data

"treatment" - "yes" if the user saw the new version of the landing page, no otherwise.
"new_images" - "yes" if the page used a new set of images, no otherwise.
"converted" - 1 if the user joined the site, 0 otherwise.

The control group is those users with "no" in both columns: the old version with the old set of images.

import pandas as pd
X_full = pd.read_csv('./data/redesign.csv')
X_full.head()

💪 Challenge

Complete the following tasks:

Analyze the conversion rates for each of the four groups: the new/old design of the landing page and the new/old pictures.
Can the increases observed be explained by randomness? (Hint: Think A/B test)
Which version of the website should they use?

🧑‍⚖️ Judging criteria

We will randomly select ten winners from the correct submissions for this challenge.

The winners will receive DataCamp merchandise.

✅ Checklist before publishing

Rename your workspace to make it descriptive of your work. N.B. you should leave the notebook name as notebook.ipynb.
Remove redundant cells like the judging criteria, so the workbook is focused on your answers.
Check that all the cells run without error.

⌛️ Time is ticking. Good luck!

import seaborn as sns
sns.pairplot(X_full)

from sklearn.model_selection import train_test_split
features = ["treatment", "new_images"]
y = X_full['converted']
X_full.drop(['converted'], axis=1, inplace=True)
# Break off validation set from training data
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X_full, y, 
                                                                train_size=.8, test_size=.2,
                                                                random_state=0)

# "Cardinality" means the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
categorical_cols = [cname for cname in X_train_full.columns if
                    X_train_full[cname].nunique() < 10 and 
                    X_train_full[cname].dtype == "object"]

# Select numerical columns
numerical_cols = [cname for cname in X_train_full.columns if 
                X_train_full[cname].dtype in ['int64', 'float64']]

# Keep selected columns only
my_cols = categorical_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='median')

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

def normalize_output(predictions):
    return [0 if prediction < .60 else 1 for prediction in predictions]
from xgboost import XGBRegressor
from sklearn.ensemble import RandomForestRegressor
from array import array
def get_model():
    model = XGBRegressor(n_estimators=400, learning_rate=0.02, n_jobs=4)
    return model

# Define model
model = get_model()

# Bundle preprocessing and modeling code in a pipeline
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('model', model)
                     ])

# Preprocessing of training data, fit model 
clf.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = normalize_output(clf.predict(X_valid))

print('MAE:', mean_absolute_error(y_valid, preds))

import numpy as np
data = {'treatment': ['yes', 'yes', 'no', 'no'], 'new_images': ['yes', 'no', 'yes', 'no']}  
  
# Create DataFrame  
to_predict = pd.DataFrame(data)  

clf.fit(X_full, y)
preds = clf.predict(to_predict)
results = to_predict
results["converted"] = preds
results