Skip to content
Competition - website redesign
  • AI Chat
  • Code
  • Report
  • Which version of the website should you use?

    📖 Background

    You work for an early-stage startup in Germany. Your team has been working on a redesign of the landing page. The team believes a new design will increase the number of people who click through and join your site.

    They have been testing the changes for a few weeks and now they want to measure the impact of the change and need you to determine if the increase can be due to random chance or if it is statistically significant.

    💾 The data

    The team assembled the following file:

    Redesign test data
    • "treatment" - "yes" if the user saw the new version of the landing page, no otherwise.
    • "new_images" - "yes" if the page used a new set of images, no otherwise.
    • "converted" - 1 if the user joined the site, 0 otherwise.

    The control group is those users with "no" in both columns: the old version with the old set of images.

    import pandas as pd
    X_full = pd.read_csv('./data/redesign.csv')
    X_full.head()

    💪 Challenge

    Complete the following tasks:

    1. Analyze the conversion rates for each of the four groups: the new/old design of the landing page and the new/old pictures.
    2. Can the increases observed be explained by randomness? (Hint: Think A/B test)
    3. Which version of the website should they use?

    🧑‍⚖️ Judging criteria

    We will randomly select ten winners from the correct submissions for this challenge.

    The winners will receive DataCamp merchandise.

    ✅ Checklist before publishing

    • Rename your workspace to make it descriptive of your work. N.B. you should leave the notebook name as notebook.ipynb.
    • Remove redundant cells like the judging criteria, so the workbook is focused on your answers.
    • Check that all the cells run without error.

    ⌛️ Time is ticking. Good luck!

    import seaborn as sns
    sns.pairplot(X_full)
    from sklearn.model_selection import train_test_split
    features = ["treatment", "new_images"]
    y = X_full['converted']
    X_full.drop(['converted'], axis=1, inplace=True)
    # Break off validation set from training data
    X_train_full, X_valid_full, y_train, y_valid = train_test_split(X_full, y, 
                                                                    train_size=.8, test_size=.2,
                                                                    random_state=0)
    # "Cardinality" means the number of unique values in a column
    # Select categorical columns with relatively low cardinality (convenient but arbitrary)
    categorical_cols = [cname for cname in X_train_full.columns if
                        X_train_full[cname].nunique() < 10 and 
                        X_train_full[cname].dtype == "object"]
    
    # Select numerical columns
    numerical_cols = [cname for cname in X_train_full.columns if 
                    X_train_full[cname].dtype in ['int64', 'float64']]
    
    # Keep selected columns only
    my_cols = categorical_cols + numerical_cols
    X_train = X_train_full[my_cols].copy()
    X_valid = X_valid_full[my_cols].copy()
    from sklearn.compose import ColumnTransformer
    from sklearn.pipeline import Pipeline
    from sklearn.impute import SimpleImputer
    from sklearn.preprocessing import OneHotEncoder
    from sklearn.ensemble import RandomForestRegressor
    from sklearn.metrics import mean_absolute_error
    
    # Preprocessing for numerical data
    numerical_transformer = SimpleImputer(strategy='median')
    
    # Preprocessing for categorical data
    categorical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='constant')),
        ('onehot', OneHotEncoder(handle_unknown='ignore'))
    ])
    
    # Bundle preprocessing for numerical and categorical data
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numerical_transformer, numerical_cols),
            ('cat', categorical_transformer, categorical_cols)
        ])
    def normalize_output(predictions):
        return [0 if prediction < .60 else 1 for prediction in predictions]
    from xgboost import XGBRegressor
    from sklearn.ensemble import RandomForestRegressor
    from array import array
    def get_model():
        model = XGBRegressor(n_estimators=400, learning_rate=0.02, n_jobs=4)
        return model
    # Define model
    model = get_model()
    
    # Bundle preprocessing and modeling code in a pipeline
    clf = Pipeline(steps=[('preprocessor', preprocessor),
                          ('model', model)
                         ])
    
    # Preprocessing of training data, fit model 
    clf.fit(X_train, y_train)
    
    # Preprocessing of validation data, get predictions
    preds = normalize_output(clf.predict(X_valid))
    
    print('MAE:', mean_absolute_error(y_valid, preds))
    import numpy as np
    data = {'treatment': ['yes', 'yes', 'no', 'no'], 'new_images': ['yes', 'no', 'yes', 'no']}  
      
    # Create DataFrame  
    to_predict = pd.DataFrame(data)  
    
    clf.fit(X_full, y)
    preds = clf.predict(to_predict)
    results = to_predict
    results["converted"] = preds
    results