1. DRY: Don't repeat yourself
Have you ever started your data analysis and ended up with repetitive code? Our colleague Brenda who works as a Product Analyst, has found herself in this situation and has asked us for some help. She's written a script to pull Net Promotor Score (NPS) data from various sources. NPS works by asking How likely is it that you would recommend our product to a friend or colleague? with a rating scale of 0 to 10. Brenda has set up this NPS survey in various ways, including emails and pop-ups on the mobile app and website. To compile the data from the different sources, she's written the following code:
# Read the NPS email responses into a DataFrame email = pd.read_csv("datasets/2020Q4_nps_email.csv") # Add a column to record the source email['source'] = 'email'
q4_nps = pd.concat([email,mobile,web])
This results in the DataFrame q4_nps
that looks like this:
response_date | user_id | nps_rating | source | |
---|---|---|---|---|
0 | 2020-10-29 | 36742 | 2 | |
1 | 2020-11-26 | 31851 | 10 | |
2 | 2020-10-27 | 44299 | 10 | |
… | … | … | … | … |
This code works, but it violates the Don't Repeat Yourself (DRY) programming principle. Brenda repeats the same code for email, mobile, and web, except with different variable names and file names. While it's often quicker to copy and paste, it makes it easier to introduce errors. For example, if you need to edit one of those lines, you have to do it in multiple places. Enter functions! Repeated code is a sign that we need functions. Let's write a function for Brenda.
# Import pandas with the usual alias
import pandas as pd
# Write a function that matches the docstring
def convert_csv_to_df(csv_name, source_type):
""" Converts an NPS CSV into a DataFrame with a column for the source.
Args:
csv_name (str): The name of the NPS CSV file.
source_type (str): The source of the NPS responses.
Returns:
A DataFrame with the CSV data and a column, source.
"""
df = pd.read_csv(csv_name)
df['source'] = source_type
return df
# Test the function on the mobile data
convert_csv_to_df("datasets/2020Q4_nps_mobile.csv", "mobile")
2. Verifying the files with the "with" keyword
Excellent, we have a function that reads and cleans Brenda's CSVs precisely the way she needs! She can call this function in the future for as many different sources as she wants. Before we combine the NPS DataFrames, we want to add a function that verifies that the files inputted are valid. Each of these NPS dataset files should have three columns: response_date
, user_id
, nps_rating
. Previously, Brenda would check this manually by opening each file.
Let's write a function that uses the context manager with open()
so that we properly close the files we open, even if an exception is raised. If we don't use the with
keyword with open()
, we would need to call close()
after we're done with the file. Even then, it's risky because an error might be raised before the close()
functions are called.
The function will return True
if the file contains the right columns. Otherwise, it will return False
. To test the function, we'll use datasets/corrupted.csv
to simulate a corrupted invalid NPS file.
# Define a function check_csv which takes csv_name
def check_csv(csv_name):
""" Checks if a CSV has three columns: response_date, user_id, nps_rating
Args:
csv_name (str): The name of the CSV file.
Returns:
Boolean: True if the CSV is valid, False otherwise
"""
# Open csv_name as f using with open
with open(csv_name) as f:
first_line = f.readline()
# Return true if the CSV has the three specified columns
if first_line == "response_date,user_id,nps_rating\n":
return True
# Otherwise, return false
else:
return False
# Test the function on a corrupted NPS file
check_csv('./datasets/corrupted.csv')
3. Putting it together with nested functions
Alright, we now have one function that verifies that the CSVs are valid and another that converts them into the DataFrame format needed by Brenda. What's left? Looking at the script, this is the last line we haven't covered: q4_nps = pd.concat([email,mobile,web])
. We could use this line of code, but we'll see more code repetition if we get CSVs from other sources or time periods.
To make sure our code is scalable, we're going to write a function called combine_nps_csvs()
that takes in a dictionary. Python dictionaries have key:value pairs. In our case, the CSV's name and source type will be the key and value, respectively. That way, we can define a dictionary with as many NPS files as we have and run it through combine_nps_csvs()
. For each file, we'll check that it's valid using check_csv()
, and if it is, we'll use convert_csv_to_df()
to convert it into a DataFrame. At the start of the function, we'll define an empty DataFrame called combined
and everytime a CSV is succesfully converted, we'll concatenate it to combined
.
# Write a function combine_nps_csvs() with the arg csvs_dict
def combine_nps_csvs(csvs_dict):
# Define combine as an empty DataFrame
combined = pd.DataFrame()
# Iterate over csvs_dict to get the name and source of the CSVs
for key, value in csvs_dict.items():
# Check if the csv is valid using check_csv()
if check_csv(key):
# Convert the CSV using convert_csv_to_df() and assign it to temp
temp = convert_csv_to_df(key, value)
# Concatenate combined and temp and assign it to combined
combined = pd.concat([combined, temp])
# If the file is not valid, print a message with the CSV's name
else:
print(key + " is not a valid file and will not be added.")
# Return the combined DataFrame
return combined
my_files = {
"datasets/2020Q4_nps_email.csv": "email",
"datasets/2020Q4_nps_mobile.csv": "mobile",
"datasets/2020Q4_nps_web.csv": "web",
"datasets/corrupted.csv": "social_media"
}
# Test the function on the my_files dictionary
combine_nps_csvs(my_files)
4. Detractors, Passives, and Promoters
We've summarized our colleague's script into one function: combine_nps_csvs()
! Let's move on to analyzing the NPS data, such as actually calculating NPS. As a reminder, NPS works by asking How likely is it that you would recommend our product to a friend or colleague? with a rating scale of 0 to 10.
NPS ratings are categorized into three groups. Ratings between 0 to 6 are detractors, ratings between 7 to 8 are passives, and finally, ratings 9 to 10 are promoters. There's more to analyzing NPS, but remember, functions should be small in scope and should just "do one thing". So before we get ahead of ourselves, let's write a simple function that takes an NPS rating and categorizes it into the appropriate group.
def categorize_nps(x):
"""
Takes a NPS rating and outputs whether it is a "promoter",
"passive", "detractor", or "invalid" rating. "invalid" is
returned when the rating is not between 0-10.
Args:
x: The NPS rating
Returns:
String: the NPS category or "invalid".
"""
# Write the rest of the function to match the docstring
if x >= 0 and x <= 6:
return 'detractor'
elif x >= 7 and x <= 8:
return 'passive'
elif x >= 9 and x <= 10:
return 'promoter'
else:
return 'invalid'
# Test the function
categorize_nps(8)
5. Applying our function to a DataFrame
So we have a function that takes a score and outputs which NPS response group it belongs to. It would be great to have this as a column in our NPS DataFrames, similar to the source
column we added. Since we've modularized our code with functions, all we need to do is edit our convert_cvs_to_df()
function and nest categorize_nps()
into it. However, the way we'll nest categorize_nps()
will be different than previous times. The pandas
library has a handy function called apply()
, which lets us apply a function to each column or row of a DataFrame.
def convert_csv_to_df(csv_name, source_type):
""" Convert an NPS CSV into a DataFrame with columns for the source and NPS group.
Args:
csv_name (str): The name of the NPS CSV file.
source_type (str): The source of the NPS responses.
Returns:
A DataFrame with the CSV data and columns: source and nps_group.
"""
df = pd.read_csv(csv_name)
df['source'] = source_type
# Define a new column nps_group which applies categorize_nps to nps_rating
df['nps_group'] = df['nps_rating'].apply(categorize_nps)
return df
# Test the updated function with mobile data
convert_csv_to_df("datasets/2020Q4_nps_mobile.csv", "mobile")
6. Calculating the Net Promoter Score
If we hadn't broken down our code into functions earlier, we would've had to edit our code in multiple places to add a nps_group
column, increasing the chance of introducing errors. It also helps that our functions have one responsibility keeping our code flexible and easier to edit and debug.
Now we're in a good place to calculate the Net Promoter Score! This is calculated by subtracting the percentage of detractor ratings from the percentage of promoter ratings, in other words:
$ NPS = \frac{\text{# of Promoter Rating - # of Detractor Ratings}}{\text{Total # of Respondents}} * 100 $
We want to calculate the NPS across all sources, so we'll use combine_nps_csvs()
from Task 3 to consolidate the source files. As expected, that will output a DataFrame which we'll use as an input for a new function we're going to write, calculate_nps()
.
# Define a function calculate_nps that takes a DataFrame
def calculate_nps(df):
# Calculate the NPS score using the nps_group column
ratings_counts = df['nps_group'].value_counts()
ratings_promoter = ratings_counts['promoter']
ratings_detractors = ratings_counts['detractor']
total_respondents = ratings_counts.sum()
NPS = (ratings_promoter - ratings_detractors) / total_respondents * 100
# Return the NPS Score
return NPS
my_files = {
"datasets/2020Q4_nps_email.csv": "email",
"datasets/2020Q4_nps_web.csv": "web",
"datasets/2020Q4_nps_mobile.csv": "mobile",
}
# Test the function on the my_files dictionary
q4_nps = combine_nps_csvs(my_files)
calculate_nps(q4_nps)
7. Breaking down NPS by source
Is it good to have an NPS score around 10? The worst NPS score you can get is -100 when all respondents are detractors, and the best is 100 when all respondents are promoters. Depending on the industry of your service or product, average NPS scores vary a lot. However, a negative score is a bad sign because it means you have more unhappy customers than happy customers. Typically, a score over 50 is considered excellent, and above 75 is considered best in class.
Although our score is above 0, it's still on the lower end of the spectrum. The product team concludes that majorly increasing NPS across our customer base is a priority. Luckily, we have this NPS data that we can analyze more so we can find data-driven solutions. A good start would be breaking down the NPS score by the source type. For instance, if people are rating lower on the web than mobile, that's some evidence we need to improve the browser experience.
# Define a function calculate_nps_by_source that takes a DataFrame
def calculate_nps_by_source(df):
# Group the DataFrame by source and apply calculate_nps()
temp_series = df.groupby(['source']).apply(calculate_nps)
# Return a Series with the NPS scores broken by source
return temp_series
my_files = {
"datasets/2020Q4_nps_email.csv": "email",
"datasets/2020Q4_nps_web.csv": "web",
"datasets/2020Q4_nps_mobile.csv": "mobile",
}
# Test the function on the my_files dictionary
q4_nps = combine_nps_csvs(my_files)
calculate_nps_by_source(q4_nps)
8. Adding docstrings
Interesting! The mobile responses have an NPS score of about -15, which is noticeably lower than the other two sources. There are few ways we could continue our analysis from here. We could use the column user_id
to reach out to users who rated poorly on the mobile app for user interviews. We could also breakdown NPS by source and date to see if there was a date where NPS started clearly decreasing - perhaps the same time there was a bug or feature realeased. With the functions we created, Brenda is in a good place to continue this work!
The last thing we'll discuss is docstrings. In Task 1, 2, 4 and 5, we included docstrings for convert_csv_to_df()
, check_csv()
, and categorize_nps()
. However, we should include docstrings for all the functions we write so that others can better re-use and maintain our code. Docstrings tell readers what a function does, its arguments, its return value(s) if any, and any other useful information you want to include, like error handling. There are several standards for writing docstrings in Python, including: Numpydoc, Google-style (chosen style in this notebook), and reStructuredText.
To make sure Brenda and other colleagues can follow our work, we are going to add docstrings to the remaining undocumented functions: combine_nps_csvs()
, calculate_nps()
, and calculate_nps_by_source
. It's up to you how you want to write the docstrings - we'll only check that a docstring exists for each of these functions.