Analyzing Google Play Store Apps and Reviews (Project)
Mobile apps became indispensable part of our life in short timespan. Their convenience comes from the services and entertainment they provide to users around the world, often times for free of charge. In this analysis, we will explore the Android app market by examining over ten thousand apps available on Google Play across different categories. Our goal is to extract insights from the data to devise strategies for driving growth and enhancing user retention.
1. Understanding the Dataset
Let's take a look at the data, which consists of two files:
apps.csv
: Contains details of various applications on Google Play, including features such as category, rating, reviews, size, installs, and price.user_reviews.csv
: contains 100 reviews for each app, most helpful first. The text in each review has been pre-processed and attributed with three new features: Sentiment (Positive, Negative or Neutral), Sentiment Polarity and Sentiment Subjectivity.
Let's get started
First, we will read and take a look at the data. We will also erase the duplicates and count the number of apps.
# Read in dataset
import pandas as pd
apps_with_duplicates = pd.read_csv("datasets/apps.csv")
display(apps_with_duplicates.info(5))
# Drop duplicates from apps_with_duplicates
apps = apps_with_duplicates.drop_duplicates()
# Print the total number of apps
print("Total number of apps in the dataset is", apps["App"].count())
# Have a look at a random sample of 5 rows
print("Below is a sample from the data")
display(apps.sample(5))
2. Data cleaning
There are 9659 entries and 14 columns in our apps dataset. But it is not ready for data analysis yet. Data cleaning is one of the most essential subtask any data science project. Although it can be a very tedious process, it's worth should never be undermined.
We start by identifying and removing any inconsistencies or special characters present in certain columns. For example columns like Installs
and Price
have a few special characters (+
,
$
) due to the way the numbers have been represented. These characters prevent us from applying numeric operations on them. We will remove those characters so that columns contain only digits from [0-9].
Hence, we now proceed to clean our data. We will replace each character with an empty string using apply
and lambda
function as we loop through each column and charcter.
# List of characters to remove
chars_to_remove = ["+", ",", "$"]
# List of column names to clean
cols_to_clean = ["Installs", "Price"]
# Loop for each column in cols_to_clean
for col in cols_to_clean:
# Loop for each char in chars_to_remove
for char in chars_to_remove:
# Replace the character with an empty string
apps[col] = apps[col].apply(lambda x: x.replace(char,""))
# Print a summary of the apps dataframe
print(apps.info())
3. Correcting data types
From the previous task we noticed that Installs
and Price
were categorized as object
data type (and not int
or float
as we would expect if the data contained digits only). This is because these two columns originally had mixed input types: digits and special characters.
The four features that we will be working with most frequently henceforth are Installs
, Size
, Rating
and Price
. While Size
and Rating
are both float
(i.e. purely numerical data types), we still need to work on Installs
and Price
to make them numeric. We will convert the datatype useing Numpy's astype
function.
import numpy as np
# Convert Installs to float data type
apps["Installs"] = apps["Installs"].astype("float64")
# Convert Price to float data type
apps["Price"] = apps["Price"].astype("float64")
# Checking dtypes of the apps dataframe
print(apps.dtypes)
4. Exploring app categories
With more than 1 billion active users in 190 countries around the world, Google Play continues to be an important distribution platform to build a global audience. Google Play categorizes apps into various categories, facilitating easier discovery for users. We aim to answer several questions regarding app categories, such as:
- Which category has the highest share of (active) apps in the market?
- Is any specific category dominating the market?
- Which categories have the fewest number of apps?
We will use Plotly
and Pandas
libraries to answer these questions.
import plotly
plotly.offline.init_notebook_mode(connected=True)
import plotly.graph_objs as go
# Print the total number of unique categories
num_categories = len(apps["Category"].unique())
print('Number of categories = ', num_categories)
# Count the number of apps in each 'Category'.
num_apps_in_category = apps["Category"].value_counts()
# Sort num_apps_in_category in descending order based on the count of apps in each category
sorted_num_apps_in_category = num_apps_in_category.sort_values(ascending = False)
data = [go.Bar(
x = num_apps_in_category.index, # index = category name
y = num_apps_in_category.values, # value = count
)]
plotly.offline.iplot(data)
5. Distribution of app ratings
We can see that there are 33
unique app categories present in our dataset. Family and Game apps have the highest market prevalence. Interestingly, Tools, Business and Medical apps are also at the top.
App ratings (on a scale of 1 to 5) are as significant as their market share for each category. Ratings are a key performance indicator of an app. They have an important role in user perception and app discoverability. Thus we analyzed the distribution of app ratings to understand how users perceive different apps.
# Average rating of apps
avg_app_rating = apps["Rating"].mean()
print('Average app rating = ', avg_app_rating)
# Distribution of apps according to their ratings
data = [go.Histogram(
x = apps['Rating']
)]
# Vertical dashed line to indicate the average app rating
layout = {'shapes': [{
'type' :'line',
'x0': avg_app_rating,
'y0': 0,
'x1': avg_app_rating,
'y1': 1000,
'line': { 'dash': 'dashdot'}
}]
}
plotly.offline.iplot({'data': data, 'layout': layout})
6. Does Size Matter?
Above we observed that the average app ratings across all app categories is 4.17
. The histogram plot is skewed to the left indicating that the majority of the apps have high rating score with a few exceptions in the low-rated apps.
We now explore the potential factors influencing user ratings: app size and app price. For size, if the mobile app is too large, it may be difficult and/or expensive for users to download. Lengthy download times could turn users off before they even experience your mobile app. Plus, each user's device has a finite amount of disk space. For price, some users expect their apps to be free or inexpensive. These problems compound if the developing world is part of your target market; especially due to internet speeds, earning power and exchange rates.
How can we effectively come up with strategies to size and price our app?
- Does the size of an app affect its rating?
- Do users really care about system-heavy apps or do they prefer light-weighted apps?
- Does the price of an app affect its rating?
- Do users always prefer free apps over paid apps?
We will use Seaborn
and Matplotlib
library to visualize and answer the questions.
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("darkgrid")
import warnings
warnings.filterwarnings("ignore")
# Select rows where both 'Rating' and 'Size' values are present (ie. the two values are not null)
apps_with_size_and_rating_present = apps[(~apps["Rating"].isnull()) & (~apps["Size"].isnull())]
# Subset for categories with at least 250 apps
large_categories = apps_with_size_and_rating_present.groupby(["Category"]).filter(lambda x: len(x) >= 250)
# Plot size vs. rating
plt1 = sns.jointplot(x = large_categories["Size"], y = large_categories["Rating"])
# adding another layer of regression line onto the plot
plt1.plot_joint(sns.regplot, color="gray", line_kws=dict(color="r"))
print(f"Joint Plot of 'Size' and 'Rating' of the Apps with a Regression Line")
The majority of top-rated apps (apps ratings over 4) tend to have sizes ranging from 2 MB to 20 MB. Additionally, the regression line depicting the relationship between app size and rating appears to be relatively flat, indicating a negligible correlation between the two features.
# Select apps whose 'Type' is 'Paid'
paid_apps = apps_with_size_and_rating_present[apps_with_size_and_rating_present["Type"] == "Paid"]
# Plot price vs. rating
plt2 = sns.jointplot(x = paid_apps["Price"], y= paid_apps["Rating"])
# adding another layer of regression line onto the plot
plt2.plot_joint(sns.regplot, color="gray", line_kws=dict(color="red"))
We also observed a weak but negative relationship between app price and ratings. However, the presence of extreme outliers (apps priced at over 300 USD) has skewed the histogram. Additionally, a significant number of data points are clustered below $50, making the results biased and the histogram difficult to interpret. To get a better view of the distribution, we will eliminate these outliers.