Investigating the Android App Market on Google Play
This is my third project. I want to showcase my skills in exploratory data analysis (EDA) with tools of Data Science with Python.
The main objective is to explore comprehensively the Google Play Store apps and reviews.
The business context is the following :
Mobile apps are everywhere. They are easy to create and can be lucrative. Because of these two factors, more and more apps are being developed.
Therefore, I will do a comprehensive analysis of the Android app market by comparing over ten thousand apps in Google Play across different categories.
I want to offer for insights in the data to devise strategies to drive growth and retention.
The data consist of two files:
apps.csv: contains all the details of the applications on Google Play. There are 13 features that describe a given app.user_reviews.csv: contains 100 reviews for each app, most helpful first.
The text in each review has been pre-processed and attributed with three new features: Sentiment (Positive, Negative or Neutral), Sentiment Polarity and Sentiment Subjectivity.
These three new features will help doing a quick sentiment analysis of app through reviews.
Let's get started!
1. Google Play Store apps and reviews
Let's take a look at the data!
# Read in dataset
import pandas as pd
apps_with_duplicates = pd.read_csv('datasets/apps.csv')
# Drop duplicates
apps = apps_with_duplicates.drop_duplicates()
# Print the total number of apps
print('Total number of apps in the dataset = ', str(apps.shape[0]))# Print a concise summary of apps dataframe
print(apps.info())# Have a look at a random sample of n rows
n = 15
apps.sample(n)2. Data cleaning
The four features that I will be working with most frequently henceforth are Installs, Size, Rating and Price.
The info() function (from the previous task) told me that Installs and Price columns are of type object and not int64 or float64 as I would expect.
This is because the column contains some characters more than just [0,9] digits.
Ideally, I would want these columns to be numeric as their name suggests.
Hence, I now proceed to data cleaning and prepare data to be consumed in my analyis later.
Specifically, the presence of special characters (, $ +) in the Installs and Price columns make their conversion to a numerical data type difficult.
I get rid of them and I convert the object colmuns to numerics:
# List of characters to remove
chars_to_remove = ["+", ",", "$"]
# List of column names to clean
cols_to_clean = ["Installs", "Price"]
# Loop for each column
for col in cols_to_clean:
# Replace each character with an empty string
for char in chars_to_remove:
apps[col] = apps[col].astype(str).str.replace(char, '')
# Convert col to numeric
apps[col] = pd.to_numeric(apps[col])
# Check the cleaned dataset
apps.sample(15)3. Exploring app categories
With more than 1 billion active users in 190 countries around the world, Google Play continues to be an important distribution platform to build a global audience.
For businesses to get their apps in front of users, it's important to make them more quickly and easily discoverable on Google Play.
To improve the overall search experience, Google has introduced the concept of grouping apps into categories.
This brings me to the following questions:
- Which category has the highest share of (active) apps in the market?
- Is any specific category dominating the market?
- Which categories have the fewest number of apps?
import plotly
plotly.offline.init_notebook_mode(connected=True)
import plotly.graph_objs as go
# Print the total number of unique categories
num_categories = len(apps['Category'].unique())
print('Number of categories = ', num_categories)
# Count the number of apps in each 'Category' and sort them in descending order
num_apps_in_category = apps['Category'].value_counts().sort_values(ascending = False)
data = [go.Bar(
x = num_apps_in_category.index, # index = category name
y = num_apps_in_category.values, # value = count
)]
plotly.offline.iplot(data)We see that there are 33 unique app categories present in the dataset.
Family and Game apps have the highest market prevalence.
Interestingly, Tools, Business and Medical apps are also at the top.
4. Distribution of app ratings
After having witnessed the market share for each category of apps, let's see how all these apps perform on an average.
App ratings (on a scale of 1 to 5) impact the discoverability, conversion of apps as well as the company's overall brand image.
Ratings are a key performance indicator of an app.
# Average rating of apps
avg_app_rating = apps['Rating'].mean()
print('Average app rating = ', avg_app_rating)
# Distribution of apps according to their ratings
data = [go.Histogram(
x = apps['Rating']
)]
# Vertical dashed line to indicate the average app rating
layout = {'shapes': [{
'type' :'line',
'x0': avg_app_rating,
'y0': 0,
'x1': avg_app_rating,
'y1': 1000,
'line': { 'dash': 'dashdot'}
}]
}
plotly.offline.iplot({'data': data, 'layout': layout})From my research, I found that the average volume of ratings across all app categories is 4.17.
The histogram plot is skewed to the left indicating that the majority of the apps are highly rated with only a few exceptions in the low-rated apps.
5. Size and price of an app
Let's now examine app size and app price.
For size, if the mobile app is too large, it may be difficult and/or expensive for users to download. Lengthy download times could turn users off before they even experience your mobile app. Plus, each user's device has a finite amount of disk space.
For price, some users expect their apps to be free or inexpensive.
These problems compound if the developing world is part of your target market; especially due to internet speeds, earning power and exchange rates.
How can we effectively come up with strategies to size and price our app?
- Does the size of an app affect its rating?
- Do users really care about system-heavy apps or do they prefer light-weighted apps?
- Does the price of an app affect its rating?
- Do users always prefer free apps over paid apps?