1. Google Play Store apps and reviews
Mobile apps are everywhere. They are easy to create and can be lucrative. Because of these two factors, more and more apps are being developed. In this notebook, we will do a comprehensive analysis of the Android app market by comparing over ten thousand apps in Google Play across different categories. We'll look for insights in the data to devise strategies to drive growth and retention.
Let's take a look at the data, which consists of two files:
apps.csv
: contains all the details of the applications on Google Play. There are 13 features that describe a given app.user_reviews.csv
: contains 100 reviews for each app, most helpful first. The text in each review has been pre-processed and attributed with three new features: Sentiment (Positive, Negative or Neutral), Sentiment Polarity and Sentiment Subjectivity.
# Read in dataset
import pandas as pd
# Assuming the file 'apps.csv' is in your current working directory
apps_with_duplicates = pd.read_csv('datasets/apps.csv')
# Drop duplicates from apps_with_duplicates
apps = apps_with_duplicates.drop_duplicates()
# Print the total number of apps
print('Total number of apps in the dataset = ', len(apps))
# Have a look at a random sample of 5 rows
print(apps.sample(5))
Conclusion: Clean data leads to trustworthy analyses, essential in a data-rich environment like the Google Play Store.
2. Data cleaning
Data cleaning is one of the most essential subtask any data science project. Although it can be a very tedious process, it's worth should never be undermined.
By looking at a random sample of the dataset rows (from the above task), we observe that some entries in the columns like Installs
and Price
have a few special characters (+
,
$
) due to the way the numbers have been represented. This prevents the columns from being purely numeric, making it difficult to use them in subsequent future mathematical calculations. Ideally, as their names suggest, we would want these columns to contain only digits from [0-9].
Hence, we now proceed to clean our data. Specifically, the special characters ,
and +
present in Installs
column and $
present in Price
column need to be removed.
It is also always a good practice to print a summary of your dataframe after completing data cleaning. We will use the info()
method to acheive this.
# List of characters to remove
chars_to_remove = ['+', ',', '$']
# List of column names to clean
cols_to_clean = ['Installs', 'Price']
# Loop for each column in cols_to_clean
for col in cols_to_clean:
# Loop for each char in chars_to_remove
for char in chars_to_remove:
# Replace the character with an empty string
apps[col] = apps[col].apply(lambda x: x.replace(char, ''))
# Print a summary of the apps dataframe
print(apps.info())
Conclusion: Precision in data formatting directly influences the feasibility of quantitative analysis.
3. Correcting data types
From the previous task we noticed that Installs
and Price
were categorized as object
data type (and not int
or float
) as we would like. This is because these two columns originally had mixed input types: digits and special characters. To know more about Pandas data types, read this.
The four features that we will be working with most frequently henceforth are Installs
, Size
, Rating
and Price
. While Size
and Rating
are both float
(i.e. purely numerical data types), we still need to work on Installs
and Price
to make them numeric.
import numpy as np
# Convert Installs to float data type
apps['Installs'] = apps['Installs'].astype(float)
# Convert Price to float data type
apps['Price'] = apps['Price'].astype(float)
# Checking dtypes of the apps dataframe
print(apps.dtypes)
Conclusion: Accurate data typing is pivotal in enabling complex, numerical computations.
4. Exploring app categories
With more than 1 billion active users in 190 countries around the world, Google Play continues to be an important distribution platform to build a global audience. For businesses to get their apps in front of users, it's important to make them more quickly and easily discoverable on Google Play. To improve the overall search experience, Google has introduced the concept of grouping apps into categories.
This brings us to the following questions:
- Which category has the highest share of (active) apps in the market?
- Is any specific category dominating the market?
- Which categories have the fewest number of apps?
We will see that there are 33
unique app categories present in our dataset. Family and Game apps have the highest market prevalence. Interestingly, Tools, Business and Medical apps are also at the top.
import plotly
import plotly.graph_objs as go
plotly.offline.init_notebook_mode(connected=True)
# Print the total number of unique categories
num_categories = apps['Category'].nunique()
print('Number of categories = ', num_categories)
# Count the number of apps in each 'Category'
num_apps_in_category = apps['Category'].value_counts()
# Sort num_apps_in_category in descending order based on the count of apps in each category
sorted_num_apps_in_category = num_apps_in_category.sort_values(ascending=False)
data = [go.Bar(
x = sorted_num_apps_in_category.index, # index = category name
y = sorted_num_apps_in_category.values, # value = count
)]
plotly.offline.iplot(data)
Conclusion: The app market is diverse, with certain categories like Family and Games leading in popularity.
5. Distribution of app ratings
After having witnessed the market share for each category of apps, let's see how all these apps perform on an average. App ratings (on a scale of 1 to 5) impact the discoverability, conversion of apps as well as the company's overall brand image. Ratings are a key performance indicator of an app.
From our research, we found that the average volume of ratings across all app categories is 4.17
. The histogram plot is skewed to the left indicating that the majority of the apps are highly rated with only a few exceptions in the low-rated apps.
import plotly
import plotly.graph_objs as go
plotly.offline.init_notebook_mode(connected=True)
# Average rating of apps
avg_app_rating = apps['Rating'].mean()
print('Average app rating = ', avg_app_rating)
# Distribution of apps according to their ratings
data = [go.Histogram(
x = apps['Rating']
)]
# Vertical dashed line to indicate the average app rating
layout = {
'shapes': [{
'type': 'line',
'x0': avg_app_rating,
'y0': 0,
'x1': avg_app_rating,
'y1': 1000,
'line': {'dash': 'dashdot'}
}]
}
plotly.offline.iplot({'data': data, 'layout': layout})
Conclusion: The prevalence of high ratings suggests a market standard where quality is typically well-maintained.