1. Google Play Store Apps Data
This dataset consists of web scraped data of more than 10,000 Google Play Store apps and 60,000 app reviews. apps_data.csv consists of data about the apps such as category, number of installs, and price. review_data.csv holds reviews of the apps, including the text of the review and sentiment scores.
Background
You are working for an app developer. They are in the process of brainstorming a new app. They want to ensure that their next app scores a high review on the app store, as this can lead to the app being featured on the store homepage. They would like you analyze what factors increase the rating an app will receive. They would also like to know what impact reviews have on the final score.
Import Libaries
# Import data manipulation tool
import pandas as pd
import numpy as np
# Import visualisation tool
import seaborn as sns
sns.set_style('darkgrid')
import plotly.express as px
import plotly.graph_objects as go
import plotly
import matplotlib.pyplot as pltRead in data sets
apps_with_duplicate = pd.read_csv("apps_data.csv")
apps = apps_with_duplicate.drop_duplicates()
apps.sample(5)
review_dat = pd.read_csv('review_data.csv')
review_dat.head()2. Cleaning the data
Data cleaning is an essential task to any data science project. Though it can be a tedious and time consuming, it's value should never be underestimated
By looking at a sample of the data set, we can see some entries in columns like Price and Installs have no numeric character. This prevent us from performing numeric operation on these colums.Ideally, Price and Installs should contain niumbers 0 - 9. We will proceed to eliminate these characters from above mentioned columns
char_to_remove = ['+', ',', '$', 'M']
col_to_clean = ['Price','Installs', 'Reviews']
for col in col_to_clean:
# Loop for each char in chars_to_remove
for char in char_to_remove:
# Replace the character with an empty string
apps[col] = apps[col].apply(lambda x: x.replace(char, ''))
# Print a summary of the apps dataframe
print(apps.info())Data clean(II)
Though we have filtered through the Price and Installs columns to remove non numeric character, further anaylis reveals that the columns contain entries which one might not expect. Such as "Free", and "Everyone". One way to filter through a Data frame for odd entries that do not corespond with the rest of the entries in that column is by uising the unique method
target = ['Free', 'Everyone']
col_to_clean = ['Price','Installs']
for col in col_to_clean:
for char in target:
apps[col] = apps[col].apply(lambda x: x.replace(char, '0'))
apps.sample(8)3. Correct The Data Type
From the previous task we noticed that Installs and Price were categorized as object data type (and not int or float) as we would like. This is because these two columns originally had mixed input types: digits and special characters.
The features that we will be focusing on the most are are Installs, Rating and Price. We still need to work on Installs and Price to make them numeric.
# Covert Install to float
apps['Installs'] = apps['Installs'].astype(float)
# Convert Review to float
apps['Reviews'] = apps['Reviews'].astype(float)
# Convert Price to float data type
apps['Price'] = apps['Price'].astype(float)
# Checking dtypes of the apps dataframe
print(apps.dtypes)4. Exploring App Categories
With more than 1 billion active users in 190 countries around the world, Google Play continues to be an important distribution platform to build a global audience. For businesses to get their apps in front of users, it's important to make them more quickly and easily discoverable on Google Play. To improve the overall search experience, Google has introduced the concept of grouping apps into categories.
This brings us to the following questions:
- Which category has the highest share of (active) apps in the market?
- Is any specific category dominating the market?
- Which categories have the fewest number of apps?
- What factors contribute to an app receiving an high content rating?
# Filter for the 10 categories with highest share
num_app_catagor = apps['Category'].value_counts().head(10)
# Order categories from highest to lowes
num_app_catagor = num_app_catagor.sort_values( ascending=True)
# Plot the apps withe highest share
fig = px.bar(num_app_catagor, y=num_app_catagor.index, x=num_app_catagor.values, title="Share of Each Category ", width=828, height=600, labels={'index': "Categories", 'x' : 'Share'}, )
fig.show()What deterimines what apps have the highes reviews ?
Though apps of category Family are most abundant on the play store, apps with under Game categor have the highest reviews. People to tend to interact more with these apps than any other apps. Given the recent trends in online gaming and gaming tournoments, it's no surprise that these fetch the highest rates for installs, Ratings, and reviews. That being said, Gaming would be the perfect market for a software developer looking to deploy on Google Play Store, Followed by Social Media apps, and Communication.