1. Introduction
Mobile apps are everywhere. They are easy to create and can be very lucrative from the business standpoint. Specifically, Android is expanding as an operating system and has captured more than 74% of the total market[1].
The Google Play Store apps data has enormous potential to facilitate data-driven decisions and insights for businesses. In this notebook, we will analyze the Android app market by comparing ~10k apps in Google Play across different categories. We will also use the user reviews to draw a qualitative comparision between the apps.
The dataset you will use here was scraped from Google Play Store in September 2018 and was published on Kaggle. Here are the details:
- App: Name of the app
- Category: Category of the app. Some examples are: ART_AND_DESIGN, FINANCE, COMICS, BEAUTY etc.
- Rating: The current average rating (out of 5) of the app on Google Play
- Reviews: Number of user reviews given on the app
- Size: Size of the app in MB (megabytes)
- Installs: Number of times the app was downloaded from Google Play
- Type: Whether the app is paid or free
- Price: Price of the app in US$
- Last Updated: Date on which the app was last updated on Google Play
- App: Name of the app on which the user review was provided. Matches the `App` column of the `apps.csv` file
- Review: The pre-processed user review text
- Sentiment Category: Sentiment category of the user review - Positive, Negative or Neutral
- Sentiment Score: Sentiment score of the user review. It lies between [-1,1]. A higher score denotes a more positive sentiment.
From here on, it will be your task to explore and manipulate the data until you are able to answer the three questions described in the instructions panel.
# Import Statements
import pandas as pd
# Read in the apps.csv file
apps = pd.read_csv('datasets/apps.csv')
apps.info()
apps.head()
# Create a list of charcaters to remove
chars_to_remove = [',', '+']
# Remove characters in list from all rows of the Installs column
for char in chars_to_remove:
apps['Installs'] = apps['Installs'].apply(lambda x: x.replace(char, ''))
# Explicitly cast the installs column as int type
apps['Installs'] = apps['Installs'].astype('int')
# Show info and first rows of apps dataframe with cleaned installs column
apps.info()
apps.head()
# Group apps df by category column and aggregate three columns of information including the count of apps in the category, the average price, and the average rating
# Assign to new df apps_category_info
app_category_info = apps.groupby('Category').agg({'App': 'count', 'Price': 'mean', 'Rating': 'mean'})
# Rename the columns in apps_category_info to be more readable and intuitive
app_category_info = app_category_info.rename(columns={'App': 'Number of apps', 'Price': 'Average price', 'Rating': 'Average rating'})
app_category_info
# Read in user_reviews.csv file
user_reviews = pd.read_csv('datasets/user_reviews.csv')
# Display information about and first rows of user_reviews
# Note that Sentiment score has NaN values
user_reviews.info()
user_reviews.head()
finance_apps = apps[apps.Category == 'FINANCE']
finance_apps.head()
free_finance_apps = finance_apps[finance_apps.Type == 'Free']
free_finance_apps.head()
free_finance_apps_user_reviews = pd.merge(free_finance_apps, user_reviews, on='App')
free_finance_apps_user_reviews = free_finance_apps_user_reviews.groupby('App').agg({'Sentiment Score': 'mean'}).sort_values('Sentiment Score', ascending=False)
top_10_user_feedback = free_finance_apps_user_reviews[:10]
top_10_user_feedback