Skip to content

1. Introduction

Google Play logo

Mobile apps are everywhere. They are easy to create and can be very lucrative from the business standpoint. Specifically, Android is expanding as an operating system and has captured more than 74% of the total market[1].

The Google Play Store apps data has enormous potential to facilitate data-driven decisions and insights for businesses. In this notebook, we will analyze the Android app market by comparing ~10k apps in Google Play across different categories. We will also use the user reviews to draw a qualitative comparision between the apps.

The dataset you will use here was scraped from Google Play Store in September 2018 and was published on Kaggle. Here are the details:

datasets/apps.csv
This file contains all the details of the apps on Google Play. There are 9 features that describe a given app.
  • App: Name of the app
  • Category: Category of the app. Some examples are: ART_AND_DESIGN, FINANCE, COMICS, BEAUTY etc.
  • Rating: The current average rating (out of 5) of the app on Google Play
  • Reviews: Number of user reviews given on the app
  • Size: Size of the app in MB (megabytes)
  • Installs: Number of times the app was downloaded from Google Play
  • Type: Whether the app is paid or free
  • Price: Price of the app in US$
  • Last Updated: Date on which the app was last updated on Google Play
datasets/user_reviews.csv
This file contains a random sample of 100 [most helpful first](https://www.androidpolice.com/2019/01/21/google-play-stores-redesigned-ratings-and-reviews-section-lets-you-easily-filter-by-star-rating/) user reviews for each app. The text in each review has been pre-processed and passed through a sentiment analyzer.
  • App: Name of the app on which the user review was provided. Matches the `App` column of the `apps.csv` file
  • Review: The pre-processed user review text
  • Sentiment Category: Sentiment category of the user review - Positive, Negative or Neutral
  • Sentiment Score: Sentiment score of the user review. It lies between [-1,1]. A higher score denotes a more positive sentiment.

From here on, it will be your task to explore and manipulate the data until you are able to answer the three questions described in the instructions panel.

# QUESTION 1 
Hidden output
# Read "apps.csv" file, check it was imported 
import numpy as np
import pandas as pd

apps = pd.read_csv("datasets/apps.csv")
print(apps.head())
# Explore data (column names and types)
print(apps.columns)
apps.dtypes 
# "Installs" column is an object, we'll make sure it's a string 
print(apps["Installs"].head())
print((apps["Installs"]+"ok").head())
# Need to remove "+" and commas in order to convert to integer datatype
apps["Installs"] = apps.Installs.str[:-1]
apps["Installs"] =  apps.Installs.str[-15:-12] + apps.Installs.str[-11:-8] + \
                    apps.Installs.str[-7:-4] + apps.Installs.str[-3:]
print(apps["Installs"].head())
# Convert apps["Installs"] to numeric
apps["Installs"] = pd.to_numeric(apps["Installs"])
apps.dtypes

## Trouble area: I can only get it to convert to a float, every attempt at int doens't work
# Look at all "Installs" values 
print(apps["Installs"].unique())                  # Need to convert the NaN to a number 
# Look at all rows where Installs columns is not filled out
view_nan_installs = apps[apps["Installs"].isna()]
print(view_nan_installs)                                # Only 1 row has NaN, at Index 8028
# Since it only returns one row, I looked it up at the app store. It has 1 million+ installs 
apps["Installs"] = apps["Installs"].fillna(1000000)

print(apps.iloc[8028])
print(apps["Installs"].unique())
# Convert Install column to integer 
apps["Installs"] = apps["Installs"].astype(int)
apps.dtypes
Hidden output
Hidden output
# QUESTION 2 
Hidden output
apps.columns    
# Use the columns "Category", "Price", and "Rating"