Skip to content

Google Play Store Apps Data

This dataset consists of web scraped data of more than 10,000 Google Play Store apps and 60,000 app reviews. apps_data.csv consists of data about the apps such as category, number of installs, and price. review_data.csv holds reviews of the apps, including the text of the review and sentiment scores. You can join the two tables on the App column.

Not sure where to begin? Scroll to the bottom to find challenges!

Source of dataset.

import pandas as pd

pd.read_csv("apps_data.csv")
pd.read_csv('review_data.csv')

Data Dictionary

data_apps.csv

variableclassdescription
AppcharacterThe application name
CategorycharacterThe category the app belongs to
RatingnumericOverall user rating of the app
ReviewsnumericNumber of user reviews for the app
SizecharacterThe size of the app
InstallscharacterNumber of user installs for the app
TypecharacterEither "Paid" or "Free"
PricecharacterPrice of the app
Content RatingcharacterThe age group the app is targeted at - "Children" / "Mature 21+" / "Adult"
GenrescharacterPossibly multiple genres the app belongs to
Last UpdatedcharacterThe date the app was last updated
Current VercharacterThe current version of the app
Android VercharacterThe Android version needed for this app

data_reviews.csv

variableclassdescription
AppcharacterThe application name
Translated_ReviewcharacterUser review (translated to English)
SentimentcharacterThe sentiment of the user - Positive/Negative/Neutral
Sentiment_PolaritycharacterThe sentiment polarity score
Sentiment_SubjectivitycharacterThe sentiment subjectivity score
# Exploring the dataset
apps_data = pd.read_csv("apps_data.csv")
apps_data.info()
# Descriptive statistics on Rating column
print(apps_data["Rating"].describe())
print(apps_data["Rating"].median())
print(apps_data["Rating"].mode())

Ratings should have a range from 1 to 5, but the maximum rating here is 19, which is obviously a false datapoint.

# Filtering false data
apps_data = apps_data[apps_data["Rating"] != 19.0]
print(apps_data["Rating"].max())
# Visualizing Rating distribution
import matplotlib.pyplot as plt

plt.hist(apps_data["Rating"], bins=25)
# Checking for missing data
apps_data.isnull().sum()

Given that missing data in the "Rating" column is more than 10% of the total number of observations, dropping these rows might be a bit too drastic. Since the data is left skewed I suggest imputation with median.

# Imputing missing values
apps_data["Rating"].fillna(apps_data["Rating"].median(), inplace=True)
apps_data.info()
# Checking descriptives again
print(apps_data["Rating"].describe())
print(apps_data["Rating"].median())
print(apps_data["Rating"].mode())