Google Play Store Apps Data

This dataset consists of web scraped data of more than 10,000 Google Play Store apps and 60,000 app reviews. apps_data.csv consists of data about the apps such as category, number of installs, and price. review_data.csv holds reviews of the apps, including the text of the review and sentiment scores. You can join the two tables on the App column.

Not sure where to begin? Scroll to the bottom to find challenges!

Source of dataset.

import pandas as pd

pd.read_csv("apps_data.csv")

pd.read_csv('review_data.csv')

Data Dictionary

data_apps.csv

variable	class	description
App	character	The application name
Category	character	The category the app belongs to
Rating	numeric	Overall user rating of the app
Reviews	numeric	Number of user reviews for the app
Size	character	The size of the app
Installs	character	Number of user installs for the app
Type	character	Either "Paid" or "Free"
Price	character	Price of the app
Content Rating	character	The age group the app is targeted at - "Children" / "Mature 21+" / "Adult"
Genres	character	Possibly multiple genres the app belongs to
Last Updated	character	The date the app was last updated
Current Ver	character	The current version of the app
Android Ver	character	The Android version needed for this app

data_reviews.csv

variable	class	description
App	character	The application name
Translated_Review	character	User review (translated to English)
Sentiment	character	The sentiment of the user - Positive/Negative/Neutral
Sentiment_Polarity	character	The sentiment polarity score
Sentiment_Subjectivity	character	The sentiment subjectivity score

# Exploring the dataset
apps_data = pd.read_csv("apps_data.csv")
apps_data.info()

Data preparation

https://www.kdnuggets.com/publications/sheets/Data_Cleaning_with_Python_Cheat_Sheet_Anello.pdf

# Descriptive statistics on Rating column
print(apps_data["Rating"].describe())
print(apps_data["Rating"].median())
print(apps_data["Rating"].mode())

Ratings should have a range from 1 to 5, but the maximum rating here is 19, which is obviously a false datapoint.

# Filtering false data
apps_data = apps_data[apps_data["Rating"] != 19.0]
print(apps_data["Rating"].max())

# Visualizing Rating distribution
import matplotlib.pyplot as plt

plt.hist(apps_data["Rating"], bins=25)

# Checking for missing data
apps_data.isnull().sum()

Given that missing data in the "Rating" column is more than 10% of the total number of observations, dropping these rows might be a bit too drastic. Since the data is left skewed I suggest imputation with median.

# Imputing missing values
apps_data["Rating"].fillna(apps_data["Rating"].median(), inplace=True)
apps_data.info()

# Checking descriptives again
print(apps_data["Rating"].describe())
print(apps_data["Rating"].median())
print(apps_data["Rating"].mode())

‌
‌
‌

Google Play Store Apps Data

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}Google Play Store Apps Data

Data Dictionary

Data preparation

Google Play Store Apps Data