Google Play Store Apps Data
This dataset consists of web scraped data of more than 10,000 Google Play Store apps and 60,000 app reviews. apps_data.csv consists of data about the apps such as category, number of installs, and price. review_data.csv holds reviews of the apps, including the text of the review and sentiment scores. You can join the two tables on the App column.
Not sure where to begin? Scroll to the bottom to find challenges!
Source of dataset.
import pandas as pd
pd.read_csv("apps_data.csv")pd.read_csv('review_data.csv')Data Dictionary
data_apps.csv
| variable | class | description |
|---|---|---|
| App | character | The application name |
| Category | character | The category the app belongs to |
| Rating | numeric | Overall user rating of the app |
| Reviews | numeric | Number of user reviews for the app |
| Size | character | The size of the app |
| Installs | character | Number of user installs for the app |
| Type | character | Either "Paid" or "Free" |
| Price | character | Price of the app |
| Content Rating | character | The age group the app is targeted at - "Children" / "Mature 21+" / "Adult" |
| Genres | character | Possibly multiple genres the app belongs to |
| Last Updated | character | The date the app was last updated |
| Current Ver | character | The current version of the app |
| Android Ver | character | The Android version needed for this app |
data_reviews.csv
| variable | class | description |
|---|---|---|
| App | character | The application name |
| Translated_Review | character | User review (translated to English) |
| Sentiment | character | The sentiment of the user - Positive/Negative/Neutral |
| Sentiment_Polarity | character | The sentiment polarity score |
| Sentiment_Subjectivity | character | The sentiment subjectivity score |
# Exploring the dataset
apps_data = pd.read_csv("apps_data.csv")
apps_data.info()# Descriptive statistics on Rating column
print(apps_data["Rating"].describe())
print(apps_data["Rating"].median())
print(apps_data["Rating"].mode())Ratings should have a range from 1 to 5, but the maximum rating here is 19, which is obviously a false datapoint.
# Filtering false data
apps_data = apps_data[apps_data["Rating"] != 19.0]
print(apps_data["Rating"].max())# Visualizing Rating distribution
import matplotlib.pyplot as plt
plt.hist(apps_data["Rating"], bins=25)# Checking for missing data
apps_data.isnull().sum()Given that missing data in the "Rating" column is more than 10% of the total number of observations, dropping these rows might be a bit too drastic. Since the data is left skewed I suggest imputation with median.
# Imputing missing values
apps_data["Rating"].fillna(apps_data["Rating"].median(), inplace=True)
apps_data.info()# Checking descriptives again
print(apps_data["Rating"].describe())
print(apps_data["Rating"].median())
print(apps_data["Rating"].mode())