As a Data Scientist working for a mobile app company, you usually find yourself applying product analytics to better understand user behavior, uncover patterns, and reveal insights to identify the great and not-so-great features. Recently, the number of negative reviews has increased on Google Play, and as a consequence, the app's rating has been decreasing. The team has requested you to analyze the situation and make sense of the negative reviews.
It's up to you to apply K-means clustering from scikit-learn and NLP techniques through NLTK to sort text data from negative reviews in the Google Play Store into categories!
The Data
A dataset has been shared with a sample of reviews and their respective scores (from 1 to 5) in the Google Play Store. A summary and preview are provided below.
reviews.csv
| Column | Description |
|---|---|
'content' | Content (text) of each review. |
'score' | Score assigned to the review by the user as an integer (from 1 to 5). |
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize# Download necessary files from NLTK:
# punkt -> Tokenization
# stopwords -> Stop words removal
nltk.download("punkt")
nltk.download("stopwords")# Load the reviews dataset and preview it
reviews = pd.read_csv("reviews.csv")
reviews.head()negative_reviews_text = reviews[reviews['score'].isin([1, 2])]['content']# Set up English stop words
STOP_WORDS = set(stopwords.words('english'))
def preprocess_review(review_text):
"""
Tokenizes and cleans review text by removing non-alphabetic characters and stop words.
Parameters:
review_text (str): The text of the review to preprocess.
Returns:
str: The cleaned, tokenized review as a single string.
"""
# Tokenize the text
tokens = word_tokenize(review_text)
# Clean tokens: keep alphabetic words, convert to lowercase, and remove stop words
filtered_tokens = [token.lower() for token in tokens if token.isalpha() and token.lower() not in STOP_WORDS]
return ' '.join(filtered_tokens)
# Apply preprocessing to each review and store results in `preprocessed_reviews` DataFrame
preprocessed_reviews = pd.DataFrame({
'cleaned_content': negative_reviews_text.apply(preprocess_review)
})
# Instantiate the TF-IDF vectorizer
vectorizer = TfidfVectorizer()
# Fit and transform the 'cleaned_content' column to get the TF-IDF matrix
tfidf_matrix = vectorizer.fit_transform(preprocessed_reviews['cleaned_content'])
# Set the number of clusters
n_clusters = 5
# Instantiate and fit the KMeans model
clust_kmeans = KMeans(n_clusters=n_clusters, random_state=42)
# Predict the cluster for each review
predicted_labels = clust_kmeans.fit_predict(tfidf_matrix)
# Convert predicted labels to a list and store in `categories`
categories = predicted_labels.tolist()
# Initialize a list to store the top term information for each cluster
topic_terms_list = []
# Loop through each unique cluster label
for cluster_label in set(categories):
# Get indices of reviews in the current cluster
cluster_indices = [i for i, label in enumerate(categories) if label == cluster_label]
# Sum the TF-IDF scores for each term within the current cluster
cluster_tfidf_sum = tfidf_matrix[cluster_indices].sum(axis=0)
# Convert the summed matrix to a 1-dimensional array
cluster_terms_freq = np.array(cluster_tfidf_sum).ravel()
# Get the index of the top term by sorting in descending order and taking the first element
top_term_index = cluster_terms_freq.argsort()[::-1][0] # Get the index of the term with the highest frequency
# Extract the top term and its frequency
top_term = vectorizer.get_feature_names_out()[top_term_index]
top_frequency = cluster_terms_freq[top_term_index]
# Append the result as a dictionary to topic_terms_list
topic_terms_list.append({
'category': cluster_label,
'term': top_term,
'frequency': top_frequency
})
# Convert the list of dictionaries to a DataFrame
topic_terms = pd.DataFrame(topic_terms_list)
# Display the results of the top terms in each cluster
print(topic_terms)