Skip to content

As a Data Scientist working for a mobile app company, you usually find yourself applying product analytics to better understand user behavior, uncover patterns, and reveal insights to identify the great and not-so-great features. Recently, the number of negative reviews has increased on Google Play, and as a consequence, the app's rating has been decreasing. The team has requested you to analyze the situation and make sense of the negative reviews.

It's up to you to apply K-means clustering from scikit-learn and NLP techniques through NLTK to sort text data from negative reviews in the Google Play Store into categories!

The Data

A dataset has been shared with a sample of reviews and their respective scores (from 1 to 5) in the Google Play Store. A summary and preview are provided below.

reviews.csv

ColumnDescription
'content'Content (text) of each review.
'score'Score assigned to the review by the user as an integer (from 1 to 5).
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
# Download necessary files from NLTK:
# punkt -> Tokenization
# stopwords -> Stop words removal
nltk.download("punkt")
nltk.download("stopwords")
# Load the reviews dataset and preview it
reviews = pd.read_csv("reviews.csv")
reviews.head()
negative_reviews_text = reviews[reviews['score'].isin([1, 2])]['content']
# Set up English stop words
STOP_WORDS = set(stopwords.words('english'))

def preprocess_review(review_text):
    """
    Tokenizes and cleans review text by removing non-alphabetic characters and stop words.

    Parameters:
    review_text (str): The text of the review to preprocess.

    Returns:
    str: The cleaned, tokenized review as a single string.
    """
    # Tokenize the text
    tokens = word_tokenize(review_text)
    # Clean tokens: keep alphabetic words, convert to lowercase, and remove stop words
    filtered_tokens = [token.lower() for token in tokens if token.isalpha() and token.lower() not in STOP_WORDS]
    return ' '.join(filtered_tokens)

# Apply preprocessing to each review and store results in `preprocessed_reviews` DataFrame
preprocessed_reviews = pd.DataFrame({
    'cleaned_content': negative_reviews_text.apply(preprocess_review)
})
# Instantiate the TF-IDF vectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the 'cleaned_content' column to get the TF-IDF matrix
tfidf_matrix = vectorizer.fit_transform(preprocessed_reviews['cleaned_content'])
# Set the number of clusters
n_clusters = 5

# Instantiate and fit the KMeans model
clust_kmeans = KMeans(n_clusters=n_clusters, random_state=42)

# Predict the cluster for each review
predicted_labels = clust_kmeans.fit_predict(tfidf_matrix)

# Convert predicted labels to a list and store in `categories`
categories = predicted_labels.tolist()
# Initialize a list to store the top term information for each cluster
topic_terms_list = []

# Loop through each unique cluster label
for cluster_label in set(categories):
    # Get indices of reviews in the current cluster
    cluster_indices = [i for i, label in enumerate(categories) if label == cluster_label]
    
    # Sum the TF-IDF scores for each term within the current cluster
    cluster_tfidf_sum = tfidf_matrix[cluster_indices].sum(axis=0)
    
    # Convert the summed matrix to a 1-dimensional array
    cluster_terms_freq = np.array(cluster_tfidf_sum).ravel()
    
    # Get the index of the top term by sorting in descending order and taking the first element
    top_term_index = cluster_terms_freq.argsort()[::-1][0]  # Get the index of the term with the highest frequency
    
    # Extract the top term and its frequency
    top_term = vectorizer.get_feature_names_out()[top_term_index]
    top_frequency = cluster_terms_freq[top_term_index]
    
    # Append the result as a dictionary to topic_terms_list
    topic_terms_list.append({
        'category': cluster_label,
        'term': top_term,
        'frequency': top_frequency
    })

# Convert the list of dictionaries to a DataFrame
topic_terms = pd.DataFrame(topic_terms_list)
# Display the results of the top terms in each cluster
print(topic_terms)