Skip to content

Social Media and Emotional Wellness

Exploratory data analysis project

Data source: https://www.kaggle.com/datasets/emirhanai/social-media-usage-and-emotional-well-being

Are people who use social media happier? This ficitonal dataset gives insight into how much people aged 21-35 are using social media across 6 popular platforms. It also gives information on specific details such as number of likes they receive.

EDA Business question:

  • What variables in the data are associated with good mood?

My Hypothesis:

  • People with more social media engagement are are happier. By engagement I mean factors such as daily usage time, likes received and comments.

Useful info: Navigate by clicking "Code" for details on my SQL code, which you'll find on the panel located on the top right of this page

Summary

  • Generally, people who spent around 2hrs + on social media were happier
  • The higher the posts made and likes received the happier the person
  • 86% of happy people used Instagram

Contents

Click on this button on the top left of page to navigate to the different sections

Cleaning

  1. Duplicates
  2. Data types/Invalid data

Tranformation

Exploratory data analysis

  1. Grouping emotions
  2. Exploring numerical columns
  3. EDA Histograms
  4. EDA Box plots
  5. EDA Scatter plots

Cleaning

1. Duplicates

I used Excel's COUNTA() function to count the number of non-empty cells in a row. Deleted all rows where the value for this was 0 (see below)


I filtered the rows that had 0 in the 'check' column and removed them


UserID column should be unique as each row corresponds to one user. I removed 4 duplicate rows using Excel's remove duplicates by using a combination of every column

2. Data types/Invalid data

In the row where UserID is 651, there was a numeric value for gender. I assume that age and gender were inserted into the wrong column so switched them.


Also in the row with UserID 849, the gender column had value 'Marie'. It's impossible to say whether this was intended to be 'Male' perhaps or if it's the user's actual name in which case the gender would be Female. For this reason, I decide to leave it blank.

Next I corrected the data types

Spinner
DataFrameas
social_media_data_1
variable
SELECT CAST(user_id AS text) AS user_id, 	-- casting columns to right data types
	CAST(age AS int) AS age,
	CAST(gender AS text) AS gender,
	CAST(platform AS text) AS platform,
	CAST(daily_usage_time AS int) AS daily_usage_time,
	CAST(posts_per_day AS int) AS posts_per_day,
	CAST(likes_received_per_day AS int) AS likes_received_per_day,
	CAST(comments_received_per_day AS int) AS comments_received_per_day,
	CAST(messages_sent_per_day AS int) AS messages_sent_per_day,
	CAST(dominant_emotion AS text) AS dominant_emotion
FROM 'sm_raw_data.csv';  -- sm stands for 'social media'
Hidden output

Transformation

The numeric columns needed to be grouped for analysis. Histograms were used to choose suitable ranges for groups.

Spinner
DataFrameas
social_media_data_2
variable
-- making groups in numerical columns
SELECT user_id,
	CASE WHEN age BETWEEN 20 AND 25 THEN '21 to 25'   
		WHEN age BETWEEN 26 AND 30 THEN '26 to 30'
		WHEN age BETWEEN 31 AND 35 THEN '31 to 35'
		END AS age,
	gender,
	platform,
	CASE WHEN daily_usage_time <= 40 THEN '<= 40'	
		WHEN daily_usage_time BETWEEN 41 AND 100 THEN '41 to 100'
		WHEN daily_usage_time BETWEEN 101 AND 160 THEN '101 to 160'
		WHEN daily_usage_time BETWEEN 161 AND 220 THEN '161 to 220'
		WHEN daily_usage_time >= 200 THEN '220+'
		END AS daily_usage_time,
	CASE WHEN posts_per_day <=1 THEN '<= 1'
		WHEN posts_per_day BETWEEN 2 AND 3 THEN '2 to 3'
		WHEN posts_per_day BETWEEN 4 AND 5 THEN '4 to 5'
		WHEN posts_per_day BETWEEN 6 AND 7 THEN '6 to 7'
		WHEN posts_per_day BETWEEN 7 AND 9 THEN '7 to 9'
	END AS posts_per_day,
	CASE WHEN likes_received_per_day <= 5 THEN '<= 5'
		WHEN likes_received_per_day BETWEEN 6 AND 30 THEN '6 to 30'
		WHEN likes_received_per_day BETWEEN 31 AND 55 THEN '31 to 55'
		WHEN likes_received_per_day BETWEEN 56 AND 80 THEN '56 to 80'
		WHEN likes_received_per_day BETWEEN 81 AND 105 THEN '81 to 105'
		WHEN likes_received_per_day BETWEEN 106 AND 130 THEN '106 to 130'
		WHEN likes_received_per_day >= 130 THEN '130+'
		END AS likes_received_per_day,
	CASE WHEN comments_received_per_day <= 2 THEN '<= 2'
		WHEN comments_received_per_day BETWEEN 3 AND 12 THEN '3 to 12'
		WHEN comments_received_per_day BETWEEN 13 AND 22 THEN '13 to 22'
		WHEN comments_received_per_day BETWEEN 23 AND 32 THEN '23 to 32'
		WHEN comments_received_per_day BETWEEN 32 AND 42 THEN '32 to 42'
		WHEN comments_received_per_day >= 42 THEN '42+'
	END AS comments_received_per_day,
	CASE WHEN messages_sent_per_day <= 10 THEN '<= 10'
		WHEN messages_sent_per_day BETWEEN 11 AND 20 THEN '11 to 20'
		WHEN messages_sent_per_day BETWEEN 21 AND 30 THEN '21 to 30'
		WHEN messages_sent_per_day BETWEEN 31 AND 40 THEN '31 to 40'
		WHEN messages_sent_per_day BETWEEN 41 AND 50 THEN '41 to 50'
		WHEN messages_sent_per_day >= 50 then '50+'
	END AS messages_sent_per_day,
	dominant_emotion
FROM social_media_data_1;
Hidden output

Exploratory data analysis

You can follow the excel workbook Exploratory data analysis.xlsx for this section

1. Grouping emotions

In Excel, I added a column indicating negative or positive emotion. "Anger", "Anxiety" and "Sadness" are used as negative emotions whilst "Happiness" is a positive emotion and "Boredom" and "Neutral" are both grouped as neutral.

Of the 99 people, almost about only 14% of people experienced positive emotion. Please see 'Emotions' tab of EDA workbook file.

2. Exploring numerical columns

In Excel, I use the Data Analysis ToolPak to look at the summary statistics of numeric columns. See 'descriptive statistics' tab of EDA workbook file.

Initial notable insights:

  • Most people were aged 27
  • The average time spent on social media was 88mins/day
  • There were mostly 1 post posted in a day