Skip to content

Does going to university in a different country affect your mental health? A Japanese international university surveyed its students in 2018 and published a study the following year that was approved by several ethical and regulatory boards.

The study found that international students have a higher risk of mental health difficulties than the general population, and that social connectedness (belonging to a social group) and acculturative stress (stress associated with joining a new culture) are predictive of depression.

Explore the students data using PostgreSQL to find out if you would come to a similar conclusion for international students and see if the length of stay is a contributing factor.

Here is a data description of the columns you may find helpful.

Field NameDescription
inter_domTypes of students (international or domestic)
japanese_cateJapanese language proficiency
english_cateEnglish language proficiency
academicCurrent academic level (undergraduate or graduate)
ageCurrent age of student
stayCurrent length of stay in years
todepTotal score of depression (PHQ-9 test)
toscTotal score of social connectedness (SCS test)
toasTotal score of acculturative stress (ASISS test)

EXPLORE THE DATA IN THE SOURCE TABLE

  • The goal of this process is to:
    1. Familiarize oneself with the data types, column headings, and so on
    2. Check the dataset for missing values and inconsistencies before performing: a. Data cleaning and transformation b. Exploratory Data Analysis (EDA) c. Data analysis and visualization
Spinner
DataFrameas
df
variable
SELECT *
FROM public.students;
Spinner
DataFrameas
df18
variable
-- Data Types

SELECT * FROM information_schema.columns WHERE table_name = 'students';
Spinner
DataFrameas
df15
variable
-- Using a Common Table Expression (CTE) to work with the data without creating a permanent object (table)

WITH students_info_staging AS (
  SELECT * FROM public.students
)
SELECT * FROM students_info_staging;

-- Data Cleaning Function & Techniques

-- To address missing values, to standardize the dataset, to deal with whitespaces and so on

-- 1. Identify and remove duplicates -- 2. Null values or blank values -- 3. Standardize the data -- 4. Remove unnecessary rows and columns

Spinner
DataFrameas
df16
variable
-- Checking duplicates
-- Query 1: Identifying duplicates in specific columns using a CTE

WITH students_info_staging AS (
    SELECT
        inter_dom,
        region,
        academic,
        stay,
        gender,
        ROW_NUMBER() OVER(PARTITION BY inter_dom, region, academic, stay, gender ORDER BY inter_dom) as rn
    FROM public.students
)
SELECT
    inter_dom,
    region,
    academic,
    stay,
    gender
FROM students_info_staging
WHERE rn = 1;
Spinner
DataFrameas
df8
variable
-- Query 2: Identify Duplicates Using SELECT * and a CTE

WITH students_info_staging AS (
    SELECT
        *,
        ROW_NUMBER() OVER(PARTITION BY inter_dom, region, academic, stay, gender ORDER BY inter_dom) as rn
    FROM public.students
)
SELECT *
FROM students_info_staging
WHERE rn > 1;

REMOVING DUPLICATES FROM THE SOURCE TABLE

  • Since I am working with a read-only file, I cannot remove duplicates from the source table.
  • The DELETE command is a write operation, which is not allowed.
  • Instead, I will use a query that returns only the unique rows.
  • This will effectively "remove" the duplicates from the final result set without changing the underlying data.
Spinner
DataFrameas
df17
variable
-- Query to Select Only Unique Rows

WITH students_info_staging AS (
    SELECT
        *,
        ROW_NUMBER() OVER(PARTITION BY inter_dom, region, academic, stay, gender ORDER BY inter_dom) as rn
    FROM public.students
)
SELECT
    *
FROM students_info_staging
WHERE rn = 1;

IDENTIFYING AND HANDLING NULL/BLANK VALUES

  • Since the underlying data is a read-only file, I identified and handled null or blank values without changing the source data.
  • I identified and excluded rows with nulls using a WHERE clause and CTE to perform this process.
Spinner
DataFrameas
df19
variable
-- Using a WHERE clause and CTE

WITH students_info_staging AS (
    SELECT
        inter_dom,
        region,
        academic,
        stay,
        gender
    FROM public.students
)
SELECT *
FROM students_info_staging
WHERE
    inter_dom IS NOT NULL AND inter_dom != '' AND
    region IS NOT NULL AND region != '' AND
    academic IS NOT NULL AND academic != '' AND
    stay IS NOT NULL AND stay > 0 AND 
    gender IS NOT NULL AND gender != '';

EXPLANATORY DATA ANALYSIS

  1. Count the frequency of each distribution
  2. Calculate summary statistics such as mean, median, and standard deviation.