Skip to content

Data Engineer Associate Exam - Virtual Reality Fitness

ActiVR provides a virtual reality device designed for exercise and fitness.

ActiVR offers a range of products, including VR devices and subscription-based fitness programs through their apps.

The sales team at ActiVR wants to analyze user data to enhance their marketing strategy and evaluate their products. For this, it is crucial that the data is clean, accurate, and available for reporting.

They need your assistance in preparing the data before launching a new promotional campaign.

Database Schema

The data schema for ActiVR's database is outlined as follows:

  • events: Contains records of events registered in different games.
  • games: Stores information about various games available on the platform.
  • devices: Holds data about the virtual reality devices used by the users.
  • users: Contains details about the users utilizing the ActiVR platform.

Task 1

ActiVR's sales team wants to use the information it has about users for targeted marketing.

However, they suspect that the data may need to be cleaned before.

The expected data format and types for the users table according to the sales team's requirements is shown in the table below.

Write an SQL query that returns the users table with the specified format, including identifying and cleaning all invalid values.

  • Your output should be a DataFrame with the name 'clean_data'. Do not modify the users table.
  • Note that the DataLab environment formats dates as YYYY-MM-DD-hh-ss-SSS.
Column NameDescription
user_idUnique integer (assigned by the database, cannot be altered). Missing values are not possible due to the database structure.
ageInteger representing the age of the customer. Missing values should be replaced with the average age.
registration_dateDate when the user made an account first. Missing values should be replaced with January 1st 2024, using the appropriate format.
emailEmail address of the user. Missing values should be replaced with Unknown.
workout_frequencyWorkout frequency as a lowercase string, one of: minimal, flexible, regular, maximal. Missing values should be replaced with flexible.
Spinner
DataFrameas
clean_data
variable
-- Step 1: Convert values to correct data types
    SELECT
        user_id,
        
        -- Ensure age is an integer and replace NULL values with the average age
        COALESCE(age, (SELECT AVG(age) FROM users WHERE age IS NOT NULL)) AS age,
        
        -- Ensure registration_date is in the correct format
        COALESCE(registration_date, '2024-01-01') AS registration_date,
        
        -- Ensure email is text (no changes to data type, just replacing NULL with 'Unknown')
        COALESCE(email, 'Unknown') AS email,
        
        -- Ensure workout_frequency is a valid string (minimal, flexible, regular, maximal)
        CASE
            WHEN LOWER(workout_frequency) IN ('minimal', 'flexible', 'regular', 'maximal') THEN LOWER(workout_frequency)
            ELSE 'flexible'  -- Default to 'flexible' if the value is invalid
        END AS workout_frequency
    FROM users;

Task 2

It seems like there are missing values in the events table for the column game_id for all events before the year 2021.

However, we know that before 2021 there were only games where the game_type is running. The game_id for these games can be found in the games table.

Write a query so that the events table has a game_id for all events including those before 2021.

  • Your output should be a DataFrame with the name 'events_with_game_id'. Do not modify the events table.
Spinner
DataFrameas
events_with_game_id
variable
-- Ensure all events have a game_id, including those before 2021, by filling the missing values based on the game_type 'running'
SELECT
    event_id,
    user_id,
    device_id,
    event_time,
    -- If the game_id is missing for events before 2021, fill with game_id from 'running' game type in the games table
    COALESCE(game_id, 
        CASE
            WHEN EXTRACT(YEAR FROM CAST(event_time AS DATE)) < 2021 AND game_id IS NULL THEN 
                (SELECT game_id FROM games WHERE game_type = 'running' LIMIT 1)
            ELSE game_id
        END) AS game_id
FROM events;

-- The output of this query should be considered as the DataFrame 'events_with_game_id'

Task 3

ActiVR's sales team plans to launch a promotion for upgrades to virtual reality devices.

They aim to target customers who have participated in events related to specific game types.

Write a SQL query to provide the user_id and event_time for users who have participated in events related to biking games. Your output should be a DataFrame with the name 'event_biking'.

Spinner
DataFrameas
event_biking
variable
-- Select user_id and event_time for users who participated in biking games
SELECT 
    e.user_id,
    e.event_time
FROM 
    events e
JOIN 
    games g ON e.game_id = g.game_id
WHERE 
    g.game_type = 'biking';

Task 4

After running their promotion, the sales team at ActiVR wants to investigate the results.

To do so, they require insights into the number of users who participated in events for each game_type.

Write a SQL query that returns the count of unique users for each game type game_type and game_id. Entries with missing game types should be ignored. The user count should be shown in a column user_count. Your output should be a DataFrame with the name 'users_game'.

Spinner
DataFrameas
users_game
variable
-- Count unique users for each game_type and game_id, excluding missing game types
SELECT 
    g.game_type,
    e.game_id,
    COUNT(DISTINCT e.user_id) AS user_count
FROM 
    events e
JOIN 
    games g ON e.game_id = g.game_id
WHERE 
    g.game_type IS NOT NULL
GROUP BY 
    g.game_type, e.game_id;