Analyzing Survey Data with SQL & Python
# Import plotly.express using the alias px
import plotly.express as pxThe survey data is contained in a CSV file named "What_does_it_take_to_generate_new_growth_Survey_data.csv".
Data dictionary
The dataset contains the following columns.
Growth_Firm: Is the company (firm) currently classified as a growth company under OECD definitions?question_2_row_1_transformed: The responses to question 2, part 1 (with some pre-applied transformation).question_2_row_2_transformed: The responses to question 2, part 2 (with some pre-applied transformation).question_3_row_1: The responses to question 3, part 1.- ...
question_7_row_1: The responses to question 7, part 1.
The details of each question are fully described in survey_questions.csv.
First, we check the dataset and see how it is formated and if it was succesfully imported.
SELECT * FROM read_csv_auto('survey_data.csv', delim=";", decimal_separator=",", nullstr=" ")
The dataset doesn't contain the actual questions that were asked. To find out what the questions are, we look up the column titles in the data dictionary contained in survey_questions.csv.
SELECT * FROM 'survey_questions.csv'Many of the questions in the survey dataset have categorical responses with 5 options from "Strongly disagree" to "Strongly agree".
The values are encoded as 1 for Strongly disagree through to 5 for Strongly agree. For visualizing the responses, it is better to have explicit labels rather than numbers.
We built up the SQL query to get the counts for each response type then draw a bar plot.
-- Import everything from agree_disagree.csv as lookup
SELECT *
FROM 'agree_disagree.csv'We're working towards getting the counts for each of the five responses, even if they aren't all present in the dataset. That means that we want zero counts to be allowed. To achieve this, we used a left join.
To create a distribution of the responses, we grouped the data using the response column
In order to draw an easy-to-interpret plot, we include a color scheme based on the level of agreement, ranging from -2 to 2. Here, -2 represents 'strongly disagree', and 2 represents 'strongly agree'
-- Copy and paste the previous code
-- Calculate the lookup code minus 3, naming the result as agreement
-- Order the result by lookup code
SELECT
lookup.response,
COUNT(survey.question_3_row_1) AS n,
lookup.code - 3 AS agreement
FROM 'agree_disagree.csv' AS lookup
LEFT JOIN read_csv_auto('survey_data.csv', delim=";", decimal_separator=",", nullstr=" ") AS survey
ON lookup.code = survey.question_3_row_1
GROUP BY lookup.response, lookup.code
ORDER BY lookup.code
To help visualize those results, we used a bar plot, using the agreement column to help visualize the results.
px.bar(
q3_1_counts,
x='response',
y='n',
color='agreement',
color_continuous_scale=px.colors.diverging.Armyrose_r,
labels={'response': '', 'n':'Number of responses'},
title=dictionary.loc[dictionary.column == 'question_3_row_1', 'title'].values[0]
)
The first question that we've analyzed is if the Employees are encouraged to be creative. The graphic shows that the most frequent answer being 'Agree' and the second one were that the managers 'Strongly agree' with that statement.
Now we try to analyse other questions and see if they supports the creativity of the employees
-- Get the counts for your new categorical question
SELECT
lookup.response,
COUNT(survey.question_3_row_6) AS n,
lookup.code - 3 AS agreement
FROM 'agree_disagree.csv' AS lookup
LEFT JOIN read_csv_auto('survey_data.csv', delim=";", decimal_separator=",", nullstr=" ") AS survey
ON lookup.code = survey.question_3_row_6
GROUP BY lookup.response, lookup.code
ORDER BY lookup.code