Skip to content

Analyzing Survey Data with SQL & Python

# Import plotly.express using the alias px
import plotly.express as px

The survey data is contained in a CSV file named "What_does_it_take_to_generate_new_growth_Survey_data.csv".

Data dictionary

The dataset contains the following columns.

  • Growth_Firm: Is the company (firm) currently classified as a growth company under OECD definitions?
  • question_2_row_1_transformed: The responses to question 2, part 1 (with some pre-applied transformation).
  • question_2_row_2_transformed: The responses to question 2, part 2 (with some pre-applied transformation).
  • question_3_row_1: The responses to question 3, part 1.
  • ...
  • question_7_row_1: The responses to question 7, part 1.

The details of each question are fully described in survey_questions.csv.

First, we check the dataset and see how it is formated and if it was succesfully imported.

Spinner
DataFrameas
survey
variable
SELECT * FROM read_csv_auto('survey_data.csv', delim=";", decimal_separator=",", nullstr=" ")

The dataset doesn't contain the actual questions that were asked. To find out what the questions are, we look up the column titles in the data dictionary contained in survey_questions.csv.

Spinner
DataFrameas
dictionary
variable
SELECT * FROM 'survey_questions.csv'

Many of the questions in the survey dataset have categorical responses with 5 options from "Strongly disagree" to "Strongly agree".

The values are encoded as 1 for Strongly disagree through to 5 for Strongly agree. For visualizing the responses, it is better to have explicit labels rather than numbers.

We built up the SQL query to get the counts for each response type then draw a bar plot.

Spinner
DataFrameas
lookup
variable
-- Import everything from agree_disagree.csv as lookup
SELECT *
	FROM 'agree_disagree.csv'

We're working towards getting the counts for each of the five responses, even if they aren't all present in the dataset. That means that we want zero counts to be allowed. To achieve this, we used a left join.

To create a distribution of the responses, we grouped the data using the response column

In order to draw an easy-to-interpret plot, we include a color scheme based on the level of agreement, ranging from -2 to 2. Here, -2 represents 'strongly disagree', and 2 represents 'strongly agree'

Spinner
DataFrameas
q3_1_counts
variable
-- Copy and paste the previous code
-- Calculate the lookup code minus 3, naming the result as agreement
-- Order the result by lookup code
SELECT
	lookup.response,
	COUNT(survey.question_3_row_1) AS n,
	lookup.code - 3 AS agreement
	FROM 'agree_disagree.csv' AS lookup
	LEFT JOIN read_csv_auto('survey_data.csv', delim=";", decimal_separator=",", nullstr=" ") AS survey
		ON lookup.code = survey.question_3_row_1
	GROUP BY lookup.response, lookup.code
	ORDER BY lookup.code

To help visualize those results, we used a bar plot, using the agreement column to help visualize the results.

px.bar(
    q3_1_counts,
    x='response',
    y='n',
    color='agreement',
    color_continuous_scale=px.colors.diverging.Armyrose_r,
    labels={'response': '', 'n':'Number of responses'},
    title=dictionary.loc[dictionary.column == 'question_3_row_1', 'title'].values[0]
)

The first question that we've analyzed is if the Employees are encouraged to be creative. The graphic shows that the most frequent answer being 'Agree' and the second one were that the managers 'Strongly agree' with that statement.

Now we try to analyse other questions and see if they supports the creativity of the employees

Spinner
DataFrameas
q3_13_counts
variable
-- Get the counts for your new categorical question
SELECT
	lookup.response,
	COUNT(survey.question_3_row_6) AS n,
	lookup.code - 3 AS agreement
	FROM 'agree_disagree.csv' AS lookup
	LEFT JOIN read_csv_auto('survey_data.csv', delim=";", decimal_separator=",", nullstr=" ") AS survey
		ON lookup.code = survey.question_3_row_6
	GROUP BY lookup.response, lookup.code
	ORDER BY lookup.code