Analyzing Survey data on company perceptions on innovative behavior with SQL & Python
This analyses data from a survey about the growth of Finnish companies. The data reports the perceptions of top managers on growth, innovativeness, and the ability for renewal.
Where is the data from?
What will I train?
- How to summarize and visualize questions with a numeric response using a histogram.
- How to determine whether there is a difference between two groups of numeric responses using a Mann-Whitney U test.
- How to summarize and visualize questions with a categorical response using a bar plot.
Task 0: Setup
For this analysis we need the plotly.express
package for drawing histograms and bar plots.
We'll also need the mannwhitneyu
function from the scipy.stats
package to perform the Mann-Whitney U test.
# Import plotly.express using the alias px
import plotly.express as px
# From scipy.stats import the mannwhitneyu function
from scipy.stats import mannwhitneyu
Task 1: Import the Survey Dataset
The survey data is contained in a CSV file named "What_does_it_take_to_generate_new_growth_Survey_data.csv"
.
Data dictionary
The dataset contains the following columns.
Growth_Firm
: Is the company (firm) currently classified as a growth company under OECD definitions?question_2_row_1_transformed
: The responses to question 2, part 1 (with some pre-applied transformation).question_2_row_2_transformed
: The responses to question 2, part 2 (with some pre-applied transformation).question_3_row_1
: The responses to question 3, part 1.- ...
question_7_row_1
: The responses to question 7, part 1.
The details of each question are fully described in survey_questions.csv
, and we'll cover the details of the specific questions that we look at as we come to them in the tasks here.
-- Select everything from survey_data.csv
SELECT *
FROM read_csv_auto('survey_data.csv', delim = ';', decimal_separator = ',', nullstr = ' ')
The dataset doesn't contain the actual questions that were asked. To find out what the questions are, we can look up the column titles in the data dictionary contained in survey_questions.csv
.
-- Select everything from survey_data.csv
SELECT *
FROM 'survey_questions.csv'
Task 2: Visualizing Numeric Responses
Question 2 asks
If the firm develops the way you would like it to, how much revenue would the firm receive, and how many employees would it have five years ahead? Disregard possible inflation.
In this task we'll consider the first part, about employee count.
The responses are numeric, and so it's natural to visualize the distribution as a histogram.
# Draw a histogram of the survey data
# On the x-axis, plot question_2_row_1_transformed
# Facet the plot in rows by growth firm status.
px.histogram(
survey,
x="question_2_row_1_transformed",
labels={"question_2_row_1_transformed": "Expected employee count in five years (as a percent from last available year)"}
)
An interesting question is whether companies that are currently classified as growth have different expectations of how many more employees they will add over the next five years compared to non-growth companies. We can draw a histogram for each.
# Copy and paste your previous histogram code.
# On the x-axis, plot question_2_row_1_transformed
# Facet the plot in rows by growth status.
px.histogram(
survey,
x="question_2_row_1_transformed",
labels={"question_2_row_1_transformed": "Expected employee count in five years (as a percent from last available year)"},
facet_row="Growth_Firm"
)