Skip to content
0

ℹ️ Introduction to data science notebooks

You can skip this section if you are already familiar with data science notebooks.

Data science notebooks

A data science notebook is a document that contains text (what you're reading right now) and code chunks. What is unique with a notebook is that it's interactive: You can change or add code chunks, and then run them by clicking the Run button above (▶, or Run All) or hitting shift + enter.

The result will be displayed directly in the notebook.

Try running the cell below:

100 * 1.75 * 16

Modify any of the numbers and rerun the chunk.

Data science notebooks & data analysis

Notebooks are great for interactive data analysis. Let's create a tibble using the read_csv() function from readr.

We will load the dataset "sales_data.csv" containing three months of sales data for the company.

By using the head() command, we display the first six rows of data:

suppressPackageStartupMessages(library(tidyverse))
df <- readr::read_csv('data/sales_data.csv', show_col_types = FALSE)
head(df)

Data analysis example:

Find the total sales for each warehouse.

We can use group_by to group the information by the column "warehouse". Then we use summarize and sum() to add the "total" column for each warehouse:

df %>% 
  group_by(warehouse) %>%
  summarize(total_sales = sum(total))

Data science notebooks & visualizations

Visualizations are very helpful to summarize data and gain insights. A well-crafted chart often conveys information much better than a table.

It is very straightforward to include plots in a Notebook. For example, let's look at the average number of items purchased by each client type.

We are using the ggplot library for this example. We save our analysis into avg_units and use qplot() to make our graph:

avg_units <- df %>% 
  group_by(client_type) %>%
  summarize(avg_items = mean(quantity))

qplot(avg_units$client_type, 
      avg_units$avg_items, 
      geom="col",
      xlab="Client type",
      ylab="Average units")

Reporting on sales data

Now let's now move on to the competition and challenge.

📖 Background

You work in the accounting department of a company that sells motorcycle parts. The company operates three warehouses in a large metropolitan area.

You’ve recently learned data manipulation and plotting, and suggest helping your colleague analyze past sales data. Your colleague wants to capture sales by payment method. She also needs to know the average unit price for each product line.

💾 The data

The team assembled the following file:

The sales data has the following fields:
  • "date" - The date, from June to August 2021.
  • "warehouse" - The company operates three warehouses: North, Central, and West.
  • "client_type" - There are two types of customers: Retail and Wholesale.
  • "product_line" - Type of products purchased.
  • "quantity" - How many items were purchased.
  • "unit_price" - Price per item sold.
  • "total" - Total sale = quantity * unit_price.
  • "payment" - How the client paid: Cash, Credit card, Transfer.
head(df)

💪 Challenge

Create a report to answer your colleague's questions. Include:

  1. What are the total sales for each payment method?
  2. What is the average unit price for each product line?
  3. Create plots to visualize findings for questions 1 and 2.
  4. [Optional] Investigate further (e.g., average purchase value by client type, total purchase value by product line, etc.)
  5. Summarize your findings.

✅ Checklist before publishing into the competition

  • Rename your workspace to make it descriptive of your work. N.B. you should leave the notebook name as notebook.Rmd.
  • Remove redundant cells like the introduction to data science notebooks, so the workbook is focused on your story
  • Check that all the cells run without error.

⌛️ Time is ticking. Good luck!