Skip to content
New Workbook
Sign up
Live Training: Green Businesses (Solutions) (copy)

2023-09-08 | Data Storytelling for Absolute Beginners: A Case Study with Green Businesses | Camilo Martinez

Let's look into Datacamp's Workspaces.

  • They are like Jupyter notebooks... but better
  • You can load datasets and interact with them with python code directly or using AI.

Let's look into a simple Exploratory Data Analysis exercise

Exploratory Data Analysis

Exploratory Data Analysis (EDA) is an important step in any data analysis project. It helps us understand the data, identify patterns, and uncover insights. In this exercise, we will perform EDA on a dataset of green businesses.

Dataset

The dataset contains information about various green businesses, including their location, type of business, and environmental impact. We will explore the dataset to gain insights into the distribution of businesses, the most common types of businesses, and the relationship between business type and environmental impact.

Let's get started!

Exercises

Let's load in a Dataset with the help of AI:

"Please read the green-businesses dataset using python"

Now let's count the number of green businesses:

"Please count the number of entries in the greeen businesses dataset"

# Read the green-businesses dataset using python
import pandas as pd

# Assuming the dataset is in a CSV file named 'green-businesses.csv'
df = pd.read_csv('green-businesses.csv')

df.head()

Now we can try something more complex. It helps to know some of the concepts we reviewed in the presentation.

"Please create a line plot that shows the evolution of the number of entries grouped by year"

# Import the necessary libraries
import matplotlib.pyplot as plt

# Group the data by year and count the number of entries
df_grouped = df.groupby('year').size()

# Create a line plot
plt.plot(df_grouped.index, df_grouped.values)
plt.xlabel('Year')
plt.ylabel('Number of Entries')
plt.title('Evolution of Number of Entries by Year')
plt.show()

Another example. It is very useful to think about what meaningful groups we can create using the existing variables. In this case, the region variable.

"Please create a horizontal bar plot of the number of entries in the dataset grouped by the region variable"

To create a horizontal bar plot of the number of entries in the dataset grouped by the region variable, you can use the following code:

# Group the data by region and count the number of entries df_grouped = df.groupby('region').size() # Create a horizontal bar plot plt.barh(df_grouped.index, df_grouped.values) plt.xlabel('Number of Entries') plt.ylabel('Region') plt.title('Number of Entries by Region') plt.show()

Make sure to run the previous cells to import the necessary libraries and read the dataset.

Exploratory Data Analysis

Exploratory Data Analysis (EDA) is an important step in any data analysis project. It helps us understand the data, identify patterns, and uncover insights. In this exercise, we will perform some basic EDA on a dataset.

Step 1: Load the Data

First, let's load the dataset into a pandas DataFrame. We can use the read_csv() function from the pandas library to read a CSV file.

import pandas as pd # Read the CSV file df = pd.read_csv('data.csv')

Replace 'data.csv' with the path to your dataset.

Step 2: Explore the Data

Once we have loaded the data, we can start exploring it. Here are some common tasks:

  • View the first few rows of the dataset using the head() function.
  • Check the dimensions of the dataset using the shape attribute.
  • Check the data types of the columns using the dtypes attribute.
  • Check for missing values using the isnull() function.
  • Calculate summary statistics using the describe() function.
# View the first few rows print(df.head()) # Check the dimensions print(df.shape) # Check the data types print(df.dtypes) # Check for missing values print(df.isnull().sum()) # Calculate summary statistics print(df.describe())

Feel free to modify the code to suit your dataset.

Step 3: Visualize the Data

Visualization is a powerful tool for understanding data. Here are some common plots:

  • Histograms: to visualize the distribution of a numerical variable.
  • Bar plots: to compare the values of a categorical variable.
  • Scatter plots: to explore the relationship between two numerical variables.

You can use libraries like matplotlib or seaborn to create these plots. Here's an example:

import matplotlib.pyplot as plt # Create a histogram of a numerical variable plt.hist(df['age']) plt.xlabel('Age') plt.ylabel('Frequency') plt.title('Distribution of Age') plt.show()

Feel free to explore different types of plots and customize them as needed.

To create a horizontal bar plot of the number of entries in the dataset grouped by the region variable, you can use the following code:

import matplotlib.pyplot as plt # Group the data by region and count the number of entries df_grouped = df.groupby('region').size() # Create a horizontal bar plot plt.barh(df_grouped.index, df_grouped.values) plt.xlabel('Number of Entries') plt.ylabel('Region') plt.title('Number of Entries by Region') plt.show()

Make sure to run this code after running the previous cells to ensure that the df variable is defined.

Finally, don't hesitate to try more natural looking questions. This is where the power of language models shines:

"Please tell me the most common type of green business according to the result variable."

import pandas as pd
green_businesses = pd.read_csv('green-businesses.csv')
green_businesses