Live Training: Green Businesses (Solutions) (copy)

2023-09-08 | Data Storytelling for Absolute Beginners: A Case Study with Green Businesses | Camilo Martinez

Let's look into Datacamp's Workspaces.

They are like Jupyter notebooks... but better
You can load datasets and interact with them with python code directly or using AI.

Let's look into a simple Exploratory Data Analysis exercise

Exploratory Data Analysis

Exploratory Data Analysis (EDA) is an important step in any data analysis project. It helps us understand the data, identify patterns, and uncover insights. In this exercise, we will perform EDA on a dataset of green businesses.

Dataset

The dataset contains information about various green businesses, including their location, type of business, and environmental impact. We will explore the dataset to gain insights into the distribution of businesses, the most common types of businesses, and the relationship between business type and environmental impact.

Let's get started!

Exercises

Let's load in a Dataset with the help of AI:

"Please read the green-businesses dataset using python"

Now let's count the number of green businesses:

"Please count the number of entries in the greeen businesses dataset"

# Read the green-businesses dataset using python
import pandas as pd

# Assuming the dataset is in a CSV file named 'green-businesses.csv'
df = pd.read_csv('green-businesses.csv')

df.head()

Now we can try something more complex. It helps to know some of the concepts we reviewed in the presentation.

"Please create a line plot that shows the evolution of the number of entries grouped by year"

# Import the necessary libraries
import matplotlib.pyplot as plt

# Group the data by year and count the number of entries
df_grouped = df.groupby('year').size()

# Create a line plot
plt.plot(df_grouped.index, df_grouped.values)
plt.xlabel('Year')
plt.ylabel('Number of Entries')
plt.title('Evolution of Number of Entries by Year')
plt.show()

Another example. It is very useful to think about what meaningful groups we can create using the existing variables. In this case, the region variable.

"Please create a horizontal bar plot of the number of entries in the dataset grouped by the region variable"

To create a horizontal bar plot of the number of entries in the dataset grouped by the region variable, you can use the following code:

# Group the data by region and count the number of entries
df_grouped = df.groupby('region').size()

# Create a horizontal bar plot
plt.barh(df_grouped.index, df_grouped.values)
plt.xlabel('Number of Entries')
plt.ylabel('Region')
plt.title('Number of Entries by Region')
plt.show()

Make sure to run the previous cells to import the necessary libraries and read the dataset.

Exploratory Data Analysis

Step 1: Load the Data

First, let's load the dataset into a pandas DataFrame. We can use the read_csv() function from the pandas library to read a CSV file.

import pandas as pd

# Read the CSV file
df = pd.read_csv('data.csv')

Replace 'data.csv' with the path to your dataset.

Step 2: Explore the Data

Once we have loaded the data, we can start exploring it. Here are some common tasks:

View the first few rows of the dataset using the head() function.
Check the dimensions of the dataset using the shape attribute.
Check the data types of the columns using the dtypes attribute.
Check for missing values using the isnull() function.
Calculate summary statistics using the describe() function.

# View the first few rows
print(df.head())

# Check the dimensions
print(df.shape)

# Check the data types
print(df.dtypes)

# Check for missing values
print(df.isnull().sum())

# Calculate summary statistics
print(df.describe())

Feel free to modify the code to suit your dataset.

Step 3: Visualize the Data

Visualization is a powerful tool for understanding data. Here are some common plots:

Histograms: to visualize the distribution of a numerical variable.
Bar plots: to compare the values of a categorical variable.
Scatter plots: to explore the relationship between two numerical variables.

You can use libraries like matplotlib or seaborn to create these plots. Here's an example:

import matplotlib.pyplot as plt

# Create a histogram of a numerical variable
plt.hist(df['age'])
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.title('Distribution of Age')
plt.show()

Feel free to explore different types of plots and customize them as needed.

To create a horizontal bar plot of the number of entries in the dataset grouped by the region variable, you can use the following code:

import matplotlib.pyplot as plt

# Group the data by region and count the number of entries
df_grouped = df.groupby('region').size()

# Create a horizontal bar plot
plt.barh(df_grouped.index, df_grouped.values)
plt.xlabel('Number of Entries')
plt.ylabel('Region')
plt.title('Number of Entries by Region')
plt.show()

Make sure to run this code after running the previous cells to ensure that the df variable is defined.

Finally, don't hesitate to try more natural looking questions. This is where the power of language models shines:

"Please tell me the most common type of green business according to the result variable."

import pandas as pd
green_businesses = pd.read_csv('green-businesses.csv')
green_businesses

‌
‌
‌

Live Training: Green Businesses (Solutions) (copy)

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}2023-09-08 | Data Storytelling for Absolute Beginners: A Case Study with Green Businesses | Camilo Martinez

Let's look into Datacamp's Workspaces.

Exploratory Data Analysis

Dataset

Exercises

Exploratory Data Analysis

Step 1: Load the Data

Step 2: Explore the Data

Step 3: Visualize the Data

2023-09-08 | Data Storytelling for Absolute Beginners: A Case Study with Green Businesses | Camilo Martinez