Skip to content
New Workbook
Sign up
Exploratory Data Analysis in Python for Absolute Beginners

EDA in Python for Absolute Beginners

In this live training, we'll be doing Exploratory Data Analysis, or EDA, on a dataset that consists of hotel booking data. It includes many details about the bookings, including room specifications, the length of stay, the time between the booking and the stay, whether the booking was canceled, and how the booking was made. The data was gathered between July 2015 and August 2017. You can consult the appendices at the bottom of the notebook for citations and an overview of all variables.

To consult the solution, head over to the file browser and select notebook-solution.ipynb.

# Import the required packages
import pandas as pd 
import plotly.express as px

Import the data

# Import hotel_bookings_clean_v2.csv
df = pd.read_csv('hotel_bookings_clean_v2.csv')
df

Basic exploration

# Show dimensions - Mostra a dimensão da tabela (Linhas e colunas)
df.shape
# Are there missing values? - Df.isnull retorna a mesma tabela com true ou false para datos NAN, usando .sum agregamos todos os valores e somamos somente true que vale 1 e false vale 0
df.isnull().sum()
# Describe with summary statistics (valores minimos, máximos, médias e afins) basicamente é entender inicialmente a distribuição dos dados da tabela

df.describe()
# How many bookings were canceled? - Como a coluna é preenchida apenas com 0 e 1 podemos somar para ver os valores que o cancelamento é veradeiro 
n_cancelamentos = df['is_canceled'].sum()
porcent_cancelamentos = df['is_canceled'].mean()

print(f"{n_cancelamentos} reversas foram canceladas, que representa {porcent_cancelamentos*100:.2f}% de todas as reservas")

Are the cancellation rates different during different times of the year?

# Calculate and plot cancellations every month
cancellations = df\
.filter(['arrival_date_month', 'is_canceled'])\
.groupby(by = 'arrival_date_month', as_index=False)\
.sum()

# Create bar chart of cancellations per month
px.bar(cancellations, x = 'arrival_date_month', y='is_canceled')
# Calculate and plot total bookings every month
reservas_total = df\
.filter(['arrival_date_month', 'is_canceled'])\
.groupby(by = 'arrival_date_month', as_index=False)\
.count()\
.rename(columns = {'is_canceled':'total_bookings'})

# Create bar chart of total bookings per month
px.bar(reservas_total, x = 'arrival_date_month', y = 'total_bookings')
# Calculate cancellation rates every month
mesclado = pd.merge(cancellations,reservas_total, on = 'arrival_date_month')
mesclado['Taxa_cancelamento'] = mesclado['is_canceled']/mesclado['total_bookings']
mesclado
# Create bar chart of cancellation rate every month
px.bar(mesclado, x='arrival_date_month', y='Taxa_cancelamento')

Does the amount of nights influence the cancellation rate?

# Prepare the data
df_selection = df.assign(stays = lambda x: x["stays_in_week_nights"]+ x['stays_in_weekend_nights']).query('stays < 15')
df_selection