Explore a DataFrame
Use this template to get a solid understanding of the structure of your DataFrame and its values before jumping into a deeper analysis. This template leverages many of pandas' handy functions for the most fundamental exploratory data analysis steps, including inspecting column data types and distributions, creating exploratory visualizations, and counting unique and missing values.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Load your dataset into a DataFrame
df = pd.read_csv("data/taxis.csv")
# Print the number of rows and columns
print(f"Number of rows and columns: {df.shape}") # fstring
print("Number of rows and columns:", df.shape)
# Print out the first five rows
df.head()Understanding columns and values
The info() function prints a concise summary of the DataFrame. For each column, you can find its name, data type, and the number of non-null rows. This is useful to gauge if there are many missing values and to understand what data types you're dealing with. To get an exact count of missing values in each column, call the isna() function and aggregate it using the sum() function:
#print(df.info())
#print(df.isna().sum())
df['total_corretto'] = df['fare'] + df['tip'] +df['tolls']
df['total_corretto']
df['prova'] = df['total'] - df['total_corretto']
prova = df[df['prova'] != 0]
prova#pd.set_option('display.max_columns', None)
print(df.describe())# 1. Rimuovi i record con pickup o dropoff nulli
df_cleaned = df.dropna(subset=["pickup", "dropoff"])
# 2. Imputa 'tip' e 'tolls' nulli con 0
df_cleaned["tip"] = df_cleaned["tip"].fillna(0)
df_cleaned["tolls"] = df_cleaned["tolls"].fillna(0)
df_cleaned['color'].fillna('unknown', inplace=True)
df_cleaned['payment'].fillna('credit card', inplace=True)
# 3. Imputa 'distance' nulli con la mediana
distance_median = df_cleaned["distance"].median()
df_cleaned["distance"] = df_cleaned["distance"].fillna(distance_median)
# Verifica: valori nulli residui
df_cleaned.isna().sum()colonne_da_controllare= ['pickup_zone', 'dropoff_zone', 'pickup_borough', 'dropoff_borough']
def analizza_nulli(df_cleaned, colonne_da_controllare):
"""
Analizza la distribuzione dei valori nulli nel DataFrame per un insieme specifico di colonne.
Stampa:
- Quante righe hanno almeno un valore nullo
- Percentuale sul totale
- Quante righe hanno tutte e 4 valori nulli
"""
# Conta i valori nulli per riga sulle colonne selezionate
df["null_count"] = df[colonne_da_controllare].isna().sum(axis=1)
# Filtra le righe con almeno un valore nullo
righe_con_nulli = df[df["null_count"] > 0]
# Totali
totale_righe = df.shape[0]
totale_nulli = righe_con_nulli.shape[0]
percentuale = round((totale_nulli / totale_righe) * 100, 2)
# Righe con 4 o più nulli
righe_con_4 = righe_con_nulli[righe_con_nulli["null_count"] == 4].shape[0]
# Stampa il risultato
print("📊 Risultati analisi dei valori nulli:")
print(f"→ Righe totali: {totale_righe}")
print(f"→ Righe con almeno un valore nullo: {totale_nulli} ({percentuale}%)")
print(f"→ Righe con 4 valori nulli: {righe_con_4}")
# Se vuoi, restituisci i DataFrame filtrati (opzionale)
return righe_con_nulli, righe_con_nulli[righe_con_nulli["null_count"] == 4]
analizza_nulli(df_cleaned, colonne_da_controllare)
If there are missing values, you'll have to decide if and how missing values should be dealt with. If you want to learn more about removing and replacing values, check out chapter 2 of DataCamp's Data Manipulation with pandas course.
The describe() function generates helpful descriptive statistics for each numeric column. You can see the percentile, mean, standard deviation, and minimum and maximum values in its output. Note that missing values are excluded here.
# Conta i valori nulli per riga
df_cleaned['null_count'] = df_cleaned.isna().sum(axis=1)
# Rimuovi le righe che hanno esattamente 4 valori nulli
df_cleaned = df_cleaned[df_cleaned['null_count'] < 4]
print(df_cleaned.shape)
df_cleaned.isna().sum()Use the unique() function to print out the unique values of a column:
df_cleaned['payment'].value_counts()
df_cleaned = df_cleaned.drop(columns="null_count")
df_cleaned.isna().sum()
Basic data visualizations
pandas' plot() function makes it easy to plot columns from your DataFrame. This section will go through a few basic data visualizations to better understand your data. If you need a refresher on visualizing DataFrames, chapter 4 of DataCamp's Data Manipulation with pandas course is a useful reference!
Boxplots can help you identify outliers:
def rimuovi_outlier(df):
cond1 = (df_cleaned['distance'] == 0) & (df_cleaned['fare'] > 10)
cond2 = (df_cleaned['fare'] == 0) & (df_cleaned['distance'] > 2)
cond3 = df_cleaned['total'] < df_cleaned['fare']
cond4 = (df_cleaned['pickup_zone'] == df_cleaned['dropoff_zone']) & (df_cleaned['distance'] > 10)
mask_outliers = cond1 | cond2 | cond3 | cond4
outliers = df_cleaned[mask_outliers]
totale_outlier = outliers.shape[0]
print(f"🧹 Totale outlier rimossi: {totale_outlier}")
#print("\n🔍 Primi esempi di outlier:")
print(outliers[['distance', 'fare', 'pickup_zone', 'dropoff_zone']].head(10))
return df[~mask_outliers], outliers
rimuovi_outlier(df_cleaned)
#sns.lineplot(x ="fare",y= "distance", data = dubbio)