Skip to content

Explore a DataFrame

Use this template to get a solid understanding of the structure of your DataFrame and its values before jumping into a deeper analysis. This template leverages many of pandas' handy functions for the most fundamental exploratory data analysis steps, including inspecting column data types and distributions, creating exploratory visualizations, and counting unique and missing values.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load your dataset into a DataFrame
df = pd.read_csv("data/taxis.csv")

# Print the number of rows and columns
print(f"Number of rows and columns: {df.shape}") # fstring
print("Number of rows and columns:", df.shape)
# Print out the first five rows
df.head()

Understanding columns and values

The info() function prints a concise summary of the DataFrame. For each column, you can find its name, data type, and the number of non-null rows. This is useful to gauge if there are many missing values and to understand what data types you're dealing with. To get an exact count of missing values in each column, call the isna() function and aggregate it using the sum() function:

#print(df.info())
#print(df.isna().sum())
df['total_corretto'] = df['fare'] + df['tip'] +df['tolls'] 
df['total_corretto']
df['prova'] = df['total'] - df['total_corretto']
prova = df[df['prova'] != 0]
prova
#pd.set_option('display.max_columns', None)
print(df.describe())
# 1. Rimuovi i record con pickup o dropoff nulli
df_cleaned = df.dropna(subset=["pickup", "dropoff"])

# 2. Imputa 'tip' e 'tolls' nulli con 0
df_cleaned["tip"] = df_cleaned["tip"].fillna(0)
df_cleaned["tolls"] = df_cleaned["tolls"].fillna(0)
df_cleaned['color'].fillna('unknown', inplace=True)
df_cleaned['payment'].fillna('credit card', inplace=True)

# 3. Imputa 'distance' nulli con la mediana
distance_median = df_cleaned["distance"].median()
df_cleaned["distance"] = df_cleaned["distance"].fillna(distance_median)


# Verifica: valori nulli residui
df_cleaned.isna().sum()
colonne_da_controllare= ['pickup_zone', 'dropoff_zone', 'pickup_borough', 'dropoff_borough']

def analizza_nulli(df_cleaned, colonne_da_controllare):
    """
    Analizza la distribuzione dei valori nulli nel DataFrame per un insieme specifico di colonne.
    Stampa:
    - Quante righe hanno almeno un valore nullo
    - Percentuale sul totale
    - Quante righe hanno tutte e 4 valori nulli
    """

    # Conta i valori nulli per riga sulle colonne selezionate
    df["null_count"] = df[colonne_da_controllare].isna().sum(axis=1)

    # Filtra le righe con almeno un valore nullo
    righe_con_nulli = df[df["null_count"] > 0]

    # Totali
    totale_righe = df.shape[0]
    totale_nulli = righe_con_nulli.shape[0]
    percentuale = round((totale_nulli / totale_righe) * 100, 2)

    # Righe con 4 o più nulli
    righe_con_4 = righe_con_nulli[righe_con_nulli["null_count"] == 4].shape[0]

    # Stampa il risultato
    print("📊 Risultati analisi dei valori nulli:")
    print(f"→ Righe totali: {totale_righe}")
    print(f"→ Righe con almeno un valore nullo: {totale_nulli} ({percentuale}%)")
    print(f"→ Righe con 4 valori nulli: {righe_con_4}")

    # Se vuoi, restituisci i DataFrame filtrati (opzionale)
    return righe_con_nulli, righe_con_nulli[righe_con_nulli["null_count"] == 4]


analizza_nulli(df_cleaned, colonne_da_controllare)

If there are missing values, you'll have to decide if and how missing values should be dealt with. If you want to learn more about removing and replacing values, check out chapter 2 of DataCamp's Data Manipulation with pandas course.

The describe() function generates helpful descriptive statistics for each numeric column. You can see the percentile, mean, standard deviation, and minimum and maximum values in its output. Note that missing values are excluded here.

# Conta i valori nulli per riga
df_cleaned['null_count'] = df_cleaned.isna().sum(axis=1)

# Rimuovi le righe che hanno esattamente 4 valori nulli
df_cleaned = df_cleaned[df_cleaned['null_count'] < 4]
print(df_cleaned.shape)
df_cleaned.isna().sum()
Run cancelled
df_cleaned['payment'].value_counts()
df_cleaned = df_cleaned.drop(columns="null_count")
df_cleaned.isna().sum()

Basic data visualizations

pandas' plot() function makes it easy to plot columns from your DataFrame. This section will go through a few basic data visualizations to better understand your data. If you need a refresher on visualizing DataFrames, chapter 4 of DataCamp's Data Manipulation with pandas course is a useful reference!

Boxplots can help you identify outliers:

Run cancelled
def rimuovi_outlier(df):
    cond1 = (df_cleaned['distance'] == 0) & (df_cleaned['fare'] > 10)
    cond2 = (df_cleaned['fare'] == 0) & (df_cleaned['distance'] > 2)
    cond3 = df_cleaned['total'] < df_cleaned['fare']
    cond4 = (df_cleaned['pickup_zone'] == df_cleaned['dropoff_zone']) & (df_cleaned['distance'] > 10)
    
    mask_outliers = cond1 | cond2 | cond3 | cond4
    outliers = df_cleaned[mask_outliers]
    totale_outlier = outliers.shape[0]

    print(f"🧹 Totale outlier rimossi: {totale_outlier}")
    #print("\n🔍 Primi esempi di outlier:")
    print(outliers[['distance', 'fare', 'pickup_zone', 'dropoff_zone']].head(10))

    return df[~mask_outliers], outliers

rimuovi_outlier(df_cleaned)

#sns.lineplot(x ="fare",y= "distance", data = dubbio)