Skip to content
Data Manipulation with pandas
  • AI Chat
  • Code
  • Report
  • Data Manipulation with pandas

    Run the hidden code cell below to import the data used in this course.

    # Import the course packages
    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    
    # Import the four datasets
    avocado = pd.read_csv("datasets/avocado.csv")
    homelessness = pd.read_csv("datasets/homelessness.csv")
    temperatures = pd.read_csv("datasets/temperatures.csv")
    walmart = pd.read_csv("datasets/walmart.csv")

    Take Notes

    Add notes about the concepts you've learned and code cells with code you want to keep.

    se puede calcular el promedio de una columna usando .mean() dogs['weight'].mean() lo mismo con .median() .mode() .max() .min() .var() .std() var siendo varianza o variance y std siendo la desviación estandar se pueden hacer sumas e incluso calcular los cuartiles con .quantile()

    the agg() method lets us use costumizable statistics def prct30(column): return column.quantile(0.3)

    print(netflix_df['duration'].agg(prct30)) como aqui que al definir una funcion que retorne el percentil 30 se usa agg para sacarlo de una columna definida de el dataframe

    The .agg() method allows you to apply your own custom functions to a DataFrame, as well as apply functions to more than one column of a DataFrame at once, making your aggregations super-efficient. For example,

    .cumsum() permite tener el acumulado de toda una columna row a row tambien existe .cummax() .cummin() .cumprod()

    Add your notes here

    # Add your code snippets here

    Update the aggregation functions called by .agg(): include iqr and np.median in that order.

    # Import NumPy and create custom IQR function
    import numpy as np
    def iqr(column):
        return column.quantile(0.75) - column.quantile(0.25)
    
    # Update to print IQR and median of temperature_c, fuel_price_usd_per_l, & unemployment
    print(sales[["temperature_c", "fuel_price_usd_per_l", "unemployment"]].agg([iqr, np.median]))
    Hidden output
    #the agg() method lets us use costumizable statistics
    def prct30(column):
        return column.quantile(0.3)
    
    print(netflix_df['duration'].agg(prct30))
    Hidden output

    Cumulative statistics Cumulative statistics can also be helpful in tracking summary statistics over time. In this exercise, you'll calculate the cumulative sum and cumulative max of a department's weekly sales, which will allow you to identify what the total sales were so far as well as what the highest weekly sales were so far.

    A DataFrame called sales_1_1 has been created for you, which contains the sales data for department 1 of store 1. pandas is loaded as pd.

    Instructions 100 XP Instructions 100 XP Sort the rows of sales_1_1 by the date column in ascending order. Get the cumulative sum of weekly_sales and add it as a new column of sales_1_1 called cum_weekly_sales. Get the cumulative maximum of weekly_sales, and add it as a column called cum_max_sales. Print the date, weekly_sales, cum_weekly_sales, and cum_max_sales columns.

    # Sort sales_1_1 by date
    sales_1_1 = sales_1_1.sort_values('date', ascending=True)
    print(sales_1_1)
    # Get the cumulative sum of weekly_sales, add as cum_weekly_sales col
    sales_1_1['cum_weekly_sales'] = sales_1_1['weekly_sales'].cumsum() 
    
    # Get the cumulative max of weekly_sales, add as cum_max_sales col
    sales_1_1['cum_max_sales']=sales_1_1['weekly_sales'].cummax()
    # See the columns you calculated
    print(sales_1_1[["date", "weekly_sales", "cum_weekly_sales", "cum_max_sales"]])

    .drop_duplicates toma todos los valores de alguna variable que ya se haya presentado y remueve todos menos el original. Se usa nombredf.drop_duplicates(subset = '*nombre de la columna')

    ya que lo anterior podria quitar duplicados de sujetos que aun asi son distintos, se usan en añadidura otra columna para filtras nombredf.drop_duplicates(subset = ['*nombre de la columna', 'nombre de la otra'])

    Dropping duplicates Removing duplicates is an essential skill to get accurate counts because often, you don't want to count the same thing multiple times. In this exercise, you'll create some new DataFrames using unique values from sales.

    sales is available and pandas is imported as pd.

    Instructions 100 XP Instructions 100 XP Remove rows of sales with duplicate pairs of store and type and save as store_types and print the head. Remove rows of sales with duplicate pairs of store and department and save as store_depts and print the head. Subset the rows that are holiday weeks using the is_holiday column, and drop the duplicate dates, saving as holiday_dates. Select the date column of holiday_dates, and print.

    # Drop duplicate store/type combinations
    store_types = sales.drop_duplicates(['store','type'])
    print(store_types.head())
    
    # Drop duplicate store/department combinations
    store_depts = sales.drop_duplicates(['store', 'department'])
    print(store_depts.head())
    
    # Subset the rows where is_holiday is True and drop duplicate dates
    holiday_dates = sales[sales['is_holiday']==True].drop_duplicates('date')
    
    # Print date col of holiday_dates
    print(holiday_dates['date'])

    Counting categorical variables Counting is a great way to get an overview of your data and to spot curiosities that you might not notice otherwise. In this exercise, you'll count the number of each type of store and the number of each department number using the DataFrames you created in the previous exercise:

    Drop duplicate store/type combinations

    store_types = sales.drop_duplicates(subset=["store", "type"])

    Drop duplicate store/department combinations

    store_depts = sales.drop_duplicates(subset=["store", "department"]) The store_types and store_depts DataFrames you created in the last exercise are available, and pandas is imported as pd.

    Instructions 100 XP Count the number of stores of each store type in store_types. Count the proportion of stores of each store type in store_types. Count the number of different departments in store_depts, sorting the counts in descending order. Count the proportion of different departments in store_depts, sorting the proportions in descending order.