Skip to content
Data Manipulation with pandas
  • AI Chat
  • Code
  • Report
  • Data Manipulation with pandas

    Run the hidden code cell below to import the data used in this course.

    Take Notes

    Add notes about the concepts you've learned and code cells with code you want to keep.

    Add your notes here

    # Add your code snippets here
    
    # Code to make understand DataFrame
    .head() returns the first few rows (the “head” of the DataFrame).
    
    .info() shows information on each of the columns, such as the data type and number of missing values.
    
    .shape returns the number of rows and columns of the DataFrame.
    
    .describe() calculates a few summary statistics for each column.
    
    .values: A two-dimensional NumPy array of values.
    
    .columns: An index of columns: the column names.
    
    .index: An index for the rows: either row numbers or row names.

    Explore Datasets

    Use the DataFrames imported in the first cell to explore the data and practice your skills!

    • Print the highest weekly sales for each department in the walmart DataFrame. Limit your results to the top five departments, in descending order. If you're stuck, try reviewing this video.
    • What was the total nb_sold of organic avocados in 2017 in the avocado DataFrame? If you're stuck, try reviewing this video.
    • Create a bar plot of the total number of homeless people by region in the homelessness DataFrame. Order the bars in descending order. Bonus: create a horizontal bar chart. If you're stuck, try reviewing this video.
    • Create a line plot with two lines representing the temperatures in Toronto and Rome. Make sure to properly label your plot. Bonus: add a legend for the two lines. If you're stuck, try reviewing this video.
    # Sort rows by a specific column in ascending order
    df.sort_values('column_name', ascending=True, inplace=True)
    
    # Sort rows by multiple columns in descending order
    df.sort_values(['column_name1', 'column_name2'], ascending=[False, False], inplace=True)
    Hidden output
    # Subset a single column
    df['column_name']
    
    # Subset multiple columns
    df[['column_name1', 'column_name2']]
    Hidden output
    # Subset rows where a specific column meets a condition
    subset = df[df['column_name'] > 10]
    
    # Subset rows where multiple columns meet conditions
    subset = df[(df['column_name1'] > 10) & (df['column_name2'] == 'value')]
    Hidden output
    # Subset rows where a specific column has certain categorical values
    subset = df[df['column_name'].isin(['value1', 'value2', 'value3'])]
    Hidden output
    df['new_column_name'] = values
    
    df['total'] = df['column1'] + df['column2']
    
    df['average'] = df['column1'] / df['column1'].mean()dfd
    Hidden output

    Use Case for data manipulation

    # Create indiv_per_10k col as homeless individuals per 10k state pop
    homelessness["indiv_per_10k"] = 10000 * homelessness["individuals"] / homelessness["state_pop"] 
    
    # Subset rows for indiv_per_10k greater than 20
    high_homelessness = homelessness[homelessness["indiv_per_10k"] > 20]
    
    # Sort high_homelessness by descending indiv_per_10k
    high_homelessness_srt = high_homelessness.sort_values("indiv_per_10k", ascending = False)
    
    # From high_homelessness_srt, select the state and indiv_per_10k cols
    result = high_homelessness_srt[["state", "indiv_per_10k"]]
    
    # See the result
    print(result)
    # Import the necessary libraries
    import pandas as pd
    
    # Create a DataFrame
    data = {'column_name': [5, 10, 15, 20, 25],
            'column_name1': [5, 10, 15, 20, 25],
            'column_name2': ['value1', 'value2', 'value3', 'value4', 'value5']}
    df = pd.DataFrame(data)
    
    # Subset rows where a specific column meets a condition
    subset = df[df['column_name'] > 10]
    
    # Subset rows where multiple columns meet conditions
    subset = df[(df['column_name1'] > 10) & (df['column_name2'] == 'value')]
    
    # Subset rows where a specific column has certain categorical values
    subset = df[df['column_name'].isin(['value1', 'value2', 'value3'])]
    
    # Create a new column with calculated values
    df['new_column_name'] = values
    
    # Perform arithmetic operations on columns
    df['total'] = df['column1'] + df['column2']
    df['average'] = df['column1'] / df['column1'].mean()
    
    # Manipulate data using pandas functions
    homelessness["indiv_per_10k"] = 10000 * homelessness["individuals"] / homelessness["state_pop"]
    high_homelessness = homelessness[homelessness["indiv_per_10k"] > 20]
    high_homelessness_srt = high_homelessness.sort_values("indiv_per_10k", ascending=False)
    result = high_homelessness_srt[["state", "indiv_per_10k"]]
    # Import the necessary libraries
    import pandas as pd
    import numpy as np
    import scipy.stats as stats
    
    # Create a DataFrame
    data = {'column_name': [5, 10, 15, 20, 25],
            'column_name1': [5, 10, 15, 20, 25],
            'column_name2': ['value1', 'value2', 'value3', 'value4', 'value5']}
    df = pd.DataFrame(data)
    
    # Calculate descriptive statistics
    mean = df['column_name'].mean()
    median = df['column_name'].median()
    mode = df['column_name'].mode().values[0]
    std_dev = df['column_name'].std()
    variance = df['column_name'].var()
    
    # Perform hypothesis testing
    t_stat, p_value = stats.ttest_1samp(df['column_name'], 0)
    
    # Perform correlation analysis
    correlation = df['column_name'].corr(df['column_name1'])
    
    # Perform linear regression
    slope, intercept, r_value, p_value, std_err = stats.linregress(df['column_name'], df['column_name1'])