Data Manipulation with pandas
👋 Welcome to your new workspace! Here, you can experiment with the data you used in Data Manipulation with pandas and practice your newly learned skills with some challenges. You can find out more about DataCamp Workspace here.
On average, we expect users to take approximately 30 minutes to complete the content in this workspace. However, you are free to experiment and practice in it as long as you would like!
1. Get Started
Below is a code cell. It is used to execute Python code. The code below imports three packages you used in Data Manipulation with pandas: pandas, NumPy, and Matplotlib. The code also imports data you used in the course as DataFrames using the pandas read_csv() function.
🏃To execute the code, click inside the cell to select it and click "Run" or the ► icon. You can also use Shift-Enter to run a selected cell.
# Import the course packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Import the four datasets
avocado = pd.read_csv("datasets/avocado.csv")
homelessness = pd.read_csv("datasets/homelessness.csv")
temperatures = pd.read_csv("datasets/temperatures.csv")
walmart = pd.read_csv("datasets/walmart.csv")
# Print the first DataFrame
avocado2. Write Code
After running the cell above, you have created four pandas DataFrames: avocado, homelessness, temperatures, and walmart.
Add code to the code cells below to try one (or more) of the following challenges:
- Print the highest weekly sales for each departmentin thewalmartDataFrame. Limit your results to the top five departments, in descending order. If you're stuck, try reviewing this video.
- What was the total nb_soldof organic avocados in 2017 in theavocadoDataFrame? If you're stuck, try reviewing this video.
- Create a bar plot of the total number of homeless people by region in the homelessnessDataFrame. Order the bars in descending order. Bonus: create a horizontal bar chart. If you're stuck, try reviewing this video.
- Create a line plot with two lines representing the temperatures in Toronto and Rome. Make sure to properly label your plot. Bonus: add a legend for the two lines. If you're stuck, try reviewing this video.
Be sure to check out the Answer Key at the end to see one way to solve each problem. Did you try something similar?
Reminder: To execute the code you add to a cell, click inside the cell to select it and click "Run" or the ► icon. You can also use Shift-Enter to run a selected cell.
walmart.head()# 1. Print the highest weekly sales for each department
department_sales = walmart.groupby(["department"])[["weekly_sales"]].max()
department_sales_srt = department_sales.sort_values(by=["weekly_sales"],ascending=False)
print(type(department_sales))# 2. What was the total `nb_sold` of organic avocados in 2017?
for year in avocado["year"].unique():
    organic_sales = avocado[(avocado["type"]=="organic") & (avocado["year"]==year)]["nb_sold"].sum()
    print(f"The total number of organic avocados sold in {year} is {round(organic_sales/1000000,2)} Million")# 3. Create a bar plot of the number of homeless people by region
homelessness_by_region = homelessness.groupby(["region"])["individuals"].sum().sort_values()
homelessness_by_region.plot(kind="barh")
plt.ylabel("Number")
plt.xlabel("Region")
plt.title("Homeless individuals by region")
plt.show()# 4. Create a line plot of temperatures in Toronto and Rome
toronto = temperatures[temperatures.city == "Toronto"]
rome = temperatures[temperatures.city == "Rome"]
toronto.groupby("date")["avg_temp_c"].mean().plot(kind="line", color="blue")
rome.groupby("date")["avg_temp_c"].mean().plot(kind="line", color="red")
plt.legend(["toronto","rome"])
plt.xlabel("Date")
plt.ylabel("Temperature (C)")
plt.title("Temperature (C) over time")
plt.show()
rome3. Next Steps
Feeling confident about your skills? Continue on to Joining Data with pandas! This course will teach you how to combine multiple datasets, an essential skill on the road to becoming a data scientist!
4. Answer Key
Below are potential solutions to the challenges shown above. Try them out and see how they compare to how you approached the problem!
# 1. Print the highest weekly sales for each department
department_sales = walmart.groupby("department")[["weekly_sales"]].max()
best_departments = department_sales.sort_values(by="weekly_sales", ascending=False)
best_departments.head()# 2. What was the total `nb_sold` of organic avocados in 2017?
avocado_2017 = avocado.set_index("date").sort_index().loc["2017":"2018"]
avocado_organic_2017 = avocado_2017.loc[(avocado_2017["type"] == "organic")]
avocado_organic_2017["nb_sold"].sum()# 3. Create a bar plot of the number of homeless people by region
homelessness_by_region = (
    homelessness.groupby("region")["individuals"].sum().sort_values()
)
homelessness_by_region.plot(kind="barh")
plt.title("Total Number of Homeless People by Region")
plt.xlabel("Number")
plt.ylabel("Region")
plt.show()# 4. Create a line plot of temperatures in Toronto and Rome
toronto = temperatures[temperatures.city == "Toronto"]
rome = temperatures[temperatures.city == "Rome"]
toronto.groupby("date")["avg_temp_c"].mean().plot(kind="line", color="blue")
rome.groupby("date")["avg_temp_c"].mean().plot(kind="line", color="red")
plt.title("Toronto and Rome Average Temperature (C)")
plt.xlabel("Date")
plt.ylabel("Temperature")
plt.legend(labels=["Toronto", "Rome"])
plt.show()