Intermediate Python
Run the hidden code cell below to import the data used in this course.
1 hidden cell
Take Notes
Add notes about the concepts you've learned and code cells with code you want to keep.
Data Visualization How to build a line chart: plt.plot(list on x axis, list on y axis) plt.show() #necessary if you want to see your plot
To put the x-axis on a logarithmic scale, you can use: plt.xscale('log')
To build a scatter plot, use: plt.scatter(x, y)
To build a histogram: plt.hist(list, number of bins, )
How to label axes: plt.xlabel('label of x-axis') plt.ylabel('label of y-axis')
Title: plt.title('Title')
If you want your y-axis to be in a set interval, you can say: plt.yticks([0, 2, 4, 6, 8, 10]) #This will ensure that the axis is labeled with these numbers and the graph is in between the interval 0 and 10
If you want to change the labels of each interval, you can use: plt.yticks([0, 2, 4, 6, 8, 10], [0, 2B, 4B, 6B, 8B, 10B])
scatter(x, y, s=None, c=None, marker=None, cmap=None, norm=None, vmin=None, vmax=None, alpha=None, linewidths=None, *, edgecolors=None, plotnonfinite=False, data=None, **kwargs) A scatter plot of y vs. x with varying marker size and/or color.
Dictionaries Dictionaries connect elements to their values. To create a dictionary with afghanistan that has a population of 30.55 million:
world = {"afghanistan":30.55, "albania":2.77, "algeria":39.21}
To get the population for albania, just do: world["albania"]
To delete a key from the dictionary, use del: del(world["albania"])
To change the value of a key, such as making albania's population 2.8, you can simply: world["albania"] = 2.8
A list is a array of numbers with specific order, indexed by numbers. A dictionary is a group of elements indexed by keys.
A list is useful if you want to collect values where the order matters, or want to select entire subsets.
A dictionary is useful when you want to find the corresponding key to an element fast.
Dictionaryception Dictionaries can contain key:value pairs where the values are again dictionaries.
Pandas
Pandas is a high level manipulation tool (built on the numpy package) that can store data in a DataFrame. This allows the user to work with data of many different types in one table.
How to build a DataFrame You can build a DataFrame through a Dictionary: dict = { "country" :["Brazil", "Russia", "India", "China", "South Africa"], "capital" :["Brasilia", "Moscow", "New Dehli", "Beijing", "Pretoria"], "area" :[8.516, 17.10, 3.286, 9.597, 1.221], "population":[200.4, 143.5, 1252, 1357, 52.98] } The keys are the column labels, such as country, capital, area, population. The values are the data, column by column
After importing the pandas package as pd "import pandas as pd", you can create a dataframe from the dictionary using
brics = pd.DataFrame(dict) This will create the DataFrame but also will assign each column an index. In order to change the index from numbers to values you can: brics.index = ["BR", "RU", "IN", "CH", "SA"]
Creating a dictionary with the data was really time consuming and annoying :/ Instead of manually entering this data everytime we want to create a DataFrame, we can IMPORT the data! If we can get the brics data in a CSV file, that means it is a file made of comma separated values.
Importing data In order to import data, you can pass the path to the csv file as an argument. EX: brics = pd.read_csv("path/to/brics.csv")
Brics automatically gave the data indexes from 0 to 4. We have to tell the read_csv function that the first column contains the row indexes. We can do this by setting the index_col argument: brics = pd.read_csv("path/to/brics.csv", index_col = 0)
Index and Select Data to select a single column from a DataFrame, use the data frame name and put the column inside the square brackets. For example, if you want to select countries from Brics, do: brics["country"] This will return the countries, but it will also return: Name: country, dtype: object
If you want the elements from a column but want to keep the structure in a data frame, you use double brackets: brics[["country"]]
Square brackets have limited functionality
Pandas has a loc and iloc function. loc is a function/technique to select parts of your data based on labels. iloc is a function/technique to select parts of your data based on integer-positions.
Row Access Loc in order to get the row for Russia, you do: brics.loc["RU"] This provides us with a Panda Series, which contains all the row's information. This information is inconveniently shown on different lines.
in order to get a Data Frame, we have to put the "RU" in another set of brackets brics.loc[["RU"]]
in order to select multiple rows at the same time, simply add: brics.loc[["RU", "IN", "CH"]]
If we want only select elements from the data frame, we can extend a list. For example, to only get the countries and capital names we can do: brics.loc[["RU", "IN", "CH"], ["country", "capital"]]
If we want all rows from the dataframe but only a few columns, we can: brics.loc[;,["country", "capital"]]
Row and Column iloc In order to get the row and column of a data frame, you can use: brics.iloc[[1,2,3], [1,2]] The left list is the row, the right list is the column
# Add your code snippets hereExplore Datasets
Use the DataFrames imported in the first cell to explore the data and practice your skills!
- Create a loop that iterates through the
bricsDataFrame and prints "The population of {country} is {population} million!". - Create a histogram of the life expectancies for countries in Africa in the
gapminderDataFrame. Make sure your plot has a title, axis labels, and has an appropriate number of bins. - Simulate 10 rolls of two six-sided dice. If the two dice add up to 7 or 11, print "A win!". If the two dice add up to 2, 3, or 12, print "A loss!". If the two dice add up to any other number, print "Roll again!".
Filtering Pandas Dataframes If we want to compare the areas of all the countries in a Pandas Dataframe, we first:
- Get Column First we want a Panda Series. We can do this by: brics["area"] Alternatives brics.loc[:, "area"] brics.iloc[:,2]
- Compare brics["area"] > 8 This will return a series containing booleans We can store this boolean series in the variable is_huge is_huge = brics["area"] > 8
- Subset DF brics[is_huge] will give us all the information about the countries that are larger than 8
If we want to put it all in one line, we can do: brics[brics["area"] > 8]
Boolean Operators If we were to use np.logical_and as well as np.logical_or then we can get the same results import numpy as np np.logical_and(brics["area"] > 8, brics["area"]<10) ## This will return the same boolean Panda Series. In order to get the ## countries, we just place this line inside brics[]
loops
for a in apple: print(a) #it does not matter what a is named, for the loop will go through each element of apple and print it out
for a loop over a dictionary:
Definition of dictionary
europe = {'spain':'madrid', 'france':'paris', 'germany':'berlin', 'norway':'oslo', 'italy':'rome', 'poland':'warsaw', 'austria':'vienna' }
Iterate over europe
for key, value in europe.items(): print("the capital of " + key + " is " + str(value))
Loop over NumPy Array
Import numpy as np
import numpy as np
For loop over np_height, a Numpy array containing the heights of Major League Baseball players
for x in np_height : print( str(x) + " inches")
For loop over np_baseball, 2D NumPy array that contains both the heights (first column) and weights (second column) of those players.
for x in np.nditer(np_baseball) : print(str(x))
Loop over DataFrame using iterrows()
Import cars data
import pandas as pd cars = pd.read_csv('cars.csv', index_col = 0)
Adapt for loop
for lab, row in cars.iterrows() : print(lab) print(row)
The row data that's generated by iterrows() on every run is a Pandas Series. This format is not very convenient to print out. Luckily, you can easily select variables from the Pandas Series using square brackets:
for lab, row in brics.iterrows() : print(row['country'])
Import cars data
import pandas as pd cars = pd.read_csv('cars.csv', index_col = 0) print(cars)
Adapt for loop
for lab, row in cars.iterrows() : print(str(lab) + ": " + str(row['cars_per_cap'])) This produces: US: 809 AUS: 731 JPN: 588 IN: 18 RU: 200 MOR: 70 EG: 45 Here, US: 809 is not from two different elements. US is from the index, or lab variable, and 809 is from the cars_per_cap variable
Add a column In the video, Hugo showed you how to add the length of the country names of the brics DataFrame in a new column:
for lab, row in brics.iterrows() : brics.loc[lab, "name_length"] = len(row["country"]) You can do similar things on the cars DataFrame.
Import cars data
import pandas as pd cars = pd.read_csv('cars.csv', index_col = 0)
Code for loop that adds COUNTRY column
for lab, row in cars.iterrows(): cars.loc[lab, 'COUNTRY'] = row["country"].upper()
Print cars
print(cars)
Add a column (2nd method) Using iterrows() to iterate over every observation of a Pandas DataFrame is easy to understand, but not very efficient. On every iteration, you're creating a new Pandas Series.
If you want to add a column to a DataFrame by calling a function on another column, the iterrows() method in combination with a for loop is not the preferred way to go. Instead, you'll want to use apply().
Compare the iterrows() version with the apply() version to get the same result in the brics DataFrame:
for lab, row in brics.iterrows() : brics.loc[lab, "name_length"] = len(row["country"])
brics["name_length"] = brics["country"].apply(len) We can do a similar thing to call the upper() method on every name in the country column. However, upper() is a method, so we'll need a slightly different approach:
Import cars data
import pandas as pd cars = pd.read_csv('cars.csv', index_col = 0)
Use .apply(str.upper)
cars["COUNTRY"] = cars["country"].apply(str.upper)
print(cars)
Random Float In this exercise, you'll be using two functions from this package:
seed(): sets the random seed, so that your results are reproducible between simulations. As an argument, it takes an integer of your choosing. If you call the function, no output will be generated. rand(): if you don't specify any arguments, it generates a random float between zero and one.
Import numpy as np
import numpy as np
Set the seed
np.random.seed(123)
Generate and print random float
print(np.random.rand())
Random Walk
NumPy is imported, seed is set
Initialize random_walk
random_walk = [0]
Complete the ___
for x in range(100) : # Set step: last element in random_walk step = random_walk[-1]
# Roll the dice dice = np.random.randint(1,7) # Determine next step if dice <= 2: step = step - 1 elif dice <= 5: step = step + 1 else: step = step + np.random.randint(1,7) # append next_step to random_walk random_walk.append(step)
Print random_walk
print(random_walk)
New Method Max The method max() ensures that a number never goes below another number. In the line below, max ensures step never goes below zero. Yet, if it is above 0 it can continue to subtract 1. max(0, step - 1)
Simulation of multiple walks (to get a histogram)
NumPy is imported; seed is set
Initialize all_walks (don't change this line)
all_walks = []
Simulate random walk five times
for i in range(5) :
# Code from before random_walk = [0] for x in range(100) : step = random_walk[-1] dice = np.random.randint(1,7) if dice <= 2: step = max(0, step - 1) elif dice <= 5: step = step + 1 else: step = step + np.random.randint(1,7) random_walk.append(step) # Append random_walk to all_walks all_walks.append(random_walk)
Print all_walks
print(all_walks)
Visualize All Walks #This code creates a graph of all 5 possible walks through the use of the transpose method. Transposing a dataset means swapping its rows and columns so that the rows become columns and the columns become rows. Before when we did plt.show() after np_aw, there were around 100 line graphs. After transposing, there were only 5!
numpy and matplotlib imported, seed set.
initialize and populate all_walks
all_walks = [] for i in range(5) : random_walk = [0] for x in range(100) : step = random_walk[-1] dice = np.random.randint(1,7) if dice <= 2: step = max(0, step - 1) elif dice <= 5: step = step + 1 else: step = step + np.random.randint(1,7) random_walk.append(step) all_walks.append(random_walk)
Convert all_walks to NumPy array: np_aw
np_aw = np.array(all_walks)
Plot np_aw and show
plt.plot(np_aw) plt.show()
Clear the figure
plt.clf()
Transpose np_aw: np_aw_t
np_aw_t = np.transpose(np_aw)
Plot np_aw_t and show
plt.plot(np_aw_t) plt.show()
Implement Clumsiness