Intermediate Python

Run the hidden code cell below to import the data used in this course.

1 hidden cell

Take Notes

Add notes about the concepts you've learned and code cells with code you want to keep.

Data Visualization How to build a line chart: plt.plot(list on x axis, list on y axis) plt.show() #necessary if you want to see your plot

To put the x-axis on a logarithmic scale, you can use: plt.xscale('log')

To build a scatter plot, use: plt.scatter(x, y)

To build a histogram: plt.hist(list, number of bins, )

How to label axes: plt.xlabel('label of x-axis') plt.ylabel('label of y-axis')

Title: plt.title('Title')

If you want your y-axis to be in a set interval, you can say: plt.yticks([0, 2, 4, 6, 8, 10]) #This will ensure that the axis is labeled with these numbers and the graph is in between the interval 0 and 10

If you want to change the labels of each interval, you can use: plt.yticks([0, 2, 4, 6, 8, 10], [0, 2B, 4B, 6B, 8B, 10B])

scatter(x, y, s=None, c=None, marker=None, cmap=None, norm=None, vmin=None, vmax=None, alpha=None, linewidths=None, *, edgecolors=None, plotnonfinite=False, data=None, **kwargs) A scatter plot of y vs. x with varying marker size and/or color.

Dictionaries Dictionaries connect elements to their values. To create a dictionary with afghanistan that has a population of 30.55 million:

world = {"afghanistan":30.55, "albania":2.77, "algeria":39.21}

To get the population for albania, just do: world["albania"]

To delete a key from the dictionary, use del: del(world["albania"])

To change the value of a key, such as making albania's population 2.8, you can simply: world["albania"] = 2.8

A list is a array of numbers with specific order, indexed by numbers. A dictionary is a group of elements indexed by keys.

A list is useful if you want to collect values where the order matters, or want to select entire subsets.

A dictionary is useful when you want to find the corresponding key to an element fast.

Dictionaryception Dictionaries can contain key:value pairs where the values are again dictionaries.

Pandas

Pandas is a high level manipulation tool (built on the numpy package) that can store data in a DataFrame. This allows the user to work with data of many different types in one table.

How to build a DataFrame You can build a DataFrame through a Dictionary: dict = { "country" :["Brazil", "Russia", "India", "China", "South Africa"], "capital" :["Brasilia", "Moscow", "New Dehli", "Beijing", "Pretoria"], "area" :[8.516, 17.10, 3.286, 9.597, 1.221], "population":[200.4, 143.5, 1252, 1357, 52.98] } The keys are the column labels, such as country, capital, area, population. The values are the data, column by column

After importing the pandas package as pd "import pandas as pd", you can create a dataframe from the dictionary using

brics = pd.DataFrame(dict) This will create the DataFrame but also will assign each column an index. In order to change the index from numbers to values you can: brics.index = ["BR", "RU", "IN", "CH", "SA"]

Creating a dictionary with the data was really time consuming and annoying :/ Instead of manually entering this data everytime we want to create a DataFrame, we can IMPORT the data! If we can get the brics data in a CSV file, that means it is a file made of comma separated values.

Importing data In order to import data, you can pass the path to the csv file as an argument. EX: brics = pd.read_csv("path/to/brics.csv")

Brics automatically gave the data indexes from 0 to 4. We have to tell the read_csv function that the first column contains the row indexes. We can do this by setting the index_col argument: brics = pd.read_csv("path/to/brics.csv", index_col = 0)

Index and Select Data to select a single column from a DataFrame, use the data frame name and put the column inside the square brackets. For example, if you want to select countries from Brics, do: brics["country"] This will return the countries, but it will also return: Name: country, dtype: object

If you want the elements from a column but want to keep the structure in a data frame, you use double brackets: brics[["country"]]

Square brackets have limited functionality

Pandas has a loc and iloc function. loc is a function/technique to select parts of your data based on labels. iloc is a function/technique to select parts of your data based on integer-positions.

Row Access Loc in order to get the row for Russia, you do: brics.loc["RU"] This provides us with a Panda Series, which contains all the row's information. This information is inconveniently shown on different lines.

in order to get a Data Frame, we have to put the "RU" in another set of brackets brics.loc[["RU"]]

in order to select multiple rows at the same time, simply add: brics.loc[["RU", "IN", "CH"]]

If we want only select elements from the data frame, we can extend a list. For example, to only get the countries and capital names we can do: brics.loc[["RU", "IN", "CH"], ["country", "capital"]]

If we want all rows from the dataframe but only a few columns, we can: brics.loc[;,["country", "capital"]]

Row and Column iloc In order to get the row and column of a data frame, you can use: brics.iloc[[1,2,3], [1,2]] The left list is the row, the right list is the column

# Add your code snippets here

Explore Datasets

Use the DataFrames imported in the first cell to explore the data and practice your skills!

Create a loop that iterates through the brics DataFrame and prints "The population of {country} is {population} million!".
Create a histogram of the life expectancies for countries in Africa in the gapminder DataFrame. Make sure your plot has a title, axis labels, and has an appropriate number of bins.
Simulate 10 rolls of two six-sided dice. If the two dice add up to 7 or 11, print "A win!". If the two dice add up to 2, 3, or 12, print "A loss!". If the two dice add up to any other number, print "Roll again!".

Filtering Pandas Dataframes If we want to compare the areas of all the countries in a Pandas Dataframe, we first:

Get Column First we want a Panda Series. We can do this by: brics["area"] Alternatives brics.loc[:, "area"] brics.iloc[:,2]
Compare brics["area"] > 8 This will return a series containing booleans We can store this boolean series in the variable is_huge is_huge = brics["area"] > 8
Subset DF brics[is_huge] will give us all the information about the countries that are larger than 8

If we want to put it all in one line, we can do: brics[brics["area"] > 8]

Boolean Operators If we were to use np.logical_and as well as np.logical_or then we can get the same results import numpy as np np.logical_and(brics["area"] > 8, brics["area"]<10) ## This will return the same boolean Panda Series. In order to get the ## countries, we just place this line inside brics[]

loops

for a in apple: print(a) #it does not matter what a is named, for the loop will go through each element of apple and print it out

for a loop over a dictionary:

Definition of dictionary

europe = {'spain':'madrid', 'france':'paris', 'germany':'berlin', 'norway':'oslo', 'italy':'rome', 'poland':'warsaw', 'austria':'vienna' }

Iterate over europe

for key, value in europe.items(): print("the capital of " + key + " is " + str(value))

Loop over NumPy Array

Import numpy as np

import numpy as np

For loop over np_height, a Numpy array containing the heights of Major League Baseball players

for x in np_height : print( str(x) + " inches")

For loop over np_baseball, 2D NumPy array that contains both the heights (first column) and weights (second column) of those players.

for x in np.nditer(np_baseball) : print(str(x))

Loop over DataFrame using iterrows()

Import cars data

import pandas as pd cars = pd.read_csv('cars.csv', index_col = 0)

Adapt for loop

for lab, row in cars.iterrows() : print(lab) print(row)

The row data that's generated by iterrows() on every run is a Pandas Series. This format is not very convenient to print out. Luckily, you can easily select variables from the Pandas Series using square brackets:

for lab, row in brics.iterrows() : print(row['country'])

Import cars data

import pandas as pd cars = pd.read_csv('cars.csv', index_col = 0) print(cars)

Adapt for loop

for lab, row in cars.iterrows() : print(str(lab) + ": " + str(row['cars_per_cap'])) This produces: US: 809 AUS: 731 JPN: 588 IN: 18 RU: 200 MOR: 70 EG: 45 Here, US: 809 is not from two different elements. US is from the index, or lab variable, and 809 is from the cars_per_cap variable

Add a column In the video, Hugo showed you how to add the length of the country names of the brics DataFrame in a new column:

for lab, row in brics.iterrows() : brics.loc[lab, "name_length"] = len(row["country"]) You can do similar things on the cars DataFrame.

Import cars data

import pandas as pd cars = pd.read_csv('cars.csv', index_col = 0)

Code for loop that adds COUNTRY column

for lab, row in cars.iterrows(): cars.loc[lab, 'COUNTRY'] = row["country"].upper()

Print cars

print(cars)

Add a column (2nd method) Using iterrows() to iterate over every observation of a Pandas DataFrame is easy to understand, but not very efficient. On every iteration, you're creating a new Pandas Series.

If you want to add a column to a DataFrame by calling a function on another column, the iterrows() method in combination with a for loop is not the preferred way to go. Instead, you'll want to use apply().

Compare the iterrows() version with the apply() version to get the same result in the brics DataFrame:

for lab, row in brics.iterrows() : brics.loc[lab, "name_length"] = len(row["country"])

brics["name_length"] = brics["country"].apply(len) We can do a similar thing to call the upper() method on every name in the country column. However, upper() is a method, so we'll need a slightly different approach:

Import cars data

import pandas as pd cars = pd.read_csv('cars.csv', index_col = 0)

Use .apply(str.upper)

cars["COUNTRY"] = cars["country"].apply(str.upper)

print(cars)

Random Float In this exercise, you'll be using two functions from this package:

seed(): sets the random seed, so that your results are reproducible between simulations. As an argument, it takes an integer of your choosing. If you call the function, no output will be generated. rand(): if you don't specify any arguments, it generates a random float between zero and one.

Import numpy as np

import numpy as np

Set the seed

np.random.seed(123)

Generate and print random float

print(np.random.rand())

Random Walk

NumPy is imported, seed is set

Initialize random_walk

random_walk = [0]

Complete the ___

for x in range(100) : # Set step: last element in random_walk step = random_walk[-1]

# Roll the dice
dice = np.random.randint(1,7)

# Determine next step
if dice <= 2:
    step = step - 1
elif dice <= 5:
    step = step + 1
else:
    step = step + np.random.randint(1,7)

# append next_step to random_walk
random_walk.append(step)

Print random_walk

print(random_walk)

New Method Max The method max() ensures that a number never goes below another number. In the line below, max ensures step never goes below zero. Yet, if it is above 0 it can continue to subtract 1. max(0, step - 1)

Simulation of multiple walks (to get a histogram)

NumPy is imported; seed is set

Initialize all_walks (don't change this line)

all_walks = []

Simulate random walk five times

for i in range(5) :

# Code from before
random_walk = [0]
for x in range(100) :
    step = random_walk[-1]
    dice = np.random.randint(1,7)

    if dice <= 2:
        step = max(0, step - 1)
    elif dice <= 5:
        step = step + 1
    else:
        step = step + np.random.randint(1,7)
    random_walk.append(step)

# Append random_walk to all_walks
all_walks.append(random_walk)

Print all_walks

print(all_walks)

Visualize All Walks #This code creates a graph of all 5 possible walks through the use of the transpose method. Transposing a dataset means swapping its rows and columns so that the rows become columns and the columns become rows. Before when we did plt.show() after np_aw, there were around 100 line graphs. After transposing, there were only 5!

numpy and matplotlib imported, seed set.

initialize and populate all_walks

all_walks = [] for i in range(5) : random_walk = [0] for x in range(100) : step = random_walk[-1] dice = np.random.randint(1,7) if dice <= 2: step = max(0, step - 1) elif dice <= 5: step = step + 1 else: step = step + np.random.randint(1,7) random_walk.append(step) all_walks.append(random_walk)

Convert all_walks to NumPy array: np_aw

np_aw = np.array(all_walks)

Plot np_aw and show

plt.plot(np_aw) plt.show()

Clear the figure

plt.clf()

Transpose np_aw: np_aw_t

np_aw_t = np.transpose(np_aw)

Plot np_aw_t and show

plt.plot(np_aw_t) plt.show()

Implement Clumsiness

Intermediate Python

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}Intermediate Python

Take Notes

Pandas

Explore Datasets

Definition of dictionary

Iterate over europe

Import numpy as np

For loop over np_height, a Numpy array containing the heights of Major League Baseball players

For loop over np_baseball, 2D NumPy array that contains both the heights (first column) and weights (second column) of those players.

Import cars data

Adapt for loop

Import cars data

Adapt for loop

Import cars data

Code for loop that adds COUNTRY column

Print cars

Import cars data

Use .apply(str.upper)

Import numpy as np

Set the seed

Generate and print random float

NumPy is imported, seed is set

Initialize random_walk

Complete the ___

Print random_walk

NumPy is imported; seed is set

Initialize all_walks (don't change this line)

Simulate random walk five times

Print all_walks

numpy and matplotlib imported, seed set.

initialize and populate all_walks

Convert all_walks to NumPy array: np_aw

Plot np_aw and show

Clear the figure

Transpose np_aw: np_aw_t

Plot np_aw_t and show

Intermediate Python