Skip to content

Intermediate Python

Run the hidden code cell below to import the data used in this course.

# Import the course packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Import the two datasets
gapminder = pd.read_csv("datasets/gapminder.csv", index_col=0)
brics = pd.read_csv("datasets/brics.csv", index_col=0)

print("------------------------------------------------------------------------------------------------------")
print(brics)
print("------------------------------------------------------------------------------------------------------")
print(gapminder)
print("------------------------------------------------------------------------------------------------------")
print(gapminder.sort_values("country"))
print("------------------------------------------------------------------------------------------------------")

ar1 = np.array(brics['country'])
plt.xlabel('Continents')
plt.ylabel('Population [trillions]')
plt.title("Poupulations of each continents' countries")
plt.scatter(np.array(gapminder['cont']), np.array(gapminder['population']))
plt.show()

Add your notes here

Numpy

Functions

logical_and()

e.g : np.logical_and(bmi > 21, bmi > 22)

logical_or()

e.g :

np.logical_or(bmi > 21, bmi > 22)

logical_not()

e.g :

np.logical_not(bmi > 21, bmi > 22) bmi[np.logical_not(bmi > 21, bmi > 22)] to select the value that match

nditer(array) to print all the elements of a 2D array

e.g :

for val in np.nditer(array_2D): print(val)

Random numbers

np.random.seed(123) coin = np.random.randint(0, 2)

It generate a random number between 0 and 1

MATPLOTLIB

import matplotlib.pyplot as plt plt.plot(x, y) plt.show()

How to plot a histogram

plt.hist(array, bins=10) # by default plt.show()

Dictionnaries :

How to create a dictionnary ?

dict = {key1:value1, key2:value2 ...}

Methods of dictionnaries

dict.keys() to get the keys of a dictionnary

key in dict to verify if a key is in a dictionnary

dict[key] = value to update the value of a key in the dictionnary

del(dict[key]) to delete a key in the dictionnary

It is also possible to chain brackets to select elements in the dictionnary that conatins dictionnaries.

e.g :

europe = { 'spain': { 'capital':'madrid', 'population':46.77 }, 'france': { 'capital':'paris', 'population':66.03 }, 'germany': { 'capital':'berlin', 'population':80.62 }, 'norway': { 'capital':'oslo', 'population':5.084 } } europe['spain']['population'] The Output : 46.77

The for loop

Use the method items() to get the key and the value

e.g:

for key, value in world.items() : print(key + " -- " + str(value))

Pandas

What is pandas ?

  • Pandas is a high level data manipulation tools
  • Pandas was created by Wes McKinney
  • Build on a numpy package

How to import data from a .csv file with pandas

import pandas as pd data = pd.read_csv("path/to/file.csv", index_col = 0)

How to add row labels in table

row_labels = ['US', 'AUS', 'JPN', 'IN', 'RU', 'MOR', 'EG'] dict.index = row_labels

How to extract column by using brackets

Using brackets

brics["column"] # The type of the output is data series brics[["column"]] #The output iis a data frame brics[["country", "capital"]] # extend to 2 column and so on brics[1:4] # To select the column to print

Using loc and iloc

The loc function
brics.loc["RU"] #to print the characteristics of the label RU.
  • Row is a panda series

  • It contains all the row's informations

    brics.loc[["RU"]] #to print the characteristics of the label RU.

  • Row is a data Frame

  • It contains all the row's informations

We can also select multiple rows

brics.loc[["RU", "IN", "EG"]] #to select the RUSSIA, INDIA and EGYPT

We can also select the column we want to use

brics.loc[["RU", "IN", "EG"], ["Country", "Capital"]] #to select the RUSSIA, INDIA and EGYPT
The iloc function
brics.loc[1] # to print the characteristics of the label RU.
  • The result are the same with loc

  • iloc use index

    brics.loc[[1:3]] #to print the characteristics of the label RU. and so on

The for loop

e.g :

for lab, row in data_frame.iterrows(): print(lab) print(row)

To only print out the capital for exemple :

print(lab + ": " +row["capital"])

To add a new label for example :

for lab, row in data_frame.interrows(): brics.loc[lab, "name_length"] = len(row["country"])

The apply() method

brics["name_length"] = brics["country"].apply(len)

The head() method

brics.head() #prints the first rows of the data Frame.

The info() method

brics.info() #display the rows and coluumn, the data type they contain

Shape attribute

brics.shape #contain the tuple of the number of rows followed by the number of columns

The describe() method

brics.describe() #compute some summary statistics for numerical columns like mean, median, std, min, max etc.

The .value attribute

brics.value #contains the data values in a 2D numpy array

The .columns attribute

brics.columns #contain the column names

The .index attribute

brics.index #contain the start, stop and step index.

The .shape attribute

brics.shape #returns the number of rows and columns of the DataFrame.

The .sort_values() method

brics.sort_values('column') sort the values of a column from the top to the bottom brics.sort_values('column', ascending=False) ... #from the bottom to the top brics.sort_values(['column1', 'column2']) #sort the values of multiple columns from the top to the bottom brics.sort_values(['column1', 'column2'], ascending=[True, False])

To select a column

brics["colum"]

To select multiple columns

brics[["column1", "column2"]]

How to subset rows

dogs["height_cm"] > 50 #gives an array of boolean dogs[dogs["height_cm"] > 50] #gives an array of all the dogs with the height less than 50 dogs[dogs["breed"] == "labrador"] #Select the rows where the breed is labrador dogs[dogs["date_of_birth"] < "2015-01-01"] #subsetting based on date in the format yyyy-mm-dd dogs[(dogs["breed"] == "labrador") & (dogs["height_cm"] > 50)] #subsetting based on multiple conditions dogs[dogs["colors"].isin(["black", "brown"])] #subsetting using the .isin() method

How to add a new column

dogs["height_m"] = dogs["height_cm"] / 100

Aggregating DataFrames

  • .median(), .mode()
  • .min() , .max()
  • .var() , .std()
  • .sum()
  • .quantile()

e.g. :

dogs["height_cm"].mean()

The .agg() Method

def pct30(column): return column.quantile(0.3) dogs["weight_kg"].agg(pct30) dogs[["weight_kg", "height_cm"]].agg(pct30) #For Multiple column def pct40(column): return column.quantile(0.4) dogs["weight_kg"].agg([pct30, pct40]) #Pass a list of functions dogs["weight_kg"].cumsum() #Cumulative sum in a column

Cumulative statistics

  • .cummax()
  • .cummin()
  • .cumprod()

Dropping duplicate names

vet_visits.drop_duplicates(subset="name")

Dropping duplicate pairs

unique_dogs = vet_visits.drop_duplicates(subset=["name", "breed"])

How to count

unique_dogs["breed"].value_counts() #or unique_dogs["breed"].value_counts(sort=True)

Proportions

unique_dogs["breed"].value_counts(normalize=True) #Use the percentage

Summaries by group

dogs[dogs["color"] == "Black"]["weight_kg"].mean() dogs[dogs["color"] == "Brown"]["weight_kg"].mean() dogs[dogs["color"] == "White"]["weight_kg"].mean() dogs[dogs["color"] == "Gray"]["weight_kg"].mean() dogs[dogs["color"] == "Tan"]["weight_kg"].mean()

Grouped Summaries

dogs.groupby("color")["weight_kg"].mean()

Multiple grouped summaries

dogs.groupby("color")["weight_kg"].agg([min, max, sum, np.mean])

Grouping by multiple variables

dogs.groupby(["color", "breed"])["weight_kg"].mean()

Many groups, many summaries

dogs.groupby(["color", "breed"])[["weight_kg", "height_cm"]].mean()

Group by to pivot table

dogs.groupby("color")["weight_kg"].mean() dogs.pivot_table(values="weight_kg", index="color")

Different statistics

import numpy as np dogs.pivot_table(values="weight_kg", index="color", aggfunc=np.median)

Multiple statistics

dogs.pivot_table(values="weight_kg", index="color", aggfunc=[np.mean, np.median])

Pivot on two variables

dogs.groupby(["color", "breed"])["weight_kg"].mean() dogs.pivot_table(values="weight_kg", index="color", columns="breed")

Filling missing values in pivot tables

dogs.pivot_table(values="weight_kg", index="color", columns="breed", fill_value=0)

Summing with pivot tables

dogs.pivot_table(values="weight_kg", index="color", columns="breed", fill_value=0, margins=True)
Run cancelled

Explore Datasets

Use the DataFrames imported in the first cell to explore the data and practice your skills!

  • Create a loop that iterates through the brics DataFrame and prints "The population of {country} is {population} million!".
  • Create a histogram of the life expectancies for countries in Africa in the gapminder DataFrame. Make sure your plot has a title, axis labels, and has an appropriate number of bins.
  • Simulate 10 rolls of two six-sided dice. If the two dice add up to 7 or 11, print "A win!". If the two dice add up to 2, 3, or 12, print "A loss!". If the two dice add up to any other number, print "Roll again!".