Intermediate Python

Run the hidden code cell below to import the data used in this course.

# Import the course packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Import the two datasets
gapminder = pd.read_csv("datasets/gapminder.csv", index_col=0)
brics = pd.read_csv("datasets/brics.csv", index_col=0)

print("------------------------------------------------------------------------------------------------------")
print(brics)
print("------------------------------------------------------------------------------------------------------")
print(gapminder)
print("------------------------------------------------------------------------------------------------------")
print(gapminder.sort_values("country"))
print("------------------------------------------------------------------------------------------------------")

ar1 = np.array(brics['country'])
plt.xlabel('Continents')
plt.ylabel('Population [trillions]')
plt.title("Poupulations of each continents' countries")
plt.scatter(np.array(gapminder['cont']), np.array(gapminder['population']))
plt.show()

Add your notes here

Numpy

Functions

logical_and()

e.g : np.logical_and(bmi > 21, bmi > 22)

logical_or()

e.g :

np.logical_or(bmi > 21, bmi > 22)

logical_not()

e.g :

np.logical_not(bmi > 21, bmi > 22)

bmi[np.logical_not(bmi > 21, bmi > 22)] to select the value that match

nditer(array) to print all the elements of a 2D array

e.g :

for val in np.nditer(array_2D):
      print(val)

Random numbers

np.random.seed(123)

coin = np.random.randint(0, 2)

It generate a random number between 0 and 1

MATPLOTLIB

import matplotlib.pyplot as plt

plt.plot(x, y)

plt.show()

How to plot a histogram

plt.hist(array, bins=10)  # by default

	plt.show()

Dictionnaries :

How to create a dictionnary ?

dict = {key1:value1, key2:value2 ...}

Methods of dictionnaries

dict.keys() to get the keys of a dictionnary

key in dict to verify if a key is in a dictionnary

dict[key] = value to update the value of a key in the dictionnary

del(dict[key]) to delete a key in the dictionnary

It is also possible to chain brackets to select elements in the dictionnary that conatins dictionnaries.

e.g :

        europe = {  'spain': { 'capital':'madrid', 'population':46.77 },
                    'france': { 'capital':'paris', 'population':66.03 },
                    'germany': { 'capital':'berlin', 'population':80.62 },
                    'norway': { 'capital':'oslo', 'population':5.084 } }
       
        europe['spain']['population']
         
       The Output : 46.77

The for loop

Use the method items() to get the key and the value

e.g:

for key, value in world.items() :
	print(key + " -- " + str(value))

Pandas

What is pandas ?

Pandas is a high level data manipulation tools
Pandas was created by Wes McKinney
Build on a numpy package

How to import data from a .csv file with pandas

import pandas as pd

data = pd.read_csv("path/to/file.csv", index_col = 0)

How to add row labels in table

row_labels = ['US', 'AUS', 'JPN', 'IN', 'RU', 'MOR', 'EG']

dict.index = row_labels

How to extract column by using brackets

Using brackets

brics["column"] # The type of the output is data series

brics[["column"]] #The output iis a data frame

brics[["country", "capital"]]  # extend to 2 column and so on

brics[1:4]  # To select the column to print

Using loc and iloc

The loc function

brics.loc["RU"] #to print the characteristics of the label RU.

Row is a panda series
It contains all the row's informations

brics.loc[["RU"]] #to print the characteristics of the label RU.
Row is a data Frame
It contains all the row's informations

We can also select multiple rows

brics.loc[["RU", "IN", "EG"]] #to select the RUSSIA, INDIA and EGYPT

We can also select the column we want to use

brics.loc[["RU", "IN", "EG"], ["Country", "Capital"]] #to select the RUSSIA, INDIA and EGYPT

The iloc function

brics.loc[1]    # to print the characteristics of the label RU.

The result are the same with loc
iloc use index

brics.loc[[1:3]] #to print the characteristics of the label RU. and so on

The for loop

e.g :

for lab, row in data_frame.iterrows():

print(lab)

print(row)

To only print out the capital for exemple :

print(lab + ": " +row["capital"])

To add a new label for example :

for lab, row in data_frame.interrows():
	brics.loc[lab, "name_length"] = len(row["country"])

The apply() method

brics["name_length"] = brics["country"].apply(len)

The head() method

brics.head() #prints the first rows of the data Frame.

The info() method

brics.info() #display the rows and coluumn, the data type they contain

Shape attribute

brics.shape #contain the tuple of the number of rows followed by the number of columns

The describe() method

brics.describe() #compute some summary statistics for numerical columns like mean, median, std, min, max etc.

The .value attribute

brics.value #contains the data values in a 2D numpy array

The .columns attribute

brics.columns #contain the column names

The .index attribute

brics.index #contain the start, stop and step index.

The .shape attribute

brics.shape #returns the number of rows and columns of the DataFrame.

The .sort_values() method

brics.sort_values('column') sort the values of a column from the top to the bottom

brics.sort_values('column', ascending=False) ... #from the bottom to the top

brics.sort_values(['column1', 'column2']) #sort the values of multiple columns from the top to the bottom

brics.sort_values(['column1', 'column2'], ascending=[True, False])

To select a column

brics["colum"]

To select multiple columns

brics[["column1", "column2"]]

How to subset rows

dogs["height_cm"] > 50    #gives an array of boolean

dogs[dogs["height_cm"] > 50]  #gives an array of all the dogs with the height less than 50

dogs[dogs["breed"] == "labrador"] #Select the rows where the breed is labrador

dogs[dogs["date_of_birth"] < "2015-01-01"] #subsetting based on date in the format yyyy-mm-dd

dogs[(dogs["breed"] == "labrador") & (dogs["height_cm"] > 50)] #subsetting based on multiple conditions

dogs[dogs["colors"].isin(["black", "brown"])] #subsetting using the .isin() method

How to add a new column

dogs["height_m"] = dogs["height_cm"] / 100

Aggregating DataFrames

.median(), .mode()
.min() , .max()
.var() , .std()
.sum()
.quantile()

e.g. :

dogs["height_cm"].mean()

The .agg() Method

def pct30(column):

return column.quantile(0.3)

dogs["weight_kg"].agg(pct30)

dogs[["weight_kg", "height_cm"]].agg(pct30) #For Multiple column

def pct40(column):
return column.quantile(0.4)
dogs["weight_kg"].agg([pct30, pct40])  #Pass a list of functions

dogs["weight_kg"].cumsum()   #Cumulative sum in a column

Cumulative statistics

.cummax()
.cummin()
.cumprod()

Dropping duplicate names

vet_visits.drop_duplicates(subset="name")

Dropping duplicate pairs

unique_dogs = vet_visits.drop_duplicates(subset=["name", "breed"])

How to count

unique_dogs["breed"].value_counts()   #or
unique_dogs["breed"].value_counts(sort=True)

Proportions

unique_dogs["breed"].value_counts(normalize=True) #Use the percentage

Summaries by group

dogs[dogs["color"] == "Black"]["weight_kg"].mean()
dogs[dogs["color"] == "Brown"]["weight_kg"].mean()
dogs[dogs["color"] == "White"]["weight_kg"].mean()
dogs[dogs["color"] == "Gray"]["weight_kg"].mean()
dogs[dogs["color"] == "Tan"]["weight_kg"].mean()

Grouped Summaries

dogs.groupby("color")["weight_kg"].mean()

Multiple grouped summaries

dogs.groupby("color")["weight_kg"].agg([min, max, sum, np.mean])

Grouping by multiple variables

dogs.groupby(["color", "breed"])["weight_kg"].mean()

Many groups, many summaries

dogs.groupby(["color", "breed"])[["weight_kg", "height_cm"]].mean()

Group by to pivot table

dogs.groupby("color")["weight_kg"].mean()
dogs.pivot_table(values="weight_kg", index="color")

Different statistics

import numpy as np
dogs.pivot_table(values="weight_kg", index="color", aggfunc=np.median)

Multiple statistics

dogs.pivot_table(values="weight_kg", index="color", aggfunc=[np.mean, np.median])

Pivot on two variables

dogs.groupby(["color", "breed"])["weight_kg"].mean()
dogs.pivot_table(values="weight_kg", index="color", columns="breed")

Filling missing values in pivot tables

dogs.pivot_table(values="weight_kg", index="color", columns="breed", fill_value=0)

Summing with pivot tables

dogs.pivot_table(values="weight_kg", index="color", columns="breed",
fill_value=0, margins=True)

Run cancelled

Explore Datasets

Use the DataFrames imported in the first cell to explore the data and practice your skills!

Create a loop that iterates through the brics DataFrame and prints "The population of {country} is {population} million!".
Create a histogram of the life expectancies for countries in Africa in the gapminder DataFrame. Make sure your plot has a title, axis labels, and has an appropriate number of bins.
Simulate 10 rolls of two six-sided dice. If the two dice add up to 7 or 11, print "A win!". If the two dice add up to 2, 3, or 12, print "A loss!". If the two dice add up to any other number, print "Roll again!".

Intermediate Python

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}Intermediate Python

Numpy

Functions

Random numbers

MATPLOTLIB

How to plot a histogram

Dictionnaries :

How to create a dictionnary ?

Methods of dictionnaries

The for loop

Pandas

What is pandas ?

How to import data from a .csv file with pandas

How to add row labels in table

How to extract column by using brackets

Using brackets

Using loc and iloc

The loc function

The iloc function

The for loop

Aggregating DataFrames

The .agg() Method

Cumulative statistics

Dropping duplicate names

Dropping duplicate pairs

How to count

Proportions

Summaries by group

Grouped Summaries

Multiple grouped summaries

Grouping by multiple variables

Many groups, many summaries

Group by to pivot table

Different statistics

Multiple statistics

Pivot on two variables

Filling missing values in pivot tables

Summing with pivot tables

Explore Datasets

Intermediate Python