Intermediate Python
Run the hidden code cell below to import the data used in this course.
# Import the course packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Import the two datasets
gapminder = pd.read_csv("datasets/gapminder.csv", index_col=0)
brics = pd.read_csv("datasets/brics.csv", index_col=0)
print("------------------------------------------------------------------------------------------------------")
print(brics)
print("------------------------------------------------------------------------------------------------------")
print(gapminder)
print("------------------------------------------------------------------------------------------------------")
print(gapminder.sort_values("country"))
print("------------------------------------------------------------------------------------------------------")
ar1 = np.array(brics['country'])
plt.xlabel('Continents')
plt.ylabel('Population [trillions]')
plt.title("Poupulations of each continents' countries")
plt.scatter(np.array(gapminder['cont']), np.array(gapminder['population']))
plt.show()Add your notes here
Numpy
Functions
logical_and()
e.g : np.logical_and(bmi > 21, bmi > 22)
logical_or()
e.g :
np.logical_or(bmi > 21, bmi > 22)
logical_not()
e.g :
np.logical_not(bmi > 21, bmi > 22) bmi[np.logical_not(bmi > 21, bmi > 22)] to select the value that match
nditer(array) to print all the elements of a 2D array
e.g :
for val in np.nditer(array_2D): print(val)
Random numbers
np.random.seed(123) coin = np.random.randint(0, 2)
It generate a random number between 0 and 1
MATPLOTLIB
import matplotlib.pyplot as plt plt.plot(x, y) plt.show()
How to plot a histogram
plt.hist(array, bins=10) # by default plt.show()
Dictionnaries :
How to create a dictionnary ?
dict = {key1:value1, key2:value2 ...}
Methods of dictionnaries
dict.keys() to get the keys of a dictionnary
key in dict to verify if a key is in a dictionnary
dict[key] = value to update the value of a key in the dictionnary
del(dict[key]) to delete a key in the dictionnary
It is also possible to chain brackets to select elements in the dictionnary that conatins dictionnaries.
e.g :
europe = { 'spain': { 'capital':'madrid', 'population':46.77 }, 'france': { 'capital':'paris', 'population':66.03 }, 'germany': { 'capital':'berlin', 'population':80.62 }, 'norway': { 'capital':'oslo', 'population':5.084 } } europe['spain']['population'] The Output : 46.77
The for loop
Use the method items() to get the key and the value
e.g:
for key, value in world.items() : print(key + " -- " + str(value))
Pandas
What is pandas ?
- Pandas is a high level data manipulation tools
- Pandas was created by Wes McKinney
- Build on a numpy package
How to import data from a .csv file with pandas
import pandas as pd data = pd.read_csv("path/to/file.csv", index_col = 0)
How to add row labels in table
row_labels = ['US', 'AUS', 'JPN', 'IN', 'RU', 'MOR', 'EG'] dict.index = row_labels
How to extract column by using brackets
Using brackets
brics["column"] # The type of the output is data series brics[["column"]] #The output iis a data frame brics[["country", "capital"]] # extend to 2 column and so on brics[1:4] # To select the column to print
Using loc and iloc
The loc function
brics.loc["RU"] #to print the characteristics of the label RU.
-
Row is a panda series
-
It contains all the row's informations
brics.loc[["RU"]] #to print the characteristics of the label RU.
-
Row is a data Frame
-
It contains all the row's informations
We can also select multiple rows
brics.loc[["RU", "IN", "EG"]] #to select the RUSSIA, INDIA and EGYPT
We can also select the column we want to use
brics.loc[["RU", "IN", "EG"], ["Country", "Capital"]] #to select the RUSSIA, INDIA and EGYPT
The iloc function
brics.loc[1] # to print the characteristics of the label RU.
-
The result are the same with loc
-
iloc use index
brics.loc[[1:3]] #to print the characteristics of the label RU. and so on
The for loop
e.g :
for lab, row in data_frame.iterrows(): print(lab) print(row)
To only print out the capital for exemple :
print(lab + ": " +row["capital"])
To add a new label for example :
for lab, row in data_frame.interrows(): brics.loc[lab, "name_length"] = len(row["country"])
The apply() method
brics["name_length"] = brics["country"].apply(len)
The head() method
brics.head() #prints the first rows of the data Frame.
The info() method
brics.info() #display the rows and coluumn, the data type they contain
Shape attribute
brics.shape #contain the tuple of the number of rows followed by the number of columns
The describe() method
brics.describe() #compute some summary statistics for numerical columns like mean, median, std, min, max etc.
The .value attribute
brics.value #contains the data values in a 2D numpy array
The .columns attribute
brics.columns #contain the column names
The .index attribute
brics.index #contain the start, stop and step index.
The .shape attribute
brics.shape #returns the number of rows and columns of the DataFrame.
The .sort_values() method
brics.sort_values('column') sort the values of a column from the top to the bottom brics.sort_values('column', ascending=False) ... #from the bottom to the top brics.sort_values(['column1', 'column2']) #sort the values of multiple columns from the top to the bottom brics.sort_values(['column1', 'column2'], ascending=[True, False])
To select a column
brics["colum"]
To select multiple columns
brics[["column1", "column2"]]
How to subset rows
dogs["height_cm"] > 50 #gives an array of boolean dogs[dogs["height_cm"] > 50] #gives an array of all the dogs with the height less than 50 dogs[dogs["breed"] == "labrador"] #Select the rows where the breed is labrador dogs[dogs["date_of_birth"] < "2015-01-01"] #subsetting based on date in the format yyyy-mm-dd dogs[(dogs["breed"] == "labrador") & (dogs["height_cm"] > 50)] #subsetting based on multiple conditions dogs[dogs["colors"].isin(["black", "brown"])] #subsetting using the .isin() method
How to add a new column
dogs["height_m"] = dogs["height_cm"] / 100
Aggregating DataFrames
- .median(), .mode()
- .min() , .max()
- .var() , .std()
- .sum()
- .quantile()
e.g. :
dogs["height_cm"].mean()
The .agg() Method
def pct30(column): return column.quantile(0.3) dogs["weight_kg"].agg(pct30) dogs[["weight_kg", "height_cm"]].agg(pct30) #For Multiple column def pct40(column): return column.quantile(0.4) dogs["weight_kg"].agg([pct30, pct40]) #Pass a list of functions dogs["weight_kg"].cumsum() #Cumulative sum in a column
Cumulative statistics
- .cummax()
- .cummin()
- .cumprod()
Dropping duplicate names
vet_visits.drop_duplicates(subset="name")
Dropping duplicate pairs
unique_dogs = vet_visits.drop_duplicates(subset=["name", "breed"])
How to count
unique_dogs["breed"].value_counts() #or unique_dogs["breed"].value_counts(sort=True)
Proportions
unique_dogs["breed"].value_counts(normalize=True) #Use the percentage
Summaries by group
dogs[dogs["color"] == "Black"]["weight_kg"].mean() dogs[dogs["color"] == "Brown"]["weight_kg"].mean() dogs[dogs["color"] == "White"]["weight_kg"].mean() dogs[dogs["color"] == "Gray"]["weight_kg"].mean() dogs[dogs["color"] == "Tan"]["weight_kg"].mean()
Grouped Summaries
dogs.groupby("color")["weight_kg"].mean()
Multiple grouped summaries
dogs.groupby("color")["weight_kg"].agg([min, max, sum, np.mean])
Grouping by multiple variables
dogs.groupby(["color", "breed"])["weight_kg"].mean()
Many groups, many summaries
dogs.groupby(["color", "breed"])[["weight_kg", "height_cm"]].mean()
Group by to pivot table
dogs.groupby("color")["weight_kg"].mean() dogs.pivot_table(values="weight_kg", index="color")
Different statistics
import numpy as np dogs.pivot_table(values="weight_kg", index="color", aggfunc=np.median)
Multiple statistics
dogs.pivot_table(values="weight_kg", index="color", aggfunc=[np.mean, np.median])
Pivot on two variables
dogs.groupby(["color", "breed"])["weight_kg"].mean() dogs.pivot_table(values="weight_kg", index="color", columns="breed")
Filling missing values in pivot tables
dogs.pivot_table(values="weight_kg", index="color", columns="breed", fill_value=0)
Summing with pivot tables
dogs.pivot_table(values="weight_kg", index="color", columns="breed", fill_value=0, margins=True)
Explore Datasets
Use the DataFrames imported in the first cell to explore the data and practice your skills!
- Create a loop that iterates through the
bricsDataFrame and prints "The population of {country} is {population} million!". - Create a histogram of the life expectancies for countries in Africa in the
gapminderDataFrame. Make sure your plot has a title, axis labels, and has an appropriate number of bins. - Simulate 10 rolls of two six-sided dice. If the two dice add up to 7 or 11, print "A win!". If the two dice add up to 2, 3, or 12, print "A loss!". If the two dice add up to any other number, print "Roll again!".