Skip to content
# Start coding here... 

Basic plots with Matplotlib

  • How to visualize data
  • How to store data in new data structures
  • Control structures, which you will need to customize the flow of your scripts and algorithms

Data visualization

This first chapter is about data visualization, which is a very important part of data analysis. First of all, you will use it to explore your dataset. The better you understand your data, the better you'll be able to extract insights. And once you've found those insights, again, you'll need visualization to be able to share your valuable insights with other people.

Beautiful plot

By making beautiful plots, we allow the data to tell their own story.

Matplotlib

There are many visualization packages in python, but the mother of them all, is matplotlib. You will need its subpackage pyplot. By convention, this subpackage is imported as plt. We call plt.plot and use our two lists as arguments. The first argument corresponds to the horizontal axis, and the second one to the vertical axis. We need the show function plt.show() to actually display the plot. This is because you might want to add some extra ingredients to your plot before actually displaying it, such as titles and label customizations. Just remember this: the plot function tells Python what to plot and how to plot it. show actually displays the plot.

With matplotlib, you can create a bunch of different plots in Python. The most basic plot is the line plot. When you have a time scale along the horizontal axis, the line plot is your friend.

  • import matplotlib.pyplot as plt
  • plt.plot(x,y)
  • plt.show()

Scatter plot

The resulting scatter plot simply plots all the individual data points; Python doesn't connect the dots with a line. For many applications, the scatter plot is often a better choice than the line plot.

When you're trying to assess if there's a correlation between two variables, for example, the scatter plot is the better choice.

  • import matplotlib.pyplot as plt
  • plt.scatter(x,y)
  • plt.show()

Sometimes, a correlation will become clear when you display the x_value on a logarithmic scale. Add the line plt.xscale('log').

Histogram

The histogram is a type of visualization that's very useful to explore your data. It can help you to get an idea about the distribution of your variables. To build a histogram for some values, you can divide the line into equal chunks, called bins. Finally, you draw a bar for each bin. The height of the bar corresponds to the number of data points that fall in this bin. The result is a histogram, which gives us a nice overview on how the values are distributed. Of course, we can use matplotlib to build histograms as well.

Matplotlib

  • You should start by importing the pyplot package that's inside matplotlib --> import matplotlib.pyplot as plt
  • Next, you can use the hist function --> plt.hist()
  • There's a bunch of arguments you can specify, but the first two here are the most important ones.
  • x should be a list of values you want to build a histogram for.
  • You can use the second argument, bins, to tell Python into how many bins the data should be divided. If you don't specify the bins argument, it will by 10 by default. The number of bins is pretty important. Too few bins will oversimplify reality and won't show you the details. Too many bins will overcomplicate reality and won't show the bigger picture. To control the number of bins to divide your data in, you can set the bins argument --> bins=n
  • If you finally call the show function, you get a histogram --> plt.show()
  • plt.clf() cleans it up again so you can start afresh.
  • Histograms are really useful to give a bigger picture.

Population pyramid

As an example, have a look at this so-called population pyramid. The age distribution is shown, for both males and females, in the European Union. Notice that the histograms are flipped 90 degrees; the bins are horizontal now. The bins are largest for the ages 40 to 44, where there are 20 million males and 20 million females. They are the so called baby boomers. These are figures of the year 2010. What do you think will have changed in 2050? Let's have a look. The distribution is flatter, and the baby boom generation has gotten older. With the blink of an eye, you can easily see how demographics will be changing over time. That's the true power of histograms at work here!

Customization

Let's figure out how to customize our plots. Creating a plot is one thing. Making the correct plot, that makes the message very clear -- that's the real challenge.

Data visualization

For each visualization, you have many options. First of all, there are the different plot types. And for each plot, you can do an infinite number of customizations. You can change colors, shapes, labels, axes, and so on. The choice depends on: one, the data, and two, the story you want to tell with this data.

Basic plot

Let's start to build a simple line plot.

Axis labels

The first thing you always need to do is label your axes. Let's do this by adding the:

  • xlabel function --> xlabel()
  • ylabel function --> ylabel()
  • Make sure to call these functions before calling the show function, otherwise your customizations will not be displayed.

Title

We're also going to add a title to our plot, with the title function --> title()

Ticks

xticks and yticks function

  • Ex. plt.yticks([0,1,2], ["one","two","three"]) --> In this example, the ticks corresponding to the numbers 0, 1 and 2 will be replaced by one, two and three, respectively.
  • Let's do a similar thing for the x-axis of your world development chart, with the plt.xticks() function. The tick values 1000, 10000 and 100000 should be replaced by 1k, 10k and 100k.
  • **plt.xticks(tick_val, tick_lab) **

Sizes

Right now, the scatter plot is just a cloud of blue dots, indistinguishable from each other. Let's change this. Wouldn't it be nice if the size of the dots corresponds to the population?

To accomplish this, there is a list pop loaded in your workspace. It contains population numbers for each country expressed in millions. You can see that this list is added to the scatter method, as the argument s, for size.

Colors

The next step is making the plot more colorful! To do this, a list col has been created for you. It's a list with a color for each corresponding country, depending on the continent the country is part of.

How did we make the list col you ask? The Gapminder data contains a list continent with the continent each country belongs to. A dictionary is constructed that maps continents onto colors:

dict = { 'Asia':'red', 'Europe':'green', 'Africa':'blue', 'Americas':'yellow', 'Oceania':'black' }

  • plt.scatter(x = gdp_cap, y = life_exp, s = np.array(pop) * 2, c = col, alpha=0.8)

  • Change the opacity of the bubbles by setting the alpha argument to 0.8 inside plt.scatter(). Alpha can be set from zero to one, where zero is totally transparent, and one is not at all transparent.

Additional Customizations

  • If you have another look at the script, under # Additional Customizations, you'll see that there are two plt.text() functions now --> Ex. plt.text(1550, 71, 'India')
  • Add plt.grid(True) after the plt.text() calls so that gridlines are drawn on the plot.

Add historical data

Before vs. after

Now that's how you turn an average line plot into a visual that has a clear story to tell!

Dictionaries, Part 1

A new Python type: the dictionary.

List

  • Ex --> you work for the World Bank and want to keep track of the population in each country. You can put the populations in a list. You start with Afghanistan, 30.55 million, Albania, 2.77 million, Algeria, around 40 million, and so on. To keep track about which population belongs to which country, you can create a second list, with the names of the countries in the same order as the populations.
  • method index(), a list method
  • we built two lists, and used the index to connect corresponding elements in both lists. It worked, but it's a pretty terrible approach: it's not convenient and not intuitive.

Dictionary

  • To create the dictionary, you need curly brackets --> my_dict = {"key1":"value1","key2":"value2",}
  • Next, inside the curly brackets, you have a bunch of what are called key:value pairs.
  • In our case, the keys are the country names, and the values are the corresponding populations --> world = {"country1":"population1", "country2":"population2"}
  • If you know want to find the population for country1, you simply type world, and then the string country1 inside square brackets --> world["country1"]
  • In other words, you pass the key in square brackets, and you get the corresponding value.
  • Check out which keys are in world by calling the keys() method on world --> world.keys()

Dictionaries, Part 2

  • We created the dictionary "world", which basically is a set of key value pairs.
  • For this lookup to work properly, the keys in a dictionary should be unique.
  • Also, these unique keys in a dictionary should be so-called immutable objects. Basically, the content of immutable objects cannot be changed after they're created.
  • Strings, booleans, integers and floats are immutable objects, but the list for example is mutable, because you can change its contents after it's created.

Dictionary Manipulation

  • Ex. Principality of Sealand
  • Sealand is an unrecognized micronation, on an offshore platform located in the North Sea. At the moment, it has 27 inhabitants.
  • To add this information, simply write the key sealand in square brackets and assign 27 expressed in millions to it with the equals sign --> world["sealand"] = "27"
  • If you check out "world" again, indeed, sealand is in there.
  • To check this with code, you can also write "sealand in world", which gives you True if the key sealand is in there --> print("sealand" in world) --> True
  • You can also change values, for example, to update the population of sealand to 28. Because each key in a dictionary is unique, Python knows that you're not trying to create a new pair, but want to update the pair that's already in there --> world["sealand"] = "28"
  • To remove it again. You can do this with del, again pointing to sealand inside square brackets. If you print world again, Sealand is no longer in there --> del(world["sealand"])

Dictionariception

  • Remember lists? They could contain anything, even other lists. Well, for dictionaries the same holds. Dictionaries can contain key:value pairs where the values are again dictionaries.
  • As an example, have a look at the script where another version of europe - the dictionary you've been working with all along - is coded. The keys are still the country names, but the values are dictionaries that contain more information than just the capital.
  • It's perfectly possible to chain square brackets to select elements. To fetch the population for Spain from europe, for example, you need: europe['spain']['population']
  • Use chained square brackets to select and print out the capital of France --> print(europe['france']['capital'])
  • Create a dictionary, named data, with the keys 'capital' and 'population'. Set them to 'rome' and 59.83, respectively --> data = {'capital':'rome', 'population':59.83}
  • Add a new key-value pair to europe; the key is 'italy' and the value is data, the dictionary you just built --> europe['italy'] = data, and then print it --> print(europe)

List vs. Dictionary

  • Using lists and dictionaries is pretty similar. You can select, update and remove elements with square brackets.
  • There are some big differences though. The list is a sequence of values that are indexed by a range of numbers.
  • The dictionary, on the other hand, is indexed by unique keys, that can be any immutable type.
  • When to use a list --> If you have a collection of values where the order matters, and you want to easily select entire subsets of data, you'll want to go with a list.
  • When to use a dictionary --> If you need some sort of look up table, where looking for data should be fast and where you can specify unique keys, a dictionary is the preferred option.
europe = { 'spain': { 'capital':'madrid', 'population':46.77 },
           'france': { 'capital':'paris', 'population':66.03 },
           'germany': { 'capital':'berlin', 'population':80.62 },
           'norway': { 'capital':'oslo', 'population':5.084 } }
print(europe['france']['capital'])
data = {'capital':'rome', 'population':59.83}
europe['italy'] = data
print(europe)

Pandas, Part 1

As a data scientist, you'll often be working with tons of data. The form of this data can vary greatly, but pretty often, you can boil it down to a tabular structure, that is, in the form of a table like in a spreadsheet.

Datasets in Python

  • To start working on this data in Python, you'll need some kind of rectangular data structure. The 2D NumPy array it's an option, but not necessarily the best one. Sometimes, there are different data types and NumPy arrays are not great at handling these. Your datasets will typically comprise different data types, so we need a tool that's better suited for the job.
  • To easily and efficiently handle this data, there's the Pandas package. Pandas is a high level data manipulation tool, built on the NumPy package. Compared to NumPy, it's more high level. In pandas, we store the tabular data in an object called a DataFrame.
  • How can we create this DataFrame in the first place? Well, there are different ways:

DataFrame from Dictionary

  • First of all, you can build it manually, starting from a dictionary. Using the distinctive curly brackets, we create key value pairs. The keys are the column labels, and the values are the corresponding columns, in list form. After importing the pandas package as pd, you can create a DataFrame from the dictionary using pd.DataFrame.
  • Using a dictionary approach is fine, but what if you're working with tons of data, which is typically the case as a data scientist? Well, you won't build the DataFrame manually. Instead, you import data from an external file that contains all this data.

DataFrame from CSV file

  • CSV is short for comma separated values.
  • Let's try to import this data into Python using Pandas read_csv function. You pass the path to the csv file as an argument.
  • Sometimes, the row labels are seen as a column in their own right. To solve this, we'll have to tell the read_csv function that the first column contains the row indexes. You do this by setting the index_col argument.
  • The read_csv function features many more arguments that allow you to customize your data import, make sure to check out its documentation.
#Dictionary to DataFrame
#Pandas is an open source library, providing high-performance, easy-to-use data structures and data analysis tools for Python. Sounds promising!

#The DataFrame is one of Pandas' most important data structures. It's basically a way to store tabular data where you can label the rows and the columns. One way to build a DataFrame is from a dictionary.

#In the exercises that follow you will be working with vehicle data from different countries. Each observation corresponds to a country and the columns give information about the number of vehicles per capita, whether people drive left or right, and so on.

#Three lists are defined in the script:

#names, containing the country names for which data is available.
#dr, a list with booleans that tells whether people drive left or right in the corresponding country.
#cpc, the number of motor vehicles per 1000 people in the corresponding country.
#Each dictionary key is a column label and each value is a list which contains the column elements.

# Pre-defined lists
names = ['United States', 'Australia', 'Japan', 'India', 'Russia', 'Morocco', 'Egypt']
dr =  [True, False, False, False, True, True, True]
cpc = [809, 731, 588, 18, 200, 70, 45]

# Import pandas as pd
import pandas as pd

# Create dictionary my_dict with three key:value pairs: my_dict
my_dict = {'country': names, 'drives_right': dr, 'cars_per_cap': cpc}

# Build a DataFrame cars from my_dict: cars
cars = pd.DataFrame(my_dict)

# Print cars
print(cars)
#Dictionary to DataFrame (2)
#The Python code that solves the previous exercise is included in the script. Have you noticed that the row labels (i.e. the labels for the different observations) were automatically set to integers from 0 up to 6?

#To solve this a list row_labels has been created. You can use it to specify the row labels of the cars DataFrame. You do this by setting the index attribute of cars, that you can access as cars.index.

import pandas as pd

# Build cars DataFrame
names = ['United States', 'Australia', 'Japan', 'India', 'Russia', 'Morocco', 'Egypt']
dr =  [True, False, False, False, True, True, True]
cpc = [809, 731, 588, 18, 200, 70, 45]
cars_dict = { 'country':names, 'drives_right':dr, 'cars_per_cap':cpc }
cars = pd.DataFrame(cars_dict)
print(cars)

# Definition of row_labels
row_labels = ['US', 'AUS', 'JPN', 'IN', 'RU', 'MOR', 'EG']

# Specify row labels of cars
cars.index = row_labels

# Print cars again
print(cars)

CSV to DataFrame (1)

Putting data in a dictionary and then building a DataFrame works, but it's not very efficient. What if you're dealing with millions of observations? In those cases, the data is typically available as files with a regular structure. One of those file types is the CSV file, which is short for "comma-separated values".

To import CSV data into Python as a Pandas DataFrame you can use read_csv().

Let's explore this function with the same cars data from the previous exercises. This time, however, the data is available in a CSV file, named cars.csv. It is available in your current working directory, so the path to the file is simply 'cars.csv'.

Import pandas as pd

import pandas as pd

Import the cars.csv data: cars

cars = pd.read_csv('cars.csv')

Print out cars

print(cars)

CSV to DataFrame (2)

Your read_csv() call to import the CSV data didn't generate an error, but the output is not entirely what we wanted. The row labels were imported as another column without a name.

Remember index_col, an argument of read_csv(), that you can use to specify which column in the CSV file should be used as a row label? Well, that's exactly what you need here!

Python code that solves the previous exercise is already included; can you make the appropriate changes to fix the data import?

Import pandas as pd

import pandas as pd

Fix import by including index_col

(Specify the index_col argument inside pd.read_csv(): set it to 0, so that the first column is used as row labels.)
cars = pd.read_csv('cars.csv', index_col=0)

Print out cars

print(cars)

Pandas, Part 2

We need to check that the code makes sure that the rows and columns are given appropriate labels. This is important to make accessing columns, rows and single elements in your DataFrame easy.

Index and select data

  1. How to use square brackets.
  2. Advanced data access methods, loc and iloc, that make Pandas extra powerful.

Column Access [ ]

  • You type df, and then the column label inside square brackets --> df[column_label]
  • But there's something strange here. The last line says Name: country, dtype: object. We're clearly not dealing with a regular DataFrame here.
  • We're dealing with a Pandas Series here. In a simplified sense, you can think of the Series as a 1-dimensional array that can be labeled, just like the DataFrame.
  • If you want to select the column_label column but keep the data in a DataFrame, you'll need double square brackets --> df[[column_label]]
  • You can perfectly extend this call to select two columns. If you look at it from a different angle, you're actually putting a list with column labels inside another set of square brackets, and end up with a 'sub DataFrame', containing only the country and capital columns. You can also use the same square brackets to select rows from a DataFrame.

Row Access [ ]

The way to do it is by specifying a slice. To get the 2º, 3º and 4º rows of df, we use the slice 1:4. Remember that the end of the slice is exclusive and that the index starts at zero.

Discussion [ ]

These square brackets work, but it only offers limited functionality. Ideally, we'd want something similar to 2D NumPy arrays. There, you also used square brackets, the index or slice before the comma referred to the rows, the index or slice after the comma referred to the columns. If we want to do a similar thing with Pandas, we have to extend our toolbox with the loc and iloc functions. loc is a technique to select parts of your data based on labels, iloc is position based.

Row Access loc

  • You put the label of the row of interest in square brackets after loc. Again, we get a Pandas Series, containing all the row's information, rather inconveniently shown on different lines.
  • To get a DataFrame, we have to put the "label of the row" string inside another pair of brackets.
  • We can also select multiple rows at the same time, simply add some more row labels to the list. This was only selecting entire rows, that's something you could also do with the basic square brackets. The difference here is that you can extend your selection with a comma (,) and a specification of the columns of interest.

Row & Column loc

  • We add a comma, and a list of column labels we want to keep. The intersection gets returned. Of course, you can also use loc to select all rows but only a specific number of columns.
  • Simply replace the first list that specifies the row labels with a colon, a slice going from beginning to end. This time, the intersection spans all rows, but only two columns.

Recap

  • Simple square brackets work fine if you want to get columns;
  • to get rows, you can use slicing.
  • The loc function is more versatile: you can select rows, columns, but also rows and columns at the same time.
  • When you use loc, subsetting becomes remarkable similar to how you subsetted 2D NumPy arrays. The only difference is that you use row labels with loc, not the positions of the elements.
  • If you want to subset Pandas DataFrames based on their position, or index, you'll need the iloc function.

Row Access iloc

  • In loc, you use the "label of the row" string in double square brackets, to get a DataFrame. In iloc, you use the index 1 instead of "label of the row" string.
  • To get the rows you can now use a list with the index 1 to 3.

Row & Column iloc

  • To in addition only keep the country and capital column, which we did as follows with loc, we put the indexes 0 and 1 in a list after the comma, referring to the country and capital column.
  • Finally, you can keep all rows and keep only the country and capital column in a similar fashion.
  • loc and iloc are pretty similar, the only difference is how you refer to columns and rows.

Square Brackets (1)

In the video, you saw that you can index and select Pandas DataFrames in many different ways. The simplest, but not the most powerful way, is to use square brackets.

In the sample code, the same cars data is imported from a CSV files as a Pandas DataFrame. To select only the cars_per_cap column from cars, you can use:

cars['cars_per_cap'] cars[['cars_per_cap']] The single bracket version gives a Pandas Series, the double bracket version gives a Pandas DataFrame.

Import cars data import pandas as pd cars = pd.read_csv('cars.csv', index_col = 0)

Print out country column as Pandas Series print(cars['country'])

Print out country column as Pandas DataFrame print(cars[['country']])

Print out DataFrame with country and drives_right columns print(cars[['country','drives_right']])

Square Brackets (2)

  • Square brackets can do more than just selecting columns. You can also use them to get rows, or observations, from a DataFrame. The following call selects the first five rows from the cars DataFrame:

  • cars[0:5]

  • The result is another DataFrame containing only the rows you specified.

  • Pay attention: You can only select rows using square brackets if you specify a slice, like 0:4. Also, you're using the integer indexes of the rows here, not the row labels!

Import cars data

  • import pandas as pd
  • cars = pd.read_csv('cars.csv', index_col = 0)

Print out first 3 observations --> print(cars[0:3])

Print out fourth, fifth and sixth observation --> print(cars[3:6])