Tutorials
python
+1

# 5 Tips To Write Idiomatic Pandas Code

This tutorial covers 5 ways in which you can easily write pandorable or more idiomatic Pandas code.

Pandas is the defacto toolbox for Python data scientists to ease data analysis: you can use it, for example, before you start analyzing, to collect, explore, and format the data. Pandas makes these steps a breeze via its numerous I/O and handy data manipulation functions. What’s more, as you will learn later in this tutorial, it also has some visualizations capabilities (through the use of matplotlib).

Even though the Pandas data structures are very powerful and flexible, they also come with some complexities which can make your analysis rather harder than easier, especially for beginners. In this tutorial, you’ll focus on learning more about how you can save yourself a headache and write more idiomatic Pandas code. Besides learning how you can load, explore and clean the data, you’ll learn more about the following five topics:

Of course, this tutorial is by no means exhaustive; The Pandas package is very rich and there are, without a doubt, other ways in which you might improve your Pandas code so that it becomes more idiomatic.

Let’s not wait any longer and get started!

## Load And Explore The Data

As always, your data analysis starts with loading and exploring your data. This section is just a warm-up in which you’ll use the read_csv() function to load in world university rankings, after which you can already quickly explore your data with the functions head() and describe(). If you’re interested in a full introduction to Pandas, consider taking DataCamp’s three-part Pandas course, starting with the Pandas Foundations.

If you already know all of this, you can just skip this section and see how you can perform indexing with Pandas in a more idiomatic way.

### Importing Packages

To start, let’s import the usual Python packages for data science. As you might have guessed, Pandas is among these.

# If you're working with a notebook, don't forget to use Matplotlib magic!
%matplotlib inline

# Import pandas under the alias pd
import pandas as pd

# Import seaborn under the alias sns
import seaborn as sns

# Set the Seaborn theme if desired
sns.set_style('darkgrid')

Moreover, you can also use the watermark package: it is a nice tool that makes your notebook more reproducible. Reproducible work is very important when collaborating with colleagues. Your future self will thank you for that as well! You can read more about reproducible data science here.

### The Data

In order to start this tutorial, you will also need to fetch some “real” world data. For this tutorial, you will be working with a dataset from Kaggle. If you are unfamiliar with Kaggle, it is one of the biggest data science and machine learning communities. Moreover, it is a good place to learn more about data science once you know the basics.

The dataset that you’ll be using is the World University Rankings.

As mentioned above, Pandas has many convenient method to read data from different data sources (you can learn more about it here). Since the data is in a CSV format, we will be using the .read_csv method. But before starting, you need to download the data from Kaggle (or get it from the code repository under the data folder if you don’t want to create an account).

Once you have fetched the data, you should have a folder data that contains the different csv files.

Next, you will be working with the Times Higher Education World University Rankings ranking (Times for short) and Academic Ranking of World Universities (Shanghai for short) ranking data. There is also a third ranking system in the data folder CWUR but it is much less known so you can safely ignore it for now.

# Import Times Higher Education World University Rankings data

# Import Academic Ranking of World Universities data
shanghai_df = pd.read_csv('data/shanghaiData.csv')

### Quickly Inspecting The Data

As you have learned in the DataCamp’s Exploratory Data Analysis tutorial, Pandas offers some methods to quickly inspect DataFrames, namely .head() to inspect the first n rows (n being 5 by default) and .describe() to get a quick statistical summary.

Refresh these functions by executing the following lines of code. The data has been loaded in already for you under the variables times_df and shanghai_df:

eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiaW1wb3J0IHBhbmRhcyBhcyBwZFxudGltZXNfZGYgPSBwZC5yZWFkX2NzdignaHR0cHM6Ly9zMy5hbWF6b25hd3MuY29tL2Fzc2V0cy5kYXRhY2FtcC5jb20vYmxvZ19hc3NldHMvdGltZXNEYXRhLmNzdicsIHRob3VzYW5kcz1cIixcIilcbnNoYW5naGFpX2RmID0gcGQucmVhZF9jc3YoXCJodHRwczovL3MzLmFtYXpvbmF3cy5jb20vYXNzZXRzLmRhdGFjYW1wLmNvbS9ibG9nX2Fzc2V0cy9zaGFuZ2hhaURhdGEuY3N2XCIpIiwic2FtcGxlIjoiIyBSZXR1cm4gdGhlIGZpcnN0IHJvd3Mgb2YgYHRpbWVzX2RmYFxudGltZXNfZGYuX19fXygpXG5cbiMgRGVzY3JpYmUgYHRpbWVzX2RmYFxudGltZXNfZGYuZGVzY3JpYmUoKVxuXG4jIFJldHVybiB0aGUgZmlyc3Qgcm93cyBvZiBgc2hhbmdoYWlfZGZgXG5zaGFuZ2hhaV9kZi5oZWFkKClcblxuIyBEZXNjcmliZSBgc2hhbmdoYWlfZGZgXG5zaGFuZ2hhaV9kZi5fX19fX19fXygpIiwic29sdXRpb24iOiIjIFJldHVybiB0aGUgZmlyc3Qgcm93cyBvZiBgdGltZXNfZGZgXG50aW1lc19kZi5oZWFkKClcblxuIyBEZXNjcmliZSBgdGltZXNfZGZgXG50aW1lc19kZi5kZXNjcmliZSgpXG5cbiMgUmV0dXJuIHRoZSBmaXJzdCByb3dzIG9mIGBzaGFuZ2hhaV9kZmBcbnNoYW5naGFpX2RmLmhlYWQoKVxuXG4jIERlc2NyaWJlIGBzaGFuZ2hhaV9kZmBcbnNoYW5naGFpX2RmLmRlc2NyaWJlKCkiLCJzY3QiOiJFeCgpLnRlc3RfZnVuY3Rpb24oXCJ0aW1lc19kZi5oZWFkXCIpXG5FeCgpLnRlc3RfZnVuY3Rpb24oXCJ0aW1lc19kZi5kZXNjcmliZVwiKVxuRXgoKS50ZXN0X2Z1bmN0aW9uKFwic2hhbmdoYWlfZGYuaGVhZFwiKVxuRXgoKS50ZXN0X2Z1bmN0aW9uKFwic2hhbmdoYWlfZGYuZGVzY3JpYmVcIilcbnN1Y2Nlc3NfbXNnKFwiWW91J3JlIG9mZiB0byBhIGdyZWF0IHN0YXJ0IVwiKSJ9

Now that you have loaded in your data, you can start thinking about how you can manipulate the data with Pandas in an idiomatic way!

## 1. Indexing

First off, writing more idiomatic Pandas code means leveraging the power of indexing. Indexing means that you select subsets from your DataFrame.

When you’re just starting out with Pandas, you might need some time to get used to how indexes work. In fact, if you have worked before with (Python) lists, you might be already familiar with the use of square brackets [] in combination with the colon : to select elements. This method works on DataFrames.

In addition, there are two other, more idiomatic ways to select a subset DataFrame. These two methods are namely iloc and loc:

• loc is label-based. This means that if you write loc[2], you are looking for the values of your DataFrame that have an index labeled 2.
• iloc is position-based. This means that if you write iloc[2], you are looking for the values of your DataFrame that are at index 2.

Try this out in the DataCamp Light chunk below by typing times_df.loc[2] and times_df.iloc[2] in the IPython console to see the difference. Next, try solving the exercise below with the help of loc, iloc and the traditional square brackets!

eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiaW1wb3J0IHBhbmRhcyBhcyBwZFxudGltZXNfZGYgPSBwZC5yZWFkX2NzdignaHR0cHM6Ly9zMy5hbWF6b25hd3MuY29tL2Fzc2V0cy5kYXRhY2FtcC5jb20vYmxvZ19hc3NldHMvdGltZXNEYXRhLmNzdicsIHRob3VzYW5kcz1cIixcIikiLCJzYW1wbGUiOiIjIFJldHJpZXZlIHRoZSB0b3RhbCBzY29yZSBvZiB0aGUgZmlyc3Qgcm93XG5wcmludCh0aW1lc19kZi5sb2NbMCwgJ19fX19fX19fX19fJ10pXG5cbiMgUmV0cmlldmUgcm93cyAwIGFuZCAxXG5wcmludCh0aW1lc19kZltfOl9dKVxuXG4jIFJldHJpZXZlIHRoZSB2YWx1ZXMgYXQgY29sdW1ucyBhbmQgcm93cyAxLTNcbnByaW50KHRpbWVzX2RmLmlsb2NbMTo0LDE6NF0pXG5cbiMgUmV0cmlldmUgdGhlIGNvbHVtbiBgdG90YWxfc2NvcmVgIFxucHJpbnQodGltZXNfZGZbJ19fX19fX19fX18nXSkiLCJzb2x1dGlvbiI6IiMgUmV0cmlldmUgdGhlIHRvdGFsIHNjb3JlIG9mIHRoZSBmaXJzdCByb3dcbnByaW50KHRpbWVzX2RmLmxvY1swLCAndG90YWxfc2NvcmUnXSlcblxuIyBSZXRyaWV2ZSByb3dzIDAgYW5kIDFcbnByaW50KHRpbWVzX2RmWzA6Ml0pXG5cbiMgUmV0cmlldmUgdGhlIHZhbHVlcyBhdCBjb2x1bW5zIGFuZCByb3dzIDEtM1xucHJpbnQodGltZXNfZGYuaWxvY1sxOjQsMTo0XSlcblxuIyBSZXRyaWV2ZSB0aGUgY29sdW1uIGB0b3RhbF9zY29yZWAgXG5wcmludCh0aW1lc19kZlsndG90YWxfc2NvcmUnXSkiLCJzY3QiOiJFeCgpLnRlc3RfZnVuY3Rpb24oXCJwcmludFwiLCBpbmRleD0xKVxuRXgoKS50ZXN0X2Z1bmN0aW9uKFwicHJpbnRcIiwgaW5kZXg9MilcbkV4KCkudGVzdF9mdW5jdGlvbihcInByaW50XCIsIGluZGV4PTMpXG5FeCgpLnRlc3RfZnVuY3Rpb24oXCJwcmludFwiLCBpbmRleD00KVxuc3VjY2Vzc19tc2coXCJBd2Vzb21lISBZb3UgaGF2ZSBzdWNjZXNzZnVsbHkgc3Vic2V0dGVkIHlvdXIgZGF0YSB3aXRoIGBsb2NgIGFuZCBgaWxvY2AuXCIpIn0=

Of course, your exploration can go a lot further than just simply subsetting or selecting from rows and columns of your data. Things can get very interesting when you combine the loc and iloc with, for example, boolean arrays.

You see that this goes somewhat beyond selecting data on the basis of columns or indexing; It’s somewhat like querying your data!

eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiaW1wb3J0IHBhbmRhcyBhcyBwZFxuc2hhbmdoYWlfZGYgPSBwZC5yZWFkX2NzdihcImh0dHBzOi8vczMuYW1hem9uYXdzLmNvbS9hc3NldHMuZGF0YWNhbXAuY29tL2Jsb2dfYXNzZXRzL3NoYW5naGFpRGF0YS5jc3ZcIikiLCJzYW1wbGUiOiIjIEFyZSB0aGUgbGFzdCBlbnRyaWVzIGFmdGVyIDIwMDY/XG5wcmludChzaGFuZ2hhaV9kZi5sb2NbOi0xMCwgJ3llYXInXSA+IDIwMDYpXG5cbiMgV2FzIHRoZSBhbHVtbmkgY291bnQgaGlnaGVyIHRoYW4gOTAgZm9yIHRoZSBmaXJzdCAgdGVuIHVuaXZlcnNpdGllcz9cbnByaW50KHNoYW5naGFpX2RmLmxvY1swOjExLCAnYWx1bW5pJ10gPiA5MCkiLCJzb2x1dGlvbiI6IiMgQXJlIHRoZSBsYXN0IGVudHJpZXMgYWZ0ZXIgMjAwNj9cbnByaW50KHNoYW5naGFpX2RmLmxvY1s6LTEwLCAneWVhciddID4gMjAwNilcblxuIyBXYXMgdGhlIGFsdW1uaSBjb3VudCBoaWdoZXIgdGhhbiA5MCBmb3IgdGhlIGZpcnN0ICB0ZW4gdW5pdmVyc2l0aWVzP1xucHJpbnQoc2hhbmdoYWlfZGYubG9jWzA6MTEsICdhbHVtbmknXSA+IDkwKSIsInNjdCI6IkV4KCkudGVzdF9mdW5jdGlvbihcInByaW50XCIsIGluZGV4PTEpXG5FeCgpLnRlc3RfZnVuY3Rpb24oXCJwcmludFwiLCBpbmRleD0yKVxuc3VjY2Vzc19tc2coXCJZb3UgaGF2ZSBzdWNjZXNzZnVsbHkgcXVlcmllZCB5b3VyIGRhdGEgd2l0aCBgbG9jYCFcIikifQ==

loc and iloc are great to index your data, but be careful when you start using two sets of square brackets after each other! Pandas could return a copy of a view and in such cases, you wouldn't know with what you're still working! So think twice if you're using two sets of square brackets in combination with loc or iloc.

Note also that, to query your data, Pandas also provides you with a query() function which you can use to quickly inspect your data (available since version 0.13). For example, you can compose a query to see which universities have a total_score that is slightly below 50:

eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiaW1wb3J0IHBhbmRhcyBhcyBwZFxudGltZXNfZGYgPSBwZC5yZWFkX2NzdignaHR0cHM6Ly9zMy5hbWF6b25hd3MuY29tL2Fzc2V0cy5kYXRhY2FtcC5jb20vYmxvZ19hc3NldHMvdGltZXNEYXRhLmNzdicsIHRob3VzYW5kcz1cIixcIilcbnNoYW5naGFpX2RmID0gcGQucmVhZF9jc3YoXCJodHRwczovL3MzLmFtYXpvbmF3cy5jb20vYXNzZXRzLmRhdGFjYW1wLmNvbS9ibG9nX2Fzc2V0cy9zaGFuZ2hhaURhdGEuY3N2XCIpIiwic2FtcGxlIjoiIyBRdWVyeSBgc2hhbmdoYWlfZGZgIGZvciB1bml2ZXJzaXRpZXMgd2l0aCB0b3RhbCBzY29yZSBiZXR3ZWVuIDQwIGFuZCA1MFxuYXZlcmdlX3NjaG9vbHMgPSBzaGFuZ2hhaV9kZi5xdWVyeSgndG90YWxfc2NvcmUgPiBfXyBhbmQgdG90YWxfc2NvcmUgPCBfXycpXG5cbiMgUHJpbnQgdGhlIHJlc3VsdFxucHJpbnQoYXZlcmFnZV9zY2hvb2xzKSIsInNvbHV0aW9uIjoiIyBRdWVyeSBgc2hhbmdoYWlfZGZgIGZvciB1bml2ZXJzaXRpZXMgd2l0aCB0b3RhbCBzY29yZSBiZXR3ZWVuIDQwIGFuZCA1MFxuYXZlcmFnZV9zY2hvb2xzID0gc2hhbmdoYWlfZGYucXVlcnkoJ3RvdGFsX3Njb3JlID4gNDAgYW5kIHRvdGFsX3Njb3JlIDwgNTAnKVxuXG4jIFByaW50IHRoZSByZXN1bHRcbnByaW50KGF2ZXJhZ2Vfc2Nob29scykiLCJzY3QiOiJFeCgpLnRlc3Rfb2JqZWN0KFwiYXZlcmFnZV9zY2hvb2xzXCIpXG5FeCgpLnRlc3RfZnVuY3Rpb24oXCJwcmludFwiKVxuc3VjY2Vzc19tc2coXCJXZWxsIGRvbmUhXCIpIn0=

Of course, there are many more possibilities when it comes to querying your data! Why not try out to come up with some queries and test them in the IPython console of the above DataCamp Light chunk?

If you’re short on inspiration, try queries that return the universities with a first national rank and universities with a first world rank. Additionally, you can query the alumni numbers of the universities.

Are you lost on how to compose the queries? Check out some of the examples below:

shanghai_df.query("national_rank == 1 and world_rank == 1")
shanghai_df.query("alumni < 20")

## 2. Method Chaining

Method chaining is something typical of Pandas, but when you’re just starting out with the Python data manipulation package, it might not be straightforward how this exactly works. It’s basically calling methods on an object one after another.

In this section, you will go deeper into method chaining by creating data pipelines with pipe().

As you have already noticed, the datasets are quite different. Here are some of the differences:

• The Times dataset has 14 columns whereas the Shanghai one has 11
• The Times dataset has 2603 rows whereas the Shanghai one has 4897
• One major missing column from the Shanghai dataset is the country of the University. You will get this information from the Times dataset later if needed.

Thus, in order for us to use both DataFrames, we will need to only select the common columns. Hopefully, these common columns have similar content.

eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiaW1wb3J0IHBhbmRhcyBhcyBwZFxudGltZXNfZGYgPSBwZC5yZWFkX2NzdignaHR0cHM6Ly9zMy5hbWF6b25hd3MuY29tL2Fzc2V0cy5kYXRhY2FtcC5jb20vYmxvZ19hc3NldHMvdGltZXNEYXRhLmNzdicsIHRob3VzYW5kcz1cIixcIilcbnNoYW5naGFpX2RmID0gcGQucmVhZF9jc3YoJ2h0dHBzOi8vczMuYW1hem9uYXdzLmNvbS9hc3NldHMuZGF0YWNhbXAuY29tL2Jsb2dfYXNzZXRzL3NoYW5naGFpRGF0YS5jc3YnKSIsInNhbXBsZSI6ImNvbW1vbl9jb2x1bW5zID0gc2V0KHNoYW5naGFpX2RmLmNvbHVtbnMpICYgc2V0KHRpbWVzX2RmLmNvbHVtbnMpXG5cbiMgUmV0dXJuIGBjb21tb25fY29sdW1uc2BcbnByaW50KGNvbW1vbl9jb2x1bW5zKSIsInNvbHV0aW9uIjoiY29tbW9uX2NvbHVtbnMgPSBzZXQoc2hhbmdoYWlfZGYuY29sdW1ucykgJiBzZXQodGltZXNfZGYuY29sdW1ucylcblxuIyBSZXR1cm4gYGNvbW1vbl9jb2x1bW5zYFxucHJpbnQoY29tbW9uX2NvbHVtbnMpIiwic2N0IjoiRXgoKS50ZXN0X29iamVjdChcImNvbW1vbl9jb2x1bW5zXCIpXG5FeCgpLnRlc3RfZnVuY3Rpb24oXCJwcmludFwiKSJ9

The common columns are thus:

• total_score: Score used to determine rank. As mentioned in the Kaggle description page, these contain range ranks (i.e. between two values) and equal ranks (there is an = sign in front)
• university_name: University name
• world_rank: Rank of the university
• year: Year for which the ranking is done

In order to concatenate both data, we need to do some cleaning first. Then apply the complete pipeline now to clean the data:

eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiaW1wb3J0IHBhbmRhcyBhcyBwZFxudGltZXNfZGYgPSBwZC5yZWFkX2NzdignaHR0cHM6Ly9zMy5hbWF6b25hd3MuY29tL2Fzc2V0cy5kYXRhY2FtcC5jb20vYmxvZ19hc3NldHMvdGltZXNEYXRhLmNzdicsIHRob3VzYW5kcz1cIixcIilcbnNoYW5naGFpX2RmID0gcGQucmVhZF9jc3YoJ2h0dHBzOi8vczMuYW1hem9uYXdzLmNvbS9hc3NldHMuZGF0YWNhbXAuY29tL2Jsb2dfYXNzZXRzL3NoYW5naGFpRGF0YS5jc3YnKVxuY29tbW9uX2NvbHVtbnMgPSBzZXQoc2hhbmdoYWlfZGYuY29sdW1ucykgJiBzZXQodGltZXNfZGYuY29sdW1ucykiLCJzYW1wbGUiOiIjIENsZWFuIHVwIHRoZSBgd29ybGRfcmFua2AgXG5kZWYgY2xlYW5fd29ybGRfcmFuayhpbnB1dF9kZik6XG4gICAgZGYgPSBpbnB1dF9kZi5jb3B5KClcbiAgICBkZi53b3JsZF9yYW5rID0gZGYud29ybGRfcmFuay5zdHIuc3BsaXQoJy0nKS5zdHJbMF0uc3RyLnNwbGl0KCc9Jykuc3RyWzBdXG4gICAgcmV0dXJuIGRmXG4gICAgXG4jIEFzc2lnbiB0aGUgY29tbW9uIHllYXJzIG9mIGBzaGFuZ2hhaV9kZmAgYW5kIGB0aW1lc19kZmAgdG8gYGNvbW1vbl95ZWFyc2AgICAgXG5jb21tb25feWVhcnMgPSBzZXQoc2hhbmdoYWlfZGYueWVhcikgJiBzZXQodGltZXNfZGYueWVhcikgXG5cbiMgUHJpbnQgYGNvbW1vbl95ZWFyc2BcbnByaW50KGNvbW1vbl95ZWFycylcblxuIyBGaWx0ZXIgeWVhcnNcbmRlZiBmaWx0ZXJfeWVhcihpbnB1dF9kZiwgeWVhcnMpOlxuICAgIGRmID0gaW5wdXRfZGYuY29weSgpXG4gICAgcmV0dXJuIGRmLnF1ZXJ5KCd5ZWFyIGluIHt9Jy5mb3JtYXQobGlzdCh5ZWFycykpKVxuXG4jIENsZWFuIGB0aW1lc19kZmAgYW5kIGBzaGFuZ2hhaV9kZmBcbmNsZWFuZWRfdGltZXNfZGYgPSAodGltZXNfZGYubG9jWzosIGNvbW1vbl9jb2x1bW5zXVxuICAgICAgICAgICAgICAgICAgICAgICAgICAgIC5waXBlKGZpbHRlcl95ZWFyLCBjb21tb25feWVhcnMpXG4gICAgICAgICAgICAgICAgICAgICAgICAgICAgLnBpcGUoY2xlYW5fd29ybGRfcmFuaylcbiAgICAgICAgICAgICAgICAgICAgICAgICAgICAuYXNzaWduKG5hbWU9J3RpbWVzJykpXG5jbGVhbmVkX3NoYW5naGFpX2RmID0gKHNoYW5naGFpX2RmLmxvY1s6LCBjb21tb25fY29sdW1uc11cbiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAucGlwZShmaWx0ZXJfeWVhciwgY29tbW9uX3llYXJzKVxuICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIC5waXBlKGNsZWFuX3dvcmxkX3JhbmspXG4gICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgLmFzc2lnbihuYW1lPSdzaGFuZ2hhaScpKSIsInNvbHV0aW9uIjoiIyBDbGVhbiB1cCB0aGUgYHdvcmxkX3JhbmtgIFxuZGVmIGNsZWFuX3dvcmxkX3JhbmsoaW5wdXRfZGYpOlxuICAgIGRmID0gaW5wdXRfZGYuY29weSgpXG4gICAgZGYud29ybGRfcmFuayA9IGRmLndvcmxkX3Jhbmsuc3RyLnNwbGl0KCctJykuc3RyWzBdLnN0ci5zcGxpdCgnPScpLnN0clswXVxuICAgIHJldHVybiBkZlxuICAgIFxuIyBBc3NpZ24gdGhlIGNvbW1vbiB5ZWFycyBvZiBgc2hhbmdoYWlfZGZgIGFuZCBgdGltZXNfZGZgIHRvIGBjb21tb25feWVhcnNgICAgIFxuY29tbW9uX3llYXJzID0gc2V0KHNoYW5naGFpX2RmLnllYXIpICYgc2V0KHRpbWVzX2RmLnllYXIpIFxuXG4jIFByaW50IGBjb21tb25feWVhcnNgXG5wcmludChjb21tb25feWVhcnMpXG5cbiMgRmlsdGVyIHllYXJzXG5kZWYgZmlsdGVyX3llYXIoaW5wdXRfZGYsIHllYXJzKTpcbiAgICBkZiA9IGlucHV0X2RmLmNvcHkoKVxuICAgIHJldHVybiBkZi5xdWVyeSgneWVhciBpbiB7fScuZm9ybWF0KGxpc3QoeWVhcnMpKSlcblxuIyBDbGVhbiBgdGltZXNfZGZgIGFuZCBgc2hhbmdoYWlfZGZgXG5jbGVhbmVkX3RpbWVzX2RmID0gKHRpbWVzX2RmLmxvY1s6LCBjb21tb25fY29sdW1uc11cbiAgICAgICAgICAgICAgICAgICAgICAgICAgICAucGlwZShmaWx0ZXJfeWVhciwgY29tbW9uX3llYXJzKVxuICAgICAgICAgICAgICAgICAgICAgICAgICAgIC5waXBlKGNsZWFuX3dvcmxkX3JhbmspXG4gICAgICAgICAgICAgICAgICAgICAgICAgICAgLmFzc2lnbihuYW1lPSd0aW1lcycpKVxuY2xlYW5lZF9zaGFuZ2hhaV9kZiA9IChzaGFuZ2hhaV9kZi5sb2NbOiwgY29tbW9uX2NvbHVtbnNdXG4gICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgLnBpcGUoZmlsdGVyX3llYXIsIGNvbW1vbl95ZWFycylcbiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAucGlwZShjbGVhbl93b3JsZF9yYW5rKVxuICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIC5hc3NpZ24obmFtZT0nc2hhbmdoYWknKSkiLCJzY3QiOiJmdW5fZGVmMSA9IEV4KCkuY2hlY2tfZnVuY3Rpb25fZGVmKFwiZmlsdGVyX3llYXJcIikubXVsdGkoY2hlY2tfYXJncygnaW5wdXRfZGYnKSwgY2hlY2tfYXJncygneWVhcnMnKSlcbmZ1bl9kZWYxLmNoZWNrX2JvZHkoKVxuZnVuX2RlZjEuY2FsbChcImYodGltZXNfZGYsIGNvbW1vbl95ZWFycylcIilcbmZ1bl9kZWYxLmNhbGwoXCJmKHNoYW5naGFpX2RmLCBjb21tb25feWVhcnMpXCIpXG5cbkV4KCkudGVzdF9vYmplY3QoXCJjb21tb25feWVhcnNcIilcbkV4KCkudGVzdF9mdW5jdGlvbihcInByaW50XCIpXG5cbmZ1bl9kZWYgPSBFeCgpLmNoZWNrX2Z1bmN0aW9uX2RlZihcImNsZWFuX3dvcmxkX3JhbmtcIilcbmZ1bl9kZWYuY2hlY2tfYm9keSgpXG5mdW5fZGVmLmNhbGwoXCJmKHRpbWVzX2RmKVwiKVxuZnVuX2RlZi5jYWxsKFwiZihzaGFuZ2hhaV9kZilcIilcblxuRXgoKS50ZXN0X29iamVjdChcImNsZWFuZWRfdGltZXNfZGZcIilcbkV4KCkudGVzdF9vYmplY3QoXCJjbGVhbmVkX3NoYW5naGFpX2RmXCIpXG5cbnN1Y2Nlc3NfbXNnKFwiQXdlc29tZSBqb2IhXCIpIn0=

As you have probably noticed, I use the .pipe method a lot in this tutorial. If you come from the R world, you should probably be (somehow) familiar with it.

It is a handy new Pandas (since 0.16.2 version) method that allows you to chain operations and thus eliminate the need for intermediate DataFrames. Your code becomes more readable.

In fact, without this operator, instead of writing df.pipe(f).pipe(g).pipe(h) you would write: h(g(f(df))). This becomes harder to follow once the number of nested functions grows large.

Now that both DataFrames are cleaned, we can concatenate them into a single DataFrame. Also, don’t forget to clean the missing data:

eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiaW1wb3J0IHBhbmRhcyBhcyBwZFxudGltZXNfZGYgPSBwZC5yZWFkX2NzdignaHR0cHM6Ly9zMy5hbWF6b25hd3MuY29tL2Fzc2V0cy5kYXRhY2FtcC5jb20vYmxvZ19hc3NldHMvdGltZXNEYXRhLmNzdicsIHRob3VzYW5kcz1cIixcIilcbnNoYW5naGFpX2RmID0gcGQucmVhZF9jc3YoJ2h0dHBzOi8vczMuYW1hem9uYXdzLmNvbS9hc3NldHMuZGF0YWNhbXAuY29tL2Jsb2dfYXNzZXRzL3NoYW5naGFpRGF0YS5jc3YnKVxuY29tbW9uX2NvbHVtbnMgPSBzZXQoc2hhbmdoYWlfZGYuY29sdW1ucykgJiBzZXQodGltZXNfZGYuY29sdW1ucylcbmRlZiBjbGVhbl93b3JsZF9yYW5rKGlucHV0X2RmKTpcbiAgICBkZiA9IGlucHV0X2RmLmNvcHkoKVxuICAgIGRmLndvcmxkX3JhbmsgPSBkZi53b3JsZF9yYW5rLnN0ci5zcGxpdCgnLScpLnN0clswXS5zdHIuc3BsaXQoJz0nKS5zdHJbMF1cbiAgICByZXR1cm4gZGZcbmNvbW1vbl95ZWFycyA9IHNldChzaGFuZ2hhaV9kZi55ZWFyKSAmIHNldCh0aW1lc19kZi55ZWFyKSBcbmRlZiBmaWx0ZXJfeWVhcihpbnB1dF9kZiwgeWVhcnMpOlxuICAgIGRmID0gaW5wdXRfZGYuY29weSgpXG4gICAgcmV0dXJuIGRmLnF1ZXJ5KCd5ZWFyIGluIHt9Jy5mb3JtYXQobGlzdCh5ZWFycykpKVxuY2xlYW5lZF90aW1lc19kZiA9ICh0aW1lc19kZi5sb2NbOiwgY29tbW9uX2NvbHVtbnNdXG4gICAgICAgICAgICAgICAgICAgICAgICAgICAgLnBpcGUoZmlsdGVyX3llYXIsIGNvbW1vbl95ZWFycylcbiAgICAgICAgICAgICAgICAgICAgICAgICAgICAucGlwZShjbGVhbl93b3JsZF9yYW5rKVxuICAgICAgICAgICAgICAgICAgICAgICAgICAgIC5hc3NpZ24obmFtZT0ndGltZXMnKSlcbmNsZWFuZWRfc2hhbmdoYWlfZGYgPSAoc2hhbmdoYWlfZGYubG9jWzosIGNvbW1vbl9jb2x1bW5zXVxuICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIC5waXBlKGZpbHRlcl95ZWFyLCBjb21tb25feWVhcnMpXG4gICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgLnBpcGUoY2xlYW5fd29ybGRfcmFuaylcbiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAuYXNzaWduKG5hbWU9J3NoYW5naGFpJykpIiwic2FtcGxlIjoiIyBDb21wb3NlIGByYW5raW5nX2RmYCB3aXRoIGBjbGVhbmVkX3RpbWVzX2RmYCBhbmQgYGNsZWFuZWRfc2hhbmdoYWlfZGZgXG5yYW5raW5nX2RmID0gcGQuY29uY2F0KFtjbGVhbmVkX3RpbWVzX2RmLCBjbGVhbmVkX3NoYW5naGFpX2RmXSlcblxuIyBDYWxjdWxhdGUgdGhlIHBlcmNlbnRhZ2Ugb2YgbWlzc2luZyBkYXRhXG5taXNzaW5nX2RhdGEgPSAxMDAgKiBwZC5pc251bGwocmFua2luZ19kZi50b3RhbF9zY29yZSkuc3VtKCkgLyBsZW4ocmFua2luZ19kZilcblxuIyBEcm9wIHRoZSBgdG90YWxfc2NvcmVgIGNvbHVtbiBvZiBgcmFua2luZ19kZmBcbnJhbmtpbmdfZGYgPSByYW5raW5nX2RmLmRyb3AoJ3RvdGFsX3Njb3JlJywgYXhpcz0xKSIsInNvbHV0aW9uIjoiIyBDb21wb3NlIGByYW5raW5nX2RmYCB3aXRoIGBjbGVhbmVkX3RpbWVzX2RmYCBhbmQgYGNsZWFuZWRfc2hhbmdoYWlfZGZgXG5yYW5raW5nX2RmID0gcGQuY29uY2F0KFtjbGVhbmVkX3RpbWVzX2RmLCBjbGVhbmVkX3NoYW5naGFpX2RmXSlcblxuIyBDYWxjdWxhdGUgdGhlIHBlcmNlbnRhZ2Ugb2YgbWlzc2luZyBkYXRhXG5taXNzaW5nX2RhdGEgPSAxMDAgKiBwZC5pc251bGwocmFua2luZ19kZi50b3RhbF9zY29yZSkuc3VtKCkgLyBsZW4ocmFua2luZ19kZilcblxuIyBEcm9wIHRoZSBgdG90YWxfc2NvcmVgIGNvbHVtbiBvZiBgcmFua2luZ19kZmBcbnJhbmtpbmdfZGYgPSByYW5raW5nX2RmLmRyb3AoJ3RvdGFsX3Njb3JlJywgYXhpcz0xKSIsInNjdCI6IkV4KCkudGVzdF9vYmplY3QoXCJyYW5raW5nX2RmXCIpXG5FeCgpLnRlc3Rfb2JqZWN0KFwibWlzc2luZ19kYXRhXCIpXG5FeCgpLnRlc3Rfb2JqZWN0KFwicmFua2luZ19kZlwiKVxuc3VjY2Vzc19tc2coXCJZb3UgaGF2ZSBzdWNjZXNzZnVsbHkgaW1wcm92ZWQgeW91ciBkYXRhIHF1YWxpdHkhXCIpIn0=

Since you have around 38% of data missing from the total_score column, it’s better to drop this column with the .drop method.

## 3. Memory Optimization

A third way of making your Pandas code more “idiomatic” -which is in this case more equal to “performant”- is by optimizing the memory usage. Especially when you start making data pipelines by using method chaining, you’ll see that memory can start to play a big part in how fast your pipelines can run. You’ll see more on all of this here.

This is probably overkill for this small dataset but it is often a good practice to cast some columns to specific types to optimize the memory usage. To start, let’s find how much memory is allocated for ranking_df using the .info method.

eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiaW1wb3J0IHBhbmRhcyBhcyBwZFxudGltZXNfZGYgPSBwZC5yZWFkX2NzdignaHR0cHM6Ly9zMy5hbWF6b25hd3MuY29tL2Fzc2V0cy5kYXRhY2FtcC5jb20vYmxvZ19hc3NldHMvdGltZXNEYXRhLmNzdicsIHRob3VzYW5kcz1cIixcIilcbnNoYW5naGFpX2RmID0gcGQucmVhZF9jc3YoJ2h0dHBzOi8vczMuYW1hem9uYXdzLmNvbS9hc3NldHMuZGF0YWNhbXAuY29tL2Jsb2dfYXNzZXRzL3NoYW5naGFpRGF0YS5jc3YnKVxuY29tbW9uX2NvbHVtbnMgPSBzZXQoc2hhbmdoYWlfZGYuY29sdW1ucykgJiBzZXQodGltZXNfZGYuY29sdW1ucylcbmRlZiBjbGVhbl93b3JsZF9yYW5rKGlucHV0X2RmKTpcbiAgICBkZiA9IGlucHV0X2RmLmNvcHkoKVxuICAgIGRmLndvcmxkX3JhbmsgPSBkZi53b3JsZF9yYW5rLnN0ci5zcGxpdCgnLScpLnN0clswXS5zdHIuc3BsaXQoJz0nKS5zdHJbMF1cbiAgICByZXR1cm4gZGZcbmNvbW1vbl95ZWFycyA9IHNldChzaGFuZ2hhaV9kZi55ZWFyKSAmIHNldCh0aW1lc19kZi55ZWFyKSBcbmRlZiBmaWx0ZXJfeWVhcihpbnB1dF9kZiwgeWVhcnMpOlxuICAgIGRmID0gaW5wdXRfZGYuY29weSgpXG4gICAgcmV0dXJuIGRmLnF1ZXJ5KCd5ZWFyIGluIHt9Jy5mb3JtYXQobGlzdCh5ZWFycykpKVxuY2xlYW5lZF90aW1lc19kZiA9ICh0aW1lc19kZi5sb2NbOiwgY29tbW9uX2NvbHVtbnNdXG4gICAgICAgICAgICAgICAgICAgICAgICAgICAgLnBpcGUoZmlsdGVyX3llYXIsIGNvbW1vbl95ZWFycylcbiAgICAgICAgICAgICAgICAgICAgICAgICAgICAucGlwZShjbGVhbl93b3JsZF9yYW5rKVxuICAgICAgICAgICAgICAgICAgICAgICAgICAgIC5hc3NpZ24obmFtZT0ndGltZXMnKSlcbmNsZWFuZWRfc2hhbmdoYWlfZGYgPSAoc2hhbmdoYWlfZGYubG9jWzosIGNvbW1vbl9jb2x1bW5zXVxuICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIC5waXBlKGZpbHRlcl95ZWFyLCBjb21tb25feWVhcnMpXG4gICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgLnBpcGUoY2xlYW5fd29ybGRfcmFuaylcbiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAuYXNzaWduKG5hbWU9J3NoYW5naGFpJykpXG5yYW5raW5nX2RmID0gcGQuY29uY2F0KFtjbGVhbmVkX3RpbWVzX2RmLCBjbGVhbmVkX3NoYW5naGFpX2RmXSkuZHJvcCgndG90YWxfc2NvcmUnLCBheGlzPTEpIiwic2FtcGxlIjoiIyBQcmludCB0aGUgbWVtb3J5IHVzYWdlIG9mIGByYW5raW5nX2RmYCBcbnJhbmtpbmdfZGYuX19fXygpXG5cbiMgUHJpbnQgdGhlIGRlZXAgbWVtb3J5IHVzYWdlIG9mIGByYW5raW5nX2RmYCBcbnJhbmtpbmdfZGYuaW5mbyhtZW1vcnlfdXNhZ2U9XCJkZWVwXCIpIiwic29sdXRpb24iOiIjIFByaW50IHRoZSBtZW1vcnkgdXNhZ2Ugb2YgYHJhbmtpbmdfZGZgIFxucmFua2luZ19kZi5pbmZvKClcblxuIyBQcmludCB0aGUgZGVlcCBtZW1vcnkgdXNhZ2Ugb2YgYHJhbmtpbmdfZGZgIFxucmFua2luZ19kZi5pbmZvKG1lbW9yeV91c2FnZT1cImRlZXBcIikiLCJzY3QiOiJFeCgpLnRlc3RfZnVuY3Rpb24oXCJyYW5raW5nX2RmLmluZm9cIiwgaW5kZXg9MSlcbkV4KCkudGVzdF9mdW5jdGlvbihcInJhbmtpbmdfZGYuaW5mb1wiLCBpbmRleD0yKVxuc3VjY2Vzc19tc2coXCJHcmVhdCFcIikifQ==

The result of the first line of code tells us that the data takes 144 KB. However, upon inspecting the arguments of the .info method, one finds out that there is an optional one namely memory_usage that is set to None by default. What happens when you set the argument memory_usage to deep?

The new memory footprint is 6 -5.6 to be more precise- times larger than your initial answer.

What happened here?

The difference in the memory estimation stems from the fact that without the “deep” flag turned on, Pandas won’t estimate memory consumption for the object dtype. This (Python) type of data takes more space than other numpy optimized dtypes. This is why it is recommended to cast object types to more appropriate ones (let’s say category when dealing with categorical data).

Let’s do it!

def memory_change(input_df, column, dtype):
df = input_df.copy()
old = round(df[column].memory_usage(deep=True) / 1024, 2) # In KB
new = round(df[column].astype(dtype).memory_usage(deep=True) / 1024, 2)# In KB
change = round(100 * (old - new) / (old), 2)
report = ("The inital memory footprint for {column} is: {old}KB.\n"
"The casted {column} now takes: {new}KB.\n"
"A change of {change} %.").format(**locals())
return report

print(memory_change(ranking_df,'world_rank', 'int16'))
print(memory_change(ranking_df,'university_name', 'category'))
print(memory_change(ranking_df,'name', 'category'))

Now that you know that, by changing the world_rank to int16, for example, your memory will be used more efficiently, it’s time to actually apply these changes. You can use the astype() function to do this. End by double-checking what your memory usage looks like now:

That’s much better. You can optimize even further by casting the year column down to int32.

## 4. GroupBy

Now that we have a well-formed data set (i.e. it is in a tidy state) with an optimized memory footprint, data analysis can start.

Now there are some questions that you might be interested in answering:

• What are the top 5 universities given by each ranking system over the years?
• How different are these rankings for each year?

Let’s find out by tackling the first question by making use of the more idiomatic Pandas pointers that you have already seen in the previous sections!

Before you start, notice that ‘Massachusetts Institute of Technology (MIT)’ and ‘Massachusetts Institute of Technology’ are two different records of the same university. Thus, you change the first name to the latter.

Note that in cases like these, the loc function, which you saw earlier in this tutorial, comes in particularly handy!

To find the 5 (more generally n) top universities over the years, for each ranking system, here is how to do it in pseudo-code:

• For each year (in the year column) and for each ranking system (in the name column):
• Select the subset of the data for this given year and the given ranking system
• Select the 5 top universities and store them in a list
• Store the result in a dictionary with (year, name) as key and the list of the universities (in descending order) as the value

Let’s apply this.

eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiaW1wb3J0IHBhbmRhcyBhcyBwZFxudGltZXNfZGYgPSBwZC5yZWFkX2NzdignaHR0cHM6Ly9zMy5hbWF6b25hd3MuY29tL2Fzc2V0cy5kYXRhY2FtcC5jb20vYmxvZ19hc3NldHMvdGltZXNEYXRhLmNzdicsIHRob3VzYW5kcz1cIixcIilcbnNoYW5naGFpX2RmID0gcGQucmVhZF9jc3YoJ2h0dHBzOi8vczMuYW1hem9uYXdzLmNvbS9hc3NldHMuZGF0YWNhbXAuY29tL2Jsb2dfYXNzZXRzL3NoYW5naGFpRGF0YS5jc3YnKVxuY29tbW9uX2NvbHVtbnMgPSBzZXQoc2hhbmdoYWlfZGYuY29sdW1ucykgJiBzZXQodGltZXNfZGYuY29sdW1ucylcbmRlZiBjbGVhbl93b3JsZF9yYW5rKGlucHV0X2RmKTpcbiAgICBkZiA9IGlucHV0X2RmLmNvcHkoKVxuICAgIGRmLndvcmxkX3JhbmsgPSBkZi53b3JsZF9yYW5rLnN0ci5zcGxpdCgnLScpLnN0clswXS5zdHIuc3BsaXQoJz0nKS5zdHJbMF1cbiAgICByZXR1cm4gZGZcbmNvbW1vbl95ZWFycyA9IHNldChzaGFuZ2hhaV9kZi55ZWFyKSAmIHNldCh0aW1lc19kZi55ZWFyKSBcbmRlZiBmaWx0ZXJfeWVhcihpbnB1dF9kZiwgeWVhcnMpOlxuICAgIGRmID0gaW5wdXRfZGYuY29weSgpXG4gICAgcmV0dXJuIGRmLnF1ZXJ5KCd5ZWFyIGluIHt9Jy5mb3JtYXQobGlzdCh5ZWFycykpKVxuY2xlYW5lZF90aW1lc19kZiA9ICh0aW1lc19kZi5sb2NbOiwgY29tbW9uX2NvbHVtbnNdXG4gICAgICAgICAgICAgICAgICAgICAgICAgICAgLnBpcGUoZmlsdGVyX3llYXIsIGNvbW1vbl95ZWFycylcbiAgICAgICAgICAgICAgICAgICAgICAgICAgICAucGlwZShjbGVhbl93b3JsZF9yYW5rKVxuICAgICAgICAgICAgICAgICAgICAgICAgICAgIC5hc3NpZ24obmFtZT0ndGltZXMnKSlcbmNsZWFuZWRfc2hhbmdoYWlfZGYgPSAoc2hhbmdoYWlfZGYubG9jWzosIGNvbW1vbl9jb2x1bW5zXVxuICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIC5waXBlKGZpbHRlcl95ZWFyLCBjb21tb25feWVhcnMpXG4gICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgLnBpcGUoY2xlYW5fd29ybGRfcmFuaylcbiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAuYXNzaWduKG5hbWU9J3NoYW5naGFpJykpXG5yYW5raW5nX2RmID0gcGQuY29uY2F0KFtjbGVhbmVkX3RpbWVzX2RmLCBjbGVhbmVkX3NoYW5naGFpX2RmXSkuZHJvcCgndG90YWxfc2NvcmUnLCBheGlzPTEpXG5yYW5raW5nX2RmLndvcmxkX3JhbmsgPSByYW5raW5nX2RmLndvcmxkX3JhbmsuYXN0eXBlKCdpbnQxNicpXG5yYW5raW5nX2RmLnVuaXZlcnNpdHlfbmFtZSA9IHJhbmtpbmdfZGYudW5pdmVyc2l0eV9uYW1lLmFzdHlwZSgnY2F0ZWdvcnknKVxucmFua2luZ19kZi5uYW1lID0gcmFua2luZ19kZi5uYW1lLmFzdHlwZSgnY2F0ZWdvcnknKSIsInNhbXBsZSI6IiMgTG9hZCBpbiBgaXRlcnRvb2xzYFxuaW1wb3J0IGl0ZXJ0b29sc1xuXG4jIEluaXRpYWxpemUgYHJhbmtpbmdgXG5yYW5raW5nID0ge31cblxuZm9yIHllYXIsIG5hbWUgaW4gaXRlcnRvb2xzLnByb2R1Y3QoY29tbW9uX3llYXJzLCBbXCJ0aW1lc1wiLCBcInNoYW5naGFpXCJdKTpcbiAgICBzID0gKHJhbmtpbmdfZGYubG9jW2xhbWJkYSBkZjogKChkZi55ZWFyID09IHllYXIpICYgKGRmLm5hbWUgPT0gbmFtZSlcbiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICYgKGRmLndvcmxkX3JhbmsuaXNpbihyYW5nZSgxLDYpKSkpLCA6XVxuICAgICAgICAgICAgICAgICAgIC5zb3J0X3ZhbHVlcygnd29ybGRfcmFuaycsIGFzY2VuZGluZz1GYWxzZSlcbiAgICAgICAgICAgICAgICAgICAudW5pdmVyc2l0eV9uYW1lKVxuICAgIHJhbmtpbmdbKHllYXIsIG5hbWUpXSA9IGxpc3QocylcbiAgICBcblxuIyBQcmludCBgcmFua2luZ2BcbnByaW50KHJhbmtpbmcpIn0=

Now that we have this ranking dictionary, let’s find out how much (in percentage) both ranking methods differ over the years: the two are 100% set-similar if the selected 5-top universities are the same even though they aren’t ranked the same.

As you have noticed, this was tedious. Is there a better, more idiomatic Pandas way?

Of course, there is! Firstly, filter the ranking DataFrame to only keep the 5 top universities for each year and ranking method. Also check out the result with the help of the head() function:

eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiaW1wb3J0IHBhbmRhcyBhcyBwZFxudGltZXNfZGYgPSBwZC5yZWFkX2NzdignaHR0cHM6Ly9zMy5hbWF6b25hd3MuY29tL2Fzc2V0cy5kYXRhY2FtcC5jb20vYmxvZ19hc3NldHMvdGltZXNEYXRhLmNzdicsIHRob3VzYW5kcz1cIixcIilcbnNoYW5naGFpX2RmID0gcGQucmVhZF9jc3YoJ2h0dHBzOi8vczMuYW1hem9uYXdzLmNvbS9hc3NldHMuZGF0YWNhbXAuY29tL2Jsb2dfYXNzZXRzL3NoYW5naGFpRGF0YS5jc3YnKVxuY29tbW9uX2NvbHVtbnMgPSBzZXQoc2hhbmdoYWlfZGYuY29sdW1ucykgJiBzZXQodGltZXNfZGYuY29sdW1ucylcbmRlZiBjbGVhbl93b3JsZF9yYW5rKGlucHV0X2RmKTpcbiAgICBkZiA9IGlucHV0X2RmLmNvcHkoKVxuICAgIGRmLndvcmxkX3JhbmsgPSBkZi53b3JsZF9yYW5rLnN0ci5zcGxpdCgnLScpLnN0clswXS5zdHIuc3BsaXQoJz0nKS5zdHJbMF1cbiAgICByZXR1cm4gZGZcbmNvbW1vbl95ZWFycyA9IHNldChzaGFuZ2hhaV9kZi55ZWFyKSAmIHNldCh0aW1lc19kZi55ZWFyKSBcbmRlZiBmaWx0ZXJfeWVhcihpbnB1dF9kZiwgeWVhcnMpOlxuICAgIGRmID0gaW5wdXRfZGYuY29weSgpXG4gICAgcmV0dXJuIGRmLnF1ZXJ5KCd5ZWFyIGluIHt9Jy5mb3JtYXQobGlzdCh5ZWFycykpKVxuY2xlYW5lZF90aW1lc19kZiA9ICh0aW1lc19kZi5sb2NbOiwgY29tbW9uX2NvbHVtbnNdXG4gICAgICAgICAgICAgICAgICAgICAgICAgICAgLnBpcGUoZmlsdGVyX3llYXIsIGNvbW1vbl95ZWFycylcbiAgICAgICAgICAgICAgICAgICAgICAgICAgICAucGlwZShjbGVhbl93b3JsZF9yYW5rKVxuICAgICAgICAgICAgICAgICAgICAgICAgICAgIC5hc3NpZ24obmFtZT0ndGltZXMnKSlcbmNsZWFuZWRfc2hhbmdoYWlfZGYgPSAoc2hhbmdoYWlfZGYubG9jWzosIGNvbW1vbl9jb2x1bW5zXVxuICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIC5waXBlKGZpbHRlcl95ZWFyLCBjb21tb25feWVhcnMpXG4gICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgLnBpcGUoY2xlYW5fd29ybGRfcmFuaylcbiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAuYXNzaWduKG5hbWU9J3NoYW5naGFpJykpXG5yYW5raW5nX2RmID0gcGQuY29uY2F0KFtjbGVhbmVkX3RpbWVzX2RmLCBjbGVhbmVkX3NoYW5naGFpX2RmXSkuZHJvcCgndG90YWxfc2NvcmUnLCBheGlzPTEpXG5yYW5raW5nX2RmLndvcmxkX3JhbmsgPSByYW5raW5nX2RmLndvcmxkX3JhbmsuYXN0eXBlKCdpbnQxNicpXG5yYW5raW5nX2RmLnVuaXZlcnNpdHlfbmFtZSA9IHJhbmtpbmdfZGYudW5pdmVyc2l0eV9uYW1lLmFzdHlwZSgnY2F0ZWdvcnknKVxucmFua2luZ19kZi5uYW1lID0gcmFua2luZ19kZi5uYW1lLmFzdHlwZSgnY2F0ZWdvcnknKSIsInNhbXBsZSI6IiMgQ29uc3RydWN0IGEgRGF0YUZyYW1lIHdpdGggdGhlIHRvcCA1IHVuaXZlcnNpdGllcyBcbnRvcF81X2RmID0gcmFua2luZ19kZi5sb2NbbGFtYmRhIGRmOiBkZi53b3JsZF9yYW5rLmlzaW4ocmFuZ2UoMSwgNikpLCA6XVxuXG4jIFByaW50IHRoZSBmaXJzdCByb3dzIG9mIGB0b3BfNV9kZmBcbnRvcF81X2RmLl9fX18oKSIsInNvbHV0aW9uIjoiIyBDb25zdHJ1Y3QgYSBEYXRhRnJhbWUgd2l0aCB0aGUgdG9wIDUgdW5pdmVyc2l0aWVzIFxudG9wXzVfZGYgPSByYW5raW5nX2RmLmxvY1tsYW1iZGEgZGY6IGRmLndvcmxkX3JhbmsuaXNpbihyYW5nZSgxLCA2KSksIDpdXG5cbiMgUHJpbnQgdGhlIGZpcnN0IHJvd3Mgb2YgYHRvcF81X2RmYFxudG9wXzVfZGYuaGVhZCgpIiwic2N0IjoiRXgoKS50ZXN0X29iamVjdChcInRvcF81X2RmXCIpXG5FeCgpLnRlc3RfZnVuY3Rpb24oXCJ0b3BfNV9kZi5oZWFkXCIpXG5zdWNjZXNzX21zZyhcIkdyZWF0IVwiKSJ9

Now that you have the correct DataFrame to work with, let’s use the .groupby method to compare both ranking methods using the set similarity defined above.

You get the same results as above with less hassle.

## 5. Visualization

Now that you have gone through some initial analysis of your data sets, it’s also time to explore your data visually. More idiomatic Pandas code also means that you make use of Pandas’ plotting integration with the Matplotlib package. Because of this, you’ll make great plots in no time!

For this section, you’ll go back to the times_df and shanghai_df datasets to make some basic visualizations with Matplotlib and Seaborn.

eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiaW1wb3J0IHBhbmRhcyBhcyBwZFxuaW1wb3J0IG1hdHBsb3RsaWIucHlwbG90IGFzIHBsdFxuc2hhbmdoYWlfZGYgPSBwZC5yZWFkX2NzdignaHR0cHM6Ly9zMy5hbWF6b25hd3MuY29tL2Fzc2V0cy5kYXRhY2FtcC5jb20vYmxvZ19hc3NldHMvc2hhbmdoYWlEYXRhLmNzdicpIiwic2FtcGxlIjoiIyBQbG90IGEgc2NhdHRlcnBsb3Qgd2l0aCBgdG90YWxfc2NvcmVgIGFuZCBgYWx1bW5pYFxuc2hhbmdoYWlfZGYucGxvdC5zY2F0dGVyKCd0b3RhbF9zY29yZScsICdhbHVtbmknLCBjPSd5ZWFyJywgY29sb3JtYXA9J3ZpcmlkaXMnKVxuXG5wbHQuc2hvdygpIn0=

The above plot immediately makes clear that there are some 0 values for the alumni column. This is something to definitley take into account when you’re exploring your data: with the help of data profiling, you’ll be able to see these 0 values and be able to handle them. If you’d like to know more about this, check out DataCamp’s Data Profiling Tutorial.

Or, in case you have values that don’t have a place in your data set, you can simply replace them by NaN and then remove all the NaN values of the columns which would make plotting the data more difficult. Note that this is just a quick way of adjusting your data and that you probably need a more elaborate data profiling step to make sure that your data quality improves because you still see 0 values for the num_students column in the plot below:

eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiaW1wb3J0IHBhbmRhcyBhcyBwZFxuaW1wb3J0IG1hdHBsb3RsaWIucHlwbG90IGFzIHBsdFxudGltZXNfZGYgPSBwZC5yZWFkX2NzdignaHR0cHM6Ly9zMy5hbWF6b25hd3MuY29tL2Fzc2V0cy5kYXRhY2FtcC5jb20vYmxvZ19hc3NldHMvdGltZXNEYXRhLmNzdicsIHRob3VzYW5kcz1cIixcIikiLCJzYW1wbGUiOiIjIFJlcGxhY2UgYC1gIGVudHJpZXMgd2l0aCBOYU4gdmFsdWVzXG50aW1lc19kZlsndG90YWxfc2NvcmUnXT0gdGltZXNfZGZbJ3RvdGFsX3Njb3JlJ10ucmVwbGFjZShcIi1cIiwgXCJOYU5cIikuYXN0eXBlKCdmbG9hdCcpXG5cbiMgRHJvcCBhbGwgcm93cyB3aXRoIE5hTiB2YWx1ZXMgZm9yIGBudW1fc3R1ZGVudHNgIFxudGltZXNfZGYgPSB0aW1lc19kZi5kcm9wbmEoc3Vic2V0PVsnbnVtX3N0dWRlbnRzJ10sIGhvdz0nYWxsJylcblxuIyBDYXN0IHRoZSByZW1haW5pbmcgcm93cyB3aXRoIGBudW1fc3R1ZGVudHNgIGFzIGludFxudGltZXNfZGZbJ251bV9zdHVkZW50cyddID0gdGltZXNfZGZbJ251bV9zdHVkZW50cyddLmFzdHlwZSgnaW50JylcblxuIyBQbG90IGEgc2NhdHRlcnBsb3Qgd2l0aCBgdG90YWxfc2NvcmVgIGFuZCBgbnVtX3N0dWRlbnRzYFxudGltZXNfZGYucGxvdC5zY2F0dGVyKCd0b3RhbF9zY29yZScsICdudW1fc3R1ZGVudHMnLCBjPSd5ZWFyJywgY29sb3JtYXA9J3ZpcmlkaXMnKVxuXG5wbHQuc2hvdygpIn0=

Of course, you don’t need to limit yourself to Matplotlib to visualize your data. There are also other libraries out there that allow you to quickly visualize your data, such as Seaborn.

The Seaborn plotting tool is mainly used to create statistical plots that are visually appealing (as stated in the official website). In fact, prior to version 2.0 of Matplotlib, it was a hassle (though possible) to create beautiful plots.

The combination of Pandas with Seaborn also allows you to quickly iterate over your data by means of visualizations.

Let’s explore some examples!

Take a look at the following plot, where you count the number of times a specific country appears in your data, you take the first ten observations and you make sure that you sort the counts from highest to lowest value when you plot the barplot:

eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiaW1wb3J0IHBhbmRhcyBhcyBwZFxuaW1wb3J0IHNlYWJvcm4gYXMgc25zXG5pbXBvcnQgbWF0cGxvdGxpYi5weXBsb3QgYXMgcGx0XG50aW1lc19kZiA9IHBkLnJlYWRfY3N2KFwiaHR0cHM6Ly9zMy5hbWF6b25hd3MuY29tL2Fzc2V0cy5kYXRhY2FtcC5jb20vYmxvZ19hc3NldHMvdGltZXNEYXRhLmNzdlwiLCB0aG91c2FuZHM9XCIsXCIpXG50aW1lc19kZlsndG90YWxfc2NvcmUnXT0gdGltZXNfZGZbJ3RvdGFsX3Njb3JlJ10ucmVwbGFjZShcIi1cIiwgXCJOYU5cIikuYXN0eXBlKCdmbG9hdCcpIiwic2FtcGxlIjoiIyBBYmJyZXZpYXRlIGNvdW50cnkgbmFtZXMgb2YgVW5pdGVkIFN0YXRlcyBhbmQgVW5pdGVkIEtpbmdkb21cbnRpbWVzX2RmWydjb3VudHJ5J10gPSB0aW1lc19kZlsnY291bnRyeSddLnJlcGxhY2UoXCJVbml0ZWQgU3RhdGVzIG9mIEFtZXJpY2FcIiwgXCJVU0FcIikucmVwbGFjZShcIlVuaXRlZCBLaW5nZG9tXCIsIFwiVUtcIilcblxuIyBDb3VudCB0aGUgZnJlcXVlbmN5IG9mIGNvdW50cmllcyBcbmNvdW50ID0gdGltZXNfZGZbJ2NvdW50cnknXS52YWx1ZV9jb3VudHMoKVs6MTBdXG5cbiMgQ29udmVydCB0aGUgdG9wIDEwIGNvdW50cmllcyB0byBhIERhdGFGcmFtZSBcbmRmID0gY291bnQudG9fZnJhbWUoKVxuXG4jIFJlc2V0IHRoZSBpbmRleCBcbmRmLnJlc2V0X2luZGV4KGxldmVsPTAsIGlucGxhY2U9VHJ1ZSlcblxuIyBSZW5hbWUgdGhlIGNvbHVtbnNcbmRmLmNvbHVtbnMgPSBbJ2NvdW50cnknLCAnY291bnQnXVxuXG4jIFBsb3QgYSBiYXJwbG90IHdpdGggYGNvdW50cnlgIGFuZCBgY291bnRgXG5zbnMuYmFycGxvdCh4PSdjb3VudHJ5JywgeT0nY291bnQnLCBkYXRhPWRmKVxuc25zLmRlc3BpbmUoKVxuXG5wbHQuc2hvdygpIn0=

Then, you can go on and plot these countries and their mean scores:

eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiaW1wb3J0IHBhbmRhcyBhcyBwZFxuaW1wb3J0IG1hdHBsb3RsaWIucHlwbG90IGFzIHBsdCBcbmltcG9ydCBzZWFib3JuIGFzIHNuc1xudGltZXNfZGYgPSBwZC5yZWFkX2NzdihcImh0dHBzOi8vczMuYW1hem9uYXdzLmNvbS9hc3NldHMuZGF0YWNhbXAuY29tL2Jsb2dfYXNzZXRzL3RpbWVzRGF0YS5jc3ZcIiwgdGhvdXNhbmRzPVwiLFwiKVxudGltZXNfZGZbJ2NvdW50cnknXSA9IHRpbWVzX2RmWydjb3VudHJ5J10ucmVwbGFjZShcIlVuaXRlZCBTdGF0ZXMgb2YgQW1lcmljYVwiLCBcIlVTQVwiKS5yZXBsYWNlKFwiVW5pdGVkIEtpbmdkb21cIiwgXCJVS1wiKVxudGltZXNfZGZbJ3RvdGFsX3Njb3JlJ109IHRpbWVzX2RmWyd0b3RhbF9zY29yZSddLnJlcGxhY2UoXCItXCIsIFwiTmFOXCIpLmFzdHlwZSgnZmxvYXQnKVxudGltZXNfZGZfZmlsdGVyZWQ9IHRpbWVzX2RmLnF1ZXJ5KCdjb3VudHJ5ID09XCJVU0FcIiBvciBjb3VudHJ5PT1cIkNhbmFkYVwiIG9yIGNvdW50cnkgPT1cIlVLXCIgb3IgY291bnRyeT09XCJHZXJtYW55XCIgb3IgY291bnRyeT09XCJBdXN0cmFsaWFcIicpIiwic2FtcGxlIjoiIyBCYXJwbG90IHdpdGggYGNvdW50cnlgIGFuZCBgdG90YWxfc2NvcmVgXG5zbnMuYmFycGxvdCh4PSdjb3VudHJ5JywgeT0ndG90YWxfc2NvcmUnLCBkYXRhPXRpbWVzX2RmX2ZpbHRlcmVkKVxuc25zLmRlc3BpbmUoKVxuXG5wbHQuc2hvdygpIn0=

Your plots can also get pretty complex. Look at the following pairplot, which is designed to show you the pairwise relationships in a dataset more clearly:

import numpy as np

np.seterr(invalid='ignore')

sns.pairplot(times_df, hue='country')

plt.show()

Likewise, you can check out this combination of FacetGrid and regplot, which gives you some insights on the evolution of the total_score over the years for the top 5 countries:

eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiaW1wb3J0IHNlYWJvcm4gYXMgc25zXG5pbXBvcnQgcGFuZGFzIGFzIHBkXG50aW1lc19kZiA9IHBkLnJlYWRfY3N2KFwiaHR0cHM6Ly9zMy5hbWF6b25hd3MuY29tL2Fzc2V0cy5kYXRhY2FtcC5jb20vYmxvZ19hc3NldHMvdGltZXNEYXRhLmNzdlwiLCB0aG91c2FuZHM9XCIsXCIpXG50aW1lc19kZlsnY291bnRyeSddID0gdGltZXNfZGZbJ2NvdW50cnknXS5yZXBsYWNlKFwiVW5pdGVkIFN0YXRlcyBvZiBBbWVyaWNhXCIsIFwiVVNBXCIpLnJlcGxhY2UoXCJVbml0ZWQgS2luZ2RvbVwiLCBcIlVLXCIpXG50aW1lc19kZlsndG90YWxfc2NvcmUnXT0gdGltZXNfZGZbJ3RvdGFsX3Njb3JlJ10ucmVwbGFjZShcIi1cIiwgXCJOYU5cIikuYXN0eXBlKCdmbG9hdCcpXG50aW1lc19kZl9maWx0ZXJlZD0gdGltZXNfZGYucXVlcnkoJ2NvdW50cnkgPT1cIlVTQVwiIG9yIGNvdW50cnk9PVwiQ2FuYWRhXCIgb3IgY291bnRyeSA9PVwiVUtcIiBvciBjb3VudHJ5PT1cIkdlcm1hbnlcIiBvciBjb3VudHJ5PT1cIkF1c3RyYWxpYVwiJykiLCJzYW1wbGUiOiIjIEZpbHRlciBvdXQgcm93cyB3aXRoIE5hTiBlbnRyeSBmb3IgYHRvdGFsX3Njb3JlYFxudGltZXNfZGZfZmlsdGVyZWQgPSB0aW1lc19kZl9maWx0ZXJlZC5kcm9wbmEoc3Vic2V0PVsndG90YWxfc2NvcmUnXSwgaG93PSdhbGwnKVxuXG5nID0gc25zLkZhY2V0R3JpZCh0aW1lc19kZl9maWx0ZXJlZCwgY29sPSdjb3VudHJ5JywgaHVlPSdjb3VudHJ5JylcbmcubWFwKHNucy5yZWdwbG90LCAneWVhcicsICd0b3RhbF9zY29yZScpLnNldCh4bGltPSgyMDEwLCAyMDE1KSwgeWxpbT0oMCwxMDApKVxuZy5maWcuc3VicGxvdHNfYWRqdXN0KHdzcGFjZT0uMikifQ==

Lastly, also a correlation plot might come in handy when you’re exploring your data. You can use the heatmap() function from Seaborn to get this done:

eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiaW1wb3J0IHNlYWJvcm4gYXMgc25zXG5pbXBvcnQgbnVtcHkgYXMgbnBcbmltcG9ydCBwYW5kYXMgYXMgcGRcbnRpbWVzX2RmID0gcGQucmVhZF9jc3YoJ2h0dHBzOi8vczMuYW1hem9uYXdzLmNvbS9hc3NldHMuZGF0YWNhbXAuY29tL2Jsb2dfYXNzZXRzL3RpbWVzRGF0YS5jc3YnLCB0aG91c2FuZHM9XCIsXCIpIiwic2FtcGxlIjoiaW1wb3J0IG1hdHBsb3RsaWIucHlwbG90IGFzIHBsdFxuXG5zbnMuc2V0KHN0eWxlPVwid2hpdGVcIilcblxuIyBDb21wdXRlIHRoZSBjb3JyZWxhdGlvbiBtYXRyaXhcbmNvcnIgPSB0aW1lc19kZi5jb3JyKClcblxuIyBHZW5lcmF0ZSBhIG1hc2sgZm9yIHRoZSB1cHBlciB0cmlhbmdsZVxubWFzayA9IG5wLnplcm9zX2xpa2UoY29yciwgZHR5cGU9bnAuYm9vbClcbm1hc2tbbnAudHJpdV9pbmRpY2VzX2Zyb20obWFzayldID0gVHJ1ZVxuXG4jIFNldCB1cCB0aGUgbWF0cGxvdGxpYiBmaWd1cmVcbmYsIGF4ID0gcGx0LnN1YnBsb3RzKGZpZ3NpemU9KDExLCA5KSlcblxuIyBHZW5lcmF0ZSBhIGN1c3RvbSBkaXZlcmdpbmcgY29sb3JtYXBcbmNtYXAgPSBzbnMuZGl2ZXJnaW5nX3BhbGV0dGUoMjIwLCAxMCwgYXNfY21hcD1UcnVlKVxuXG4jIERyYXcgdGhlIGhlYXRtYXAgd2l0aCB0aGUgbWFzayBhbmQgY29ycmVjdCBhc3BlY3QgcmF0aW9cbnNucy5oZWF0bWFwKGNvcnIsIG1hc2s9bWFzaywgY21hcD1jbWFwLCB2bWF4PS4zLFxuICAgICAgICAgICAgc3F1YXJlPVRydWUsIGxpbmV3aWR0aHM9LjUsIGF4PWF4KVxuICAgICAgICAgICAgXG5wbHQuc2hvdygpIn0=

You can keep going on and on with these visualizations, so maybe it’s best to stop for now :).

## To go beyond

That’s it for this time, I hope you have enjoyed learning some intermediate techniques working with Pandas and more specifically how to write more idiomatic Pandas code.

If you have enjoyed this blog post and want to dig deeper about performance in Pandas, I highly recommend the following additional material: