Course
Python Dictionaries and the Data Science Toolbox
As a data scientist working in Python, you’ll need to temporarily store data all the time in an appropriate Python data structure to process it. A special data structure which Python provides natively is the dictionary. Its name already gives away how data is stored: a piece of data or values that can be accessed by a key(word) you have at hand.
If you look up the word “python” in a paper dictionary, let’s say the Oxford Dictionary of English, you will usually start by trying to browse to the part of the dictionary that contains the words starting with “p”, then “y”, “t” and so on until you hit the full word. The dictionary entry will tell you that “python” is a large non-venomous snake that constricts its prey, or a high-level programming language (!).
A paper dictionary has only been a well-respected aid because its words are ordered alphabetically and with a bit of practice, you can find any word in there within a minute. A Python dictionary works in a similar way: stored dictionary items can be retrieved very fast by their key. Unlike Python lists, for instance, Python does keep track of where to find a specific piece of information.
In today’s tutorial, you’ll learn more about the following topics:
- how to create a dictionary by making use of curly brackets and colons,
- how to load data in your dictionary with the help of the
urllib
andrandom
libraries, - how to filter the dictionary with the help of a for-loop and special iterators to loop over the keys and values of your dictionary,
- how to perform operations on your dictionary to get or remove values from your dictionary and how you can use dictionary comprehension to subset values from your dictionary,
- how to sort a dictionary with the
re
library and howOrderedDict
and lambda functions can come in handy when you’re doing this, and - how Python dictionaries compare to lists, NumPy arrays and Pandas DataFrames.
Let’s start!
How to Create a Python Dictionary
Suppose you are making an inventory of the fruit that you have left in your fruit basket by storing the count of each type of fruit in a dictionary. There are several ways to construct a dictionary, but for this tutorial, we will keep it simple. For a complete overview, chekc out the Python documentation on dictionaries.
Most important features by which you can recognize a dictionary are the curly brackets { }
and for each item in the dictionary, the separation of the key and value by a colon :
.
As you can try for yourself, the variable fruit
below is a valid dictionary, and you can access an item from the dictionary by putting the key between square brackets [ ]
. Alternatively, you can also use the .get()
method to do the same thing.
Loading Data in your Python Dictionary
Now you are going to put some real data in a dictionary, more specifically, a nested dictionary (meaning a dictionary that has a dictionary as its value rather than for instance a string or integer).
This way, tables or matrices can easily be stored in a dictionary.
The data used are the reviews of Donna Tartt’s The Goldfinch in the Amazon book review set from the Irivine Machine Learning Repository. These reviews have been stored in a simple tab separated file, which is nothing more than a plain text file with columns. The table contains four columns: review score, url, review title and review text.
There are several ways imaginable to put this into a dictionary, but in this case, you take the url as the dictionary keys and put the other columns in the nested values dictionary.
In this case, you were lucky enough to get a data set that has no missing values. This is of course not always the case; real data sets “from the wild” are often a big mess (wrong formatting, encoding errors, missing data, etc.) when you start using them. To keep it simple, the tutorial did not provide anything in the above script to cope with missing values here, but it is something you will usually have to take into account.
You can, however, easily verify whether all keys are present in the dictionary by comparing the number of lines from the file to the number of dictionary keys. In this case, this tells you it is safe to proceed to data processing.
How to Filter a Dictionary in Python
Now that the Amazon reviews are stored in a dictionary, it is time to try some operations on it. Let’s say you are interested in the bad reviews and want to see what people actually wrote by selecting only the reviews that score 1.0
The review scores are stored in the dictionary values, which means you will have to loop over the dictionary. Unfortunately (not really though), you can not simply use a for-loop to go over the dictionary object. Python dictionary items not only have both a key and a value, but they also have a special iterator to loop over them. Instead of for item in dictionary
, you need to use for key, value in dictionary.items()
, using this specific iterator and using the two variables, key and value, instead of the single variable. Likewise, there are separate iterators for keys (.keys()
) and values (.values()
).
You store the keys of the reviews with a low score in a list, named lowscores
so later on you can just reuse the list to retrieve them from the dictionary.
Python Dictionary Operations
If the dictionary containing the full dataset is large, it might be wiser to use the lowscores
list you just compiled to create an entirely new dictionary. The advantage is that for further analysis, you do not need to keep the large dictionary in the memory and can just proceed with the relevant subset of the original data.
First, you use the keys stored in lowscores
to create the new dictionary. There are two options for this: one just retrieves the relevant items from the original dictionary with the .get()
method leaving the original intact, the other uses .pop()
which does remove it permanently from the original dictionary.
The code for subsetting could look as following: subset = dict([(k, reviews.get(k)) for k in lowscores])
. This notation might look unfamiliar, because the loop is written in a single line of code. This style is called a “dictionary comprehensions”, but it’s actually a for-loop in disguise, looping over the items from lowscores
, retrieving the values from reviews
and using these to fill a new dictionary. It is very similar to a list comprehension but, evidently, outputs a dictionary instead of a list.
It is not recommended however to use comprehensions if you aren’t familiar yet with this programming style; the written-out for-loop is way easier to read and understand. However, as you will often have to read other people’s code, you should be able to at least recognize it. You can read more about dictionary comprehension here.
You could compare the traditional for-loop style with the dictionary comprehension and verify that they indeed produce the exact same result:
Suppose you now want to rearrange our dictionary in order to have the review scores as dictionary keys, instead of the ids. You could use a for-loop for this, specifying both the keys and values and build a new nested dictionary. You will have to retrieve the ‘score’ from the originally nested dictionary to use it as the new key.
To simplify the code a bit, you create the new nested dictionary as the object newvalues
on a separate line before filling the scoredict
with the ids as keys and the newvalues
dictionary as its values:
How to Sort Dictionaries in Python
As you made the effort to load a real dataset into a dictionary, you can now perform some basic analysis on it. If you are interested in the words that go hand in hand with a negative sentiment about the novel, you could do a low-level form of sentiment analysis by making a frequency list of the words in the negative reviews (scored 1.0).
You need to process the review text a little bit by removing the HTML-tags and converting uppercase words to lowercase. For the first we use a regular expression which removes all tags: re.sub("<.*?>", "")
. Regular expressions are a very useful tool when dealing with text data. They are quite complex to compile and definitely deserve a tutorial of their own for (aspiring) data scientists.
In this example however, you just need to grasp that which starts with <
followed by an unknown number (including 0) of any character and closed by >
, is substituted with nothing: ""
(empty quotes).
Python has a built-in function to remove capitals from words by simply chaining the function .lowercase()
to a string. This way, you avoid that words which are capitalized because they occur at the beginning of a sentence are seen as separate words. There are, of course, cases in which the capital letter stands for a different word, but detecting these requires some advanced text processing (called Named Entity Recognition), but this is way beyond the scope of Python dictionaries.
Next, you build the frequency dictionary using a defaultdict
instead of a normal dictionary. This guarantees that each “key” is already intialized and you can just increase the frequency count with 1.
If you were not using defaultdict
, Python would raise an error when you try to increase the count for the first time (so from 0 to 1) because the key does not yet exist. This could be overcome by first checking whether a key exists in the dictionary, before increasing its value, but this solution is far from elegant compared to defaultdict
.
Once the frequency dictionary is ready, you still need to sort the keys by value in descending order to see promptly which words are highly frequent. As normal dictionaries (including defaultdict
can not be ordered by design), you need another class, namely OrderedDict
. It stores a dictionary in the order the items were added. In this case, you need the sort the items first, before storing them again in the new, OrderedDict
.
The sorted
function takes 3 arguments. The first one is the object that you want to sort, your frequency dictionary. Remember however that accessing the key-value pairs in a dictionary is only possible through the .items()
function. If you forget this, Python will not even complain, but only return the first key it encounters. In other words: if you are looping over a dictionary and your code behaves in a weird way, check whether you added the .items()
function before starting screaming.
The second argument specifies what part of the first argument should be used to sort: key=lambda item: item[1]
. Again, you will have to dig a bit deeper into the Python language to grasp what this is about. The first part is pretty self-explanatory: you want the keys to be sorted.
But what is the lambda
doing there?
Well, a lambda function is an anonymous function, meaning it is a function without a name and can not be called from outside. This is not the place nor the time to discuss this in full, but it is an alternative way to loop over a whole range of object with a single function. In this case, it simply uses the dictionary value (item[1]
, with item[0]
being the key) as the argument for sorting.
The third and final argument, reverse
, specifies whether sorting should be ascending (the default) or descending. In this case, you want to see the most frequent words at the top and need to specify explicitly that reverse=True
.
If you were to look immediately at the top of the sorted items now, you would be disappointed by the words that dominate this frequency list. These would be merely “function words” such as “the”, “and”, “a”, etc. English (and many other languages of course) is full of these words, but they are mainly used to glue language together and they are quite meaningless in isolation.
In text analytics, so-called stop lists are used to remove these highly frequent words from the analysis. We apply (again) a more rudimentary approach by ignoring the top 10% words and only consider words that are among the 90% most frequent. You will see that the top of this list provides more interesting, negatively loaded words such as “uncomfortable” and “frustrating”, but also positive ones like “captivating” and “wonderfully”.
You can experiment with the slicing yourself to see in which parts of the data you can find interesting words.
Dictionary Versus Python lists, NumPy Arrays and Pandas DataFrames
Dictionaries are an essential data structure innate to Python, allowing you need to put data in Python objects to process it further. They are, next to lists and tuples, one of the basic but most powerful and flexible data structures that Python has to offer. Lately however, much of the dictionary functionality can be and is indeed replaced by Pandas, a Python Data Analysis Library that allows to keep more of the data processing and analysis within Python, rather than forcing you, as a data scientist, to use specialised statistical programming languages (most notably R) on the side.
If there are off-the-shelf libraries readily available, why still bother to grasp what dictionaries can do?
Well, it is always good to learn to walk before attempting to run.
It is definitely so that libraries like Pandas allow data scientists to work faster and more efficient because they no longer need to bother about the lower level details of how the data is stored. Pandas, however, also uses dictionaries (next to other advanced data structures such as the NumPy array) to store its data. As a result, it is a good idea to know how a dictionary works before leaving the hard work, namely storing the data in the appropriate data structures, to Pandas.
Even when using Pandas, it is sometimes recommended to still use Python dictionaries when the situation calls for it, for instance when values simply need to be mapped and you don’t need Pandas functionality for anything else. Using a Pandas object is in such cases simply inefficient and overkill.
Finally, Pandas contains functions to convert a dictionary to a Pandas DataFrame and vice versa and dataframes can contain dictionaries. Both are indeed very useful parts of the modern data scientist’s toolbox.
What’s Next?
Congratulations! You have reached the end of our Python dictionary tutorial!
Complete your learning by taking DataCamp’s the free Intro to Python for Data Science course to learn more about the Python basics that you need to know to do data science and the Intermediate Python for Data Science course to learn more about the control flow. If you’re ready to move on to Pandas, don’t miss out on our Pandas Foundations course.
If you’re looking for more tutorials, check out the Python list tutorial, Pandas tutorial or the NumPy tutorial.
Python Courses
Course
Introduction to Data Science in Python
Course
Intermediate Python
tutorial
Python Dictionaries Tutorial: The Definitive Guide
tutorial
Python Dictionary Comprehension Tutorial
tutorial
How to Sort a Dictionary by Value in Python
tutorial
Sorting in Python Tutorial
DataCamp Team
1 min
tutorial