An estimated 2.5 quintillion bytes of data are produced daily [Source: CloudTweaks]. Try not to feel discouraged if you are having a hard time with that number – you are not alone. To put things into perspective, one billion is 109 and one quintillion is 1018. In other words, the amount of data we generate is growing at an exponential rate.
Organizations and individuals leverage data to determine the causes of problems and identify actionable steps. However, with an ever-growing amount of data, it becomes increasingly challenging to make sense of it all. Our innate nature is to search for patterns and find structure. This is how we store information and learn new insights. When incoming data is not presented in a visually pleasing way, seeing these patterns and finding structure can be difficult or, in some cases, impossible.
In this article, we will look at how we can solve the above problem with data visualization. We will also cover what data visualization is, why it is important, tips to develop your data visualization skills, common graphs, and tools you can use to make the process of visualizing your data easier for you.
What is Data Visualization?
Data visualization may be described as graphically representing data. It is the act of translating data into a visual context, which can be done using charts, plots, animations, infographics, etc. The idea behind it is to make it easier for us (humans) to identify trends, outliers, and patterns in data. Data professionals regularly leverage data visualizations to summarize key insights from data and relay that information back to the appropriate stakeholders.
For example, a data professional may create data visualizations for management personnel, who then leverage those visualizations to project the organizational structure. Another example is when data scientists use visualizations to uncover the underlying structure of the data to provide them with a better understanding of their data.
Given the defined purpose of data visualization stated above, there are two vital insights we can take away from what data visualization is:
1) A way to make data accessible: The best way to make something accessible is to keep things simple. The word ‘simple’ here needs to be put into context: what is straightforward for a ten-year-old to understand may not be the same for a Ph.D holder. Thus, data visualization is a technique used to make data accessible to whomever it may concern.
2) A way to communicate: This takeaway is an extension of the first. To communicate effectively, everyone in the room must be speaking the same language. It does not matter if you are working on a task independently or with a team, visualizations should derive interesting insights relative to those who may view them. A CEO may prefer to view insights that provide actionable steps, and a machine learning team may prefer to view insights into how their models perform.
In short, data visualization is a technique used to make it easier to recognize patterns or trends in data. But why is this such a big deal for data scientists? We will answer this in the next section.
Why is data visualization so important for data scientists?
According to Wikipedia, data science is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from noisy, structured, and unstructured data. However, the challenge data scientists face is that it is not always possible to connect the dots when faced with raw data; this is where data visualization is extremely valuable. Let’s delve deeper into why data visualization is a veritable treasure to data scientists.
Discoverability & Learning
Data scientists are required to have a good understanding of the data they use. Data visualization makes it possible for data scientists to bring out the most informative aspects of their data, making it quick and easy for them (and others) to grasp what is happening in the data. For example, identifying trends in a dataset represented in a table format is much more complicated than seeing it visually. We will talk more about this in the common graphs used for data visualization section.
Stories have been used to share information for centuries. The reason stories are so compelling is that they create an emotional connection with the audience. This connection allows the audience to gain a deeper understanding of other people’s experiences, making the information more memorable. Data visualization is a form of storytelling: it helps data scientists to make sense of their data and share it with others in an easily understandable format.
Insights can undoubtedly be derived from databases (in some cases), but this endeavor requires immense focus and is a significant business expense. Using data visualization in such instances is much more efficient. For example, if the goal is to identify all of the companies that suffered a loss in 2021, it would be easier to use red to highlight all the companies with profits less than zero, and green to highlight those greater than zero. The companies that broke even remain neutral. This technique is much more efficient than manually parsing the database to mark the individual companies.
While raw data may be informative, it is rarely visually pleasing or engaging. Data visualization is a technique used to take informative yet dull data and transform it into beautiful, insightful knowledge. Making data more visually appealing plays into our nature of being primarily optical beings. In other words, we process information much faster when it stimulates our visual senses.
Data Visualization Tools
Data professionals, such as data scientists and data analysts, would typically leverage data visualization tools as this helps them to work more efficiently and communicate their findings more effectively.
The tools can be broken down into two categories: 1) code free and 2) code based. Let’s take a look at some popular tools in each category.
Not everyone in your organization is going to be tech-savvy. However, lacking the ability to program should not deter you from deriving insights from data. You can lack programming ability but still be data literate – someone who can read, write, communicate, and reason with data to make better data-driven decisions.
Thus, code-free tools serve as an accessible solution for people who may not have programming knowledge (although people with programming skills may still opt to use them. More formally: code-free tools are graphical user interfaces that come with the capabilities of running native scripts to process and augment the data.
Some example code-free tools include:
Power BI is a highly-popular Microsoft solution for data visualization and business intelligence. It is among the world's most popular business intelligence tools used for reporting, self-service analytics, and predictive analytics. This platform service makes it easy for you to quickly clean, analyze, and begin finding insights into your organization's data.
Tableau is also one of the world’s most popular business intelligence tools. Its simple drag-and-drop functionality makes it easily accessible for anyone to begin finding insights into their organization's data using interactive data visualizations.
Datacamp’s Tableau Fundamentals skill track is a great way to get started with Tableau.
Visualization with Code
If you are more tech-savvy, you may prefer to use a programming language to visualize your data. The increase in data production has boosted the popularity of Python and R due to their various packages that support data processing.
Let’s take a look at some of these packages.
Python is a high-level, interpreted, general-purpose programming language [source: Wikipedia]. It offers several great graphing packages for data visualization, such as:
The Data Visualization with Python skills track is a great sequence of courses to supercharge your data science skills using Python's most popular and robust data visualization libraries.
R is a programming language for statistical computing and graphics [source: Wikipedia]. It is a great tool for data analysis, as you can create almost any type of graph using its various packages. Popular R data visualization packages include:
Common Graphs Used for Data Visualization
We have established that data visualization is an effective way to present and communicate with data. In this section, we will cover how we can create some common types of charts & graphs, which are an effective starting point for most data visualization tasks, by using Python and the Matplotlib python package for data visualization. We will also share some use cases for each graph.
Note: See this Datacamp Workspace to access the full code.
A frequency table is a great way to represent the number of times an event or value occurs. We typically use them to find descriptive statistics for our data. For example, we may wish to understand the effect of one feature on the final decision.
import pandas as pd """ source: https://heartbeat.comet.ml/exploratory-data-analysis-eda-for-categorical-data-870b37a79b65 """ def frequency_table(data:pd.DataFrame, col:str, column:str): freq_table = pd.crosstab(index=data[col], columns=data[column], margins=True) rel_table = round(freq_table/freq_table.loc["All"], 2) return freq_table, rel_table buying_freq, buying_rel = frequency_table(car_data, "class", "buying") print("Two-way frequency table") print(buying_freq) print("---" * 15) print("Two-way relative frequency table") print(buying_rel)
The bar graph is among the most simple yet effective data visualizations around. It is typically deployed to compare the differences between categories. For example, we can use a bar graph to visualize the number of fraudulent cases to non-fraudulent cases. Another use case for a bar graph may be to visualize the frequencies of each star rating for a movie.
Here is how we would create a bar graph in Python:
""" Starter code from tutorials point see: https://bit.ly/3x9Z6HU """ import matplotlib.pyplot as plt # Dataset creation. programming_languages = ['C', 'C++', 'Java', 'Python', 'PHP', "Other", "None"] employees_frequency = [23, 17, 35, 29, 12, 5, 38] # Bar graph creation. fig, ax = plt.subplots(figsize=(10, 5)) plt.bar(programming_languages, employees_frequency) plt.title("The number of employees competent in a programming langauge") plt.xlabel("Programming languages") plt.ylabel("Frequency") plt.show()
Pie charts are another simple and efficient visualization tool. They are typically used to visualize and compare parts of a whole. For example, a good use case for a pie chart would be to represent the market share for smartphones. Let’s implement this in Python.
""" Example to demonstrate how a pie chart can be used to represent the market share for smartphones. Note: These are not real figures. They were created for demonstration purposes. """ import numpy as np from matplotlib import pyplot as plt # Dataset creation. smartphones = ["Apple", "Samsung", "Huawei", "Google", "Other"] market_share = [50, 30, 5, 12, 2] # Pie chart creation fig, ax = plt.subplots(figsize=(10, 6)) plt.pie(market_share, labels = smartphones, autopct='%1.2f%%') plt.title("Smartphone Marketshare for April 2021 - April 2022", fontsize=14) plt.show()
Line Graphs and Area Charts
Line graphs are great for visualizing trends or progress in data over a period of time. For example, we can visualize the number of sneaker sales for the month of July with a line graph.
import matplotlib.pyplot as plt # Data creation. sneakers_sold = [10, 12, 8, 7, 7, 10] dates = ["Jul '1", "Jul '7", "Jul '14", "Jul '21", "Jul '28", "Jul '31"] # Line graph creation fig, ax = plt.subplots(figsize=(10, 6)) plt.plot(dates, sneakers_sold) plt.title("Sneakers sold in Jul") plt.ylim(0, 15) # Change the range of y-axis. plt.xlabel("Dates") plt.ylabel("Number of sales") plt.show()
An area chart is an extension of the line graph, but they differ in that the area below the line is filled with a color or pattern.
Here is the exact same data above plotted in an area chart:
# Area chart creation fig, ax = plt.subplots(figsize=(10, 6)) plt.fill_between(dates, sneakers_sold) plt.title("Sneakers sold in Jul") plt.ylim(0, 15) # Change the range of y-axis. plt.xlabel("Dates") plt.ylabel("Number of sales") plt.show()
It is also quite common to see stacked area charts that illustrate the changes of multiple variables over time. For instance, we could visualize the brands of the sneakers sold in the month of July rather than the total sales with a stacked area chart.
# Data creation. sneakers_sold = [[3, 4, 2, 4, 3, 1], [3, 2, 6, 1, 3, 5], [4, 6, 0, 2, 1, 4]] dates = ["Jul '1", "Jul '7", "Jul '14", "Jul '21", "Jul '28", "Jul '31"] # Multiple area chart creation fig, ax = plt.subplots(figsize=(10, 6)) plt.stackplot(dates, sneakers_sold, labels=["Nike", "Adidas", "Puma"]) plt.title("Sneakers sold in Jul") plt.ylim(0, 15) # Change the range of y-axis. plt.xlabel("Dates") plt.ylabel("Number of sales") plt.legend() plt.show()
Each plot is showing the exact same data but in a different way.
Histograms are used to represent the distribution of a numerical variable.
import numpy as np import matplotlib.pyplot as plt data = np.random.sample(size=100) # Graph will change with each run fig, ax = plt.subplots(figsize=(10, 6)) plt.hist(data, bins=6) plt.title("The distribution of data") plt.xlabel("Data") plt.ylabel("Frequency") plt.show()
Scatter plots are used to visualize the relationship between two different variables. It is also quite common to add a line of best fit to reveal the overall direction of the data. An example use case for a scatter plot may be to represent how the temperature impacts the number of ice cream sales.
import numpy as np import matplotlib.pyplot as plt # Data creation. temperature = np.array([30, 21, 19, 25, 28, 28]) # Degree's celsius ice_cream_sales = np.array([482, 393, 370, 402, 412, 450]) # Calculate the line of best fit X_reshape = temperature.reshape(temperature.shape, 1) X_reshape = np.append(X_reshape, np.ones((temperature.shape, 1)), axis=1) y_reshaped = ice_cream_sales.reshape(ice_cream_sales.shape, 1) theta = np.linalg.inv(X_reshape.T.dot(X_reshape)).dot(X_reshape.T).dot(y_reshaped) best_fit = X_reshape.dot(theta) # Create and plot scatter chart fig, ax = plt.subplots(figsize=(10, 6)) plt.scatter(temperature, ice_cream_sales) plt.plot(temperature, best_fit, color="red") plt.title("The impact of weather on ice cream sales") plt.xlabel("Temperature (Celsius)") plt.ylabel("Ice cream sales") plt.show()
Heatmaps use a color-coding scheme to depict the intensity between two items. One use case of a heatmap could be to illustrate the weather forecast (i.e. the areas in red show where there will be heavy rain). You can also use a heatmap to represent web traffic and almost any data that is three-dimensional.
To demonstrate how to create a heatmap in Python, we are going to use another library called Seaborn – a high-level data visualization library based on Matplotlib.
import numpy as np import seaborn as sns import matplotlib.pyplot as plt data = np.random.rand(8, 10) # Graph will change with each run fig, ax = plt.subplots(figsize=(10, 6)) sns.heatmap(data) plt.title("Random Uniform Data") plt.show()
Treemaps are used to represent hierarchical data with nested rectangles. They are great for visualizing part-to-whole relationships among a large number of categories such as in sales data.
To help us build our treemap in Python, we are going to leverage another library called Plotly, which is used to make interactive graphs.
""" Source: https://plotly.com/python/treemaps/ """ import plotly.express as px import numpy as np df = px.data.gapminder().query("year == 2007") fig = px.treemap(df, path=[px.Constant("world"), 'continent', 'country'], values='pop', color='lifeExp', hover_data=['iso_alpha'], color_continuous_scale='RdBu', color_continuous_midpoint=np.average(df['lifeExp'], weights=df['pop'])) fig.update_layout(margin = dict(t=50, l=25, r=25, b=25)) fig.show()
3 Tips for Effective Data Visualization
Data visualization is an art. Developing your skills will take time and practice. Here are three tips to get you moving in the right direction:
Tip #1: Ask a Specific Question
The first step to creating truly insightful data visualizations is to have a specific question you want to answer with your data. If ever you see a visualization with a distorted critical message, it is highly likely this step was skipped or the creator was trying to answer multiple questions at one time. To avoid overkill, ensure you have clearly articulated specific questions to be answered with your visualizations (i.e. Is there a relationship between feature A and B?).
Tip #2: Select the Appropriate Visualization
An effective visualization should achieve two goals: 1) it should clearly reveal the answer to the question it was posed and 2) it should be easy for the audience to quickly grasp what is being presented. A key component for meeting such criteria is to ensure you are selecting the appropriate chart to represent what you wish to reveal.
Tip #3: Highlight the Most Important Information
Humans are extremely good at recognizing patterns, but the same cannot be said for our ability to recall. For example, have you ever seen someone who looks familiar but you have absolutely no idea what their name is? Such scenarios happen to us all as a result of our design. It is much easier for us to remember someone's face because this correlates with our ability to identify patterns, whereas their name requires us to recall information.
“Humans have a tendency to see patterns everywhere. That’s important when making decisions and judgments and acquiring knowledge; we tend to be uneasy with chaos and chance.”
– Gilovich, 1991.
With all things said, an effective visualization leans on our natural tendencies to recognize patterns. The use of colors, shapes, size, etc is an extremely effective technique for emphasizing the most important information you want to display. Data visualization is an art; doing it well involves asking a specific question, selecting the appropriate visualization to use, and highlighting the most important information. Do not be too discouraged if your first attempt is not the greatest; it takes time and practice to improve your skills and DataCamp's wide range of data visualization courses are an excellent resource to help you become a master data visualizer.
Data Visualization Courses at DataCamp
Introduction to Tableau
Understanding Data Visualization
[Infographic] Data Science Learning Checklist
How Organizations Can Bridge the Data Literacy Gap
Dr Selena Fisk joins the show to chat about the perception people have that "I'm not a numbers person" and how data literacy initiatives can move past that. How can leaders help their people bridge the data literacy gap and, in turn, create a data culture?
Why We Need More Data Empathy
We talk with Phil Harvey about the concept of data empath, real-world examples of data empathy, the importance of practice when learning something new, the role of data empathy in AI development, and much more.
Introduction to Probability Rules Cheat Sheet
Data Governance Fundamentals Cheat Sheet