Data literacy and knowing your data helps you build trust with data producers and consumers alike
For optimal data storytelling, the right tool for the right job is just as important as the right chart for the right insight
Data literacy helps organizations break down analytic silos and upskill anyone to make data-driven decisions
True business value from data comes from understanding how to answer questions before it's delivered. So, once the data is available, what techniques are used to visualize the data to help tell a story? This talk will cover how data literacy is used by business users to ask richer and more complex questions from planning the data across the organization. I will also talk about the power of government data democracy, analytics upskilling, and self-service best practices to help anyone to make data-driven decisions. I will wrap up by sharing how data literacy and Data For Good can be adopted within corporate cultures to help improve communication and empower people throughout the organization.
The true business value of data visualization
Why is data visualization important? Visualizing data is everywhere. Gone are the days where you have to read through paragraphs and numbers or sift through Excel spreadsheets to find insights. Visualizing data breaks down the complexity behind the data sources and makes it easier and faster to consume. Think about how the pandemic has made interpreting data part of your daily routine. I check covid cases numbers like the weather now and a quick Google Trends keyword search on “Covid Dashboards” shows an exponential curve in relevance in just a few days globally. Searches went from 0 to 100, because people are hungry to digest information quickly so they can make informed decisions.
Visualizing data reduces the ramp-up time to understand the context behind the data and invites discussion to understand the what and the why. Today, we have many different technology tools available to paint a picture with data to help uncover anomalies or patterns. The phrase, “A picture is worth a thousand words”, invites consumers of data to ask questions about the data, while also encouraging more thoughtful discussion. This is commonly known as a train of thought analysis.
At Bloomberg, I start with a business question like, how many reads occurred in the last four weeks from all the news stories that were published? Within a few clicks, I would see those results and then start asking questions like, is readership trend going up or down which I could see in the combo part and chart that's below. I can then ask questions about the geographic distribution of the readers and this map chart that uses color intensity to represent the high readership for each country. Then, I can see the specific stories with the most reads in this horizontal bar chart.
Our team builds self-service data visualization solutions that allow anyone who has access to answer these questions in three clicks or less.
Art + science
What do a bowl of fruit and a database table have in common? I find creating a data visualization is both an art and a science. In my case, this started with my dad, who was an artist, and painted watercolors and pastels, inspired creativity, and introduced me to the master craftsmen of the world of art. On a side note, if you're ever in the Philadelphia area, please go visit the Barnes Museum for inspiration. I'm mad at myself for not going sooner. After I went with my daughter a few weeks ago for a school assignment, I found the entire experience amazing. The layout of each piece of art becomes an immersive experience where symmetry can be found everywhere. The coordination of colors is like eye candy for the soul. I was quite impressed and it was a wonderful way to showcase artists differently.
Art is about inspiring free thought through visual images using the elements of art. Let's break those down together.
First, we have shape, which is the area defined by edges.
Next, we have form, which is the perceived volume or dimensionality of a work of art.
Then we have value, which is the lightness or darkness.
Finally, we have the line, which can be straight or curved, which spans a distance between two points.
Color is the most common element in our product when it later strikes an object and reflects back on the eye. Some properties of color include hue, which is red, yellow, green, blue, intensity, and value, which is brightness.
Then we have space, which is the area an artist provides for a particular purpose, like the background or foreground.
Finally, we have texture, which is a three-dimensional property. Just the way a three-dimensional work of art actually feels when it's touched or the visual feel of a two-dimensional work.
On the other side, we have science, which is about empirical evidence, where information is acquired by observation or experimentation. Data is recorded as part of the scientific method, and used for investigating and acquiring knowledge. When we bring these two concepts together, we have a data visualization that's both aesthetically pleasing to the eye, like a work of art, and provides insights from the underlying statistical evidence captured from the data.
The intention is for the audience to quickly digest and interpret the information without any guidance. I highly encourage you to create visualizations that incorporate those two elements into there.
In this visual heat map example on the right, you can see the use of color, which has the days of the week that have more activity over a multiple-year span, because you used to be using darker shades of red. I can interpret this information without going through the complexity of how this data was produced for what technology technologies were used to create it.
Techniques to ask the right questions
So, how do we ask the right question about the data before we even see it? I summarize the techniques we use as follows:
KYD: The first is Know your Data — Understand Business Requirements, and what the users are doing with the data, including the people, process, and technology used. You want to do a little research ahead of time, and to be honest, you want to understand a person's workflow. So, how are they currently recreating existing reports? For example, do they need quarter-end financial reports to be accurate because of regulatory impacts to the company?
VOC: Next, we have the Voice of the Customer, and this concept has existed for decades and was taught at many different business schools around the world. VOC is about interviewing your business users and actively listening to their needs. Write down specific points on what business questions are trying to answer from the data, and you want to schedule working sessions, where you can actively engage in a dialog with them. Make sure you focus on the current pain points. So, if you can deliver a dashboard that can reduce their time from three days to three clicks, you become that Analytics Hero.
ABA: Last, we have ABA, which is Always Be Agile. At Bloomberg, we use the agile methodology, which has become commonplace in the tech industry for application development. One of the reasons that makes Agile successful is it creates an interactive communication line between the business and engineering teams to iteratively deliver value. The process involves stories, which is a common theme or a development team, completes tasks in 2 to 3 weeks, called Sprints. In that process, we understand the what, and the why behind each story becomes the focus. All of these techniques help improve your data literacy, regardless of your role within the organization.
Data literacy is formally defined as the ability to read, work with, analyze, or argue with data, and increasing your data literacy helps ask the right questions before, during, and after you work with the data.
So, how did we get to this point where technology has evolved, to allow anyone to solve problems with data so quickly? In my book, I research and collected key people, processes, and technology milestones over the last few decades that have broken down those barriers to work with data. The evolution of data analysis includes thought leaders like Naomi Robins who runs the New York City Data Visualization Meetup, and Dr. Ralph Kimball was one of the founding Fathers of Data Warehousing.
The following diagram visually represents the underlying source data I collected in columns or rows, that I did for my research, using a dendogram, also known as a sunburst chart. What I like about this visualization is it makes it easier to present the entire population of data without scrolling through the information, like Excel or Google Sheets. I was inspired to create this visual from one of my favorite authors Alberto Cairo who says this about data visualizations: “Good graphics are not just displays to extract information from, but devices to explore information with.”
As we go behind the curtains to see what makes a great data visualization, let's talk a little bit about data types.
Knowing your data
Understanding data types
Data types are the details of how data is stored, or its intended use when it was created. Data types are a well-known concept in many different programming languages. Data types will create consistency for each data value when creating your charts and visualizations. In my book, I discuss data types and data attributes in much more detail, but I want to share this summary where I classify the data in this falling hierarchy.
Continuous, which is numeric values like integer, float, or time.
Next we have categorical, which are descriptive values, like a stock ticker, first name, or last name.
Discrete, which defines boundaries around the possible data values, like numbers on a roulette wheel.
The takeaway from this slide is you don't have to be a data scientist or an engineer to understand how data is stored. But knowing that data types exist will help you create better data visualizations much faster.
Understanding data flow
Next, we have Data Flow, which is a subset of data lineage and part of the organization's overall data governance strategy. In my example here, we have a simple visual representation of how data is processed, which becomes important as data moves across multiple stages from each source or target before it becomes available to you to work with. Knowing data flow will help you understand the three V's of data: volume, velocity, and variety.
Volume is the quantity of data and how it's stored, measured in megabytes, gigabytes, terabytes, or even petabytes.
Velocity is the speed at which data is generated. This process covers both how it is produced and consumed. For example, batch processing is how data feeds are sent between systems where blocks of records or bundles of files are sent and received. Modern velocity is in real-time, where streams of data are in constant state of movement.
Variety is the different formats data can be stored in, including text, image, database tables, and files. Variety creates both challenges and opportunities for analysis because of different technologies used and techniques you need to work with it like NLP, which stands for Natural Language Processing.
Part of our data literacy journey is the ability to argue with data, and I find it to be effective, you should be transparent with the data flow to build trust with their consumers.
Building data models
During this talk, we've been peeling back the layers behind data visualizations down to its core, which is known as a data model. The data model defines the relationships between one or more tailored data sources within the analytic solution.
The first step in building a data model is to understand what questions you're trying to answer from the data. Then, you can decide which model will best fit the answer to the questions at hand. Common questions that apply to almost any data model include who, what, when, how, where, and why. depending on the subject area you are working with, answering some of those questions would become very easy. For example, if we're working with sales data from a shopping cart app:
Who would be the customer
What would be the product
When are the transaction date and the time of the purchase
Where is the geographic location of the transaction
How could be a person, or if you're located in a physical store or a mobile device
Why is usually the most important question. Was the why a promotion or a marketing campaign that happened for a specific product?
That's why KYD (knowing your data) becomes important, to help them find those missing pieces to the puzzle, to create a complete picture from the analysis of your data.
Creating a data model requires you to identify what fields will provide the highest quantity and quality for analytics. In many cases, you'll have to conform the data so that each row has to be a consistent data type, to make your analysis much faster and easier.
My tip of the day is to not use every single field that you have available in your data source. That way, you can focus on the specific fields that you need to answer your business questions on. When creating your visualizations, the anatomy of a chart boils down to defining what fields you plan on using for analysis into dimensions and measures.
Dimensions are values with descriptive attributes that are commonly used to identify a person, place, or thing.
A measure is any field that can be aggregated using statistical calculations, like Sum, Min, Max, or Average. Remember that quality metrics come from quality-conformed data.
Creating consistency, as you see on the table to the left, will be much easier to visualize versus the disjointed texts on the right, which will become a distraction when you're creating your data visualizations. This concept goes hand in hand with the prior slide on data types. Your data types are well defined. It's much easier to create your dimensions, measures, and new charts and visualizations.
Choosing the best visualization
I treat creating data visualizations like a craft. So what I've learned over time evolves, my thinking and thought process, but I found this chart chooser to help me define some rules of the road, and we can walk through it to help you tell the right story with your data. Some of them include some of the rules of the road that I always use are:
Less is more, let the data tell the story — This means, hold off finding the right chart until after you understand the data fields and values are trying to visualize.
Next, we have to reuse your code — You don't have to reinvent the wheel every time to create a new chart, app, or dashboard. If you model your data correctly, adding a year-over-year trend chart could be as simple as a cut piece of code with some minor adjustments.
Next, you have to avoid the clutter and Christmas tree charts — Starting out when you're visualizing data and all these tools have some great ability to drag and drop The challenges. You get really excited, and you get over-enthusiastic, and, like me, that's kind of judge clouds your judgment. So, avoid the need to put too many charts in one place and have different color variations all at the same time would have conflicting information.
Lastly, there are experts in the field of data visualizations. I highly encourage you to read up on best practices from authors and practitioners like Stephen Few, Alberto, Cairo, Edward Tufte who are masters of data visualization.
So let's go through a couple of examples that I found in this diagram to help you better choose the right chart for the right source data.
Pie charts are very common in dashboards, but they can be overused or even abused. Remember, the anatomy or chart is the dimension to measure. A pie chart is a single dimension with one measure. However, pie charts are taking the value of each component, dividing it by the total of all the values, and then multiplying it by 360 degrees.
So we visually represent a slice of each data element. So by design, a pie chart is intended to represent all the values that exist in a single column. That visual works well when you only have a few of our values to select from. But when you have hundreds or thousands of distinct values, the chart becomes something that is difficult to interpret and basically unreadable. If you have lots of values and you then limit the information, you're only showing a sampling which could cause confusion for your audience. So bottom line, use a pie chart when the data is well defined and only a few values are there available to choose from.
Line charts are popular when you want to display changes over time and compare them visually. A line chart commonly has 1 to 2 dimensions along with a single measure. So, if I'm going to include date and time in my dashboard, I'll start off with a line chart that has, like this example, when I'm displaying the closing price of a stock price on a specific date.
The X-axis is using the date values in four digits here, along with the number to represent the month, 1 through 12, and then the date day value.
The Y-axis has the value of the daily closing price for each specific date.
So, if a single date includes weekends, which is most common, and unlike transactional data, or if the information that you're bringing in there has weekend dates with no values, right? I usually create a derived field at a time in my model. That way we avoid that roller coaster pattern where the measure is going up and down without providing any insights from the data. So, model your data to fit into the messaging and narrative that you want to display in your charts and visualizations.
Lastly, let's find that bar charts end up being my go-to option when we’re working with data. They're versatile, they can adjust with properties easily by rotating the axis of the text and the bars horizontally to make them easier to read. A bar chart has a full representation of the entire sample values, especially when you're working with tools like Tableau and Qlik Sense because they use scroll bars.
But in my example here, I sorted the measure by descending order. So we're adding an extra layer of the development life cycle, to make sure that you only show the most important information like the top five, as an example.
Be sure to think like an end-user and tell a well-defined story with your data, by using one or more charts together. You want your audience to consume the information quickly without additional guidance. So, you may have to remove charts and restrict the number of dimensions you display at one time.
Visualizing your data
Now we get to the fun part where I walk through a finished data visualization solution. I enjoy mentoring others in Bloomberg and we offer volunteer hours for tech talks and projects at local universities. This app was built with an open-source public dataset to help computer science students learn how to build predictive models. The source data was rental listings in the New York City area over a specific period of time, like 90 days. The class assignment was to see if the narrative, detailed descriptions within each real estate listing can be used to do predictive future pricing and reduce the vacancy time.
My contribution to this Data For Good project was to show how a data discovery platform can be used so quickly to profile and visualize the data, before building any predictive algorithm. I recommend building your descriptive analytics like this one before making any predictions.
One of the insights uncovered from this data was presented in this block chart, to display distribution from each keyword or phrase which I derived from the sentences of the narrative text, using some basic NLP techniques. Each text or phrase is a categorical data type, which I named “feature”. Each value like “doorman”, “dishwasher”, “laundry”, and “building” is used as a dimension in the chart. The size of the block is distributed as a measure number of properties, which is a distinct count of each real estate listing. The larger the block size, the more frequent the appearance of the feature in this population of data. I also included a visual cue to help the audience with using color gradients like the average price per listing, so the higher the prices are the darker red, and the lower the prices fade into a dark blue.
An insight uncovered in this data was the features like exclusive and dining room have a higher average price, which is emphasizing the darker shades which you see in orange and red.
One reason why the average price is higher, though, is because of the lower number of listings that you can include that are included in the feature. Because you can determine that because the size of blocks is much smaller in relation to the other block sizes. So this stage of visualization combines both color and shape to help a consumer answer multiple questions at once and gain insights using one single chart.
To democratize data is to share best practices with the intention of improving data literacy across the organization. One of the simple grassroots approaches came from a suggestion from my mentor Rich Hunter. I became a self-appointed Data Literacy Brand Ambassador, where I am leading by example, with all things, data, and analytics. This involves creating an Analytic Center of Excellence COE for the department, where I hold weekly office hours open to anyone in the firm. We host data hackathons. We share code, we host Tech Talks and help others within the organization upskill. All of this has turned out to be personally and professionally rewarding as I meet new people across the firm and share stories of how complex problems can be solved using data.
Data for good
Before I wrap up, I just wanted to share some examples of how data can be used for good, helping others. I believe in working at companies with a social conscience and giving back as part of our corporate culture. Some of the challenges I find helping non-profits, as they always have to fight to get any resources, including the technology, tools, and skills needed to make data-driven decisions. To help reduce those resource gaps, DataCamp offers free access to its platform for instructors and students. At Bloomberg, all full-time employees can participate for volunteer hours, where up to $5000 can be donated to Philanthropies. So I get paid to help others that I mentor. Like I do with the American Corporate Partner organization, ACP. They help military vets transition into jobs in corporate America, that's been a very rewarding experience. Another example is Bloomberg's Data For Good Exchange Immersion Fellowship Program, which is called d4gx. It recruits and pairs PhD students that are stunning data science with non-profit organizations and municipalities around the world.
This shows how technologist's non-profits and researchers can pull together their knowledge and bring positive change to their communities. There's an excellent example to share like an app that was built for the City of Bogota, Colombia where city officials, Bloomberg colleagues, and Xavier Gonzales, a data scientist fellow from Columbia University, came together to provide transparency to public works projects. Their app was built on open source code and it shows citizens of the local community how their tax dollars are being distributed by project. The goal was to bring transparency to local government using data, so they can see details like the cost of the repair road or the creation of a public park, or the construction of a kindergarten. The data's categories, and aggregated by project size and type. So you can drill down into details, an interactive map with pictures of each project.
My favorite feature is the send feedback button, so citizens are empowered to participate directly with the app and the data by asking questions about the project. Now people become that controller where they can point out any bias or preferential treatment like questioning if too many projects are going to a single contractor.
So to wrap things up, I just want to share a few things:
Data literacy improves communication between both consumers and producers of data across the firm.
We can use data visualizations to provide context, insight, and understanding from data to a large audience of users without being a data scientist or an engineer.
Data models are the secret sauce to fast, fast visualizations. Creating your data miles to answer questions, so you can focus on the Why.
KYD, or Know Your Data. It requires you to do a little research about the underlying source. You can uncover business rules that are not transparent in the data itself. KYC will build trust when you're analyzing and arguing with your data.
Lastly, sharing is caring. We want to break down those analytic silos by empowering others within the organization to read, work with, analyze, and argue with data.
So thank you very much for joining me today. I look forward to having more conversations about analytics and all things data. So I think we're ready for some questions.
Questions and Answers
Question: You mentioned not having too many colors in a visualization. Is there a best practice to have only a certain amount of colors? Is there a maximum of 6 or 10 colors for example?
Answer: That's a good question. It will vary, depending on the data subject that you're working with. But I do go to the power of three sometimes. I've seen a lot of value in theming your applications now, and color matching within the dimensions across different charts and apps. Three is ideal, but if you have to go more, then, you know, five or seven can work. I think the key takeaway from that is if you have to display lots of colors, make sure they don't contradict something in another chart. So the legend in one chart is the same color as the different dimension values in another. That happens in geography often right? If you're looking at a map of countries and then you look over in another dimension and it has the same color that's using, for example, Belarus or whatever other countries you have. The user in the audience would start confusing, is that data associated with that country? Even know that the other chart is not related to that.
Question: Is it better to create one multiple-dimension chart or multiple one-dimension charts?
Answer: I am definitely, on the side of less, is more. So I usually go to one-dimensional charts. I find that multiple dimension charts or creating complimentary charts can confuse the user. But even in my last example, right? I am using color and dimensions to tell that story. So one suggestion would be, use the color as an opportunity to add another insight, without adding an extra dimension in the chart. So you don't have to display too many values all at the same time.
Question: Can over-labeling add clutter or add confusion? I think this is something that I've definitely seen also in my personal life when I look at different data visualizations, and I look at different data, storytelling techniques, and books. Or is it best possible to provide all labels and ensure they are built on quickly?
Answer: Yeah, there's a lot of give and take on that. You know, once you see this chart for the first time, there doesn't need to be all this boilerplate language to explain everything to the user, because the next time they go in, they already know all the information in their data, that's the KYD, right? All the extra background or history of how you got to that. I try to always provide information about the data or some kind of link to that information. So maybe that's a nice tradeoff. We always have pages or, you know, links to our apps to give much more detail around the methodology of how we built the app and the data dictionaries or details around, you know, how to read the charts. I've been leaning towards using YouTube-style videos on occasion to train people. Because I find that to be also being about, available to for first time users. But there's no right or wrong answer, there's sometimes, you know, for legal responsibilities, you have to put, you know, regulatory stuff. You know, you have to put the boilerplate language in there.
Question: What are your default tools or techniques for exploratory data analysis?
Answer: Oh wow. So I mean my book is using Jupyter Notebook. And that's using Python Code and SQL. Certainly, that's the advantage to using Jupyter as it is, it ties well into the modern ecosystems and libraries that are available for data scientists. So, for going down that path for doing predictive models, then staying within the technology stack makes sense. However, we do have lots of technology tools available for us at Bloomberg like business intelligence and analytics visualization tools, Tableau, Qlik Sense, certainly at the top of the List, Power BI. There's plenty of examples out there that tell you which ones are better or worse, but I find a lot of them, from a visualization perspective, offer a lot of out-of-the-box features.
Question: I'm a full-stack analyst working with various leaders to have data-driven decision-making. One of the main prerequisites is the trust and real reliability of data that leaders may refuse to even start a conversation. If they do not trust how reliable data is, Do you have any recommendations? To have buy-in from influential business leaders? Is this part of `I know your data’?
Answer: It's a wonderful question, and there's plenty of examples where I've had to learn through trial and error, right. I've led with a visualization and then I didn't build trust ahead of time. I followed the lead from people that are smarter than me, and a lot of times they've recommended progressive disclosure. With that ability, you don't need to build a final solution. You actually build the solution in tandem with your business stakeholders at every stage of the game so that there is trust built over time. That's not always, you know, we don’t always have the luxury of that. But I do make sure that I don't try to slap a dashboard on a dataset without really making sure that our business stakeholders and our sponsors really have trusted the data so that they aren't confused with the tool, right? We don't want them blaming the tool when the data's wrong. You know, we want to understand the data flow and the lineage before going for the visualization.
Question: How do you create an elevator pitch that doesn't force data democratization like on, let's say, data governance? How do you not force corporate change, while also balancing trying to increase organizational data fluency overall?
Answer: Those are very good questions. You know, forcing change from the top down always has consequences. I think I find the grassroots efforts, slow progressions, i.e., making sure you don't go to the bridge too far, work well. Just incremental changes to see which one was working, and at the end of the day, the low-cost stuff, like I did where I did the Center of Excellence had value. We just held weekly office hours, I mean, the cost of that is just time, but, you know, the investment is going to, is returning it, you know, fivefold, right. There are now many more people working with data and they can have a various spectrum of skills, right? They don't have to be experts or engineers and data scientists just because they want to work with data. There is analytic upskilling. And they can use Excel. It's not like, you know, you can use any technology available. That's just to work with data. Organizationally. It does make sense to have a data governance strategy and to have leadership in that place and build it to bring in people across the organization. As a council, as an example, I've seen that certainly, if you have executive sponsorship of that, it will help guide data governance, and of course, you want data literacy as part of that gate data governance strategy.
Question: When providing context and data, let's say, comparing it to last year, to a benchmark or average, or to forecast your expectation, are there any thoughts you have, maybe, on how to provide context concisely as a single data point?
Answer: Um, there's no one size fits all on that one. But I, you know, a lot of times, it goes back to that progressive disclosure. If you built the solution with the stakeholders, as you progress, you don't define that threshold or benchmark. That actually becomes a feature that they co-developed with you. I find people will start defining these thresholds without even looking at the data or even understanding the data behind it, and then everything starts looking bad. So, if you look at the data and actually analyze it, there's descriptive analytics on top of your data first. Then, we say, looking at the data, let's define a threshold that's going to be for a 30-day average. If we exceed the 30-day average, we can set up alerts, for example. You should get collaboration and a consensus on that threshold, as you go along.
Question: What do you think is the best skill when creating a data story?
Answer: I think practice makes perfect. It's an investment in time and yourself. When do you start working with the data, immerse yourself in it, and think like a business user or your audience, right? Knowing your audience will then help you build that story and then practice it, right? Boil down the narrative of what you want to focus on a couple of things. Because if you overwhelm the audience like drinking from a firehose, it's just too much.