Using Python to Power Spreadsheets in Data Science
Excel, my old friend.
I’m the type of person who manipulates data in spreadsheets just because I can. Last week, my team lead sent out a list of employees who hadn’t updated their project checklists, and I immediately opened it up, started making tables and getting summary statistics by employee, then pinged him some summary data, broken down by employee. It probably drove him a little nuts but I can’t help it; I love data analysis and I love Excel. I’ve been using Excel since 2001 and until just a few months ago it was my tool of choice for data organization, analysis, and visualization. It’s great for organizing data, performing calculations and analysis, and even works as a simple database. I've used it for everything from keeping track of my personal budget, expenses, and workouts to multi-page personnel strength analysis reports in the Army. Excel, for the most part, worked well there and it was the only option. Due to security reasons, I didn't have access to VBA, Python, or any other type of automation. The data sets were also small enough for it to handle.
Almost a year ago today I started working in Cerner's Abilities Lab as a System Engineer in software performance testing. The short version of what I do is software performance testing. We test new and updated software before it goes out to our clients and we test everything: Windows and Unix CPU and memory, Java Virtual Machine performance, response times on both backend services and the User Interface, Database performance, the impact on our current software, network performance, and pretty much anything else you can think of. No test is limited to a single machine, most of the time it's a combination of Windows and Unix devices (physical and virtual), end-user systems and servers. For one of the projects I was involved in we had more than 20 systems that data had to be collected from.
Depending on how long a project runs for, there can be upwards of 50 tests that have to be reviewed, analyzed, and reported on. All of that data has to be summarized into a final report of test results that gets posted to an internal Jira project page to be consumed by the software development team. We do have some in-house custom tools that can aggregate most of the data, but not all of it, and the reports generated in those tools are not able to be customized to the extent they need to be for the particular type of testing I do. My tool of choice for writing and delivering these reports was Excel, and why not? I come from a strong background in Excel and it was my belief at the time that I could make it do anything.
Excel, we have a problem...
I will not get into the tedious process of getting data from our in-house tools to Excel but it was a multi-hour process for me and I'm very proficient with Excel. Data had to be copied over, cleaned, sorted, and filtered. No matter how I wrote my formulas they would almost always have to be adjusted, then tables and graphs would have to be adjusted depending on the number of tests, nodes, and various response times that were recorded during the project. After that it was a process of find and replace to get the test names right, making sure the number formats and highlights were correct in every row on every page, and general data cleaning. I tried writing templates, macros, and VBA scripting. Nothing I did was adaptable enough to handle a varying number of tests, devices, and performance metrics without some serious cleanup for every test. I was able to come up with a template that did speed the process up, but it would still take half a day to get things right.
I get a little laugh thinking of some of the ridiculously complicated formulas and VBA scripts that went into that template. It got a lot worse when we updated Java versions and our in-house tools couldn't process the logs correctly. We switched to an outside tool called GC Easy that does a great job parsing logs out to a JSON file that can be loaded into Excel. The JSON files can be loaded with PowerQuery, but it's a slow, tedious process and when you're trying parse 50+ JSON files with more than a million rows each, you wind up breaking Excel pretty often. Add to that the fact that I needed a way to automate the process and I just could not find a good way to do that. Even when I could get the data to load in Excel, the visualizations (stacked area graphs representing Heap Size over time) would freeze and cause more crashes. Even if I could solve that problem, the JSON files were just the start.
Manually copying, cleaning, and analyzing data is going to result in errors. We are all human and we miss things, we get distracted, we misclick. My process was not working and it would never work for everyone unless the entire team went through extensive training in Excel. Maybe it's my military background, maybe it's just a personality quirk, but I'm always looking for a way to improve processes and make things more efficient, so I started trying to figure out how to streamline the whole process.
Our in-house tool, at the time, had no API and no working export function. I had no way to pull data directly from it without copy/paste. My next thought was go right to the source: log files from each device in the test. That meant searching a test directory and the many sub-directories within it, pulling data from multiple file types in different formats, doing all the calculations, and writing it to Excel in the format I needed it to be in, complete with visualizations and conditional formatting.
The file types I needed to parse were
.json (various structures),
.xml (various structures),
.html, and some data had to be pulled from an Oracle Database. I had to take into account a varying directory structure, varying numbers and structures of files,
.xml files without closing tags, and file sizes ranging over 10GB. The process also had to be easy to follow, automated to the greatest extent possible, and handle millions of rows of data. Excel was absolutely not going to meet my needs for all that, so I turned to some popular BI tools: PowerBI and Tableau. I don't want this to be a product review of BI tools so let me just say those categorically did not meet my needs.
So what is a tool that has does everything Excel can do, can write reports in Excel, and can meet all of the requirements I listed?
Meet my new friends, Python and Pandas
If you are at all familiar with Excel, I think you’ll absolutely love the pandas for Python. pandas brings the functionality of Excel together with the power of the Python language. Excel columns become pandas Series, tables become DataFrames, and complex formulas become Python functions. pandas can do everything Excel can do:
Excel does a pretty good job reading flat files, and with PowerQuery it has a limited capacity to query databases and read certain
.json files. If you read the documentation it looks like the perfect tool for reading nearly everything but in reality that it just not the case. The interface is clunky and takes a lot of getting used to, there is no good way to automate the process, and data cleaning is almost as much fun as visiting the dentist. pandas, on the other hand, makes reading data a breeze.
This Python data manipulation library can natively read data from a multitude of sources. Check out the documentation page for a full listing. The really great thing is, if pandas doesn’t read it natively, there’s a good chance that you can find a Python package that will read it or write a few lines of code that will solve the problem. Check out DataCamp's Importing Data in Python course series, among other courses, to cover this topic in-depth.
I was able to use pandas to read every file type I discussed earlier, including those
.xml files without closing tags. With some creative natural language processing I can also grab all the test metadata from the raw
.txt files, something I wasn't able to do with Excel because they are not formatted for data. The version of the script I use now not only reads every file type I needed to read, but with a few extra lines of Python code I was able to automatically get the Java Garbage Collection
.json logs from GC Easy by hitting the API; something I was never going to be able to do with Excel. How does this compare to Excel? Using pandas I can search an entire test directory and get all the data back in nice, neat DataFrames in only a couple of minutes. It even handles those log files that exceed 10GB with ease. That may not seem like a large file size, but good luck reading anything that size with Excel.
I've even started switching from reading those raw flat files to pulling data from a multi-terabyte Oracle Database. I can't compare that process to Excel; other than a brief experiment with it I have avoided database queries in Excel. From what I remember it wasn't an experience I'd like to repeat. Connecting through Python, however, is a breeze with SQLAlchemy. Connecting is easy and the query speed runs circles around Excel. Check out Introduction to Relational Databases in Python for a great introduction to the topic.
Visualizations aside, I firmly believe that data should be in tables that are clean, neat, formatted, and labeled. Pandas has some great options for cleaning data and presenting nicely formatted tables. The Styling page on the official Pandas website is my favorite guide for formatting DataFrames. With a few short lines of code you can set decimal places, format numbers, add in row and conditional formatting, and and the process is repeatable across an infinite number of documents. The ability to automate certain parts of data cleaning, regardless of the size of data, has saved me countless hours.
To be fair, you can certainly clean data in Excel. It has a large number of built-in tools for that very purpose, and that's exactly what I had to do every single time I (eventually) got all the data I needed loaded into it. It was a largely manual process involving entirely too many mouse clicks and, once again, is very error-prone and time consuming. Even if you're extremely proficient with all the tools Excel has to offer, mistakes are bound to happen. There are ways to speed it up with formulas, macros, and VBA scripts (I'll get to that in a minute), but it's a far from perfect solution.
I'm going to be very blunt here: pandas beats Excel hands-down when it comes to analyzing data. There isn't a fair way to compare the two. You can write formulas and use some built-in functions in Excel to analyze data; I did that very thing for years and I made it work. I taught some classes on data analysis in Excel, advocated for its use on several occasions, and took a statistics class where Excel was the tool of choice. I am not saying that it can't be used for analysis. However, comparing it to pandas is like comparing a pencil and paper to a calculator; they're not in the same league.
In August of last year, Kaggle reported that Python had overtaken R in Data Science, and in December, Quartz called pandas the 'most important tool in Data Science'. I don't think I need to say much more on the topic.
Macros and VBA:
In 2012 I was still on Active Duty and had to go in for a routine dentist appointment on Fort Bliss in El Paso, Texas. When I sat down in the chair the dentist mentioned that my wisdom teeth needed to be removed, and since he had the time, he performed the procedure on the spot with only a local anesthetic. It was a terrible experience that left me in pain for several days. If I had to choose between doing that again or working with Excel Macros or Visual Basic, I think I'd rather get more teeth pulled.
There are plenty of people who are highly skilled in those areas and I have seen some amazing things done with clever VBA scripts and I applaud them. I am clearly not a fan of it and I don't recommend it to anyone, especially when you can use Python instead. It's such a popular alternative to VBA that Microsoft is going incorporate Python as a scripting language in Excel.
But I still need to use Excel:
As much as I'd like to do everything in Python, it's just not practical in my situation. For now, the results of my analysis still have to be delivered in Excel. Fortunately, there are packages like OpenPyXl and XlsxWriter that were made for that very purpose. I use XlsxWriter on a daily basis to convert DataFrames into very professional reports in Excel that don't have any formulas, macros, or VBA Scripts. It even does visualizations.
Speaking of visualizing data...
No More Boring Graphs
I have never been a data visualization enthusiast. I always preferred to see data in rows and columns with highlights on areas of interest, maybe a bar graph here and there. Excel does a decent job by providing plenty of options for visualizing data, but the extent of my knowledge is making stacked area graphs, bar graphs, and line graphs. For each one of those, I select the data, hit “Insert Chart” and find the one I need. From there, I might do a little filtering and add a title but that was the extent of it.
For transparency, there are a ton of options for customizing graphs in Excel but the process can be very difficult and there is a steep learning curve. I still don't like making graphs in Excel, but with Python it's a different story.
I am now a firm believer in data visualization and put a lot of effort into coming up with great visualizations for analysis. When I started taking courses here on DataCamp and got introduced to packages like Matplotlib and Bokeh, I started to understand the real power and purpose of visualizing data. With Matplotlib, visualizations can easily be made from within pandas or by importing
pyplot. The list of possible visualizations within
pyplot is extensive to say the least, and virtually all aspects of a plot can be customized. Introduction to Data Visualization With Python shows how to make some excellent visualizations with Matplotlib.
Bokeh, supported by Anaconda, incorporates D3.js and Python to make fully interactive graphs that can be exported as standalone html documents, embedded into webpages, or ran on a Bokeh Server. While it's has a steeper learning curve than Matplotlib, I think you'll agree that the visualizations it produces are well worth the effort; just take a look at the Gallery and see for yourself. I’m not half as skilled with Bokeh that I plan to become, but it has been one of my favorite visualization packages since the first lesson of Interactive Data Visualization With Bokeh. If Bokeh doesn't fit your needs, there's Dash from the good folks at Plotly.
Save Time With Scheduling and Task Automation
In the summer of 2014, we had a problem in the organization I worked for at the time. The Army Reserve is driven by a series of numbers collectively called “Readiness”. It’s the percentage of personnel assigned to a given organization that could be called up for a deployment. The number is determined by several categories of data including medical, dental, education, physical fitness, and a whole slew of other numbers. I can’t give out the number here but ours wasn’t anywhere near what it needed to be.
My solution was a multi-page report in Excel that consisted of several tables and charts of information compiled from various personnel databases. This information included summary data on the organization, information on each individual in every measurement category, and projections of where that number would be over the next three months based on the current actions being taken that week. It was actually a huge success and our ‘Readiness’ did increase significantly and stayed that way for a very long time.
While it did work, it was a mess. I’d come early on Monday morning and manually pull the information from all the different sources, then clean the data and copy it into that week’s report. I would then spend at least an hour checking the report for mistakes and doing more data cleaning. After that it would have to be emailed to the administrative staff so they could verify the data and add in any updates that weren’t in the databases yet. On average, it took three days to get this report sent out. The cost in man hours was excessive and unnecessary.
With a little Python scripting I could have automatically pulled all the data from the various sources, compiled everything into
pandas DataFrames, written the whole report out to Excel, and emailed it automatically to the administrative staff. The best part, this could have been scheduled to run at one minute after midnight on Monday morning and saved countless hours of work.
On Linux systems, it’s too easy to schedule Python scripts to run at certain times by adding them to crontab. In my current job we work with so many systems that crontab is an essential part of the job and scheduling tasks there is an essential skill. So if you did need to pull data every night at midnight and load it into a report, crontab would be the way to go on a Linux system.
I am a Linux fan but the lion’s share of my work is done on Windows, and it’s my preferred operating system. With it, you have various options for scheduling tasks through Windows Task Scheduler, Schtask.exe, or Powershell. With any of those options, you can schedule your scripts to run during the least busy hours when the most system resources are free or set the them to run just after you know a data source has been updated. It’s an option I can only wish I’d had a couple of years ago.
My Favorite Tool: Jupyter Notebooks
Having used Excel for as long as I have, learning Python was like living my life with only a pair of pliers and one day discovering an entire toolbox full of shiny new tools. Each one of them is great and has its use but if there’s one tool that I use more than all the others, it’s Jupyter Notebooks. Everything I’ve written in Python has been written in a Jupyter Notebook, either for direct consumption or for later use as a standalone Python script. Jupyter is a great tool for writing functions, testing code, doing exploratory data analysis, and even presenting a final product. It can even be used to write blog posts (like this one) or on Github with Pages. If you browse the tutorials on DataCamp you can see that many of them were written the same way.
Why mention it during a discussion about Excel?
Jupyter Notebooks can be used in lieu of Excel as the product of your analysis. You can write your code, show
pandas DataFrames, visualizations, and conclusions all on one page that can be converted to HTML with the click of a button. One of my personal goals is to break away from Excel completely and use Jupyter Notebooks instead for sending out completed reports. This becomes an even more viable option now that JupyterLab has been released. Between that and widgets you can turn a Jupyter Notebook into an interactive data analysis application.
I still use Excel daily, it’s a great tool that is entrenched in many organizations, including the one I work for. Almost everyone is familiar with it and can use it to consume information and do simple data analysis. What I don't think a lot of people know is that doing the same analysis with Pandas and Python is not only more efficient, it's easier.
My goal is to break completely away from Excel by the end of the year, and I will be able to do that using the tools I mentioned in this article. You won't become an expert overnight, but the same can be said for Excel. I hope that reading this has inspired you to break away from it and try out some of the new tools I mentioned. It will make you more efficient, more productive, and ultimately speed up your workflow.