Reducing Rework as a SQL Data Analyst
In theory, I’d love to be ‘breaking new ground’ and making progress toward my objectives with every minute of my workday. I know that isn’t practical, and maybe not even advisable if I really want to maximize productivity. But few things are as obviously unproductive as re-doing something that has already been done well. And, unfortunately, data work is full of situations like the following:
A product manager or marketing analyst re-writes a query the BI team has already produced because the published report isn’t specific to her product or market.
A new, interesting question leads to an ad hoc report in a spreadsheet that has to be painstakingly updated with copied and pasted data daily, weekly or monthly.
Someone asks me a question I KNOW I answered 6 weeks ago...or was it 8 weeks ago...and where would I have put that query anyway… By the time I find the work I already produced, I might as well re-write the stupid query.
Unless they set up your workflow to avoid it, an analyst can spend an extraordinary amount of time traveling over ground they (or someone on their team) has already explored. Here are a few hacks to help you avoid making re-work the bulk of your day.
Curate: Build a well-documented library of your best work. There are dozens of systems and tools to help you do this, ranging from GitHub to a spreadsheet on your desktop. Whatever system you choose, be sure that:
- You capture tags or categories that are searchable. A chronological list of all the useful queries you’ve written is better than nothing, but tags that let you find all the queries related to a particular question or topic make the library infinitely more useful.
- The system is part of your workflow. If you have to go search for and open the document to save a query, chances are you won’t do it until you’ve re-written that query at least a half dozen times.
One thing you’ll need to decide is whether you want to save every query or just the ones you deem most useful. Sometimes it’s difficult to know whether you’ll need a query again in the future, so saving every one can remove the need to predict the future. But you’ll also need to sort through more queries in the library to find the right one.
Share: Having access to your own work is a great start. Having access to your entire team’s work is a huge force multiplier. This is particularly true for beginning analysts, or those in functions like product management and marketing analysts, who spend less of their total time writing code. Often a product manager can take a query written by a data analyst or the BI team and tweak it to answer a similar question directly related to her/his product.
Curation is doubly important when you’re sharing. There are many more queries to sort through, and the searchers lack the context the authors generally have. Meta data like who wrote the query and when it was written or the last to run it can help narrow the field. A system that allows for ‘upvotes’ can also help identify the most useful queries and mark outdated or junk queries for deletion.
Publish: Put your results in collaborative spaces, and link them back to the original code. Whether it’s on a BI platform or in a shared spreadsheet, give others access to the work you’ve done, and a way to replicate or edit it if they need to. And make sure you’re plugged into the work others have published in a similar matter. BI dashboards are the most refined examples of this, but a well-organized system of spreadsheets can provide useful but more marginal reports and insulate the dashboards from overcrowding.
Automate: The more data you publish, the more time you can spend updating it. Updating might not technically be rework, but it can be a tremendous time drain. Spending a few moments properly organizing the data and setting up automation can release you from that burden.
GitHub (Curate & Share) Starter Pricing: Free for up to 3 collaborators
Git’s primary strengths are in sharing and in version control. Multiple editors can make changes without overwriting past versions, so you can always track back if you need to, and there is less risk of mistakenly choosing an old version of the code--because the newer version is clearly linked to it. GitHub allows users to use Git in a visual interface (rather than the command line) and brings your projects to the web.
The challenge with GitHub is that it is not generally seamless with your workflow. You’ll need to transfer queries or edits to the file in the repository, which adds an extra step...which requires team-wide discipline in the system.
SeekWell (Curate, Share, Publish, Automate) Pricing: Free for Curate, Share, Publish, $49/mo for Automate
We built SeekWell to allow teams to collaborate around a shared SQL Repository and through an integration with Google Sheets. SeekWell combines a shared SQL Repository and an integration with Google Sheets. Your team can write SQL and get data back within the app, or send the data directly to Sheets. The data can then be set to update on a schedule, as often as every 5 minutes.
Meanwhile, every query is automatically saved in the repository and can be searched by anyone on the team based on the content of the query, the author, when it was run and (if properly tagged) the question being answered.
SeekWell’s main advantage is its tight integration with analysts’ workflow. There is no need to switch between programs to write, save, search or publish data in a shared spreadsheet.
Snipplr (Share) Pricing: Free
Snipplr is a web-based repo that allows you to save and share code with your team. It doesn’t have the same version control of Git, but it’s pretty simple to use.
And of course, you can always ‘roll your own.’ You can save and share queries in a shared spreadsheet. To publish and automate data into Google Sheets, check out how to develop your own apps script here. For sending data into Excel, you’ll want to set up an ODBC connection.
Maximizing your time spent answering interesting new questions necessarily means limiting the time you spend addressing questions that have already been answered by others and manually updating reports. Curating your best code, sharing your work with your team, publishing your results linked to the source code and automating update processes can minimize the time you spend on rework and direct more of your time toward breaking new ground.
If you would like to learn more about the basics of SQL, take DataCamp's free Intro to SQL for Data Science course.