When I recently completed DataCamp's Data Scientist with Python career track, I felt armed to face some real data challenges, but it's not quite that simple. Completing courses on DataCamp or at a university is just the first step on a long road to becoming a data scientist. The skills that you've learned must be practiced, cultivated, and turned into a working portfolio that can showcase what you've learned and prove to prospective employers that you can be an asset to their team.
But how do you get there?
Having some tools in your toolbox and you having a pretty good idea of how to use them is one thing, but getting started can be a daunting task.
So how do you get started?
This article isn’t about the code that you'd be writing, it’s about the process that you'd be going through. This article will try to answer the above question from my experience in the US army.
Civil Affairs and Civil Information Management
Before I started on the road to becoming a Data Scientist, I spent twenty years in the US Military, the last five of which were with the US Army Reserve's Civil Affairs branch. Civil Affairs, in brief, is the branch of the Army responsible for stabilizing a region after a disaster or conflict. They get deployed into situations ranging from the Ebola outbreak in the Horn of Africa from 2014 to 2016 to disaster recovery after Hurricane Katrina. They are also heavily used during conflicts like the Iraq War. Their mission is to bring a given region back to what was "normal life" before the disaster struck.
Now, you might wonder: how does this apply to data scientists?
Civil Affairs relies on information. That is, data. To bring a region back to normal they have to know what normal was, and if normal is enough to make a region stable. The populace of a region may think it's normal to lack an education system, security, or even fresh water. Making decisions about what a region needs to be stable requires enormous amounts of data that must be gathered, analyzed, and put in the hands of the decision-makers.
Decision-makers in Civil Affairs are what you'd call 'upper management' in corporate America. They are high-ranking military and government officials, leaders from agencies like the Red Cross and World Health Organization, who make the final determination on where to commit money, effort, and time. Their decisions are based heavily on the results of good information collected through a process known as as Civil Information Management (CIM).
My time in Civil Affairs was spent leading a CIM team, teaching the CIM process to other units with in Civil Affairs, and improving on the techniques that we used for this process. I was widely considered to be a Subject Matter Expert on CIM, and I can tell you that it's not just for the military. The data we used was not classified. It had to be consumable by the military, local government, FEMA, and civilian organizations like the Red Cross and WHO.
Now, the CIM process we used to gather, analyze, and disseminate that information can be applied to data science and from what I’ve seen so far on DataCamp, it is widely used in all but name.
Data science goes beyond the command line or a Jupyter Notebook. When you're working with a data set there is almost always and end goal and a customer. If you’re working through a data set out of curiosity or to practice skills, the customer is you. The customer could be your blog readers, other associates at work, your supervisor, or a paying customer.
If you accept that there is always a customer, then you must accept that there always needs to be a product, and that product must answer the all-important “question”.
Before you can get into the six-step CIM Process, you have to know what “The Question” is. In the simplest terms, it’s the reason you’re doing the analysis.
For example, in Intro to Python for Data Science, the question is “What is the median height of Goalkeepers in FIFA?”. In Kaggle’s Titanic Competition, the question is “What type of person was most likely to survive the sinking of the Titanic?”.
There is always a question, and in a lot of cases there are several questions that need to be answered. Now that you know what “The Question” is, you can start the process.
The CIM Process
Without data there would be nothing to analyze and not really a reason for data scientist. The data sets used on DataCamp, Kaggle, Driven Data, and Data.gov all had to be collected. A nearly unlimited amount of data is available on the Internet and most of the collection can be automated through web scraping or other automated techniques. When I was still in Civil Affairs, I relied heavily on the CIA World Factbook, Census Data, and Google Maps.
It’s not always that easy though.
In November 2016, the town of Gatlinburg, TN was hit by a devastating wildfire that destroyed more than 2,500 homes and businesses. At the time I was assigned to a Civil Affairs unit in nearby Knoxville, TN and we were asked to help assess the damage. The area around Gatlinburg is rural, mountainous terrain with homes scattered through the mountains, and it’s almost impossible to keep maps up to date. The only way to collect data on which buildings were still standing was to manually collect the data. Working alongside local volunteers and government agencies, we split into teams and drove or walked those rural roads to find homes. For every building we came across we’d mark it on the map and note how damaged it was. That data was then sent in to be collated by local agencies.
Another great example of manual data collection is the Scandens Beak Depth Heredity data set, which was used in Statistical Thinking in Python Part 2. That data was obtained from 40+ years of personal observations!
I’m sure you won’t have to go to that extent to get some data to work with. You could start with the data sets provided at the beginning of every DataCamp course, and if there’s nothing there you like you browse the resources listed above, or simply try a Google search for the data you’re looking for.
It's a great feeling when you find a data set that has all the information you're looking for, wrapped up in a neat package. Unfortunately, that's more of an exception than a rule. Most of the time you'll find your data in pieces that need to be assembled.
Collation is where data starts being organized from its raw form into a set of data that can be worked with.
While a lot of the initial data in Civil Affairs comes from Internet sources, it’s a small minority. The vast majority of data comes from people on the ground. Civil Affairs Teams will go into an area and take pictures, interview the local populace, and assess local buildings like Airports, Schools, Police Stations, and Power Plants. All that data comes back in written reports that need to be collated into workable information. I wasn’t involved in collating the data gathered from the wildfires I talked about earlier, but every one of those hand-marked maps had to be collated into a single source showing the coordinates, condition, and property description of every location surveyed.
As a budding Data Scientist, collation is where I think most of us will start to handle data.
Take the data from the Scandens Beak Depth Heredity, for example. Each daily report had to be collated into a weekly or monthly report, then an annual report, and eventually a single data set representing 40 years of data.
Good collation can save a lot of work down the road as it presents an opportunity to clean the data as you go and put it into proper categories that will be easier to work with down the road. Techniques like merge, join, and append earn their keep in collation.
Let’s face it, even if we find a collected, collated data set with all the information we’re looking for, it’s probably going to be ugly and need some cleaning. Processing is where we take “data” and turn it into a useable form that’s ready for some analysis. In Civil Affairs I spent a lot of time collecting and collating data, but processing is where I really got to know the data and began formulating my plan for analysis.
On the CIM Team I did a lot of what we called ‘area assessments’. That is simply an initial analysis of a given area, before we on the ground there, that gives an idea of what issues we might need to prioritize in that area. After collecting and collating all the relevant data on the area, it’s processed into consumable information, even before the analysis. The result is typically and Excel workbook or database containing clean, categorized information that can be searched or filtered to answer basic questions that don't require an analysis.
In data science, this is where we’re tidying and exploring the data, and performing the Exploratory Data Analysis. You can really save yourself some trouble down the road in the analysis part if you’re careful to properly categorize and label the data, and ensure it is the right data ‘type’. You can really set yourself up for a good analysis by taking care to process the data correctly.
Taking my area assessment from earlier, it’s great to have a data set that I can query and get answers from. I could find out how many doctors are in the area, or how many gallons of fresh water per person, per year that the area has. This is important information, but it is not an analysis. The analysis would answer questions like, “how does the area compare to the surrounding region in the areas of Government, Economics, Infrastructure, Security, and Healthcare?” and “At the current rate of usage, how long until the area runs out of fresh water?”.
Analysis is the meat of what most of us probably came to DataCamp to do, it’s what I have the most fun with, and it’s where we get to answer that all-important question. It is also a topic covered extensively in nearly every DataCamp class, so I won’t go into too much detail on it. The key part of analysis is working with the data and drawing conclusions from it.
Let’s go back to the Gatlinburg Wildfire. Once all the data is collected, collated, and processed into a database listing the coordinates of every building, its current condition, and a description of the damage, we determine that 60% of all the buildings in the affected area are damaged beyond repairs. We have answered the question of “How much of the surrounding area is impacted by the fire?”.
We can go a step further and find the value of every damaged building and determine the reconstruction cost of the entire area, broken down by neighborhood and street. We could even determine which neighborhoods were hit hardest so we know where to prioritize the reconstruction efforts.
The analysis is great, but it does not good if it isn’t turned into a product, and that is what the Production step of the CIM Process is all about. Every step so far has built up to this one, it’s where you build that showcase item, the thing you’re going to share with the customer.
It’s the culmination of everything you’ve done with the data. The product could be in the form of a Jupyter Notebook, spreadsheet with tables and graphs, web page, PDF file, or source code. Other ideas are an in-person presentation, video presentation where you walk through the analysis, or a slide show with Sway or PowerPoint. Your choice, regardless of what it is, must fit the customer’s needs. There's some additional requirements:
- Be clear, concise, and consumable by the customer.
- Tell the story and make sense to the customer.
- Be tailored to the customer’s technical and educational level.
Dissemination is the final step in the CIM Process and on the surface can sound pretty simple; just get the product to to the customer. However, if you stop and think about it, this is probably the most important step in the whole process. When I was with the CIM Team we mostly produced either PDF’s, PowerPoint Slides, or Excel Reports based on our analysis. Why use those methods? All of them are accessible, easy to use, portable methods that nearly everyone knows how to use.
With the skillset I have now, I could have been producing interactive web pages containing reports, Bokeh graphs with sliders and filters, and interactive maps. That would be a complex, dynamic product that nearly everyone could use because it would all wrapped in a Web Page. Regardless of how great the final product, it does no good if it's not disseminated to the customer in a way that is:
- Accessible: since you are dealing with information, it has to be accessible by the customer(s). Burying the product somewhere it can't be accessed by the intended audience makes it difficult to consume and leads to it being quickly forgotten.
- Easy to use: your chosen delivery method has to be easy to use for the absolute majority of the customers you are delivering to. For example, delivering a Python script to a group fellow Data Scientists might be fine, but sending that same script to someone who doesn't know how to run "
data_analysis.py" might not go over very well.
- Portable: let's face it, we are all consuming more and more of our information on mobile devices. It is a rare occassion that we have time to sit down at a desk consume lengthy amounts of information. Data products need to be portable, be that through Office applications like PowerPoint and Excel, PDF's, or via the Internet.
There is an abundance of dissemination methods available to everyone reading this article and I cannot even begin to name them all. How you choose to disseminate your product requires some decision-making. Always ask yourself, "How can I get this product to the customer in that way that best fits their needs for accessibility, ease of use, and portability?"
That being said, if you're building a portfolio, Github, Anaconda Cloud, personal web pages and blogs, and social media sites are great places to start. Don't be afraid to put your work out there and share it with other like-minded individuals.
If you are involved in data, you are most likely already using some of the steps I talked about. While you may not personally be involved in all six steps, you now have a better understanding of them and of one process to turn raw data into a finished product. It may seem strange to be discussing military procedures for Data Science, but those six steps have served me well both in the Military and Corporate America and I hope they also aid you on your journey to becoming a Data Scientist.
Top 10 Data Science Tools To Use in 2024
Google Cloud for Data Scientists: Harnessing Cloud Resources for Data Analysis
A Guide to Docker Certification: Exploring The Docker Certified Associate (DCA) Exam
Bash & zsh Shell Terminal Basics Cheat Sheet
Functional Programming vs Object-Oriented Programming in Data Analysis
A Comprehensive Introduction to Anomaly Detection