Data Scientist vs Data EngineerFebruary 23rd, 2017 in General
The discussion about the data science roles is not new (remember the Data Science Industry infographic that DataCamp brought out in 2015): companies' increased focus on acquiring data science talent seemed to go hand in hand with the creation of a whole new set of data science roles and titles. And two years after the first post on this, this is still going on!
Recently, a lot has been written about the difference between the different data science roles, and more specifically about the difference between data scientists and data engineers. Maybe the surge in interest comes from the fact that there indeed has been a change in perspective over the years: whereas a couple of years ago, the focus was more on retrieving valuable insights from data, the importance of data management has slowly started to sink in in the industry. Because in the end, the principle of "Garbage In, Garbage Out" still holds: you can build the best models, but if your data isn't qualitative, your results will be weak.
The role of the data engineer has gradually come forward into the spotlights.
Today's blog post will lay out the most important differences between data scientists and data engineers, focusing on responsibilities, tools, languages & software, educational background, salaries & hiring, job outlook and resources that you can use to get started with either data science or engineering!
If you prefer to see the visual presentation and references, make sure to check out the corresponding infographic "Data Engineering versus Data Science".
Data Engineers' Responsibilities
The data engineer is someone who develops, constructs, tests and maintains architectures, such as databases and large-scale processing systems. The data scientist, on the other hand, is someone who cleans, massages, and organizes (big) data.
You might find the choice of the verb "massage" particularly exotic, but it only reflects the difference between data engineers and data scientists even more.
Generally speaking, the efforts that both parties will need to do to get the data in a usable format is considerably different.
Data engineers deal with raw data that contains human, machine or instrument errors. The data might not be validated and contain suspect records; It will be unformatted and can contain codes that are system-specific.
The data engineers will need to recommend and sometimes implement ways to improve data reliability, efficiency, and quality. To do so, they will need to employ a variety of languages and tools to marry systems together or try to hunt down opportunities to acquire new data from other systems so that the system-specific codes, for example, can become information in further processing by data scientists.
Very closely related to these two is the fact that data engineers will need to ensure that the architecture that is in place supports the requirements of the data scientists and the stakeholders, the business.
Lastly, to deliver the data to the data science team, the data engineering team will need to develop data set processes for data modeling, mining, and production.
Data Scientists' Responsibilities
Data scientists will usually already get data that has passed a first round of cleaning and manipulation, which they can use to feed to sophisticated analytics programs and machine learning and statistical methods to prepare data for use in predictive and prescriptive modeling. Of course, to build models, they need to do research industry and business questions, and they will need to leverage large volumes of data from internal and external sources to answer business needs. This also sometimes involves exploring and examining data to find hidden patterns.
Once data scientists have done the analyses, they will need to present a clear story to the key stakeholders and when the results get accepted, they will need to make sure that the work is automated so that the insights can be delivered to the business stakeholders on a daily, monthly or yearly basis.
It is clear that both parties need to work together to wrangle the data and provide insights to business critical decisions. There is a clear overlap in skillsets, but the two are gradually becoming more distinct in the industry: while the data engineer will work with database systems, data API's and tools for ETL purposes, and will be involved in data modeling and setting up data warehouse solutions, the data scientist needs to know about stats, math and machine learning to build predictive models.
The data scientist needs to be aware of distributed computing, as he will need to gain access to the data that has been processed by the data engineering team, but he or she'll also need to be able to report to the business stakeholders: a focus on storytelling and visualization is essential.
What this means in terms of focus on the steps of the data science workflow, you can see in the image below:
Of course, this difference in skillsets translates into differences in languages, tools, and software that both use. The following overview includes both commercial and open source alternatives.
Even though the tools that both parties heavily depends on how the role is conceived in the company context, you will often see data engineers working with tools such as SAP, Oracle, Cassandra, MySQL, Redis, Riak, PostgreSQL, MongoDB, neo4j, Hive, and Sqoop.
Data scientists will make use of languages such as SPSS, R, Python, SAS, Stata and Julia to build models. The most popular tools here are, without a doubt, Python and R. When you're working with Python and R for data science, you will most often resort to packages such as ggplot2 to make amazing data visualizations in R or the Python data manipulation library Pandas. Of course, there are many more packages out there that will come in handy when you're working on data science projects, such as Scikit-Learn, NumPy, Matplotlib, Statsmodels, etc.
In the industry, you'll also find that commercial SAS and SPSS do well, but also other tools such as Tableau, Rapidminer, Matlab, Excel, Gephi will find their way to the data scientist's toolbox.
You see again that one of the main distinctions between data engineers and data scientists, the emphasis on data visualization and storytelling, is reflected in the tools that are mentioned.
Tools, languages, and software that both parties have in common, as you might have already guessed, are Scala, Java, and C#.
These are languages that aren't necessarily popular for both data scientists and engineers: you could argue that Scala is more popular with data engineers because the integration with Spark is especially handy to set up large ETL flows.
The same goes a bit for the Java language: at the moment, its popularity is on the rise with data scientists, but overall, it's not widely used on a daily basis by professionals. But, all in all, you'll see these languages popping up on job openings of both roles. The same can also be said about tools that both parties could have in common, such as Hadoop, Storm, and Spark.
Of course, the comparison in tools, languages, and software needs to be seen in the specific context in which you're working and how you interpret the data science roles in question; Data science and data engineering can lie closely together in some specific cases, where the distinction between data science and data engineering teams is indeed so small that sometimes, the two teams are merged.
Whether this is a great idea or not is enough material for another discussion which is not in the scope of today's blog.
Besides all of this, data scientists and data engineers might also have something in common: their Computer Science backgrounds. This study area is widely popular for both professions. Of course, you'll also see that data scientists have often studied econometrics, mathematics, statistics and operations research. They often have a little bit more business acumen than data engineers. You often see that data engineers also come from engineering backgrounds, and more often than not, they have had some prior education in computer engineering.
However, all of this doesn't mean at all that you won't find data engineers that have gathered knowledge in operations and business acumen from prior studies.
You have to realize that, in general, the data science industry is made up of professionals that come from all different types of backgrounds: it is not uncommon that physicists, biologists, or meteorologists find their way to data science. Others have made a career switch to data science and come from web development, database administration, etc.
When it comes to salaries, the medium market for data scientists is set at a paycheck of $135,000 on a yearly basis on average. The minimum is at $43,000, and the maximum is at $364,000. For data engineers, the medium market is a bit lower: they earn on average $124,000, and their minimum and maximum paycheck are also considerably lower: the minimum is at $34,000 on a yearly basis, the maximum at $341,000.
Where the difference in paychecks exactly comes from is not entirely clear, but it might have something to do with the number of open positions: according to data from indeed.com, there are about 85,000 job openings for data engineers, while there are about 110,000 jobs for data scientists on the market.
Companies that want to hire data engineers at the moment are PlayStation, The New York Times, Bloomberg and Verizon, but in the past, also companies such as Spotify, Facebook, and Amazon have hired data engineers. Data scientists, on the other hand, are currently wanted in companies such as Dropbox, Microsoft, Deloitte, and Walmart.
As described before, the creation of roles and titles is needed to reflect changing needs, but other times they are created as a way to differentiate from fellow recruiting companies.
In addition to the rise in interest for data management issues, companies are looking for cheaper, flexible and scalable solutions to store and manage their data. They want to move their data to the Cloud and to do this, they need to build "data lakes" as a complement to the data warehouses that they already have in place or as a replacement for the Operational Data Store (ODS).
Data flows will need to be redirected and replaced in the coming years and as a result, the focus on and the number of job postings to hire data engineers has gradually increased over the years.
The data scientist role has been in demand since the start of the hype, but nowadays, companies are looking to compose data science teams instead of hiring unicorn data scientists that possess communication skills, creativity, cleverness, curiosity, technical expertise, etc. For recruiters, it's hard finding persons that embody all the qualities that companies are looking for and the demand clearly exceeds the supply.
You could argue that the "data scientist bubble" has burst. Or maybe it will still burst in the future.
One thing will still hold throughout all of this: the demand for experts that have a passion for data science topics will always be there. The job look for these experts is highly positive: according to McKinsey, the US could face a shortage of 140,000 to 190,000 people with deep analytic skills and 1.5 million managers and analysts with the know-how to use the analysis of (big) data to make effective decisions in 2018.
You see, there are more than enough reasons to get started with data. :) And this is exactly something that won't be a huge problem.
But, of course, also for those who want to get started with data science, there's the Exploratory Data Analysis, Introduction to R for Data Science, Machine Learning Toolbox and Introduction to Python for Data Science courses.
Start learning interactively today!