Who is the data scientist?
A data scientist researches, extracts, and reports meaningful insights present in data. Most of the time, a data scientist must be able to effectively communicate those insights to other non-technical members of the team or stakeholders.
There are three main fields a data scientist should be proficient on:
- Programming: Data science programming languages to write scripts and a little bit of APIs and cloud computing.
- Theory: Algebra, calculus, statistics, probabilities, optimization, and artificial intelligence (AI) techniques.
- Communication: Ability to explain to non-technical people the achieved results.
Those are the three main pillars of data science. But in some contexts, one of them could be more important than the rest. In this article, we are going to break down the responsibilities of data scientists across many industries. We’ll see what the data scientist’s job looks like, how you can become a data scientist, and even how to land your first job as a data scientist.
Let’s get started.
What does a data scientist do?
The data scientist’s mission can be summarized in the following sentence: Deliver value from processing and analyzing data.
Depending on the industry, that value can be predicting the future, detecting trends, classifying pictures, extracting information from text, etc. To do so, data scientists use computers and programming languages to process all the information, make complex calculations, create useful data visualizations, and obtain results.
The operations that a data scientist performs to achieve results are guided by theoretical and practical expertise. Knowing the math behind the tools and understanding statistics and probabilities can make the difference between an awesome and a mediocre outcome.
But having results that only a few technical experts can understand and interpret is not enough. To convert those results into value, the data scientist should have communication and presentation skills. They should be able to explain to non-technical people the meaning of their results and the implications derived from them.
This can be summarized in the following diagram:
In the rest of this section, we are going to focus on the first rectangle: the data science pipeline. But first, let’s answer a common question: What is the difference between a data scientist and a data analyst?
Differences between data scientists and data analysts
The role of a data analyst is very similar to the role of a data scientist. The previous diagram could also be used to explain the work of a data analyst. Data analysts focus on getting insights from the data. They find relationships and trends that are meaningful to the business and, of course, communicate them to non-technical stakeholders.
Although they do a very similar job, data scientists also pursue other goals like predicting future events or making a difficult classification task. Another important difference is that data scientists usually have to create entire products. That means they have to put their solutions online and make them accessible for a group of users.
This line is fuzzy sometimes, mainly because some companies use data analyst and data scientist terms interchangeably, making it difficult for us to know the differences between both roles.
Now let’s dive into the different fields a data scientist should master.
Fields a data scientist should master
We mentioned above the three main areas in which a data scientist should aim to be an expert. In this section, we are going to be more specific and break down each of those areas with the different skills they’re composed of.
- Scripting language: Python and R are the most popular programming languages for data science out there. You need to be proficient in at least one of those.
- Data science frameworks and tools: You need to know the specific tools used for data science. Depending on where you work you might use tools like Power BI or Hadoop. You should also master relational and non-relational databases and how to query them.
- APIs and cloud computing: To productize your solutions you will need to know the basics of API building. It’s great if you can master some data science-oriented cloud solutions.
- Algebra: To understand mathematical data representation and transformation.
- Calculus: Mainly to understand the optimization algorithms that train machine learning (ML) models.
- Statistics: To analyze, visualize, and learn from data. But the most important role of statistics is to guide the whole data science pipeline in a way that we don’t make mistakes in the process that can mislead the results.
- Probabilities: This is fundamental to understand how many ML algorithms work. It is also important when dealing with data transformations and sampling.
- Optimization: Most ML algorithms are solved optimization problems.
- AI and ML: All the previous fields applied to AI and ML common problems and solutions. Data scientists apply these to solve their prediction and classification tasks.
- Data visualization: Besides being an excellent way to find insights into the data, good data visualizations are the best way to communicate those insights.
- Vocabulary: The ability to explain complex processes and algorithms in simple language is vital to communicate results.
- Experiments: You should consistently test your results. You need to be sure of the validity of your results but also to use those tests to convince others.
This is not an exhaustive list, but these are the main skills you should master. Don’t worry, you don’t have to be an expert in every one of these skills (although it would be great!). Some of them are more or less useful depending on the industry. We will talk about use cases of different industries in the next section.
But first, let’s break down the data science pipeline. It is the process a data scientist follows to go from raw data to results.
The data science pipeline
We’ve already seen a general diagram about the work of data scientists. Let’s break down the first part of that diagram with another diagram to describe the data science pipeline.
Sometimes a data scientist should take care of collecting the data, and this can be a very hard task. It could involve scraping the web, or even manually collecting and rectifying the raw data.
After getting the raw data, we need to make a lot of transformations, fill missing values, get rid of the noise, etc. This is known as data cleaning. In this stage, we also perform feature engineering. This is the process in which we transform, remove, and create data features. This way, we create a better representation of the data for our problem. For example, data can include the age of some people, but we may be interested just in knowing if each specific person is a child, an adult, or a senior. In the middle of this process, we can get a deep understanding of the data using data visualizations.
Once we get the transformed data, we are ready to choose the models we are going to use to solve the problem. This process and the previous ones complement each other. We perform feature engineering according to the model we think will be better. After choosing the models, we train them and finally validate our results. If we are not happy with those results, then we return to the data transformation process and repeat all the steps.
The pipeline repeats until we get satisfactory results. You know what should come next: Communicate those results!
After getting an overview of the work of a data scientist, what skills they should master, and the data science pipeline, let’s talk about data scientist job opportunities.
Where can a data scientist work and how much do they make?
In this section, we explore data science job opportunities. Many examples in different industries are shown, and we’ll give some information about the salaries and responsibilities of a data scientist today.
But first, let’s analyze why every company needs data scientists.
The necessity for data scientists
Most of the time data science is used to guide decision making — like what ads to show to what users on YouTube, what products to sell in this season of the year, or what players a team should hire.
But there’s more! Once you have some data science experience, you start seeing it everywhere: the description of the chapters of a show on Netflix, face detection with your mobile camera, results in Google search engine, and much, much more.
There is an increasing awareness among companies that data science can have a big impact on their success. For example, Starbucks started to use data science to personalize the experience of customers, manage the logistics, and guide its marketing campaigns. A fashion company like Burberry is also using data science to recommend offers to its clients by comparing the preferences of many customers. So there are a lot of data science jobs to do out there, and that means there is a high demand for data scientists too.
Now we’ll show some examples of industries where a data scientist can work.
Examples of industries where a data scientist can work
Let’s go deeper and see real-life examples of data science applications in different industries.
|Healthcare||IBM has developed a platform to assist common healthcare tasks with an emphasis on the automatic analysis of images and X-rays.|
|Sports||In 2015, MLB introduced StatCast, which is the data-based analysis of player movements, pitch speeds, and batted ball exit velocity.|
|Government||The US Department of Defense leverages data science and analytics to build battlefield advantage and enhance senior leader decision support.|
|Marketing||Nike uses predictive analytics to improve customer acquisition and retention by identifying the right customers to target.|
|Education||Nottingham Trent University traced multiple interactions of students with the center. The school discovered the engagement of students was key to their progress.|
Salaries and responsibilities by industry
We know a data scientist is valuable. But let’s be practical and direct: How much does a data scientist make?
According to Glassdoor, the data scientist position is considered the second-best job in America in 2021 based on salary, job satisfaction, and the number of job openings. A data scientist can expect to earn an annual median base salary of around \$114,000, significantly higher than the average median income of \$51,000.
In Europe, it is more variable. While in Spain the average salary of a data scientist is about €35,000 per year, in Germany data scientists earn €60,000 per year on average according to Glassdoor. Salary also varies depending on the industry and the employer.
Let’s move to another topic: the differences in the work of a data scientist across industries.
|Healthcare||Usually implies a lot of image processing, from ultrasounds to colposcopies. Results should be shown in an explainable way, i.e. we can’t just say “the patient is sick,” we need to explain why our algorithm is giving us that result. All the work in this field is thought of as assistive technology and the final decision is always made by professionals.|
|Sports||Here the data usually comes in tabular form, and it is mainly numeric. That’s an advantage! This field requires expertise in statistics and exploratory data analysis to find the correspondence between numbers and actual player performance and value.|
|Government||I have personally worked in this area. The applications here can range from logistics to budget management. For example, data science can guide the decision of what people should receive economic aid from the government. Analyzing surveys, knowing how to interpret economic data like poverty indices, and being able to explain results to government functionaries are some of the most important skills to work in this industry. As it happens with healthcare, results should be highly explainable, since most of the time they influence very sensitive decisions.|
|Marketing||Collecting data could be a huge problem in this field. For example, when you want to analyze the competition, you won’t get that data easily. In the same line, most of the time clients don’t give their data voluntarily. You have to create smart ways to get data from potential clients and the competition. Data by itself is not too representative, you should take into account the time. For example, if all your data was collected in the winter, the analysis you do on that data will surely be useless in summer. This industry is very related to business analytics.|
|Education||As it happens with the sports industry, one of the interests here is to find connections between statistics and students’ performance. But we can use data science to compare different teaching methodologies and to determine the best ways educational centers can help students. The interpretation of the results will depend on the culture and characteristics of people in the specific region under consideration. This also happens in other industries like Government and Marketing.|
Will technology replace data scientists?
This question can arise after hearing about what AutoML is and what it can do. This relatively new field of study inside artificial intelligence allows us to create machine learning models automatically from data. That should sound a bit scary for data scientists but let’s take a deeper look into it.
The dawn of AutoML and what does it mean for data scientists
Automatic Machine Learning (AutoML) is a new research field inside artificial intelligence. The AutoML goal is to create the best model automatically. That means building the model that is best suited for the problem we are facing with no code at all.
There are many AutoML frameworks out there like auto-sklearn and autokeras. But all those frameworks focus on finding the best models and fine-tuning them. As we discussed earlier, the work of a data scientist goes far beyond that.
Analyzing the data, transforming it, doing feature engineering, feature extraction, and feature selection are tasks in which AutoML has not had an impact yet—and we shouldn’t expect that to change in the near future.
Also, remember that an extremely important part of the data scientist’s work is to interpret and communicate results. This kind of work is even harder to replace with an algorithm.
But when it comes to finding the best models and training them, we have to recognize the great potential of AutoML. Right now, AutoML frameworks don’t entirely replace data scientists in that area either. But we can expect these frameworks to continue making a lot of improvement in this field in the upcoming years. So, how exactly will a data scientist’s work be impacted by AutoML?
The switch to data-centric AI and the role of data scientists in it
Even without having AutoML at all, having good data is more important every day. For years, the AI approach has been to collect a lot of data and to train a sufficiently powerful model that could handle the noise and inconsistencies of that data.
But having data without that noise and inconsistencies has proved to dramatically improve the performance of the models. And with good data, we can get awesome results with very simple models.
If you add the progress AutoML has achieved so far, and all the improvements coming soon, it seems that data science should now focus on getting better data every day.
That’s what is called data-centric AI. And it is getting more and more attention. For example, check out this article by Forbes about the call of Andrew Ng to switch to a data-centric approach in AI. And who is going to lead that switch? Data scientists, of course. They need to keep sharpening their data manipulation skills. That will be the key to get the most out of the models, no matter whether those models are automatically built or not.
So I don’t think the data scientist’s work is in danger. Far from it! It will probably change, and the work will be focusing more on improving the quality of data, which requires a lot of insight and experience.
How can I become a data scientist?
Here we discuss how to get started, what path you may follow, and what you should do to get your first data science job.
How much math does a data scientist need to know?
The short answer: the more the better.
But if you don’t have a solid math background, don’t worry. You will be able to learn on the way.
Also, with all the great tools and libraries out there, we can do a lot of things without knowing math at all. But don’t get me wrong, when you learn math, you become a better data scientist. Learning math is something that should always be on your learning path. But it is not a requirement to get started.
Now that we know what we need to get started with data science, let’s discuss a learning path.
A proposed data scientist learning path
This learning path is designed for people that start from zero. That means people who don’t have coding or math skills. If you are not starting from zero, you can begin the path at the point that is best for you.
- Learn to code (Python or R): These are the main data science languages. Python is a general-purpose language while R is focused on data science. Python is more popular and is my recommendation because of its great community, its simplicity, and its documentation. Try to take data science-oriented courses like these on DataCamp: Introduction to Python for Data Science and Introduction to R.
- Begin solving simple data science tasks: What you know about programming is enough to start solving some simple problems. For example, these DataCamp guided projects are a great starting point. You can find a lot of different projects designed for different languages and skills. Don’t try to get the theory behind the models and the methodologies, just get familiar with pipelines and tools.
- Begin understanding the data science pipeline and the theory behind some models: Now is time to go a little bit further. Study why we need to split the datasets before training our models, how some basic models like linear regressors work, how to measure the performance of your models, etc. Again, DataCamp provides excellent introductory materials. There are more advanced courses available for free like the machine learning course by Andrew Ng on Coursera (it is free if you decide not to get the certification). This free data science course on Udacity is also a good option for this stage. And you can find much more, just google it. But the most important recommendation, learn while attempting to solve harder problems. You’ll be able to face a lot of new problems by yourself. This is also the time to get started with algebra and calculus.
- Gradually learn more models and the math behind them: Go at your own pace. When you feel you have already mastered some concepts and models, and you have solved a couple of problems, then move to the next concepts and models. And learn the required math at your own pace. Always prioritize doing projects and solving problems. This is a non-exhaustive list of models you should learn about: linear regression, logistic regression, decision trees, random forests, gradient boosting, knn, support vector machines, naive Bayes classifiers, and neural networks.
- Start publishing your solutions: Once you have solved many problems using some of the techniques mentioned, you need to get started with one important part of ML: getting models to production. Now it is not about math. You need to create APIs and web apps. Remember your mission as a data scientist is to deliver value, and there is no value in a solution that is only in your machine. I recommend using simple frameworks like Streamlit to get started. But stay growing in this area since this skill is in high demand.
- Continue solving problems and try to increase the complexity: You now know the basics of each task performed by a data scientist. You need to learn more every day. It’s time to focus on how to transform the data to get better results, and learn more techniques to combat common problems like overfitting. Build a strong statistics and probabilities background that will let you analyze your methodologies and detect problems with your data. But the most important thing: keep solving problems, and start thinking about new ideas and projects to work on with what you have learned so far. Get more projects to production and learn about how to maintain them. Make sure you can understand and explain the results.
- At this point you are ready for your first job!
What does it take to land your first data science job?
As we have described, a data scientist is a key asset for any company. Their work is very important and an error in the analysis or methodology might be disastrous.
That’s why most data science positions out there require previous experience. But how to get that previous experience then?
If you follow the proposed path, you’ll eventually have some projects out there that can be used by anyone. That is not working experience but you’ll have something to show to employers. A solution that passed all the phases of the data science pipeline and is live and ready to use. That could get you a job even when you don’t have experience. That’s one of the reasons you should always build something while learning.
An important recommendation is to try to identify the field inside data science that you like the most (business analytics, geospatial analysis, image processing, natural language processing, etc.). That way you’ll have more projects related to that field. Build your portfolio and show it to employers!
If that seems not to be enough then try to get a scholarship. Your projects will be of great help for that too! Then you will finally have the required experience.
Don’t worry, the opportunities will come sooner or later. Data scientists are still highly demanded. Believe me, I come from a country with very little data science development and culture in the industry and still was able to get jobs.
In this article, we showed what a data scientist does. We broke down the responsibilities of data scientists and the many industries in which they can work. Then, we talked about how technology can impact the work of data scientists and, if you want to become a data scientist in the future, you now have a detailed learning path.
If you received good math and programming courses in the school that’s great. You’ll maybe find the path easier to follow. And if you received education on how to communicate with others that’s good too. But remember data science is a wide multi-disciplinary field and there are always a lot of things to learn beyond the school. And every industry and scenario has its own characteristics and problems. Make sure to dedicate time to acquire all the skills you will need.
No matter where you come from, today is an excellent day to begin your data science career.
← Back to blog