Recapping parts 1 and 2
Welcome, everyone to the third and the last in the series of how to scale data science. Just to recap, this might be a little repetitive for some of you who've attended the first two webinars, but I'm doing this to make sure that everyone's on the same page. We started this webinar series by talking about what data science has made possible, right. The idea is that data science has made two things possible. One, it basically made the impossible possible. So things like self-driving cars, AI, AI players playing Go, things like that.
But the hidden revolution of data science is actually turbocharging is on making the possible widespread, and this is essentially where we're going to be talking about, how can we enable organizations to scale this hidden revolution of data science? That's about letting more people be able to use data, and drive decisions with data. We talked about the IPTOP framework and set it up. As Adel pointed out, there’s essentially five levers: at the base we have, Infrastructure and People. Then at the top, we have, Tools, Organization, Processes.
In the last webinar, we went more into detail on Infrastructure and Tools, which in many ways, I see as the technical levers of scaling. In today's session, we're going to talk about how to organize teams. So, we're going to talk about People, Organization, and Processes. These three can be seen as the people levers in scaling data science.
What does scaling people really mean? If you remember from the previous webinars, we talked about how data flows through an organization. To give you a quick recap, data always starts at the left, where it says raw data.
That is the raw data collected from various resources. Most organizations have data coming in all shapes and forms, right. So you have some data in Excel sheet, some in databases, or data from Google Analytics. So, pretty much like you have a lot of data in different shapes and forms. The objective of any data organization regarding the use of data is to transform this into insights, which is what the flow talks about. You collect the raw data, put it in a central warehouse, process it, and then you have people in the organization access it and turn them into insights, which could be dashboards, machine learning models, or reports.
The data science workflow
The reason I'm sharing this view of the data flow is because for this flow to work, you need people who can handle each piece in this layer. Depending on what pieces they handle, you have different kinds of roles. So, the whole question is, how can you systematically think through scaling the skills of people in your organization, so that you can successfully do this flow of raw data collection and insights? If you think about it, every step here is critical. For example, if you don't have people with the right skills to handle the data engineering portions, then the data scientists can’t do much, because they don't have access to the data. It's really important to anchor the organization in terms of saying, OK, here's how we're thinking about infrastructure and flow. Now, what skills should the people hold?
There are many ways of thinking about this, and the way that DataCamp thinks about this is basically anchoring it on the whole flow, right? This is a zoomed out version of the data science workflow. So, you can think of it as everything past the data warehousing stage. If you look at the last part, it’s about communicating and deploying output. Again, there are different ways people communicate output. You can have people building dashboards, doing experiments, analyzing experiments, writing reports, building machine learning models, and deploying them into production. This flow together is what, as an organization, you need to be able to achieve.
Identifying Data Personas: What's the goal here? The goal here is to start from a few data experts in your company, and to end up with organization-wide data fluency. I'm being careful in calling this organization-wide data fluency, because not everyone needs to become an expert. Not everyone needs to become an expert, which is why the optimal way to scale skills is to be objective and ask the question, what is this person going to be doing? So that's step one. Identify the data personas. So is this person going to be helping us with getting data into the warehouse, or is this person going to be doing dashboards? Is this person going to be doing machine learning experiments? It's really important to understand and identify the data personas.
Mapping Skills by Role and Measuring Competencies: Once you identify the personas, the next step is to map the skills, by role. For example, if you have someone who is essentially going to be doing dashboarding, then one of the skills they need is to know how to use a dashboarding tool. It could be Tableau, Power BI, Shiny, or Dash. Regardless of what they use, they need to have those skills. Once you map the skills by role, the next question is measuring competencies. Not everyone in the role has all the skills mastered for the role that they are in.
Personalized Learning: Finally, once the competencies are measured, it's important to set up a personalized learning path, so that a person who's expected to have a certain skill level can understand the gaps and move forward. So, this is again, a very highly simplified flow, but I think it captures, like, 80-90% of what you would need to do for an organization to be able to scale skills and people. So let's dive into each of these in a little more detail.
Identify data personas
When scaling data science, the organization needs to think about your data personas in a more detailed manner. At a very broad level, you could think about people belonging to four different buckets in your organization.
Data Consumer: A data consumer is someone who is not in a technical role, but needs to understand and interact with data. For this kind of role, as long as they have Excel spreadsheet skills, a little bit of Power BI, or Tableau skills, that might be enough for data consumers.
Data Analyst: Domain specific analysts who support business decision making. A pretty large portion of the data organization in most companies. Could be business analysts, marketing analysts, finance analysts, or HR analysts. They are all essentially supporting departments, and they need slightly more skills than the consumer, because they are having to create stuff that a data consumer is going to consume.
Data Scientist: Can do more than what the data analysts do. Usually brings in some training in statistics, advanced statistics, machine learning, and has a wider array of skills to tackle problems. This is usually the bucket where people get into a lot of coding. So you have R, Python, or Scala, depending on the need.
Leader/Manager: Similar to data consumers, but they need to be able to make decisions based on the analysis that is presented. They need to be able to know how to look at dashboards, maybe even tweak dashboards to answer questions that they're interested in.
So these are basically at a pretty high level of what the roles basically look like. There are other ways to think about this and I'm gonna give you a few more examples because I think it's key to see how different companies think about it. If you dive one step deeper, for example, with data scientists, not all data scientists are created equal. The reason is that data scientist is a catch-all term in some ways. In fact, if you compare the roles of data scientists, chances are they're very different. It's important to zoom a little bit deeper and really understand what exactly they are going to be doing.
Detailed data personas: Airbnb
Airbnb tends to look at data scientists as belonging to three different buckets.
Data Scientists (Analytics): These are usually people who define and monitor metrics. They create data narratives, build tools, build dashboards, closer to the data analysts in some ways. At Airbnb, it's a data science analytics track.
Data Scientist (Algorithms): These are people who build and interpret algorithms. They bring in skills in machine learning and focus on productizing. They need to have an engineering mindset. They’re not on the business side of things, but they create a lot of infrastructure that eventually gets used by other data people in the organization.
Data Scientists (Inference): These are people who would analyze experiments. They would help with statistical analysis of experiments and other insights that need to be driven by that. These are typically people with a background in stats, economics, or social science. Their whole goal is to improve decision-making and measurement impact.
I really like this structure, because it's a pretty clean way to divide the responsibilities. I think the other reason it's pretty clean is because of the skills required for each of these personas. While there is overlap, there is enough difference. For example, someone who is an inference data scientist, as long as they're doing inference work, they don't need to learn machine learning because that’s not something they would be using in their day-to-day role.
Detailed data personas: Netflix
Netflix classifies the data personas into three broad buckets.
Analysts: Broadly speaking, analysts are the ones who usually fit into the business. Their goal is to sort of be as close as possible to the business domain, and depending on what kind of analysis they do, Netflix again has more divisions. Data analysts, quantitative analysts, or business analysts.
Scientists: People who do a lot of analysis and a lot of research into what could potentially solve interesting problems for Netflix. Depending on what kind of research analysis they do, Netflix classifies them as data scientists. So if you do machine learning based research, you would be a machine learning scientist. People to do hardcore research developing new algorithms that can be used would be a research scientist.
Engineers: It's very explicit that these are folks who target productionizing systems. For example, an analytics engineer, the role is to productionize a lot of the analytics that the analyst in the organization would deliver. Similarly, a data engineer is going to productionize a lot of the data infrastructure.
This is a pretty nice way of looking at things because it's very clear that engineers are working on productionizing things, scientists are working on research, analysts are spending time on the day-to-day questions that come from the various business departments.
One of the first exercises that I think any organization needs to carry out is to think about data personas. Maybe you don't have as many personas, like Netflix, for example, who is a very data driven organization. That's a huge competitive advantage, but you may not have all these roles, so maybe in your organization, you might say, OK. I have these five roles of business analysts, data engineer, data scientist, and again, you can say data scientist inference or data scientist algorithms. Create your personas and really understand what this person is all about.
Map skills by role
Map out skills by role: McKinsey
Once you identify the personas, the next step is to map the skills required for the tool. So how do you map skills? Here’s a pretty neat chart from McKinsey, where in one of their reports they look at a pretty broad set of skills such as programming tools, data intuition, statistics, data wrangling, and machine learning. Then, they look at four fundamental roles: data analysts, machine learning engineer, data engineer, or data scientist.
They’ve created this matrix of what is the level of expertise and the level of importance of each of these skills for a particular role. You can see that if you're a machine learning engineer, data visualization and communication is somewhat important. If you're an analyst or a data scientist, it's extremely important, which resonates with what you observe in the real world. Again, these skills could be very different in your organization, but the nice thing about data is that there is a standard set of skills that people need. While your classification might look a little different at the core of it, usually it's basically based on the flow of data.
Map out skills by role: Airbnb
This is another example from Airbnb. Their roles are based on what a person is going to be doing. Are they going to be doing visualizations, experimentation, data products, analysis, ETL, or infrastructure? Based on that, there is an understanding of what mix of skills they need to bring in. For example, someone with experimentation needs to have 50% data science skills, and 50% data tools, because that's how they'll be able to handle the role.
Map out skills by role with DataCamp
DataCamp allows you to map skills by role in a clean way as well. Here, we have the whole concept of topics, and then we have the concept of various skill tracks and career tracks. Then we have a mapping of what topics need to be mastered for each of these skills. That's another way to think through how you want to map your skills by role.
Once you map your skills by roles, the next critical step is to measure the competencies. Usually, the people you hire are pretty good and have most of the skills that you have for the role. But, nevertheless, there's always going to be room for improvement, right? The question is, to do this in an optimal way, you need to really measure the competencies of people in the roles that they're in, and essentially identify what strengths and gaps exist so that you can create a learning program. In short, you need to:
Identify skill gaps
How do you measure competency? When measuring competencies there are two levels. You need to assess skills, and then you need to identify skill gaps.
There are various ways to do this. One way is with DataCamp Signal, which is a skill assessment. What skill assessments do is take you through a series of questions on a particular skill. It's an adaptive test, and it measures your skill level in terms of a score, and provides a measure of where you stand in that particular skill.
This is just one example of how to measure skills. There are other skill assessment products out there. You could even have your own internal assessment tools to get a good handle on it. It's important to really be objective about this, and say, we want to basically track skill levels, so that we know what to improve. If you don't measure something, like, you can't really improve it. That's the reason why it's really critical to adopt a tool that can help you measure skill levels.
Identify Skill Gaps
Once you assess skill levels, the next question is to identify skill gaps. If you're using Signal, then Signal also identifies skill gaps. Based on what your reported skill level is, Signal will compare it with our courses and recommend what you need to do in order to scale up and fill your gaps. For example, this person was a novice at tidyverse, which is essentially a set of R packages, so the assessment pointed out they need to do these courses to be able to level up and improve the gaps. Measurement is not just to say, I’m a beginner, intermediate, or an expert on a topic. It’s to identify, what am I good at, and what am I not good at? How do I get better at what I'm not good at?
Once you identify what skills gaps there are, the next step is to be able to identify what you need to do to bridge this gap. I think it's really important to have a personalized learning plan because not everyone in the same role has the same set of strengths, or same set of gaps. The more personalized you can make it, the better the skills that people have will fit into their roles.
Creating a Personalized Learning Path
How do you create a personalized learning path? There are two pieces to it, one is providing access to personalized learning, and the second is the culture of continuous learning. Once again, I'm going to take an example from Data Camp. This is an example of a learning path or a track, this is a custom one. This is for someone whose role in the organization is as an analytics engineer, and after the assessment from Signal, it pointed out that for this particular role, this is your personalized learning path. These are the courses that you should be taking in order to level up on your skill set and improve.
Learning is pretty much a loop in some ways. You first start with skills. You need to have, then you measure your strengths, identify the gaps, and you learn to fill up those gaps. Again, you go back to assessing, measuring, and this is the feedback loop that's going to help scaling people in an organization.
Supporting Continuous Learning
So far, we've talked about how it's important to identify roles and skills for a role, and to make sure that people have the skills to do their job. Really good organizations go above and beyond this, and create a healthy environment for continuous learning. Doing this has a lot of benefits. What I mean by continuous learning is essentially for people to expand out from their roles and learn whatever they need to broaden their roles, or even switch roles in a company. If you're a data analyst and want to become a data scientist, I think an organization supporting continuous learning would provide incentives for people to say, I have the ability to learn whatever I need to, so for me to become a data scientist, I need to learn statistics and machine learning, and I have all the resources to learn that, and that's the path that I'm going to take. While it's important to have people access the skills they need, I think it's even more important to make sure that they have access to a much broader learning curriculum so that they can expand their horizons.
This can bring tremendous results, because suddenly you will see that people who are not signed up for a particular role suddenly realize there are tools that they can use, and they can really change and turbocharge what they do by learning something that is completely different.
This is an example of Airbnb's Data University, and the goal is to empower every employee to make data-informed decisions. So back before Covid, they would have regular meetings where they attended classes, but now it’s probably done over Zoom. Nevertheless, I think it's basically like having a community of people learning together, to sort of upskill or obstacle themselves. Another example is Amazon, so some of you might have seen this news. Amazon actually has a Machine Learning University, and recently they made it available to all developers, so there’s a lot of content on machine learning that Amazon released.
So this is about scaling people, and just to quickly summarize, I think the idea is that you need to recognize your personas and data roles. You need to map skills to those roles, and then you need to be able to assess skills and gaps. Finally, you need to be able to personalize learning, and support continuous learning. If you do all these pieces, the goal is to have a 100% data fluent organization, and as pointed out right at the beginning, that is the goal for DataCamp. Our goal is to be that lever that will help organizations get to 100% data fluency through a combination of learning assessment, practice, and application.
People are a very important lever, but the way people are organized can make a huge difference in terms of the value they can deliver. So how should, how should a data team be organized? I'm going to do it in two parts here. We’ll first talk about a few common organizational structures that I have seen out in the data world, and then talk about how you should think about the right data organization strategy for you or your organization.
So, on the left-hand side of the slide, you see the centralized model of organizing data teams. The whole idea here is that you have a central data science org that functions as a center of excellence. Then you have all the business units which ask questions to data science teams, and the data science teams respond back with results. So, this is the classic model that's followed in several organizations.
The other model, which is on the far right, is the decentralized model. In the decentralized model, you really don't have a central data team. Instead, you have data people strewn around all the business units. For example, you have a data person or a group of data people for the finance team, for the product teams and marketing teams, engineering teams. So what are the pros and cons of these?
Your data science org can function as a center of excellence.
Because it's like a department of its own, you have people with similar skills.
It's easier to move resources.
The data science teams are led by people who have a very strong technical and domain knowledge.
It limits the coordination between the data science and stakeholders, as there tends to be a throw it over the wall kind of mentality.
It does not cultivate the joint problem solving mindset.
There’s a huge risk of having misalignment between the units and data science.
The biggest risk, in my opinion, is the fact that it can isolate data science as a support function.
People will start to think about data science as a support function and not really as a giant problem solving entity.
Data science has a natural seat at the table. In other words, the data people on the product team have deep insight into the problems that the product team is facing.
Not only can they solve problems that come their way, but they can also solve problems that they observed that people may not really think about where data can help. It is more biased towards action.
Because data people are spread across teams, it makes it harder to coordinate on things that are more central. If the finance team is looking at some experiments, and the marketing team is looking at experiments, there's a very high chance that the data people on the two teams will work on very similar tasks, and they don't know they're working on the same tasks.
It prevents development of best practices and centralization of a lot of the efforts, which is what the centralized model provides.
This brings me to the discussion as a company, how should you think about whether our model is going to be useful. Again, I think that there are a number of things to consider here. One, of course, is the size of the organization. Are you a small company, medium sized company, or are you a big company? Secondly, how far along in your data journey are you? The reason I'm saying this is because there could be companies that are big, but they are still early on in their data journey. I think a good way of looking at things is to look at these two attributes, and then base your decision on that. For example, if you're a small company, I would argue that it makes sense to actually have your data people sort of embedded within the departments. The whole concept of centralized or decentralized becomes irrelevant because the company is small. On the other hand, if you're a big company that is starting out on your data journey, it might actually make sense to start out with a centralized team. The reason is that you need people who are all thinking about data to be together, so that they can map out what data science in the organization is going to look like. As you grow bigger I think the key is knowing how to decide where to be on the continuum.
One way to look at the decision is to know, what is the amount of productization that your data people are doing? Are people doing a lot of day-to-day analyses and domain specific work, or are they actually building systems? Are they building products that everybody else is using? If there's a lot of day-to-day inside, then it really makes sense to have a hybrid model with a small core team, because the core team can still build stuff like a much larger, embedded team. Then, they can handle the day-to-day requirements. But if you're on the other side, where, as an org you're doing a lot more productization platforms, then it makes sense to have a bigger core team. Once again, I think the balance is all about knowing what needs to be accomplished. It's really important to consider all these aspects before you make a decision on what is the optimal alignment.
The third lever here is about scaling processes. Process is usually one of the last things that people like to think about. I hate processes, partly because it forces me to really constrain how I do things. Having said that, I also recognize the process is, by far, one of the most important elements in an organization. Process is what allows a wide group of people with different skills and mindsets to work efficiently on the same problem. So, what are the various elements of process that I would look at optimizing in terms of scaling a data science team?
The first step in scaling processes is looking at these four levers:
The project life cycle
The Project Life Cycle
The project life cycle defines how data is going to be taken end to end. Having a project life cycle makes it clear to everybody, including the data team and the non-data stakeholders like on the projects, how exactly the project is going to be delivered, and it builds a common understanding.
For example, let us say the marketing department, or the sales department has a question, or the finance department has a question on churn. They need to understand how often our customers churn, and what is the trend of these churn numbers. Having a project life cycle here means that there is a clearly mapped out series of steps.
For example, at DataCamp what happens is that anytime at any time a question arises, there is a request that gets made, we use Jira as a tool to capture all those requests. For our data team, we have a hybrid model, so we have people at the central team but also people spread out across the departments, but coordinating together. Then all these issues get triaged based on their priorities. Once they get triaged, they move into the development phase. Finally, once everything is done and sorted out, the models can deploy, dashboards get created, and knowledge gets shared. The life cycle essentially makes it extremely clear to everybody what are the series of steps that are required.
Standardized Project Structures
The next thing is standardizing project structures. This is also important, especially in the very important, big organizations. If you have too much variation in how projects are organized, it can make it really difficult to switch resources. Standardization of templates and access to resources makes a big difference. These are two examples of how to organize. One is from the world of R, Project Template is an R package that allows you to organize folders in a project, and then Cookie Cutter is a Python project that does the same thing. Once again, you do not have to adopt one of these. All that’s important here is that you adopt a standard. There is nothing like the best standard out there. What's important is you adopt a standard, adhere to it, and improve it incrementally based on the needs of your organization.
One of the final things to keep in mind for process is embracing notebooks. Again, this might be a highly biased view, but based on what I've seen in the data science world, notebooks are one of the most important technologies that have made a huge difference in how people work. Notebooks allow you to write down an analysis with code, with text, with output in a reproducible way, in a single document, as opposed to a copy pasting things from different places. I would strongly recommend that you take a look at notebooks, as it is a really powerful tool for the team to adopt. There are two broad notebooks out there, JupyterLab, which is basically from the Python community, but supports more than just Python. JupyterLab and Jupyter Notebooks, are open-source projects. In the world of R, we have RMarkdown Notebooks. Again, it’s very similar to Jupyter Notebooks. There are just some technical differences in how the data is stored, but they do the same thing in many ways.
Just to give you an example, Netflix actually embraces Notebooks in a super big way. While we're all familiar with the use of Notebooks for interactive analysis, Netflix takes it a step further. Pretty much every single thing they put in production, they typically take it through Notebooks. There is an open source product out there called Interact, which is a UI for Notebooks. They use Interact as a tool for everyone to push their analysis to do their productions.
The last couple of things I will touch upon is sharing knowledge. So while Notebooks are great to share resources across data people, not everyone is going to fire up JupyterLab to look at some things. It's important to be able to share knowledge. Again, there are multiple ways of doing it. So we use a product called Kyso, which allows sharing of knowledge articles. Airbnb uses an open source tool called Knowledge Repo. This is basically the Knowledge Feed at Airbnb. This is the open-source Knowledge Repo project. But long story short, I think it's important that there is a clear channel on how to communicate and share knowledge.
Embrace Version Control
Finally, while these are more technical, I think they're equally important from a process standpoint. The first is embracing version control. It hasn't really changed how software engineers work, and that is why software engineers have been able to deliver a lot of products while working from many different places. So data scientists have also embraced version control, and I think it's important that as an organization, embrace version control in a big way.
This is an example of what's called a commit on GitHub. As you can see, it makes it very clear for anyone looking at it to understand what did we change in particular, and what was there previously. That allows a lot of things like figuring out what went wrong, being able to diagnose it, being able to really understand if there is a bug, when it was introduced, so it’s a huge advantage.
Adopt Style Guide
Finally, there's always going to be a lot of talk about function over form. This is one area where again, I might be opinionated here. Form matters when you have a lot of people writing code or doing anything where they all have different styles. It's really important to adopt a style guide or a common way of doing things, and the advantage there is that it makes things efficient. For example, people write code very differently. It makes it a lot harder for one person to follow and read code written by another person. Style guides remove some of that cognitive overload that gets created by having multiple styles. This is an example of a style for SQL that I am a big fan of. It makes it really easy to read SQL. Again, there is no one style that fits everybody. What's more critical is that you adopt a style guide rather than what style guide you adopt.
To wrap up before we move into Q&A, we talked about the IPTOP framework, we talked about infrastructure and tools in the first part, we talked about people, organizations, and processes in today's webinar. The goal of the framework is to be able to get an organization from just a team of data experts to a 100% data fluent organization.
Questions and Answers
Question: Do you think that some organizations, when you think about organizational models for data science, do you think that some organizations can benefit from one model that another, for example, when you think centralized versus decentralized, is there a specific type of organization that could benefit more from one model than another?
Answer: Yes, excellent question. I think yes, getting the right model is really key. As I said, it's a much harder problem to solve than probably some of the real data science problems. I think it's important to think about where an organization is and where they want to get to. If you're an organization starting out with data, then I would argue that having a strong, centralized group championing the cause of data in the company is really important. In fact I would say it's more important than having data scientists embedded because you need to create a data culture first. But, as an organization evolves, I think the optimal setup for most organizations is somewhere along that continuum. As I indicated, I think the key question that I would ask is, are people driving day-to-day insights more, or are people creating these products at scale, that then make it easier for everybody to make their own data driven decisions? The more you are doing daily driven analysis, I would say a decentralized model looks better. The more you are driving, productization and tooling, you're better off with a more centralized model.
Question: You've been on the forefront of thinking about how to scale data science within the organizations that you've been in, also you've managed different data science teams. So, from your side, what do you think is the first process that should be set in place when you try to think about the analytics team, and to make its work flow better than previously?
Answer: This is like a million dollar question. The way I usually think about it is focusing on the three roles that are fundamental to any data team. I think a data engineer is indispensable, especially when you're starting out with building a data team, because a data engineer is the one who can get all the data and put it into a warehouse. I know a lot of companies tend to hire a data scientist first because they want to get to the insights as early as possible, but if you don't have a data engineer, and you don't have data easily accessible, data scientists really cannot do much with that. So, the first role that I would hire would be the engineer because it really helps with setting the basic code. Then, the other two roles, I would say are a data scientist and a business analyst. The reason is that a business analyst is someone who has more domain expertise, and understands the problems of the business really well. If you really want to prove the value of data in an organization, the best way to do it is to take a problem that is important, and create the infrastructure and the tooling required to solve it. The three roles I talked about, data engineers bring in the infrastructure, the business analysts bring in the domain, and data scientist basically is the bridge between the data and combining it to the domain to get insights. There’s a fourth one which is basically the champion. This is essentially a leader who manages the steam, but somebody who has the ability to manage upwards, downwards, and sidewards. Data rules are pretty complex from this perspective, so having someone really strong doing that is key. So these are the four rules that I would look at.
Question: What skills and competency level is recommended for a middle management position? Like a product manager, product owner, for example, for a team composed of data scientists, and data leaders that is building AI products, which is really at the heart of the work that we do.
Answer: This question hits pretty close to home, in the sense that this is exactly the kind of problem that we're trying to solve. Let's talk about skills, and then go into competency. For someone who is essentially in the middle management position, what's really important is for this person to have a breadth of knowledge of how data science works. I don't expect such a person to roll up their sleeves and start coding. But, I think what's important is they have a pretty broad understanding of the various pieces. I would say that they need to understand how data science flow works, what are the various steps to do things, and what is machine learning? What are the forms of machine learning? How machine learning gets applied, what is deep learning, and what kind of problems can deep learning solve? I would say that they need to have a wide variety of breadth in understanding. For example, we do have a track on DataCamp called Data Science for Business Leaders. Something like that would be perfect to take you through the breadth of resources. For competency level, I think the most important requirement in such a role is that you need to be credible to the data scientists and engineers that you manage either sidewards or downwards. The way to bring that credibility is to display an understanding of what supervised learning is. What are the challenges of supervised learning? So when your engineers come and talk to you about a problem they're facing, you can be empathetic and problem solve on what needs to be done.
Question: Can you explain how sharing knowledge can be balanced with data privacy, and also intellectual property, for example, for proprietary algorithms.
Answer: Just to be clear. I think sharing knowledge is at two levels. One is internal and external. The part I touched upon in today's talk is the internal knowledge sharing. With internal knowledge sharing, I don't think privacy is an issue. Everybody in the company is working towards the same goals. With sharing knowledge externally, in general, what we've seen in the data world is that proprietary algorithms are, I mean, they're there, but I don't think a lot of highly proprietary algorithms, in the sense that they’re deep. Data is oil, right? So, for example, the reason why companies like Facebook, Google, open source, all the amazing deep learning frameworks have built, and not kept it to themselves, is because without the data required to follow those models, they cannot be used to compete against Google. I would strongly argue that like I mean I think algorithms, yes, to some extent that are proprietary is involved. You can always put things away, and the big advantage of sharing and making open source is that it's great from a brand building perspective. It's great from a hiring talent perspective where people want to work in places where they can share their knowledge without really being stuck.
Question: Is the DataCamp Signal tool free?
Answer: Anyone's first assessment is free on DataCamp Signal, and I would also like to signal that this week is DataCamp Free Week, so all of the content on DataCamp is free, whether it be projects, practice, assessments, or courses as well. So make sure to check that out.
Question: When considering people identifying personas and scaling, what do you think is the most important when working with a young company, or small data team where one person wears multiple hats?
Answer: This is a great question. I joined DataCamp three years back, and I can certainly see how we've grown over time. This whole one person wearing multiple hats happens more often than not. I think the trick there is to figure out what is the right compromise between direct and depth. I'll give you an example. Let's say you're the only data scientist or data engineer on the team. From a skill perspective, I think the most important skill that you can gain and leverage at that point in time is data engineering skills. If I were to join a young startup and I'm the only data person, the first thing that I would do is data engineering. If I don't have the data engineering skills, I will try to pick up the least amount I need to do the job. My goal is not to become an expert, but my goal is to just learn enough to get the job done. I think what's important is to think backwards, saying, if I need to have insights, what is the pattern you take, what are the skills and learn, and what is the depth of skills I need? Taking a very agile mindset here helps. It's all about getting the job done, not about how you get it done.