Hands-on learning experience
Companies using DataCamp achieve course completion rates 6X higher than traditional online course providersLearn More
So before we talk about scaling, I think it's important to understand how data science is typically viewed in the common media. When people typically talk about the data science revolution, usually what people talk about are all these difficult to achieve, highly complicated artificial intelligence applications. So you hear about self-driving cars, deep learning, voice synthesis, natural language processing, robo-advisors, and go-playing games. This is definitely one of the core advancements that data science has kind of provided to do the work, right? This is typically viewed as the data science revolution. However, one of the things that people miss out on and is often not discussed is that behind all of this, that is also a hidden data science evolution.
If you really think about what data science enables, I would like to look at it in terms of two different questions.
How hard is something to do?
How many people can do it?
If you think about the data science revolution that is out there and people talk about, it's usually about making the impossible possible. That's essentially where all these exciting self-driving car applications, robo-callers come in. But that part of the revolution is usually done by a handful of experts, and essentially, it also solves a handful of problems out there.
The hidden part of this revolution, in my opinion, is about how do you enable many, many different people to work with data and essentially do data science. This is all about making the possible more widespread, and here's where you can think about dashboards, simple forecasts, reports, and data insights. What we're going to be talking about in today's session is, how do you kind of enable this hidden revolution of data science, which is actually a lot more powerful, for a wide range of companies because, essentially, it enables more people within the company to be data-driven and data-focused.
What do you really mean by scaling data science? At DataCamp, we tend to think about this in terms of four different levels.
Data Reactive: This is where no one really uses data in their daily work, and companies rarely report on data. Of course, these days, there’s hardly a company that is in this market, although probably in some domains, there are still companies that just go with intuition, gut feeling, and so on and so forth.
Data Scaling: This is where there are few people who have the skills, and they need to analyze the board and present data. You would have observed in your own organizations, sometimes the organizations that you work for that essentially, a few people, a data science group, that essentially is the one that acts as the gateway to data and insights.
Data Progressive and Data Fluent: Data fluent is essentially what every company aspires to be. A data fluent company is a company or organization where everyone knows how to access the data they need, and everyone knows how to process the data and create insights.
Remember, we're not talking about the hard to achieve, making the impossible possible. Here, we are talking about how we can get the entire company to be data fluent and use data to gather good insights. So how does an organization think about data scaling? There are many different ways to think about it. One of the most powerful ways to think about it, at DataCamp we’ve developed over time, is to think about it in terms of this framework that we have coined “IPTOP.”
IPTOP talks about scaling data science as consisting of scaling five different levers. You have two fundamental levers at the bottom, Infrastructure and People. Essentially, there is no way you can scale data science without really scaling infrastructure and scaling data access, and from the people perspective, scaling data skills.
While these two are the core pillars, there are three other pillars that sit on top of these two pillars. Those are Tools, Organization, and Processes. In many ways, the organizations that truly excel at scaling data science build a solid foundation of infrastructure and people and ramp it up with tools, organization, and process.
In this webinar, we are going to look at a quick overview of the IPTOP framework. I'm going to be talking about each of these five pillars. I'm going to be giving a couple of examples, and the idea is to give you, in a nutshell, what you should think about in terms of scaling data science.
Let's start with the first pillar, Scaling Infrastructure. So what does infrastructure really mean in the data context? One way to think about this is in terms of this AI hierarchy of needs, from Monica Rogati. The AI hierarchy of needs, similar to Maslow's pyramid for human needs, essentially talks about the different nested levels of how data science works in any organization. At the bottom of the pyramid, you have data collection. As you keep going further, higher and higher, you're going to get to AI, deep learning, and all these applications.
Essentially, infrastructure is what enables you to start at the bottom of the pyramid and advance all the way to the top. In other words, infrastructure is really, really critical. In fact, I would argue that it’s something that every organization needs to start thinking about before really thinking about doing data science. Because in the absence of infrastructure, your data scientists won't have any clear path to access the data or move up this hierarchy of needs.
Let's talk about scaling infrastructure. So what does scaling infrastructure really mean? Scaling infrastructure— data science infrastructure, in our opinion, is essentially something that should guarantee easy and reliable organization-wide access to data and enable everyone in the organization to share their insights through dashboards or reports. The data science infrastructure should power this whole process so that people can just focus on the analysis and not have to worry about the boilerplate associated, typically, in moving data across the pillar. So let us now look at how to think about data infrastructure.
Now, this is a highly abstracted view of data infrastructure at a company. If you remember the AI hierarchy of needs, this infrastructure diagram here essentially takes that hierarchy of needs and casts it into different buckets, right? If you remember, data science is all about how you kind of create insights and make decisions from data. In order to do this, your data needs to move across multiple layers.
On the very left, you have raw data collection. This is how organizations collect their raw data. Typically, raw data comes in many different forms and shapes. For example, you could be tracking clicks on your website. You could be tracking who's logging onto your website, who was signing in, right? So all this data gets stored somewhere. There are lots of application databases that have information about users. Let’s say you sign up for a credit card. Your information is captured in one of the app databases. If you think about the more recent developments, IoT, and sensors, there are incredible amounts of data collected all over the place.
One of the very first pieces in the data infrastructure puzzle is how companies should think about collecting the raw data. There are different tools available here. There are different technologies that we will be talking about in greater detail. Once the raw data is collected, a lot of times, it's really important to be able to analyze data and that all of this raw data be pulled into a data warehouse, right? Why is this even needed?
One of the central arguments here is that data collection is often optimized for transactions. Data collection is not optimized for analysis. What do I mean by that? Typically, for the analysis, let’s say: you want to understand, OK, once a customer clicked on this page, are they more likely to make a purchase or not? Now, analyze the data. You essentially want a table that contains all these fields that you need. But typically, the raw data is not collected this way. There would be a couple of databases that track your clicks, there might be a different database that tracks purchases. It's incredibly hard to analyze raw data collected this way.
The first step to analyze the data is essentially the data warehousing layer. You could think about it as centralized data storage, and usually, there, the idea is that you pull all your raw data into a data warehouse. Once again, there are many tools out there available. Some of the more popular ones here are Google Cloud, AWS, and Azure. Once data has been collected into a centralized data storage, it's important that this data be processed to create views that will enable analysis. So, back to the question I just raised earlier. For example, how does click behavior affect purchasing behavior? To answer those kinds of questions, you need to be able to combine the various pieces in your data warehouse and create datasets, which I would call analytics-friendly. A very important step in the journey of taking data from raw data insights is data processing.
The idea here is, you take the data in the warehouse, combine them, and create these views, create these datasets that are essentially relevant to the organization and relevant for analysis. In this step of data processing, again, there are very different, many different tools that are used here. If you live in the world of Python, typically, Jupyter Notebooks is a good way to kind of process this data. They could do it with Python scripts. You could use SQL. If you are an R user, you would use R Markdown or just simply R Scripts. Another candidate here is Polynote, which essentially is a polyglot notebook built by Netflix. Long story short, data processing essentially enables taking the raw data that has been accumulated in the warehouse and creating analytics-friendly datasets.
The next step is data access. All this wonderful data is kind of curated by your data team. The next step is to make it easy for anyone in the organization to access the relevant data analysis for analysis and insights. Again, this can be done with the tools, with R, Python, Jupyter Notebooks, R Markdown. They all play a role here. But essentially, they could also be cloud-based tools like Google Cloud or AWS. Again, you would see these usual suspects in many places.
The final piece in this whole puzzle, which I would argue, is why people do data science, why people need to be data-driven. If you are just going to end up with data access, it doesn't really create any value for an organization. The value lies in insights. Insights, again, come in different forms and shapes. An insight could be a dashboard that tracks some metrics. It could be a machine learning model deployed that essentially enables people to make decisions. It could be just a knowledge repository with shareable analysis, and this is essentially where the data journey kind of ends.
Here again, there are many different tools that allow this layer of work.
Commonly used: R, Python
Dashboarding perspective: Shiny is a powerful framework for R. Streamlit and dash are really powerful dashboarding frameworks for the Python world.
Low code environment: Power BI, Tableau, Excel.
These tools enable people to take data in an organization and transform them into actionable insights. This is a pretty simplified view of how data flows to an organization. As I explained, the goal is to give you a full perspective of what data infrastructure typically is involved in the company.
Scaling data science or scaling data infrastructure means that you need to think through each one of these layers, and think through design decisions. Like, how are you going to, what are you going to use to collect your raw data? How are you going to move it to a centralized storage? How are you going to process it? What tools are going to be used to kind of do that process, right? So it's really important to think about this from the get-go so that you're able to build infrastructure that will scale as your organization evolves.
This is essentially a view of the tools that we use at DataCamp. For example, our clickstream data is collected in a tool called Snowplow. So every time you're doing a course, and you're clicking on a button, that data gets stored in Snowplow. When you're doing courses, and you're submitting exercises, your responses are kind of logged into a Postgres database. Our accounting team, for example, uses Google Sheets for information. Our content team uses Airtable.
As many organizations do, we have data in all shapes and forms, and we essentially take this data, the use flow, which is the tool for scheduling. We use Amazon Redshift as a data warehouse. We're big users of R Markdown and SQL to process data. Once data is processed, again, we access the data from Redshift and Amazon S3. And finally, we use Shiny dashboards with R to sort of share our insights with the company. We also use Metabase, which is an open-source database access tool.
We'll be talking a lot more about each of these layers in detail in the next webinar. If you have questions, definitely feel free to raise them, and I will only take them at the end of the session. So, this is all about data infrastructure. Once you have data infrastructure, the next challenge that happens within your organization is how do we enable discovery?
Discovery here is all about most organizations having so much data out there. Even if you enable good infrastructure, it's critical to make sure that people know what there is to find and people can access what that is. Along with infrastructure and enabling data access, it's critical to enable discovery. This discovery could be about data, could be about insights, about dashboards. As an organization grows, it's even more important to kind of curate these enough. So how does data discovery typically work? Well, a couple of examples here.
So at DataCamp, we’re big users of Shiny. We built our own Shiny dashboard that surfaces all the common datasets that people use. People can search through, based on tags. People can search based on description or based on columns. People can get a sense of, like, is this a dataset that's used by a lot of people? In a way, it got put into the quality of that particular dataset and a high emphasis on the documentation. Every column, every field, every table is documented clearly, so people can find what they're looking for.
Now, there are several tools out there. Again, open-source tools are available. We will go over those in the next session. But essentially, data discovery is a very critical layer.
Now that we have talked about the first fundamental pillar of scaling data science, Infrastructure. The next pillar— which, in my opinion, is actually probably a little more important than infrastructure, I think, is People. And what does scaling people really mean? So scaling people here is essentially all about: how do you scale data skills that people have? There are, again, different ways of thinking about scaling.
One way to think about it, again, is to mirror what you see on the data infrastructure side. If you look at how data flows through an organization and map it to essential skills, then the picture becomes a lot clearer. For example, if you need to enable raw data collection and data warehousing, you need people who have data engineering and programming skills. Those are the skills that allow scaling and enabling raw data collection and data warehousing. Similarly, if you look at the data insights all the way to the right, the skills required to create good insights are typically data manipulation, machine learning, data visualization, and probability and statistics reporting. These are the skills that are required. So, mapping infrastructure to these skills is a really good starting point.
Once you map these skills, you can think about the goal of scaling people, right? It's basically moving from a few data experts to organization-wide data fluency. One way to do this is to think about it as a four-step process.
Identify the data personas: So it's very common to kind of hear, like, OK, business analyst, data scientist, data engineer, but it's really important to go a step beyond and really identify data personas.
Map the skills by personas and role.
Measure competencies. In other words, it's saying, hey, these are the skills required. How does the team measure against these skills?
Come up with a personalized learning path: that enables people to go from the skills they need for their job, the skills they need for their role, to essentially accomplishing those.
So scaling people, the first step is identifying the data personas. What is a data persona? Again, there are many different ways to think of that here. One way to look at it from a pretty high level is to kind of think about people in your company as in four different buckets.
Data consumers: these are usually people who have a nontechnical role but still need to understand and interact with the data.
Data analysts: these are usually domain-specific. They are all about helping and enabling business decision-making. Again, these roles are typically named differently if they are business analysts, marketing analysts, data analysts. You have all these different meanings, possibilities.
Data scientists: these are essentially the people who use advanced analytics to answer business questions.
Leaders and managers: these are the people who consume analytics and essentially make strategic decisions.
What do you think about scaling data skills? It's really important to kind of bucket people into these four personas. Of course, some of these have personal notes that can be blown up into more granular frameworks. For example, you could be a data scientist doing more statistical work. You could be data scientists doing more machine learning work, right? So the clearer you are about the persona, the easier it is to approach scaling.
So let's take the next step here. Once you identify the personas, the next step is to map the skills by role. The whole idea here is that, for somebody to be doing a good job as a business analyst, these are the skills they need for their role.
This is a mapping that we developed at DataCamp. For example, if you're a data scientist, the skills that you typically need would be you need to know how to manipulate data. So you need to be good at data manipulation, advanced statistics, and you need to know a certain amount of programming. Whereas, if you are, let us say if you're in a supporting department like finance, accounting, or marketing. You need to know how to kind of do reporting, and you need to kind of understand some basic statistics. Mapping out skills by role is a critical next step.
Once you've mapped these roles, the next question is to see: to what extent people in these roles have the skills they need? Here that's both an issue of depth and breadth. For example, the people you hire for a particular role. They may need more breadth. They may need to kind of get skills in areas that they currently don't have skills in. Secondly, depth. In other words, they need to become better at the skills that they already possess. This is critical to achieving scaling of data skills. How do you do this?
So the way to do this, to measure competencies, again, is essentially two steps. You assess skills. So you assess skills using different frameworks. Once you assess these skills, you identify the gaps. For example, if someone needs data manipulation, machine learning, and statistical skills, once the skills are assessed, there's a question of skill gaps. Like, OK, hey, you're really good at machine learning, you're really good at data manipulation, but you need some advanced skills. You need to kind of upscale yourself on the data engineering side of things.
So, here is where comes the fourth step, which is personalized learning. In other words, the assessment here is geared towards figuring out what this person needs to learn to achieve the goals for their role and do their role better? Here is where we advocate for personalized learning. Once an organization has wide visibility over data competencies, it's important to create a personalized learning path for each role. Creating a personalized learning path, again, can be thought about in two steps.
One, of course, is to provide access to education. For example, DataCamp hosts a lot of courses, a lot of projects, a lot of different ways to learn a topic, and also to create a culture of continuous learning. I cannot emphasize enough the importance of this. I think it's really important to create this culture because the more people learn, the better they become, the more efficient they become. The more efficient they are, they can contribute more to the organization, and the more they contribute, your organization as a whole, can essentially achieve success.
Personalized and continuous learning is really, really critical to scale data science in any organization. So at the end of this, if, essentially, you've moved all the pieces correctly. A data fluent organization is an organization where everybody in the company has the skills that they need to be data-driven and make decisions. We'll talk more about this in the third webinar, where we will talk about people.
So, Infrastructure and People. The three pillars on top of this are essentially Tools. So we’re talking about tools now. What exactly are tools? If you think about the data science workflow, you can think about it as: you import the data, clean and store the data you process, the data you explore, the data you build, models, deploy, and communicate out. If you think about this process, there are a lot of tasks that are highly repetitive. As an organization, you will find that if more than three people are writing the same kind of code to do an analysis, it usually means repetition.
Data science tools are all about recognizing this 80/20 in work and creating tools that sit on top of your infrastructure and enable people to handle these in a clean way. I'll give you a couple of examples here. So for example, at DataCamp, to import the raw data, which is sitting in various databases, typically, people would have to write multiple lines of code to connect to the database, essentially, query the right table, and bring the data into either SQL or R or Python.
And what we did is we built our package called DataCamp R, DataCamp Py that simplifies abstracts away into, like, a single line of code. This provides huge leverage because instead of many people writing many, many lines of code, we now have people being able to write a single line of code. The same thing for processing data, for creating business metrics, we created a package for dashboarding. So essentially, tooling is all about creating frameworks and abstractions that sit on top. This is not just for tool-based or for programming language-based software. It could be for Excel. You could have templates. If you're using Tableau, this could be like, dashboarding templates that get people up and running fairly fast. So this is an example of metrics.
One of the common types of questions the data science team gets asked all over is, I want to track course completion rates over the last year aggregated by week, broken down by technology topic and track. Before we had a tool or a framework, everyone had to write lots of code to essentially accomplish this for a wide range of questions.
But given the commonality that all of these are questions where somebody wants to track a number over a period, broken down by some dimensions, you can build tools, and that's exactly what we did. So, we built a tool that would reduce answering these questions, select just a few lines of code. As an added bonus, it would allow people to create, essentially, dashboards. And this is an example of the dashboard that we provide for our instructors that allows people to visualize courses, and essentially, I mean, this was made a lot easier by building these tools on top. So it's really important to kind of think about my opinion. Like they really supercharge the whole data analytics workflow. We'll be talking more about tools in our second webinar.
Finally, we're going to talk about the last two levers, scaling organizations. And the last one is scaling processes. So, yes, you have people, you have infrastructure, you have tools. But for any organization to achieve goals, it's important to kind of think about how people are organized, especially the people who are running with data. There are two common models that we've seen in practice.
Centralized model: Centralized model of data science basically means the central data science team that essentially takes questions from all the various departments and answers them.
Decentralized model: here you have the data scientists embedded within each department, and essentially it is getting a lot closer to the decision-making unit.
The big advantage of a centralized model is that you have all the data scientists working as a center of excellence.
Data Science Manager has domain knowledge, it's easier to kind of move resources all over the place.
It's easier to build tools.
It limits coordination between the stakeholders and the data scientist.
There is a risk of misalignment between the units.
I think the most important risk is that the data science team can be isolated as a support function. So people just see it as throwing it over the wall. Hey, I need these questions answered; give me the answer. So, there is no feedback loop that is in place.
It essentially alleviates all the risks associated with a centralized model part. But, pretty much for all the advantages you have in the centralized model, they have their issues in the decentralized model.
It's much harder to move resources between teams. Since we don't function as a data science center of excellence.
There is a possibility of repetition across teams.
It also creates this incentive in compatibility at times.
So, what's the model that usually you'll see work best? And this is the hybrid model. In the hybrid model, you have a core data science team that essentially functions as a center of excellence, and then you have data scientists distributed across all the departments. The idea is that this data science team, essentially, the hybrid model, takes the advantages of the centralized model and adds on top of it the advantages of the decentralized model.
Of course, depending on your company, the size of your core data science team, versus the size of the embedded data science teams, I mean, that's an important level that can then be decided.
Let's come to the final piece in this whole puzzle. We talked about infrastructure, people, the two fundamental pillars we built on top of it, the tools, and organization. The final piece, probably not the most interesting piece for anyone. Processes are usually the ones that get jogged upon. But I cannot emphasize, I think, the extent to which processes can really, really help scale. So once again, what does the scaling processes really involve? It can be divided into four parts:
Define a project life-cycle
Standardized project structures
Embrace version control
So scaling data science and scaling processes is all about standardizing things. You're kind of taking some decisions out of the equation, like creating templates and boilerplates. While there are many things, think about scaling processes, the four fundamental things to think about, in my opinion, are defining a project life cycle, standardizing project structures, sharing knowledge, and finally, embracing version control.
So, revisiting the data science workflow. What the first step means and what the first level here means is that you need a clear life cycle of a project. In other words, where does a project start? Where does it end? Who are the stakeholders?
There are many different frameworks that support this. One example is essentially a framework from Microsoft. It's called The Team Data Science Process. If you can see here, the steps basically talk about an understanding of the business problem, data acquisition, followed by deployment, and modeling, happening in tandem. Then, it essentially talks about a couple of other layers. Now, this means there is no one framework that's going to be appropriate for every company. It's really important to kind of build on top of existing frameworks and think about a project lifecycle for your own.
The next is about standardizing project structures. This is actually something that's not often enforced in companies, and that's because everyone has their own way of working. And while yes, flexibility is good, if you're thinking about scaling, I think the idea should be that, like, at any point, you should be able to have someone work on the project that someone else would be working on. One really critical way to kind of ensure that everyone can get up to speed really fast is standard structure, the standard project structure, standardized access to resources.
The third is about embracing version control. Now, if you are a software engineer and have done this for a long time and realize how competent version control is, the one thing that's really changed is how people work. Especially software engineering teams. Data work is, at the end of the day, also like software engineering in some respects. It's really important to embrace version control to allow people to collaborate. This will allow you to make changes without really having to worry about things just falling apart. There are excellent tools available to achieve this.
On the slide, you'll see a screenshot of, essentially, what's called a permit or a pull request, made, where we change certain things in the code. Someone who's looking at it can clearly see, OK, what changed? And essentially, it allows you to kind of trackback things if something breaks. At this point, I would like to say we’ve kind of talked about all these five different layers. We'll be talking more about these layers in future webinars, but, at this point, I would like to sort of open up the arena for questions.
Question: When you think about scaling different levers of organizations in order to be able to become more data fluent, how would you imagine ownership of these different levers? Should it be a top to bottom approach or a bottom-up approach, in your opinion?
Answer: That's an excellent question, Adel. I think any initiative to succeed; I think it has to be top-down. We also have a champion; the higher the level, the better. So, ideally, a C-level executive. And the reason here is that I think it's important to think about the whole idea of scaling data scientists is to be able to make your organization data-driven, data fluent. Any initiative that doesn't have a champion at the highest level is going to fail at the data insights layer of using data to make decisions and create insights. So I think it has to be C-suite driven. Usually, you'll set up as a champion, and then, of course, you have someone in middle management, like a manager or a head of data science, who is then able to kind of execute and build everything.
Question: When you think about how to scale data science, what do you think are the major obstacles or the most challenging lever to be able to scale in this situation?
Answer: Again, an excellent question. Usually, when organizations start to do data science, many organizations start by hiring data scientists. One of the issues with that is a data scientist needs to be able to access data to be able to do anything, right? So, infrastructure actually needs to come before data science. I think one of the key obstacles is not thinking through the order in which kind of you need to put things in place. So it's really critical. That's why we kind of, if you think about the top, like, the infrastructure comes before, people, in a way. Then, of course, you need people to build an infrastructure. So it's really critical to kind of think through that. Not thinking through that can sort of result in infiltration until you have a data scientist who was like, OK, yeah, I was hired to do analysis. But I don't know where the data sits; I can't access the data. I mean, it's too big to kind of download it onto and use it on my laptop. All these kinds of questions then makes things pretty frustrating.I think it's really important to kind of plan the whole thing out carefully, plan the various levers out carefully, and also a clear articulation of what the objectives are and what the vision is.
Question: Between Jupyter Notebooks, Markdown, and SQL, which one do you think scales better at the complex workflows for data processing?
Answer: Again, an excellent, excellent question. A question that I spend a lot of time thinking about. So I think I would kind of nuanced this question in two parts, right? So one is about, I think, the tool you use to kind of process data, right? And usually, it can be according to R, Python, SQL, Looker. I mean, all these great tools as well, Tableau. And then, you'll have the other layer, which essentially is like Notebooks. So Jupyter is a notebook that allows you to run our Python, SQL, et cetera. Within the same thing is R Markup, right? Now, I'm probably a little biased here. But I would strongly argue that notebooks are essentially a great way to do this in a clear and consistent way across roles. So much of that, for example, a company like Netflix, pretty much everything that is done on the data side. They do it with notebooks; there are some excellent articles on it. At the end of the day, I would kind of argue that notebooks are a really good way to do this. Then, of course, like, you could do it, in R, Python, or SQL. What you use within the notebook, I think that's a matter of every data scientist having their preferred tool and the boardroom idea.
Question: How would you suggest to assess the skills of the personas, with what tools?
Answer: At DataCamp, we think about assessing these skills. DataCamp has Protocol Signal. Signal is a highly adaptive way to assess skills. So, you basically answer a bunch of questions. These questions, essentially, the difficulty level of scale is based on how you answer the previous questions. The whole idea is to arrive at the skill level and identify gaps. So, this is one way to kind of assess skills. I'm pretty sure there are other tools out there in the market as well that allow you to kind of get an objective assessment of what the skill level of a particular person is.
Question: How do you think about prioritizing work that data teams take on? For example, we've discussed that sponsorship from top to bottom is important, but there's often technical or data hurdles in a project that problems arise and are sometimes difficult to solve. How do you approach solving and prioritizing data tasks for the data team?
Answer: An excellent question. I think while the executive team is kind of critical, the executive sponsorship is really critical; I think here is where I think the middle management plays a really, really important role, just like in any other task. I think a lot of middle management is to kind of balance this whole technical complexity and realities from what C-suite executives typically do. I mean, one, from an aspirational standpoint, I think the critical layer here is clear communication, and essentially a strong middle manager layer where they are able to understand the complexities and be able to dive in and really get their hands dirty if they need to get deeper. At the same time, manage expectations with higher management. Let’s say we are trying to build a recommendation system for something. Let's say there are some technical hurdles. I think the role of a manager should be, OK, if we need to complete this in the next one month, then what is it realistic that you can achieve? Then descoping the work a little bit and then communicating to management that if you need to wrap it up in a month, you won't have these features. But if you have a month more, then we can add on these things. Now, the question is, let’s make those decisions. At the end of the day, it's about clear communication, solid understanding that there is always going to be a tradeoff between time and what you can achieve. The better you are at understanding the tradeoff and communicating it, I think the less of the frustration for everybody involved.
Question: What are common issues or bottlenecks do you see organizations face when they try to enable data transformation programs to become data fluent?
Answer: Again, a very, very interesting question, and a very important question. I can think of a couple of ways to kind of look at this. So I think the first, the first bottleneck is kind of not having a clear articulation of what does it mean for the company to be data fluent? I think it's really important to start with the end in mind, like, if we did become data fluent, what would we want everybody in the company to achieve? I think that's one of the very first fundamental bottlenecks that I see. The second issue is, given that there are both technical and non technical aspects, I think it's really important that a data strategy is inclusive and diverse. So it's really important to kind of think about, for example, if you're building tools for data access. If you only provide code-based access, then you're essentially alienating a group of people. For success in an organization, it's really important to think about consumers of data. Think about: how do you enable them? What would be the best way for them to access data? What would be the best way for them to make sense of the data? Especially when you hear about all these R versus Python language wars. These thoughts of not really respecting and thinking about the fact that different people like to do things differently is another really, really important issue. The organizations that thrive, that recognize what access they need to enable access in multiple forms and be inclusive and diverse. I think these are the two top issues that I think are important to solve in order to make an organization data fluent.
The intersection of emerging technologies like cloud computing, big data, artificial intelligence, and the Internet of Things (IoT) has made digital transformation a central feature of most organizations’ short-term and long-term strategies. However, data is at the heart of digital transformation, enabling the capacity to accelerate it and reap its rewards ahead of the competition. Thus, having a scalable and inclusive data strategy is foundational to successful digital transformation programs.
In this series of webinars, DataCamp’s Vice President of Product Research Ramnath Vaidyanathan will go over our IPTOP framework (Infrastructure, People, Tools, Organization and Processes) for building data strategies and systematically scaling data science within an organization.
The first part of this three-part webinar series will provide an overview of the IPTOP framework, and how each element in the framework fits together to enable scalable data strategies. The second and third sessions will provide a deeper look at each element of the IPTOP framework, going into best practices and best-in-class industry examples on how to scale infrastructure, the tradeoffs with adopting different organizational structures, key data roles for the 21st century, and more.
Fill out the form to access the webinar recording, slides, and extended Q&A video.
Data Science Evangelist at DataCamp
VP Product Research at DataCamp
Companies using DataCamp achieve course completion rates 6X higher than traditional online course providersLearn More