View the webinar slides.
Scaling data science at your organization part 1 recap
Before we get to tools and infrastructure, I wanted to give a quick recap of what we saw in the last webinar. We talked about the hidden data science revolution. When you talk about data science and what it's made possible, you could think about it in two different ways.
One is, it's made a lot of impossible things possible, such as anything to do with artificial intelligence, the deep mind, the goal of playing computer, and self-driving cars.
There's also a hidden data science evolution that doesn't get talked about as much, and that is all about how to make the possible more widespread.
This is essentially enabling a larger chunk of people in your organization to be able to do data-driven work, leading to dashboards and the center forecasts. The whole idea of scaling data science in an organization is about, how can you enable more people to be data-driven? While there are many different ways to think about it, in our last webinar, we introduced the IPTOP framework.
The IPTOP framework means there are five levers to scale data science. The two fundamental ones are Infrastructure and People. We’ll talk about infrastructure in detail in today's session. On top of Infrastructure and People, there are three other supporting pillars; Tools, Organization, and Processes.
In today's webinar, we are going to focus at length on Tools and Infrastructure. In a way, you could view these two levers as the technical leavers, whereas People, Organization, and Processes are the human levers to scale data science. In today's session, we're going to be expanding on what we discussed when it comes to tools and infrastructure. Let's start talking about what infrastructure is and what scaling infrastructure means. To understand that, we need to understand what data infrastructure really means.
If you think about the AI hierarchy of needs, by Monica Rogati, it is similar to Maslow's pyramid of human needs. The hierarchy of needs talks about what needs to happen in order to apply AI in an organization. At the very bottom, you have the collection of data through logging and sensors. The whole data gets aggregated, and all the way at the top you have the application of deep learning, advanced ML, and stats algorithms to make sense of it. For any organization to be successful in navigating this pyramid, it’s important that we put a solid data infrastructure in place.
One way to think about data infrastructure is that data infrastructure is the fundamental building block that allows data to move from the bottom of the pyramid, all the way to the top of the pyramid.
Enabling data access
The first part of scaling infrastructure is enabling data access. If you need to move through the pyramid, a company needs to ensure that everybody in the company has access to the data that they need in an easy and reliable way. How do we go about making this possible within an organization? This is a highly simplified, although reasonably accurate view of how data typically moves through an organization.
Raw data collection
Raw data collection is how data gets collected in a company. The thing these days is that data comes in all sizes, shapes and forms. No matter what company you look for, data gets stored in databases. There’s digital data that's stored in streaming services. For example, every click that somebody makes is getting stored somewhere.
On top of that, there’s always ad hoc data that every company has. For example, data that you find logged in Airtable, data that you'll find logged in Excel, Google Sheets. The first important step in any data infrastructure is all about making sure that the data that you collect is collected and stored using the right platform.
The reason why this happens is when you have things in one place, and that becomes a single source of truth, it's a lot easier to combine various sources of data to do analysis. For example, let’s say you have accounting data in Excel sheets, and then you have product data in a SQL database, and you want to do an analysis that combines both of these. If they are in Excel and SQL separately, then it's a lot harder to do analysis, whereas, by bringing them into a data warehouse and making the single source of truth, it makes it a lot easier to handle these kinds of questions.
Data warehousing is a pretty important step for any organization in their whole data, infrastructure, and data access endeavor. There are multiple tools that people typically use. The top three usual suspects are Google Cloud, AWS, and Microsoft Azure. Typically, they have a pretty high market share in the tools that are used for data warehousing.
Data processing is all about using the data in the warehouse, and combining them to answer various questions. I previously gave you an example of accounting data in Excel and product data in SQL. The data processing step basically means combining these datasets to create something that's more analytics-friendly. The reason this becomes critical is because when data is collected, it's usually optimized for transactions and collection. Analysis is like a whole different ball game, and analysis almost always involves combining multiple data sets. There is hardly any analysis that you will see done in an organization where you can rely on a single table.
Data processing is a critical step in the flow where you take different datasets in your warehouse, and you combine them in order to make it analytic-friendly. Again, there are multiple tools that people use, R and Python are two of the most common coding tools that people use. Of course, you have the notebooks that are built on top of R and Python. You have Jupyter notebooks, Markdown, and Polynote. These are all the code based tools. There are also a lot of low code tools that have come about these days that can also be used for this step.
Once we have the data processed, the next important step is to make sure that this data is accessible to people in the organization. Again, the important thing to keep in mind is that people access data in different ways, if you’re a data scientist who's comfortable working with R then you're more likely to access data from R. Whereas, let’s say you prefer low code tools, then you might prefer more of a graphic user interface.
So the key thing is to ensure that when you think about data access, make sure that you're able to think through the different groups of people who want to access data and make it available the way they would like to access it. I cannot emphasize how important this step is. For example, if the only way to access data is writing SQL queries, then you're excluding a lot of people in the company. In order to make sure that there is inclusion, and people are able to access this, it's very important to think through these tools in a very clear way.
All these steps are important, but if you really think about it, the reason why companies invest in infrastructure is because they want insights. Here again, there are like a bunch of different tools that companies typically use. So you have R, Python, Power BI, Tableau. There’s a whole bunch of dashboarding tools.
The critical aspect here is that it's very important that when you think about it, you also think about it end to end. If you don't get to the end, then the whole purpose of what you're doing is defeated. So this is how data typically flows through an organization.
How data flows through DataCamp
To kinda give you an illustration of how things happen at DataCamp, again, we follow this flow. The tools we use are slightly different. So, for example, with data collection, we use all the conventional tools. We use Snowplow for collecting clickstream data. We use Amazon Redshift for our data warehousing, and we're big users of R Markdown and SQL.
For our data insights, again, big users of R, and Shiny as a dashboarding chamber to provide insights into our data. No matter what, I think data infrastructure should always start with thinking about these five aspects.
How are you going to collect your raw data?
How are you going to warehouse your data?
How are you going to process your data?
How are you going to access your data?
And finally, how are people going to derive insights from that data?
So while this is a highly simplified schematic, I think in the real world you can always map different elements of a data infrastructure strategy to this flow.
Let’s talk about another really important element in this whole game. If you really think about moving data across the collection, warehousing, processing, access phases, you're also dealing with incredibly large amounts of data. Typically, when you're combining datasets, you have dependencies in place.
Let’s take this example of the product and accounting, right. For that particular task to be done, the task that needs to be done first is that the data in the Excel sheet needs to be pulled into the warehouse, and then the data from the product database needs to be pulled into the warehouse. Only after this happens, should any query on any analysis dataset that uses these two sources should be run. There are a whole bunch of tasks that need to be done, but they all have dependencies. In other words, it's very important to make sure that you're running tasks only after the dependencies run. This can get ugly and complicated very soon.
Scheduling Tasks Using Pipelines
One of the tools that companies use when they think about the data infrastructure is scheduling tools. So here's an example. You can think about Task B, and Task C, as getting the Excel sheet and then getting the SQL database into the warehouse. Task D can be thought about as combining these two, to create an analytic-friendly dataset, and Task E could be thought about as pulling this dataset and displaying it in a dashboard.
How do you make sure that these tasks get executed in the right order? Also, you can paralyze a lot of these tasks. For example, if Task B and Task F have no common dependencies, then you could run Task B and Task F simultaneously so that you're able to finish everything. The tools that typically help here are called pipelining tools. There are two classes of pipelining tools to think about. One is a class of tools that allow you to do data pipelines. Data pipeline tools are all about taking data from the collection state, to the warehouse, to processing, to the accessible state. The most commonly used tools are shown here.
Apache Airflow is one of the leaders in this area. It's an open-source Apache project. Then there is a tool called Luigi. This is built by the folks at Spotify. Here's another tool called Oozie. All of these allow you to specify your tasks as a pipeline. Once you specify your dependencies, these tools figure out what is the best order to run all these tasks, and how to optimize the whole process so that it runs in the least possible amount of time.
Now, a new set of tools have also emerged recently. If you think about it, for a lot of companies, the data flow is one. But another flow that a lot of companies typically have is, essentially, machine learning models, right? Machine learning models often involve cleaning the data, feature engineering, model selection, model scoring, hyper parameter tuning, and all these elements.
So recently, there’s been a slew of tools that have come about that allow pipelining of machine learning workflows. These are two of the most popular I've seen out there. Again, these are both open-source.
Metaflow: Built by the folks at Netflix. Netflix’s Metaflow, in a big way, is to schedule their machine learning pipelines.
MLflow: Is a tool that was built by the folks at Databricks
Data Pipelines Using Airflow
The big message here that I want to give is that it's really important that when you think about your data infrastructure, that you go with a solid pipelining tool. Without a solid pipelining tool, it's nearly impossible to negotiate the complexities involved in your data pipeline. For example, at DataCamp, we’re big users of Apache Airflow. We use Airflow to schedule all our tasks.
This is just an example. It's a schematic example, where in this chart, there is a task that collects data from database one, database two, and external APIs. We combine these two collected dataset, then these two collected datasets get further transformed. Then you have the reporting part, where you're building a report using the data, and then you're doing some alerting.
This is a highly abstracted away example, but pretty much any pipeline will have many of these tasks. Airflow is one of the leaders in this space. It's open-source. So I would definitely recommend checking out Airflow if you're trying to think about building a data infrastructure.
Data Infrastructure at Other Companies
Data Infrastructure at Airbnb
To give you more information, I thought it would be good to take a look at the infrastructure at a few more companies, some of the big ones, so that we can kind of map some of the similarities and differences. This is a highly simplified view of Airbnb's data infrastructure.
If you think about comparison, the event logs happen through two high level elements so you have events that get collected with Kafka, which again is an open-source Apache project. There is a lot of SQL data that gets collected. They pull all that data into Hive, which again, is another open-source framework. Then they use Presto to query this data from the Hive clusters. At the end, once all the data is queried, the access is through tools such as Tableau, Panoramix, which is now called Superset, and they have Airpal, which is a visual way to query Presto. They use Airflow in a big way, and that's why you'll see this Airflow scheduling layer on top. Airflow is the one that orchestrates all the various elements of the book.
Data Infrastructure at Netflix
Here is again, a view of data infrastructure at Netflix; it’s a very highly simplified view. You can see the common elements, and they use Kafka as well for a lot of the event infrastructure. They use Teradata, and Redshift for data storage. They use Tableau, D3, and RStudio for a lot of data visualization. Similarly, they use Jupyter and Python for data exploration. For data processing, you have Presto, Spark, and Python.
One thing I want to emphasize here is that, you'll typically see that bigger companies like, use a wide array of technologies. It's really important when you think about it for your own company, or your organization, and there is a reason why every tool is typically used. In general, the fewer the tools, the better, because it’s easier to maintain, easier to handle. It's very easy to go about this as chasing the best tool out there to date, and go tool chasing and changing that infrastructure, like every time a new tool comes into the market. But I think it's very important to think about it Some fundamentals, in other words, if a tool satisfies what you need, and you think it's going to be satisfying your needs for a good amount of time, then it’s important to stick to it, and not really switch. I just wanted to kind of give that piece of information here, because it's kind of intimidating and overwhelming sometimes. There are all these various tools, so it's important to keep that in mind.
Data Infrastructure at Uber
This is basically the infrastructure of Uber. They have their own custom platform called Michelangelo for machine learning work. This is a schematic of how they orchestrate their models. The one reason I wanted to put this in here is to show that typically, there is also a machine learning pipeline in companies that use machine learning in a big way.
Here, you see the machine learning pipeline is basically getting data, training, evaluating, and deploying. This is highly online, or in other words, real-time processing. It's important to choose infrastructure in the right way that suits the job.
Enabling Data Discovery
We’ve talked about data assets, and we’ve talked about data infrastructure. Now, as important as it is to enable data infrastructure, it is equally important to make sure that people in an organization are able to find the data that they want. It's not just important to make sure the data is there; people need to be able to find it. That brings me to the next important part of data infrastructure, which are tools that allow data discovery.
Data Discovery at DataCamp
Data discovery is basically all about making sure that people are able to search through datasets, find datasets relevant to their departments, and look at the popularity of datasets. In other words, it's like a search engine where people should be able to discover, and find what they need. At DataCamp, we have a fairly simple approach.
So, we have a Shiny app, and then we have data preprocessing that. All the tables, all the documentation everywhere at a single table is used, all that gets processed. A user that essentially is presented with a table with controls where they can search through it and find datasets that they need based on either the column names, the tags, the owner, or the description of the documentation. Our approach has been fairly simple; centralized data dictionary, and a lot of documentation.
Data Discovery Tools
There are a number of tools available out there, and I've just listed a small set here. These are all data discovery tools, and many of them have come out recently.
Amundsen: created by Lyft
Datahub: created by LinkedIn
Metacat: created by Netflix
Databooks: created by Uber
Dataportal: created by Airbnb
This table over here gives you an overview of what each of these discovery tools supports. So, for example, one important aspect that a good discovery tool should support is data profiling.
This means that once you look at a particular dataset, it should quickly tell you what all the columns are and some properties of these columns. Then, it allows you to understand the data set a lot better, even without expecting it in full. Similarly, I think the top user's query is important, but it’s a way to discover data, like searches. This could be one way you discover data, but you'll also want to know what are the datasets that people in my department are using mostly? It lends a little more confidence into, OK, I'm using the right data for decisions that get made.
Finally, lineage, which is super important. If you look at the data type that I described, by the time data gets into the form where it can be used, it's gone through a lot of these transformations. So, it's very important to be able to answer the question like, hey, I'm using this column from the dataset. Where does it come from? This is a question that gets asked, like far too commonly. It should be possible to trace the origin of a column through the whole pipeline until you can identify this column comes from this raw dataset, this raw column. While choosing a tool, it's very important to look at all these various features that a good data discovery tool can support, and make your choices based on that.
Lyft’s Amundsen's a pretty good tool. We don't use it, but we have plans to kind of expand and start using Amundsen because it provides a lot more features than what our home-grown Shiny dashboard can provide. So, just a quick screenshot of Amundsen. This is what the interface looks like. It shows all the popular tables. So this is one screen. Once you select a particular table, it gives you the full details of what are all the columns, the documentation of each of these columns, who are the frequent users of this data, who owns this dataset and then what are the tags.
I would strongly recommend if you're really thinking about data infrastructure and data discovery, to explore the tools that I've described here. There's also a link to a blog post that does a slightly more detailed comparison into what all this means.
Now that you've talked about infrastructure, let's talk about tools. I did talk about tools when I talked about data discovery and your data structures, that you could think about everything as a tool. But what I'm talking about here are tools that you build on top of that infrastructure. So these are custom tools that get built in order to support and make things more efficient. I'll walk you through tooling and give you a number of examples to help you understand this better.
The data science workflow
So this is the typical data science workflow. For those of you who attended the last webinar, you would have seen this. So, data science workflow usually starts with data access. You import your raw data, and then you process your data cleaning and storing it. Then, you get it into this loop. When you process that, you explore and visualize it, you modulate, and this loop keeps going on and on. Finally, you deploy and communicate the output. Now, if you really think about it, even if you have the best data infrastructure in place, even if you have the best data discovery in place, there is almost always a lot of boilerplate involved in every one of these steps.
For example, if you want to import raw data from a database, you need to make sure that you're able to connect to the database, authenticate against the database, access the right tables, and bring it into whatever tool that you're using. Let's say R, Python, Ultrix, any low code tools as well. There's almost always a boilerplate involved there. Similarly, if you are creating and visualizing data, and you want to kind of stick to corporate themes, then there is code that needs to be written to make sure that everything is styled accordingly. The goal of tools in an organization is to be able to simplify this boilerplate and abstract away the most commonly solved problems so that everybody can benefit and be efficient about how they do it.
The data science workflow at DataCamp
For example, at DataCamp, our tools are primarily built in R. We have R packages that handle every element in this data science workflow. We have a package called datacampr and a Python equivalent that simplifies the access. It gives access to all the tables. For all the processing data we have a package called dcmetrics that does all the heavy lifting. We have packages for plotting and modeling as well as documentation. To emphasize, we use R, so we built our tools in R. It could be tools built on any other language. It could be Python tools or tools built on low code platforms as well.
Access data with datacampr
So what does datacampr allow us to do? If you think about data access, datacampr makes it simple to connect to the database and provide access to every single table that is out there in the database. This is actually a game changer for us in a big way because prior to datacampr and these features, we had to know what table names existed, lookup documentation, and say, OK, this is the table that I need to use. Then we had to type it in the R console to get access. Basically, with datacampr, the tool that we built, our data scientists can access tables directly and through autocomplete. If I want to look at a table on courses, I just need to type
tbl, and then
_course, and there'll be a drop-down that will show all the various possibilities there.
So that's one problem it solves for us. The second problem it solves for us is that usually databases have documentation that gets stored in the database. So either you can use the data discovery tool. But a lot of times, data scientists prefer having access to information right where they work. For example, if you're using Jupyter Notebooks, you're most comfortable just pulling up the help to look at what this particular function of this data does. Similarly, if you use R Studio, you'll want to use R, you’d want to do a question mark and then figure out details.
Basically, datacampr does that for us, with being able to pull all the documentation from all the databases into the package. If I need to look at what this dataset is about, I can just do a question mark, and then I can pull up the Help page for this particular dataset, along with details of all these columns. So, to emphasize, this tool allows us to go above and beyond and make things more efficient. We could do things with infrastructure, but with tools, it just makes the common tasks so much faster.
Airbnb again, is another company that uses tooling in a big way, pretty much every company out there uses tools in a pretty big way. The next example I want to give you is, at any company, when you think about the questions that data science typically answers, there is sometimes a commonality.
Processing data with datacampr
So this is an example of a question from our VP of Content, saying, hey, I want to track course completion rates over the last year, aggregated by week, broken down by technology, topic, and track. I'm sure you get questions like this, day in and day out. Here's another question from our VP of Finance. Jelle asked us to track the recurring revenue over the last two years, aggregated by quarter, broken down by segment and geography.
One way to answer these questions is to kind of go back to your data infrastructure, to data access tools, and write custom code to answer these questions. On the surface, they do look like very different questions. However, if you really think about it, there is a commonality. The commonality is that in both cases, both of them want to track a number, they want to track it over a certain period, they want it aggregated by a certain interval, and they want it broken down by a certain set of dimensions.
Once you recognize the commonality, it becomes a lot easier to say, why do I need to solve these differently? Why can't I just build a tool that would allow a data scientist in the company to answer these by writing a lot less code. And that's exactly what we did.
We built a tool that allows answering a question of this kind of metric over a time period, aggregated by time interval, broken down by dimensions. We built a package called dcmetrics, which allows the question you just saw to be answered with 40 lines of code, starting with, this is the table with the course date, I want to enrich that data with technology, topic, and track data. I want to summarize with these dimensions, I want to choose for it to be aggregated by these periods, and finally, I want to summarize it and create the bucket. By building this tool, we collapsed what would typically take like a week, to combine all the data and get it into a visualization. We just collapsed it into a 5 to 10 minute effort, where at the end of it, it could give us really beautiful and informative data visualizations and dashboards.
Experimenting Reporting Framework at Airbnb
Here’s a couple of more examples of tools. At Airbnb, experimenting and A/B testing are a really big part. For example, Airbnb might want to decide, if I show a slider for prices versus allowing the drop-downs for price ranges, what works better? What works better in improving the book through rates? That could be an experiment they want to run and they want to understand. But at any point, there are many experiments running and every experiment needs to be analyzed.
So rather than taking this view that data scientists would analyze the experiments separately, they went about building an entire reporting framework, where once an experiment is set up and deployed based on some conventions, there is an automatic dashboard that pulls all the data from the experiments and provides a visual interface for the product managers to look at how the experiments are doing. Once again, these tools kind of save an enormous amount of time for companies.
One thing I want to point out is that, while it might seem that these kind of tools benefit like the bigger companies the most, because there is a lot of leverage, I would strongly argue that even if you're in a smaller company, and you're new, starting out with data, it's important to think about it from the get-go. It will really, really simplify scaling in the long run.
Hidden Technical Debt in Machine Learning
Let me talk about one more example. This is all about machine learning. If you think about machine learning, everybody talks about all these cool algorithms. But, if you really think about machine learning in practice, the actual learning code you write is a small fraction of the code that's required to solve the problem. In fact, there is a beautiful paper by Scully at Google, where he talks about hidden technical debt in ML systems, and that's your whole argument that to do ML, you need all this monitoring.
Machine Learning Workflow at Airbnb
Once again, tools can be pretty helpful here. If you look at one of the key steps, you have feature engineering, where you generate features for your model. If you're a big organization, very soon, you might find people re-inventing features. So one way to solve the problem is to kind of build a tool where anybody who builds the future and finds it useful, can log it into a central database. And that's exactly what Airbnb did.
Zipline: Scaling Feature Engineering
They have a tool called Zipline that allows their data scientists and data engineers to write features into a database. The advantage is, if somebody else is building a model, they can just come in and look at what features have already been built for this particular kind of problem and then, not have to reinvent the wheel every single time.
Identify Which Tools Work Best For You
So, long story short, tools are really important to accelerate and make effective the whole analysis pipeline in any company. Again, we talked about a lot of coding tools, but I want to emphasize that this is not just about tools. It could be, for example, if you're using Power BI, it could be writing some plugins that simplify database connections. It could be templates. It could be Tableau templates. It could be Excel templates, it doesn’t matter. The whole mindset behind tooling is if you are repeating something many times, and it gets done by many people many times over it is worth abstracting and making a tool.
That brings me to our concluding section. We've talked about IPTOP, as well as Infrastructure and Tools, the two fundamental layers. We will talk about People, Organization, and Processes in a later webinar.
Questions and Answers
Question: When you think about infrastructure and tools, especially at an organizational level, this could be different for organizations of different sizes. What do you think you need to take into consideration when choosing different types of infrastructure and tools?
Answer: I think it's important to think that for any tool that you think about, you need to consider two aspects, right? The technical aspect and the people aspect. It's always important to choose something that is good on both fronts. Of course, there is also the element, the cost element involved. So, it's a combination of these three. But I would start with people. I would start from the extreme end. For example, if a lot of people in the company use R, then I would basically make sure that the tools and infrastructure you build allow people to function efficiently. So I would say that that is the most important aspect. Then, of course, once you are able to handle the people side of things, the question is, technically, what's a better tool to do something? Again, every organization is different. There are competing requirements, and so the choice of tools typically involve quite a bit of thinking, and that's another factor that will be key.
Question: When you think about scaling infrastructure, do you do, are there any skills in mind that you think about when trying to invest in scaling infrastructure and tools?
Answer: So, if I go back to the slide on data infrastructure, the skill sets required actually vary considerably. But if you don't think about the data collection and data warehousing layer, typically for the raw data collection layer, you need data engineers. Data engineers are essentially the warriors, who are the backbone of any company ensuring that this gets done in a clean way. Once you get to data processing, data access, and tool building, you need data scientists, but you need data scientists with more of a product mindset. So, basically, I would say that it's a combination of these two skills.
Question: When you think about pipelining tools, which one is the most popular and versatile, for example, between Airflow or Metaflow?
Answer: One distinction I want to make here is that Airflow and Metaflow solve slightly different problems. Airflow is basically a data pipelining tool, whereas Metaflow is more specific to machine learning. I would argue that they have their own place. Of the tools that do data pipelining, I've seen Airflow have the most mindshare, at least. You can see on Airflow’s website that a lot of big companies, a lot of really good companies use Airflow. We use Airflow ourselves, it's an amazing tool for pipelining. But having said that, if you're looking more into workflows involving machine learning, then Metaflow would be good to check out. Airflow is more of a general purpose tool, whereas Metaflow is basically built for custom machine learning. If you're doing a lot of Machine Learning in your organization, then Metaflow might actually be something worth looking at.
Question: What's the difference between events and data and operations there on the Netflix slide? Can you give some examples on what could be an events data, or operations data, from your perspective?
Answer: So the big way I see it is events data is data that's continuously streaming, that's coming and collected. I think it's more or less a frequency based data. For example, what is the quality of the video that you're watching? It’s those kinds of metrics, essentially data that gets collected at a much higher frequency. Let's say you clicked on something, that kind of data. I don't know enough about how Netflix thinks about it, but most of the time operations data is more about the data that gets stored. For example, your account information, the movies that you’ve watched. Essentially it doesn't change at the same frequency. So, one is streaming and the other is not. So, that's the best way to look at this, and so for us, for example, Snowplow is where we collect every single click data; that's the events data. Whereas, when you complete a course, that's basically something that gets stored in a SQL database for us.
Question: As you scale your tools and capabilities, you start touching on an exponentially broader scientist user base that might not have the skill set to jump in, what are the ways not to scale linear by just hiring more data scientists? Do you have any ideas about training or interface tools?
Answer: There's actually a pretty, pretty tricky question. There’s a couple of ways to look at it. One is finding a data scientist, but you can actually achieve this with a code. So the way I would approach this problem is, it's not like everybody needs to have all the skill sets. However, you need to have a core team that has a very engineering mindset that can handle the data engineering side, the tool building side and the access side. Once you have that, then your data scientists can then work around that to solve the problems. I would take that mindset. And, again, tools can kind of help in a big way. For example, there are low code tools, like Looker, and Ultrix. Microsoft has some products that allow you to do drag and drop and to do analysis. However, I think there's no getting around the fact that you need this core team of people in the company to build that data infrastructure. There is no getting around that. I think by choosing a strong core team, you can ensure that your data scientists, like an analyst, don't need to have the full skill set. They can focus on problem solving, and leave the infrastructure and tooling part to the experts.
Question: Given an existing enterprise architecture, and limited resources dedicated for data science, if you have a small team, what do you think is the first thing to invest in, a data catalog, for example?
Answer: To answer the question, I think, it's always when you should always look at it from the point of view of what's going to drive the most value? I would argue that, for example, if the biggest bottleneck you're doing data work is accessing the right data, and accessing it really quickly, like, I'm not having to go through a lot of boilerplate, I would argue that that's the first problem to solve. If that problem is already solved and it's pretty good, then I would say the data catalog becomes important, because unless you know all the data that's out there, you won't be able to answer the questions in an intelligent way. I think the way I would think about tooling is always going to be thinking, what is the bottleneck? What is preventing me from going up this AI hierarchy of needs? Then, identifying the most pressing problem. Then building and investing in the tool that solves that problem, rather than going from a tool first approach.