Towards Self-Service Data Engineering with Taylor Brown, Co-Founder and COO at Fivetran

Richie and Taylor explore the biggest challenges in data engineering, how to find the right tools for your data stack, defining the modern data stack, federated data, data fabrics and meshes, AI’s impact on data and much more.

Oct 3, 2024

Guest

Taylor Brown

Taylor Brown is the COO and Co-Founder of Fivetran, the global leader in data movement. With a vision to simplify data connectivity and accessibility, Taylor has been instrumental in transforming the way organizations manage their data infrastructure. Fivetran has grown rapidly, becoming a trusted partner for thousands of companies worldwide. Taylor's expertise in technology and business strategy has positioned Fivetran at the forefront of the data integration industry, driving innovation and empowering businesses to harness the full potential of their data. Prior to Fivetran, Taylor honed his skills in various tech startups, bringing a wealth of experience and a passion for problem-solving to his entrepreneurial ventures.

Host

Richie Cotton

Key Quotes

Self-serve data is the holy grail. This is the democratization of data. Anyone can ask any question. That is the big promise. The challenge that I've seen in practice are even when you implement this, people don't always know what to ask. They don't know what questions they're supposed to be asking. So there's one side, is the human element of it, AI is going to play a big role in this.

One of the biggest challenges in data engineering right now is this explosion of tooling that's happened over the last few years. Where should we invest? Where should we not invest? What are the right things to build on top of? Ultimately, data is one big stack and you have to build on top of each layer. On top of that, the layers that you're building on top of, have to be layers that can survive for a long time.

Key Takeaways

Rather than building custom pipelines, leverage tools like 5Tran to automate data extraction and synchronization from various sources, reducing bottlenecks and improving speed to insight.

Instead of creating separate strategies for data and AI, integrate both into a unified approach by first ensuring your data is centralized, clean, and well-governed—this will make it easier to apply AI and machine learning tools effectively.

Encourage decentralized analytics teams within departments for faster insights, but keep data centralized to ensure security, governance, and consistency across the organization.

Links From The Show

Fivetran

Career Track: Data Engineer in Python

Rewatch sessions from RADAR: AI Edition

Transcript

Richie Cotton: Hi, Taylor. Thank you for joining me on the show.

Taylor Brown: Richie, great to be on the show with you today.

Richie Cotton: So I want to talk about problems to begin with. I'm just wondering, like, what do you think are the biggest challenges in data engineering right now?

Taylor Brown: I think there are a few, like, massive challenges. The first one is just, there's just so many new tools that have come out and there's so much emphasis towards AI and AI tooling. That it's very difficult to understand what should folks be doing, what should folks not be doing. You to answer a little bit more about what Fivetran does, just as a quick background I'm the co founder and chief operating officer of Fivetran, and we move data for companies.

We're the, global leader in data movement. You know, loading data for, Fortune 500 companies and all the way down to tiny startup companies. such as like open AI snowflake data breaks, Morgan Stanley, NAB and thousands of other customers. and so I often think about the world in terms of data, data movement.

And so probably a lot of today's, conversation will skew that direction just so the audience, understands that background, but yeah, to answer your question more directly. I think it's just this explosion of tooling that's happened over the last few years and understanding, like, where should we invest, where should we not invest, what are the right things to build on top of, because ultimately, data is like one big stack... See more

, right, and you have to build on top of each layer, and the layers that you're building on top of, they have to be, layers that can survive for a long time, that's the, biggest problem that I would say.

Richie Cotton: that's absolutely fascinating. There are so many tools and it's something we struggle with at Datacamp as well. It's like, which tools do you teach because are they going to be around in a few years? Which ones are the most important ones? So do you want to elaborate on that? Like, how do you go about deciding which tools you need in your data stack?

Taylor Brown: when you think about the overall data stack, the biggest problem is basically like every company has a ton of data, and now that data is all over the place. And so, at the end of the day, you want to use and utilize this data.

and that's the overall goal of building a data platform, So you're using it for AI, you're using it for, now that's like the biggest one, AI ML. You're using it for, for BI, for building applications. Like, these are some of the use cases. But then the question becomes like, okay, well, where's all this data?

Why is this data everywhere? how do we get access to this data? Do we centralize it? Do we decentralize it? Who has control over that data and I think some of the biggest challenges that our customers and the market faces today is around just first and foremost Scalability like I get access to that data? You know, second biggest one is security and last one is just You know, compliance and governance. Like who has access today? Who doesn't have access to the data? And as you've seen, there's been a ton of type problems over the last few years Uber just got fined 300 million or nearly 300 million for passing data from Europe to the U.

S. that is a problem that they're facing with governance, They don't have the right tooling in place. They shared it. these are some of the challenges that the companies are trying to face and how you face that, in my opinion is that, I would say the, thing that companies should be using.

Is what I would call the modern data stack. Do you, are you familiar with this, Richie?

Richie Cotton: Yeah. So, the modern data stack is a very well known concept, but a lot of different guests seem to define it differently. So, uh, I'd love to hear your take on this. What, what's in the modern data stack?

Taylor Brown: It's funny because we, I mean, we're, I would say, part of the original group of tooling for the modern data stack, now almost nine years ago. And it was even like a post modern data stack, and we can talk about that shortly. But, you know, what I would define the modern data stack is, is a cloud based data warehouse or storage layer, generally data warehouse and, you know, you're taking data and loading that into the data warehouse from all the different places that you have it.

And then you're building tooling on top of data warehouse and you're doing more of an ELT versus ETL type workload. So in other words, order of operations is different than what it was prior, which is you extract, you load, then you transform and you build this whole ecosystem on top of the data warehouse and that's in my definition would be the modern data stack.

Richie Cotton: the idea is you've got lots of data all over the place in different sources. You bring it all into the data warehouse and that's like a single point. And then that's the point where you transform things and then you can build something useful. So you mentioned this idea of like, ETL versus.

DLT. How does the workflow change there? All

Taylor Brown: maybe I'll start with just like a history databases and data warehousing as like a very brief way of thinking about. So, in the very beginning companies, let's say in the 60s, 70s, 80s. You know, you'd have maybe one database and that was like your point of record for all the things that, the business is running.

It's like your sales, your marketing, your accounting, potentially your product. And it's just in one big monolithic database but the problem then became, okay, well, how do I run reporting on this? And so the problem, like, okay, I could just build a spreadsheet on top of it. I could do reporting, but they're really slow.

So then, folks have always wanted to say like, okay, well, I need to get data out of these systems or maybe not. And I can just report right on top of it. But if you report right on top of your source system, you can sometimes slow it down because these are, The types of queries that you're running are analytical and by nature, so you know, like relational database, they're not indexed.

And not performant for doing, aggregations basically basically there's this wish like, okay, I wish I had a database that could just store everything that could be really fast for analytical query. This is probably like early 90s, right? But there was no such thing.

And so what a lot of companies did is they would just take another relational database and they would load some of the data. Right. From their monolithic system into that other system. And they would do what I would call ETL, which is they'd extract the data out. They'd transform it doing those aggregations.

And then they'd load the aggregated data into that database that made it perform it because it was aggregated. But you know, you define maybe 10 metrics and maybe the business is like, Hey, I need an 11, 11 metric. And it's like, Oh, I'm sorry. We don't have that. We didn't store that data. Like you gotta go, you gotta go back and like take it out.

And so. was like very cumbersome to actually get a lot of insights. And so there was always this wish of like, man, I wish I could just load all the data and not aggregate this data. but that sort of ETL workload has survived, you know, since the early 90s, probably even earlier. And I think in maybe 2007, 2006 timeframe, then came Hadoop, This like, all right, wow, now we have this place, we can just dump everything. And so like, okay, let's take all the data, let's dump it into Hadoop. And Google's doing this. So this obviously must be the right thing. And so everyone just sort of dumped it in. It's more of like a data swamp where you just have like tons of data.

There's a lot of duplicate data. You're doing snapshots every day and just like dumping it in. And then you spend a ton of time on top of that trying to make sense of that data, right? And I would say, this is the first step towards the modern data stack, but the problem is because the data is not organized, you end up spending like a tremendous amount of time, energy and effort just, essentially ETL ing the data once it's been loaded there, So that's like, 2006, 2007, and then eventually 2013, 2012 ish Redshift comes out. So Redshift is the first cloud based, data warehouse. And, and actually I skipped a little bit of stuff. There were a few other on premise cloud, on premise, Data warehouses that are column store data warehouses that are designed for analytical query.

So the way in which that they're actually the data stored is designed for doing this analytical querying. And so there was somebody who's like the teaser, Vertica, things like that. So that was kind of like everyone, everyone focused on in the early two thousands. But the problem with them is that they were constrained by storage and compute, Because it's an on premise box. It's in your basement. And you're like, okay, I can only, drive as much as possible off of the box that's in my basement. Similar to like on your computer and so once you load it up with too much data or you're trying to do too many queries against it at the same time, you end up with like really slow queries.

Some people are like, Oh, hey, I'm waiting like forever to get my data out of this system. and everyone said, Hey, I want these columns to our databases, but I want them in the, in the cloud, right? So that I can. I could scale them out horizontally. That was like the holy grail.

So that's where eventually Redshift takes an on premise database it makes it available in the cloud. all of a sudden everyone's like, Oh my gosh, this is the future. This is where, you know, this is where we're going. And the really great part about that was that it somewhat democratized the ability to use these column store databases.

Because in the past, the TISA and Vertica and all these other data warehouses were so expensive that most companies couldn't actually afford them. And so now you have this massive rush towards, hey, let's just move everything in. And the workflow then became, take the data, load it into the warehouse, and then do your analytical on top of it.

And you know, but when you load into data warehouse, make sure it's organized appropriately. And, the thing that this then allowed for businesses to do is to automate this first step. And in the previous world, the types of aggregations they were doing was always custom to your company, so you couldn't automate any of it.

So you had to build out your own ETL from scratch all the time, every time. And so now you have this like automatic replication of getting data into the warehouse. And then from there you build, the transformations that you run on top of that, the data modeling is very unique to your company, but instead of spending your like engineering time building these pipelines, you could basically automate that piece and spend engineering time just doing the actual like analysis within the warehouse.

And so that's where Fivetran came in. Is like the replication product loading data directly from all the different places that you have it into your redshift cluster. And that's what I would say is the modern data stack, you know, and then on top of that, there were like more modern BI tools like, like Looker, who became available that were running directly on top of the database versus in the previous world, it was like you ship some data to your computer, and then you've run these like, Local queries around it more like a spreadsheet.

And so those are some things that are happening. I think the other big trend that we saw over the last over that time frame was that you went from these monolithic databases where you have like one database, one SAP system or one ERP system to having thousands, Because as soon as the cloud took off.

Then it was like, alright, great. Now there's like a million different services to solve every problem that you have. And so companies went from like maybe one to five, data sources to like maybe in the two thousands, like let's call it five to 10. And then in the 2000 tens it's like, a hundred to like a thousand now or even more.

We have customers who have like 10,000 different sources and databases that they're running or more. And so like that. problem becomes even more valuable to solve in an automated way to get all that data into a, centralized place. So that, that's like the brief history. And then there's the Neo age of this, which I guess I might as well since I'm talking about it, if you don't mind.

I'm going to keep going a little bit, which is what happened in 2015 is that Snowflake became available. And Snowflake is really the first or BigQuery, but stuff like particularly was the first commercial data warehouse that was designed for the cloud. In other words, it was not constrained by compute and storage.

most databases in the past were always like, you have a box and that's how much, storage and compute you have. And you're constrained by that. And so Blake was like, no, no, we're going to use S3 for our storage layer. And we're going to use EC2 for our compute engine, but we're going to scale it out infinitely.

Like you want to spend more money. We will let you spend until the cows come home and you're going to have the fastest queries ever, but you're going to pay us a lot of money for it. And so that was really truly the first step towards, all right, now we have a fully elastic. very automated, cloud data platform or data warehouse that I can just load everything into.

and that was like a huge change. And, you know, I think that really got everyone, you know, that's maybe a hundred times better than the on premise databases. And so everyone was like, wow, we got to go do this, right? fast forward to today. And the big thing that's happening now is that a lot of these companies are saying, look, I just want to store data in my own S3.

But I, then I want to query it inside of Snowflake. I want to create inside of Databricks. So I want to create inside of, like Redshift or Trino or wherever. And so I think that has been the next wave of this is essentially saying Yeah. Okay, let's store it all in a, repository that's my own repository because they're all all these services like snowflake are using s3 anyways, and then allow me to query it in different places.

And what this does is it gives the benefit of having sort of a future proof commodity place you can put all of your data. That you can also then use in different places. So you don't quite get the lock in that you have, in previous where you sort of get stuck to one vendor.

Yes, you may be stuck to a certain cloud, but you then can use various different compute engines on top of that, depending on what the workload is. So it might be an AI type workload, so you want to use something different, or it might be like a classic BI type workload, so you just want to use Snowflake, right?

and so that gives a lot more optionality to enterprises as they move forward and that is like the really big shift. And so what, you know, how Fivechain fits into this new paradigm, like, I think we were fundamental, in helping the shift towards automation into Redshift and Soflake in the initial modern data stack.

And now we've built a, connector for our customers to load data into S3 or into a data lake in a ACID compliant file format using Iceberg and using Delta, But the difference between the sort of data lakes of the 2007s and the data lakes of 2024 is that the data is not the swamp.

It's highly organized. You can do history on it. you know, you get all the upside of a database, but you also get the upside of a data lake. And that's the like, holy grail of what people have been wanting since, the early nineties. So that's a really exciting thing that we're seeing right now.

It's still early in that, second wave of like postmodern data stack, but anyway, so I just went on for a whole lot time there, Richie, but hopefully that gives some context to the conversation today.

Richie Cotton: right. So it sounds like the dream then is you can scale essentially infinitely. You decide where you're storing your data and where you're storing or where you're doing your compute. And then in theory, the pipelines to get from from where your data stored to where, where things have been computed, that's easy to figure out.

And so, how do you get to that dream then? So you said the technology is kind of there. I guess, what processes do you need to put in place in order to realize this dream?

Taylor Brown: In talking about it in the way in which we, Fivetran has worked to solve this problem is. Is extraordinarily simple, like you want to get data from 5000 places into a single location, whether that's a data warehouse or a data lake, we can do that for you. You basically go and you set it up all these different connections to put it in perspective.

Setting up a salesforce connection takes, I think, like five clicks, right? You authenticate your salesforce. You point it at what data warehouse you want. We go and grab all the data from Salesforce. We create the tables loaded into the data warehouse for you, and then we keep it up to date incrementally syncing that data every minute or whatever.

And if there's schema changes that happen on the source, we persist these through to the warehouse. And basically we take care of everything. So you just end up with like a mere copy of your data within the warehouse. That makes it extremely easy to access and move all this data. But then there are some more challenging things around compliance and security and all those things that I mentioned.

And so, this is a problem that every company is facing, who's allowed to access what, how do I make sure that I don't, send GDPR specific data to the U. S. and things like that. and so Fivetra helps very deeply with each piece of this in a slightly different way.

with the security piece. I mean, we have, the security sort of like table stakes, right? You have to have the best security. We have the best protocols for doing the connections. Everything is encrypted end to end. We now have a product called hybrid deployment. So you can deploy all of Fivetran behind a customer, your, your firewall.

So like the data processing all happens within your region and your cloud. So data never touches by trade servers, like we have all the different options for helping our customers in whatever way they need to keep the data secure. On the compliance side, we use like role based access control to be able to assign all the way down to the table level the connection level.

So you can say, Hey, like, I want to give this group access only to marketing connected. So I want to give this group only access to. the ERP system, I wanna give, this group only access to the, sales systems or whatever. And so, and then that you, you apply it like at the table level, and you can apply it at the data warehouse level, and then you can apply it like even in the embedding level.

So like. You can really get fine grained about that, and we don't necessarily do those controls or create those policies, but we integrate with all the tools that do to make it very easy for our customers who need to have control over all this. to do that still in a very automated fashion because it's scale.

You really can't do any of this manually. So it sounds like, there's sort of quite a lot of different moving parts there, so you've got to like worry about all the the compliance side of things and the security as well as As figuring out like which bits of data need to connect to where and things like that. So i'd love to go through this sort of maybe slowly and just like figure out like where you get started.

Richie Cotton: So let's think of like a common problem. So like for example, almost every organization they're complaining Okay, we've got some data stuck in silos. The right people don't know how to access it. What's sort of step one in solving this problem.

Taylor Brown: think the first step is figure out a use case within the organization that you are trying to solve specifically around data. Hey, what questions do you have? What product are you trying to build? And then figure out, okay, what is the data that we need access to? Then you go and grab that data.

you'll still, you'll need to select the data warehouse or a data or a data lake. Okay. And then, ideally use five trend. You don't necessarily have to use five trend. you can build this yourself, but there's always a build by trade off and frankly, you'll end up getting faster results through, a product like five trends.

and more accurate results and more security and all the things that I talked about. You basically, I'd say, pick an ease case. So, say you want to understand your sales velocity. And you can't really get a great understanding of sales velocity because you don't necessarily have Everything within Salesforce, Like you want to understand the velocity all the way from the top of the funnel to the bottom of the funnel. And so you have to combine your marketing data from, say, Marketo to Salesforce data. And in order to do this, you need both of those systems. So I would start by setting up a sink for both of those sources into the warehouse.

And then, starting to solve one or two questions. Hey, let's, build a, a model on top of that, that helps us understand the velocity from the MQL all the way to close one.

Richie Cotton: Yeah. So I do like the idea of starting with use cases rather than maybe just like go and saying, well, we've got all these different data sources, let's just hook them up at once. So I guess there's going to be some data sources where it's like, if no one cares about it, then there's not even any point in trying to hook them up.

Okay. Yeah. So, I like that. So you have the sales example, you've got your Salesforce data, Marketo data, whatever other tools you're using. And then around that use case, those are the things you need to sort out. Yeah. now there've been a few sort of ideas that have been fairly well hyped recently.

So, for solving this idea. So, things like data fabrics, data meshes. Do you want to talk me through like what these things are and when they might be useful or not?

Taylor Brown: In that whole history, now, like, now I've given the background and I can explain this a bit more, there's been this concept of federated data. And so, federated data is like, the data fabric, is, it's this concept where, say you have hundreds of different data sources, instead of taking all that data and centralizing it, You query it in real time.

So you'd say, Hey, like I need the Salesforce object. This is the exact same use case. Hey, I need this Salesforce objects. And I need this one Marketo object that I'm going to just create on real time, get that data. And then usually in a virtual layer, you will combine that data and you get your answer. it's like, this is amazing idea. Cause you're like, Hey, I could just get all the data I needed at any given time. And I can just go get it. it's like this holy grail promise, though, that doesn't usually end up paying off because getting data out of these systems takes so much time. So Marketo, for example, can take up to like a few months to get your initial data set out because of the API limits that they impose on you.

And so you've got, you know, like, Hey, I'm going to write this query, ask for it. And like two months later, I'll get the results, So that's where the whole concept of like getting data out of each system is more valuable to put into a data warehouse. So you're not like beholden to the source systems, you get all the data out and now you can run it whatever speed and whatever, you need.

And so, that is like the initial federated, thing now, now there's like a more the fabrics that are coming out now are kind of a version of, we'll take some of the data and load it in different clouds, and then you can kind of join it together, virtualized, but doing the joins actually you can't actually get the results that you want you really can't join the actual full data together.

So we still strongly advocate for getting all your data into a single location. The concept of a data mesh is that each individual team owns their data stack, and then, you know, you sort of share that data around, so you'd say, hey, like, your marketing team has its own data stack and it builds its own thing, and you have, like, your finance team does their own thing and everyone has their own things.

And I think there's always like a trade off between centralization and decentralization, decentralization, you're going to get faster results. You're going to get probably a little bit more customer happiness within the, whatever organization that you have, but you're trading off.

Security challenges you're trading off. Like most recently, there is an example where Morgan Stanley had a huge lawsuit because they had a bunch of decentralized data on computers. And then they sold these computers because they're like, we're end of life thing is, but they didn't clear the data off of it.

Right. And so all this data was like sold. and so like that is hugely problematic. And that because you have all these decentralized, owners of the data. And so like, we generally believe you should get all the data together. And you can still have different owners and different philosophies around modeling, but you can put them all together within the same data center, right?

And I think there's a lot of benefits than being able to join that data together. So like again, if you haven't figured out, data centralization is like the approach which we still highly advocate and Fivetran is, that's obviously how we focus on moving data for customers, but we can support meshes.

We can support other ones. If you decide, I want to have, Fivetran load to, Redshift for one team and Snowflake for another team or Data Lake over here for another team, we can do all of those things. It's just not what we advocate. It's the best practice. for joining us.

Richie Cotton: Okay. So it sounds like centralization you mentioned is better for security. So anything that's like highly sensitive, you probably want to centralize and then maybe decentralization gives you sort of. Speed advantages for like important data sets, something like

Taylor Brown: You know, like, I mean, it's more speed and workflow, right? Because imagine that you're just like the marketing team and you have a question about a marketing thing you want to go get. You're like, Oh, I can just like tap the guy next to me who like is the marketing analyst and I'm going to ask him for this thing and he's going to go and just create it quickly and give him the answer for it, right?

But the question then becomes like, one, is this person even supposed to have access to this? Like, Central IT, who's responsible for understanding access controls, has no idea what's happening over here. So security is, like, challenging. And then two, just quality. Like, how do you know that it's gone through the quality and QA, that it's actually correct, right?

And so there's some, like, trade off. You want to have, like, your fast and dirty stuff, that you're like, hey, this is directionally answer that want or I'm getting. But your hardcore dashboard should go through probably more of a centralized process that you are evaluating, like, hey, one, are the metrics correct?

Are they governed? Two, like, is the sensitive data, who's it being accessed by, et cetera, et cetera, right? And so when you centralize it, you take all this data, And you, your, your central IT team and your security team can understand what data is getting moved. You can use Fivetran, for example, to say, Hey, I want to block PII data, or I want to block PHI data.

I don't really want certain tables to load from this highly secure, thing into the centralized location because it's really important. Or, hey, I actually want to load up, like what a lot of people do is they use the Medallion system where, like, you take all the data, you put it into a raw layer that is highly secure you know, and that's the bronze layer.

And then you go to your curated silver layer that your IT team and everyone else has combed through and you have good governance on it. And you're like, great, okay, now other people can go and access this and can build to the gold layer on top of that. But they're not accessing like our, critical sensitive data and they can apply the role based access controls there.

Richie Cotton: I like that sort of easy color coded system for like, how important is this? How, high risk is it if something goes wrong here? I'm curious as to, does what you do with your data affect the org chart? do you need to change your team structures depending on your data strategy?

Taylor Brown: think work structure and strategy do need to go hand in hand. For example, like if you invest heavily in a decentralized set of teams and you want to have a centralized strategy, you're at odds with that, right? Because you're never really going to have the right strategy. Team structure and organization to support a centralized team.

Now what we sort of advocate for is that you have a large and strong centralized team who is helping to both bring all the data in, curated to those different levels. But then you end up having still like small analyst teams that are decentralized. And what this allows for is you have someone who sort of speaks the language of data.

So say, in that organization, who can communicate back and forth and say, Hey, these are the key metrics that I wanted. These metrics are not quite defined, right? Or whatever. I'm going to run a like really quick and dirty, thing here. And so you get this, like, fast response, but then also someone that can communicate back to the central team.

So that's what we've seen success with. But again, like every company is different in the size of every business and their sophistication is sort of different. So it's like, to sort of think through those.

Richie Cotton: am I right in thinking then it's going to be the data engineers who are more centralized and then the analysts are kind of a bit more spread out and closer to the commercial teams?

Taylor Brown: That's what we've seen success with, frankly. But you know, when you get to really big sizes, you end up having multiple central data teams, that focus on different areas. And so it'd be, Like again, there's always a decentralization centralization, challenge you go back and forth over time, depending on the size of the business.

But for us, we're about a 1250 employee company right now. We find that centralization is a really good path forward for the future to come. And what I would imagine is that we'll continue to have a centralized data engineering team for many, many years. But the analyst teams in each of the groups will probably start to grow over time, Just to help support each of the, the different functions directly.

Richie Cotton: you mentioned the idea of having sort of quick and dirty analysis where you can just ask your colleague for some results. I guess another one of these sort of holy grails of analytics is having self service capabilities where anyone can ask a question about the company data and be able to get their own answer.

How do you get closer to that?

Taylor Brown: I think this is the holy grail, right? This is the democratization of data. Like anyone can ask any question. That I think is the, the big promise, the challenge that I've seen in practice is even when you implement this. People don't always know what to ask, They don't know what questions they're supposed to be asking.

So there's sort of one side, which is, you know, the human element of it. And I think AI is going to play a big role in this. Now there's like the infrastructure of it, And so, from our perspective, having a centralized location for all of your data, Then puts it in a place where you can have people ask questions of all of the data.

So if you have this decentralized setup where you're like, in marketing can only ask about marketing, sales can only ask about sales. Well, what happens when you want them to think about these other things? Like sales might have a really interesting idea about this, or hey, product might have a really interesting idea about what's happening in marketing.

So that's why, like, you have it all centralized. Everyone is able to like, ask questions against it from an infrastructure perspective. What we see oftentimes is there's like another piece to this, which is like back to that centralized and decentralized state. There are times when the central I.

T. Team says, Hey, I want to make sure that we have control over the data that's moving in and out. But if the central I. T. Team is actually responsible for always setting up every pipeline or moving that data. They become a huge bottleneck, So what they want is like this thing where like I can control what types of sources and what you're allowed to move and I can see what data is being moved, but I'm going to give control to the, the various groups to do this.

And so Fivetran actually has, we've built a tool around this, we call it Connect Cards, or powered by Fivetran, and what it allows is, the centralized TTM runs the Fivetran account, but then they can give, smaller accounts to each of the different teams, and since it's so easy to set up, They can say, hey, marketing, you're allowed to only load marketing connectors, but you can add whatever you want within the marketing, route, and then they just go and add as much as they want.

It automatically flows through central IT to the central location, but maybe gets put into a schema that they have access to and then shows up immediately, and then they can start querying it, know, you have to think through that, but now it's possible where it wasn't before to get this sort of, Self service aspect to data in a centralized way, which is very exciting on the other side, like just training people how to ask the right questions.

It's hard, you need to go into ask, really promoting education within the organization around what questions are the right questions and what are good analytics? frankly, like we spent a lot of time in stats. Our CEO, George is a big, statistician type person.

He spends a lot of time educating folks around like, Hey, like when we're thinking about experimentation or other things, like you really need to be thoughtful about, do we have enough enough data to actually or enough, like, and to actually look at the results here with stat significance, And things like that. And so that just comes back to education. I think the advent of AI and the raise of AI is going to be the thing that makes this easier for your average employee within the organization.

Richie Cotton: I like this idea that well, teaching people how to ask good questions, maybe the hardest thing. And so you do need some level of education around the company. I'd like to get into that a little bit more. So, in terms of skills that people need in order to be able to ask good questions what's like the one thing that you would try and train everyone on?

Taylor Brown: so I, uh, I was at an event last year where I ran a roundtable with a bunch of CIOs and there were all different levels of companies of CIOs and all different levels of, I would say, data readiness across those companies. And. some folks were trying to figure out, like, how do I build the modern data stack?

How do I get through each of these pieces? And some were like, I'm, I built it years ago. We're good to go. And the thing that, the folks who had already built this, the advice that they had to the folks that had not really built this and how they changed the culture was what they found is a lot of times.

The central IT team, say they're responsible for building analytics, they would go to each of the departments and say, Hey, what, data do you want? What metrics do you want? And a lot of times the response was, well, what data do you have? it was like the teams hadn't really spent any time thinking about data.

And so they don't want to be stupid, but they were like, well, just give me everything right. And the way in which the central IT team has found success in doing this is you go and say, Hey, how do you know if you're being successful? which ways do you know?

Like I just was successful. And. Teams are very easy to answer that. Hey, like if I hit this particular thing or our revenue grows to this amount or, this particular activity. And so you start with very basic ways of trying to think about measurement and you, figure out the metrics around that.

And then you start to like build that in. Hey, let's start with those few first metrics. Let's get those delivered. Let's start looking at those and then use just, continue to grow from there. And I think there's a lot of books around this. And so, I don't know if I can just back on to the question.

Richie Cotton: No, that's absolutely brilliant. Cause I think cross team communication is definitely one of the hardest things, especially when you've got commercial teams speaking to data teams and just being able to say, well, what constitutes success for you? That's a great sort of line in terms of like trying to get to what data do you need.

And going right back to the start of this, you were talking about how the tooling landscape, particularly for data engineering, is just evolving incredibly quickly. So, with everything changing, what do you think are the most important skills for data engineers right now?

Taylor Brown: I would say that just learning about the new tooling is like the most important thing. think there are certainly new tools that are widely accepted today. The modern data stack. I would say, Hey, if you don't know about the modern data stack, you never used it.

I highly suggest going and spending some time learning it. There's a tremendous amount of material online. Go learn about it on YouTube or otherwise. so that's the first one. I think data lakes in my mind are the future as well. And so, we've obviously invested a tremendous amount of energy in this.

I would say we're one of the only vendors that provides Cheers. flawless access to data within a data warehouse that is highly organized. And you know, I'd say spend time learning about the modern data lakes. on top of that I think trying various different modern data set tools on top of it and just understanding what all is out there, because there's so many now, like.

If you think about the modern day stack is one big combination of tools that all work well together that create the solution for the customer. It's now like thousands of little tools that are like doing very little pieces here and there. And so there's a lot to learn. And at some point, I'm sure there will be more of a consolidation.

But just understanding like, what are each of these things solve? Is it worthwhile? Is it not? I think a lot of engineers oftentimes think as well, Hey, I should just build this whole thing, , why would I use any tools for any of this? And that's the constant build versus buy. And I'm not suggesting that you should buy everything, but I think that there are a lot of tools in the modern data stack that engineers sort of cuckoo because they say, Hey, I just want to build this one myself.

I don't build this all myself. And what I think is that just like stripe or other types of infrastructure tools, you can build and have tooling at the same time. and so like, you know, I'd say like Fivetran is similar to that. If you're, if you want to build a data application and you're a data engineer, instead of spending all your time building these connections.

Hey, like you should probably just leverage a tool like Fivetran, you know, you can do all of it through API, you can build it all underneath the hood and just like Stripe, it can just run all your connections for you for your application, Or like one of the areas that I've seen success or sort of interesting is, A lot of applications that were built in the 2000s, even cloud applications, had to build a reporting layer into them, you know, imagine you're building like a simple application, and you spend half of your time building this complicated reporting layer that's not very good.

Well, that's like half your R& D effort that's just being spent on like building out something that people, are maybe going to use or maybe not. And so what the trend is now is like, just don't build a reporting layer. Just load this, send the data to Fivetran and that way the customer can build their own data layer.

But you as an organization can spend more time on your R& D effort building the application yourself. And that's going to make you much more competitive and, faster for building, So I mean, hopefully that gives the readers or the listeners a bit more direction.

Richie Cotton: Okay. Yeah. So, if you want to work in data engineering, probably don't spend too much time worrying about the reporting side of things, just focus on like, I guess, the data movement side of things and then data warehousing or data lakes. And that's, I guess that seems like place to start.

But you also mentioned like the, postmodern data stack. So do you need the postmodern data stack tools as well as the modern data stack tools?

Taylor Brown: generally sort of play together. So, like, you could just use, like, a data warehouse, a cloud data warehouse, like Snowflake or Databricks and build on top of that. Or you could, you know, if you add a data like you're adding it underneath it in the stack, so you could still use S3 and then snowflake and then build on top of that.

Or you could do a data lake and then not use snowflake and just put your tooling directly on top of it. And so, I don't think that it necessarily changes the stack that much. It just gives more optionality to using additional tooling. It gives more, ownership of the future of where your data resides.

and how you're going to leverage it. Right. So when new tools come out, you're like, Oh, great. I can just throw this right on top of here. And I'm, good to go.

Richie Cotton: I guess you're not going to run out of tools to learn this anyway.

Taylor Brown: tools today, especially with all the, the advent of AI. And I think just as like a super quick comment on AI, like the really cool thing about AI and the thing that we're hearing so much right now in my mind, it opens up the ability to do Aggregations across your text data or your video data or whatever. Whereas before it was like, Hey, I want to count the number of humans that signed up last month. It's like, okay, well I have to have that count somewhere and it's gonna aggregate and whatever. It's like, Hey, I want, to know.

the number of people who favorite color is blue, and it just like can grab that right things like that that are much harder to get because they're deep within the actual text data. And so that is going to open up a whole lot more information. that's something I'm super excited about.

And we're seeing a massive uplift of our customers new prospects who are using Fivetrain to get all this data together. And generally, the first step in building an AI strategy is building a really solid data strategy. Back to that whole the, the CIOs that I met with last year, you know, the ones that were really advanced in their AI strategy were very advanced in their data strategy.

They had a centralized strategy, they had governance put in place, they had replication fully in place, they had modeling in place, and then adding AI on top of that. It's actually quite easy because you're just using the same infrastructure that you already have. I think the mistake that a lot of companies fall into is that they try and build a separate AI strategy with a whole separate set of tools.

And, what you end up realizing is you just build it all from scratch and you spend all of your time doing the infrastructure rather than just, using all the tools that are already available. And then building on top of that, and 5Train is like a core part of that infrastructure, and that's why I think we're seeing this big uplift and success with all of our customers who are using us for AI, like, like Saksith at 5th Avenue.

Richie Cotton: Okay, so, I agree with you that it's important to have like good well governed data in order to be successful with AI. Does that mean that your data strategy needs to come first and then your AI strategy follows it, or? Is it the other way around?

Taylor Brown: just think that you should think about them as one strategy, right? Which is you're getting all the data from all different places. if you take all this data and you centralize it, then you have it on one place and you can use some of this data for BI, you can use some of this data for AI, you may use some of the same or overlapping data, and you may use different tools, So, like, again, the Data Lake strategy and using, a company like Fivetran to load all your data into your own S3 then gives you the freedom to say, I'm going to use Snowflake for this set. I'm going to use. Databricks for this set, I'm just going to build, my own rag application here on top of the data that is conjoined between the two of these, right?

And that freedom is really important. But I think again, like if you have, like what we've seen is sometimes people will say, I'm going to take all my data for my ass strategy and load it over here. And then all my data for my BI strategy load over here. And then you sort of end up with different answers, Because you've built it in different ways, different teams, how the data is replicated. You sort of lose this whole governance component as well. And then you have to rebuild all the same things. The same governance, the same control, the same like, security controls. the same quality levels, why not just do that once versus having to do that twice, Okay. Yeah. Doing things once rather than twice to sound like a very appealing idea. And because generative AI has been incredibly hyped over the last few years there've been a lot of executives and boards kind of pushing for more generative AI everywhere. Does that. Take away from sort of funding and data initiatives.

Richie Cotton: Is there a competition between data and AI, do you think?

Taylor Brown: there is, unfortunately, last year, there was a large company we were working with in Europe that said they went to the board and said, Hey, I need funds to build a data platform. And they said, no, there's no money for data, but you can have a limited funds for AI. And so this person was like, okay, so they went back, revise their, plan came back and said, Hey, I need a AI platform.

And that was basically just building a data platform. And they just changed the language and they're like, yep, we're good to go. And so I think, like, it's more confusion. Like, I think there's a lot of folks in the market who don't necessarily understand, what a good AI platform looks like and what a good data platform looks like and the fact that they're sort of the same.

And I think there's just like craze to be like, let's still spend a ton of money on AI, let's just go build it. It doesn't matter. Let's go crazy. so you end up with really weird things. Like we've worked with a few vendors. Because we're doing a lot of stuff internally on our own AI and we like to always understand what folks are doing.

And so we built our own internal chat. talked to a few vendors along the way. And we're like, hey, can you build a chat for us? And they said, yes, absolutely. step one, take all your data, put it in a PDF and send it to us. We're like, wait a second, you're taking like snapshots of data over time.

And that's what you're using for your live chat app. versus like, why don't you just use Fivetran, load all the data in, have it updated every minute, and then your data is like, live to that level, and so I think the whole world is just not really caught up on the AI side for what has happened in the data side over the last like 10 years.

like that will happen quickly, and it is happening. But I think a lot of education is just needed at this moment.

Richie Cotton: I have to say that's an absolutely fascinating story. Like a genius hack just taking your proposal for a data platform. Find and replace data with AI and then suddenly it gets funded. I mean, I'm sure you can do that with a lot of things as well. Take anything, replace it, put AI in there and you'll get more

Taylor Brown: true. true. For sure. And I mean, and then like AI is, is I think. is sort of coming into all different areas of the business, so you could kind of do that, but especially with your data infrastructure, that is the key to having a good AI story.

Richie Cotton: I see. Do you have any other stories of AI insanity?

Taylor Brown: No, I mean, no, like crazy ones that I think I can speak about one interesting one that we've done is unique or that, when I was going back to how do we help the broader mass of people within the organization work with data?

Yeah. and I said AI is one of those areas that I think will actually help a tremendous amount. when we're building all these different connections, we have over 500 automated connections today, and we're adding, something like, 200 a year. Well, part of that is like, we have to go build, like, we actually have to build the integration into an API.

So we have to go figure out what the API is, we have to have engineering figure out like all these pieces. Thank you. We wanted to figure out how can we scale this, so we built an, a machine learning module that you could point at the API, it grabs the docs, based on the docs, creates the actual integration.

In a copilot setup and then we have, human jump in and look at it. But instead of engineer, we have more of an analyst, right? because the analyst is the one who actually knows what the data model should be. They know what more context around what the application does. And then very easily can say, Oh, no, we got to tweak this, tweak that, tweak this.

We're good to go. And that's made our ability to create new connection significantly faster, like from weeks to like hours. so, you know, I think that is not exactly like. analyzing data, but it's working with data and I think this is going to happen all over the organization.

that's one way in which we're heavily leaning in to make sure we can leverage this as well. And it support every single connection across the planet for our customers.

Richie Cotton: Okay. So again, yeah, it's scaling stuff. So, every bit of data gets connected somehow. I like that.

Taylor Brown: Right. There's like thousands. I 10, 000 different applications that across the world, probably even more, probably 50, 000 startups all over the place who have Some sort of important data for our customers. and so five friends, is the single platform for enterprise. I said enterprises, large and small for centralizing data from everywhere.

And so you have to do things like this to be able to access all those right and build all those connections. And one of the other things like we're coming out with product later this year. That we're very excited about for our customers to leverage a lot of the core technology within five trim, but build their own connectors.

So imagine it's like probably 25 percent or 10 percent of the work of someone just building it themselves because they can use all of our centralized functions. But they have access to build whatever they need, So now they can go and add these very quickly. It's another thing that just, helps our customers.

Richie Cotton: I like it. Yeah. have to say I do love this idea of just like basically pointing and clicking and creating a pipeline. It's like not having to write vast amounts of code. Just stuff just working. It's, it's very appealing. So, just to wrap up, what are you most excited about in the world of data engineering right now?

Taylor Brown: You know, I have to say that, again, this is sort of, it's something that I'm excited about from, it's coming out from Fivetran. And so it's a bit of a, you know, a plug, but in about a week and a half, we are launching the Fivetran game. feature called hybrid deployment, which takes our fully managed cloud based, tooling and makes it available behind the customer's firewall for replicating.

And while this sounds like, Oh, that makes sense. You know, it's very easy. It's pretty complicated in the sense it's very easy for our customers. They get this cloud experience because it's still controlled from the cloud. They still go to fivetrain. com and they can access whatever. But the actual processing happens behind their firewall.

And so to make that really seamless and very easy, it took a tremendous amount of work. And I think this is really the future for the industry in terms of the architecture for security purposes. and I'm, you know, I'm very excited that we're on the cutting edge of it.

Richie Cotton: Yeah. So, I like the idea of like good experience for customers, but in the cloud, but also it's, it's the involving security stuff. And I, it's just kind of, it's just an engineering problem. My engineers hate when I say things like that, but yeah, I like that. Do the, do the hard work and then the, the customers don't have to.

Taylor Brown: I will say the last thing is. We tried to make our product so easy. Like literally when we first launched it in 2015, I had my mom go through the workflow to set it up because I wanted it to be like that easy for customers to set up. And the cool part is like, now it's that easy through the UI or it's that easy for engineers building on top of 5chan where they can go and very quickly get set up and they can run and set up like all their connectors.

Like, you know, some of our larger customers, like. Rapi, which is, you know, sort of like an Amazon of South America, there's thousands of different databases that they're running through the system, and it's all set up programmatically, right? Hey, whenever I set up a new microsystem, microservice within a new, region, spin up Fivetran, run all the, data for it, get it centralized.

And so, it just becomes part of their engineering stack at the end of the day. Which, you know, I think hopefully for this, for this audience should be quite intriguing.

Richie Cotton: Any final advice for organizations that want to improve their data engineering capabilities?

Taylor Brown: Uh, Modern data stack, post modern data stack. I would say learn about those, figure out a strategy for moving pieces over to it. And, I would urge people to listen to your podcast so they, are, learning all the new things that are coming out from all the uh, interesting folks.

And, use 5Chain. We make it extremely easy.

Richie Cotton: All right. I do like the idea that listening to Data Brained is the solution to everyone's problems. Excellent. Yeah, and I love all the different names for their data stacks, modern, postmodern. There should be like an Art Deco data stack. That'd

be, uh, wonderful.

Taylor Brown: modern is something I sort of threw out people like talked about it a bit like more like I'd say the modern data lake is more what I've heard or we've coined. But again, like, who knows? It's like everyone talks about things differently.

Richie Cotton: All right. Wonderful. Thank you so much for your time, Taylor.

Taylor Brown: Richie, you as well.

Topics

Data Engineering

Big Data

podcast

Scaling Data Engineering in Retail with Mo Sabah, SVP of Engineering & Data at Thrive Market

Richie and Mo explore data engineering tools, data governance and data quality, collaboration between data analysts and data engineers, ownership mentality in data engineering and much more.

podcast

[AI and the Modern Data Stack] How Databricks is Transforming Data Warehousing and AI with Ari Kaplan, Head Evangelist & Robin Sutara, Field CTO at Databricks

Richie, Ari, and Robin explore Databricks, the application of generative AI in improving services operations and providing data insights, data intelligence and lakehouse technology, how AI tools are changing data democratization, the challenges of data governance and management and how Databricks can help, the changing jobs in data and AI, and much more.

podcast

[AI and the Modern Data Stack] Adding AI to the Data Warehouse with Sridhar Ramaswamy, CEO at Snowflake

Richie and Sridhar explore Snowflake and its uses, how generative AI is changing the attitudes of leaders towards data, the challenges of enterprise search, management and the role of semantic layers in the effective use of AI, a look into Snowflakes products including Snowpilot and Cortex, advice for organizations looking to improve their data management, and much more.

podcast

Scaling Enterprise Analytics with Libby Duane Adams, Chief Advocacy Officer and Co-Founder of Alteryx

RIchie and Libby explore the differences between analytics and business intelligence, generative AI and its implications in analytics, the role of data quality and governance, Alteryx’s AI platform, data skills as a workplace necessity, and more.

podcast

The Full Stack Data Scientist with Savin Goyal, Co-Founder & CTO at Outerbounds

Richie and Savin explore the definition of production in data science, steps to move from internal projects to production, the lifecycle of a machine learning project and much more.

podcast

Data & AI Trends in 2024, with Tom Tunguz, General Partner at Theory Ventures

Richie and Tom explore trends in generative AI, the impact of AI on professional fields, cloud+local hybrid workflows, data security, the future of business intelligence and data analytics, the challenges and opportunities surrounding AI in the corporate sector and much more.

See More See More