Is Big Data Dead? MotherDuck and the Small Data Manifesto with Ryan Boyd Co-Founder at MotherDuck
Ryan Boyd is the Co-Founder & VP, Marketing + DevRel at MotherDuck. Ryan started his career as a software engineer, but since has led DevRel teams for 15+ years at Google, Databricks and Neo4j, where he developed and executed numerous marketing and DevRel programs. Prior to MotherDuck, Ryan worked at Databricks and focussed the team on building an online community during the pandemic, helping to organize the content and experience for an online Data + AI Summit, establishing a regular cadence of video and blog content, launching the Databricks Beacons ambassador program, improving the time to an “aha” moment in the online trial and launching a University Alliance program to help professors teach the latest in data science, machine learning and data engineering.
Richie helps individuals and organizations get better at using data and AI. He's been a data scientist since before it was called data science, and has written two books and created many DataCamp courses on the subject. He is a host of the DataFramed podcast, and runs DataCamp's webinar program.
Key Quotes
We're seeing this need for a combination of local compute and remote compute. Fiber is only so fast, right? You're limited by the speed of light. So if you want to get real-time awesome visualizations of your data that are dynamic, that you could filter dynamically, why not run that on your local machine?
The small data manifesto is about focusing on the simplicity of our data infrastructure so that we can get the joy out of our data and the utility out of our data. This is really about modern hardware being a beast. You can do so much just on your laptop. This even carries over to the AI world. The more compute costs that is, the more data centers that have to go up, the more my Nvidia stock goes up!
Key Takeaways
Instead of building complex, distributed systems, focus on using smaller datasets and simpler tools that match your needs, which can reduce costs and increase efficiency.
With tools like DuckDB and WebAssembly, you can now run queries locally on your machine, reducing latency and allowing real-time analysis without always relying on the cloud.
Larger datasets and models can have diminishing returns. Focus on using smaller, more targeted datasets and AI models, which can be just as accurate and much more cost-effective.
Transcript
Richie Cotton: Hi Ryan, welcome to the show.
Ryan Boyd: Hey, thanks, Richie. It's great to be here.
Richie Cotton: To begin with I want to talk about the size of data and for years there's been this sort of extreme eye boggling statistics about how much data there is in the world and how it's growing exponentially. Do you think that's going to continue?
Ryan Boyd: You know, I was kind of part of that messaging. I came from working on, Google BigQuery for a few years about ten years ago or so and I certainly said, like, hey, data's going to grow exponentially. no matter the definition, we knew that the size of your data was going to grow and huge, bigger than you could possibly imagine and BigQuery was going to handle that, of course.
And I, I, I believed that message at the time but I don't think it's actually grown as much as We expected, and I think that, Moore's law compute and memory and distributed storage and things like that have all grown at a much faster pace than, the data, that we're collecting or at least the data that we're using.
Richie Cotton: Okay, so there's sort of different sizes of exponentiality there, so the data's grown, but maybe computing power has grown faster than that.
Ryan Boyd: totally. I mean, the data has definitely grown, right? You know, we're, starting to collect more and more. I mean, there's, obviously security and privacy regulations and things like that, that put some constraints on that. But of course, in the end, t... See more
The amount of data out there has grown, but there's only so much that people can analyze, and only so much that is useful to analyze. And so, that combination of factors. It hasn't grown as much as we thought, plus there's only a subset of it that we actually care about analyzing.
Richie Cotton: So plenty of data there, but maybe not as much really interesting data.
Ryan Boyd: Yeah, and it also is, in terms of the overall. company landscape the, the data that belongs to any given company, most businesses out there don't have tons of data. So, at Google BigQuery, 95 percent of our customers had less than a terabyte of data.
at SingleStore where one of my co founders was at, 80 percent of the customers want the smallest instance. One of the, biggest requests we got when I was at Databricks was for single node Spark clusters. most enterprise data warehouses are less than 100 gigs. So I think, standard businesses that aren't sort of the massive, Online businesses, Silicon Valley is building technology for, they, still don't have as much data, right?
It's grown probably exponentially, but it started at a very small size for those businesses. You know, and even the startups that we've talked to, like, Andreessen Horowitz one of our investors, surveyed their portfolio companies, and on the B2B side, most companies had less than one terabyte of data.
On the B2C side, most had less than 10 terabytes of data. and nowadays, one terabyte of data, like I've seen SD cards advertised to my camera that are one terabyte. So, I think a lot of it has to do with, our bias in Silicon Valley and in the tech industry.
There are companies that have collected and analyzed much more data. Of course. And, the people that are in that environment in San Francisco or other tech cities, New York, Seattle, Amsterdam, the people that are in that environment here about all these, these big data use cases all the time.
But that's not the vast majority of companies.
Richie Cotton: That is kind of interesting. I suppose, if you're Google and you own YouTube, you've got just petabytes of video data. You're going to
Ryan Boyd: You're in the data 1%. You're the data 1%. and actually, it's more like 99. 95%. or sorry, 0. 05 percent is the elite group of people. And, you know, it's not just our background. the Redshift folks released an academic paper recently. And they're actually talking at our upcoming Small Data SF conference.
But Redshift is one of the more popular data warehouses and only 0. 03 percent of the queries that are run query more than 10 terabytes of data. 0. 03. Like, it's tiny, other people are starting to see this, starting to realize this. we published Small Data Manifesto on smalldatasf.
com, which is website for our conference. And huge roster of amazing speakers that are coming that really believe in this message, that the technology has allowed us to simplify our infrastructure and work with smaller data, because that's what customers have.
Richie Cotton: That is absolutely fascinating that it was like 0. 3 percent of queries running on
Ryan Boyd: 0. 03%!
Richie Cotton: 03 or even less than that. So basically everyone's just running relatively small sub 10 terabytes. dataset queries.
Ryan Boyd: Exactly. So just because you have the data doesn't mean you're actually querying it. And that's partly just what data you care about, but it's also partly the data technology has gotten better to be able to do, partitioning and sharding of the data and things like that so that you only have to touch the columns and the row groups that you really care about.
Richie Cotton: All right. So you mentioned the small data manifesto. Can you tell me a bit more about what that is?
Ryan Boyd: the small data manifesto is basically our belief that we need to think small as an industry. And we believe in the, simple joys of small data. But, you know, some of this is also, you know, facetious, if that's the right word, is that, what was big data, in 2012 you know, when a lot of the technologies that we're using today were created is now small data, So it's, a play on the words in a way, like, we say small data and people are just like, you mean just like Excel spreadsheets or whatever? I did, I did learn that Excel spreadsheets now can have two billion rows, I believe, but, that's beside the point.
When we say small data, we do mean sub 10 terabytes or something like that, which is, is what most people have. But the manifesto is about, Focusing on the simplicity of our data infrastructure so that we can get the joy out of our data and the utility out of our data.
And this is really about, modern hardware is a beast. You can do so much just on your laptop. And this even carries over to the AI world is, the bigger the, the model in an LLM the more compute costs that is, the more data centers that have to go up, the more my.
NVIDIA stock goes up. if you think about it, this isn't going to be as subsidized. One our speakers at Small Data SF is actually my brother in law, Gilad Lotan at BuzzFeed. And he runs all the data work at BuzzFeed. And they can call these large models but the small models are, 95 percent or something like that is accurate.
And cost a tenth of the amount. So if you think about that, like, we can have small data, we can have small models. We can recognize that more data doesn't necessarily equal better results. you get diminishing returns with more and more data that you collect. We can recognize that single machines like your laptop are efficient and powerful.
Like my laptop is over two years old but, it has I think 32 gigabytes of RAM and is, you know, sits 95 percent idle. Why not use that compute? And, you know, we've seen people go back to that where developing locally is so much nicer than developing with, remote systems and such.
And the technology allows it. And, especially in the database world, we're seeing folks where you can run DuckDB locally as a developer, and you can push to prod, and it's running the exact same database in prod on the analytics side. And even on the transactional side, there's a lot of startups like Terso, who's one of the co organizers for our conference, who's using SQLite, and you can use SQLite locally and push SQLite into production in the cloud.
overall, it's, it's, we believe that small data and AI is, more valuable than you think. And we believe that, the technology changes that have happened really allow us to use the compute the way we want to use the compute to have access to things locally, to not have that latency between us and the cloud for a lot of things, or, supplement the cloud them.
for instance, with MotherDuck we have this thing called dual execution, and we can run queries that decide, like, which parts of the query should run locally, which parts should run in the cloud, right now, it's a fairly simplistic algorithm for, deciding that, but we have some of the world's leading database researchers working on more cost based analysis of that.
What is the bandwidth currently? What is the compute on both sides of the pipe? what is the CPU? And, you know, I have to acknowledge this is kind of a new style of distributed compute in a way. It's still just, distributed between your local machine and the cloud, but it gives you that advantage of being able to have really low latency access to your data on your laptop.
Richie Cotton: that's interesting that this sort of trend of like everything going to the cloud is now being reversed and you've got the idea of hybrid computing. So some of it's in the cloud, some of it's local.
Ryan Boyd: Yeah, I mean, honestly the whole technology industry is a pendulum. It swings back and forth, when I started my career, it was, yeah, you'd have local, like, sudden desktops, sitting in your office doing stuff. But largely, everything was, in the data centers, Not in the cloud, but in your local data center. And you would remote desktop into those data centers when you wanted to do stuff. Or you would do like distributed SSH to push changes to a bunch of machines. And it was all about developing remotely in, in this data center environment.
And then the cloud came along and it made it a lot easier to build things. But, now we're, we're seeing, like, this need for a combination of local compute and remote compute. I mean, fiber is only so fast, right? You're limited by the speed of light. So if you want to get, like, real time awesome visualizations of your data that are dynamic, that you could filter dynamically and things like that, Why not run that on your local machine?
And especially with technologies like WebAssembly, that's possible. And can run in WebAssembly, SQLite can run in WebAssembly. And that means that it's actually running the same database in your browser. In DuckDB's case, it's, less than a 20 megabyte executable which, frankly, you see 20 megabyte images on websites nowadays.
That's horrible. But um, you know, it's tiny, right? and that really gives us that power to go back to doing some things locally when we get advantages out of it. Like, the pendulum shouldn't swing unless, dependent variables swing but dependent variables also swing in our, in our capitalistic society, I guess.
Richie Cotton: That's interesting. So, I'm curious as to which bits of the data workflow you think should be done locally then. So it sounds like it's going to be maybe like late stage analysis where you're dealing with smaller data sets and then visualization. So is it, should be like dashboards, but local, but in your browser, is that the.
Is that the part?
Ryan Boyd: So that, that is part of it, right? So a lot of the BI tools nowadays Blackbox embed DuckDB into the BI tool to enable those local dashboards. And, this includes like the standard Consumer user facing B I dashboards that, the average data analyst can build in something like hex, for instance, but also includes, more of the dashboard as code tools like evidence, for instance evidence.
You can build dashboard and it's basically you write and mark down dashboard, and it will actually, give you a WebAssembly embedded DuckDB to speed up the local results of that dashboard because, in my past world, working at one of the world's leading data companies we had Tableau that was hooked into our, our lake house.
And I would drag a slider and I, you know, literally we had to go get a cup of coffee before the, the view refreshed. we don't want that, we want super fast access to your data that, really empowers people to do more analysis their data because they're not waiting, because they're not frustrated.
You can just look at your data in different ways. in MotherDuck. We launched a thing called Column Explorer which is actually, similar type of interface as some other data warehouses and such have. But, basically it shows you the distribution of your data in different columns.
What type is it? how many different unique values are there allows you to drill down and such. We're able to make it a lot faster because it's actually running, DuckDB in the browser. So it's taking your result set back from the server, and it's saying, OK, now let's figure out what the, column distribution and things is.
And it's storing that so that you can drag sliders and such and see instantaneous results. And my favorite part of this is people developing tools like this no longer need to do all that data manipulation in JavaScript, because JavaScript's a pain in the butt, if you ask me.
So, you can use queries and allow people to have that level of interaction. So, these are the types of things that we're seeing. But we also see other things, like people saying, well, there's this Particular data set that I really don't want to store in the cloud for some, security reason is an example.
And so you can take the data that is in the cloud and join locally against that data in one SQL query is if the data was sitting in the same place. And, you know, that's Amazing, whether it be like, ID number lookups. Not that social security numbers are private anymore because, all of them for the U.
S. leaked recently. But you know, things like that, right? Where you're saying, hey, I have this local Excel database, and I want to, analyze that and join it against what's in my data warehouse. You can do that nowadays. So it just enables, like, a new type of, flexibility for analysts. some of the stuff I've been talking about is MotherDuck, but, that's really kind of been theory around DuckDB as a whole, is increase the ergonomics of working with data.
It doesn't matter where that data is sitting. You should be able to use the same type of queries to access it.
Richie Cotton: I like the idea of the ergonomics of data analysis, just smoothing things out, making sure you're not going from one tool to another and having to do different things in different places. And actually, since you were mentioning some of our competitors, I should probably also mention that DataCamp's Datalab productivity tool also makes quite heavy use of DuckDB under the hood.
So that's like, again, working with SQL in a browser and then yeah, you can
Ryan Boyd: Apologies that you had to but uh, honestly, the, the greatest thing that I like about working in the data industry is, sure we have competitors, but we're all friends. everyone gets along so well in the, in the data industry and it adds the smiles to the day.
Richie Cotton: Alright, so, just while we're on the subject of hybrid computing, I think one of the big pushes for cloud computing was just that it's kind of a pain managing software on your own machine. I think one of the big perks for me from moving to cloud was like, you don't have to like argue with IT departments about whether you're allowed this tool that you want to use.
So how does I guess it's going to affect software management if you're working locally, and are there any of the processes that need to change if you're setting up hybrid computing?
Ryan Boyd: This is another case where the pendulum swings, Like, it used to be you go and create an AWS account and you have free reign on that AWS account. We recently went through the, SOC 2 compliance at MotherDuck. And I'm a co founder and a leader in the company, but, the same rules apply to me as everyone else.
there's so many of the AWS APIs that I'm not even allowed call without, getting explicit permission and making sure that it's written in wherever the documents say that I'm allowed to call it. And, you know, basically, software over time becomes a little more bloated, a little more geared towards the enterprise security and compliance side of things that obviously has great benefits and is necessary in many ways but also privacy usability are sort of at the opposite ends of the spectrum, right?
You either geared towards really good privacy and security. Google APIs is a great example. Like, they used to be so much easier to use back when I worked on them. And over time, like, they added more and more API keys and restrictions on how those API keys work and, verification by Google of your application and all sorts of things like that.
And now they're kind of a pain in the butt to use, if I'm being totally honest. And it's sad because, like, I was, I worked on Auth at Google, worked on, their proprietary stuff called AuthSub before OAuth existed. You know, I worked on OAuth, I wrote a book on OAuth, but I find it very hard to use these APIs.
So, I don't necessarily think we're at the point right now where cloud makes it so much easier. I think that if you have completely unrestricted access to the cloud, sure. It's, it's super easy. But, some, one of our co organizers at the Small Data SF, the founders the company, I think this was Allama, and the founders actually created Docker Desktop.
And that was the era where they're like, Oh, we need to make sure that the same software that is running on your server is running locally. And so you would run Docker, you'd put containers in it, somewhat heavyweight to access. we've seen lighter weight things over time. But the general gist is you can run the code that you would be running on your production servers on your local machine and develop against that.
And, we want the same thing to be true for our databases. that's been much slower to happen as, you know, the cloud data warehouses and the lake houses and things like that. Don't just have an easy docker image or a container that you can run to work with your data and that's why things like, DuckDB and SQLite and other things are so valuable because you can run that same thing.
Now I can say the same thing for Postgres. I think a lot of people do run Postgres locally in their development environments. But we also find that a lot of people find it painful to get Postgres to do what they need to do for analytics because it's not designed for analytics. It's designed as a transactional database, but a lot of app developers because of that property that you can run it in your containers and your dev environment.
prod environment all as the same database, a lot of people just start there and start building things with Postgres and then they're like, Oh, I need some analytics for my users. And they build it on top of Postgres. And then they eventually hit a performance wall where they have a choice of Larger machine in Postgres, more time spent on indexes and optimizing all the queries or going to a true, analytics database.
So, we want to see the same thing happen in analytics where you can build locally. You can push to cloud if you want. You can push to a local data center if you want. And technologies are kind of enabling that in the same way that, Postgres in Docker containers enabled in the transactional world.
Richie Cotton: Okay. Yeah. Postgres is just sort of amazingly popular. I mean, it's everywhere. so you mentioned that because Postgres is a transactional database, it doesn't perform well for analytics. Can you just for people who don't know, can you give us an overview of like what's the difference between transactional databases and analytic databases?
Ryan Boyd: Yeah, so typically transactional databases the words can get kind of confusing because, like, for instance, in our analytical database, we also have transactions. But when you hear transactional database, what that typically means is that it stores the data in a row format. So each bit of data you collect, you populate one row.
You can think of, like, a web server log, or you can think of, like, a new customer, or a customer's purchase has a bunch of different values and you want to store that all at once and so you create a transaction, usually ACID compliant, so you have your, durability and reliability of, the data and you store it to your data, but it's storing as one row at a time.
And if you think about from an analytics standpoint, usually what you're looking at is columns, you're looking at, for instance what is the total amount of tax paid by our users? And that is just in one column of your, transaction, if you're a retail shop, for instance.
And yeah. You don't want to have to load the entire row into memory. You want to load just that column into memory. and so that's exactly what, the analytical databases are really good at, is storing the data in columnar formats. And in addition to that, doing things like we call vectorized execution, which is basically breaking those columns up even further for analysis and taking advantage of modern CPU infrastructures and execution.
something called SIMD, but that gets into details. But basically, the difference is the, columnar versus row based storage affects a lot of how performant and reliable a system is for transactions versus analytics.
Richie Cotton: Okay. Yeah. I mean, certainly I spend probably a bit more time working with Python compared to SQL and all that, you know, the data frames in both of those there, they're always columnar. So, yeah, it's it's good that SQL is kind of matching that.
Ryan Boyd: Well, I mean, those data frames are usually meant for analysis, right? Like, if you're, you're working in Python and, and are, like, you're trying to do some data science or data analysis and, this is one area where I talked about ergonomics and, DuckDB being able to access data wherever it sits.
the ergonomics of using DuckDB in Python is you can actually analyze your data frames in SQL so you can have a data frame that's in Python or R and, and write a SQL query against it or, nowadays, like, arrow format is also really popular and you can run SQL against that. But you can also access your data that's in your SQL database as if it is a data frame so you can go kind of both directions.
And it's just that's the beauty of it is no matter what way you're used to interacting with your data it allows you to do that. And I don't know, even if it comes down to like CSV files, the world's most boring format, but the most accessible format some databases, it's really hard to work with the CSV file or JSON with DuckDB.
They actually just make it so much easier. And that's one thing I love and I'm throwing a lot of praise on DuckDB. And just to make it clear, like, I'm from a company called MotherDuck. We partner with the folks called DuckDB Labs who build DuckDB. And they're a shareholder in us, but they're a completely separate company.
And, a company that I'm kind of, in awe over both the the company, I guess, and the, creators of DuckDB uh, which is largely Hannes and Mark are the two folks that created DuckDB.
Richie Cotton: it's very cool technology, I have to say. And I have some more questions for you about the small data manifesto, but maybe, maybe we'll talk about Duct DB for now. So, some of the changes to the SQL language that incorporated in Duct DB have to say very cool. Like in regular sql it's just a real pain to do simple things like calculate median and duct DB just has a median function.
Makes it easy. Do you wanna talk me through like some of the other changes that duck DB's making to sql?
Ryan Boyd: Yeah, one of my, one of my colleagues Alex, has written a couple blog posts on this that, maybe we can remember to include in the links with this podcast, but the blog posts are like friendlier SQL in DuckDB and even friendlier SQL in DuckDB, DuckDB has, taken a philosophy of Let's make things useful.
Let's improve these ergonomics so that, people don't have to do repetitive tasks and things like that. So, they started with a base of what is what is largely the Postgres version of SQL and used Postgres parser, I believe, is still in DuckDB. then they said, Okay, let's SELECT STAR FROM, how many people have done that, Everyone has done a SELECT STAR FROM. Why do you need to say SELECT STAR? If you just say FROM table name, they return the results. and it just skips the whole SELECT STAR thing. Which is, you know, it's, it's, it's, things like that that really make it easier to work with your data, more that you don't have to, it's not necessarily things you don't have to remember, but things you don't have to type.
So your fingers get a little happier. and they've also done a lot of, other things in terms of, you know, Group by, so you don't need to remember all of the different column names when you do a group by, you can just say group by all and that's a SQL feature, which I believe started with DuckDB, but then, rolled out to the big queries and snowflakes and things like that of the world.
So, it is one thing DuckDB is doing is, is really building simpler ways to work with your data in SQL. And, also those are rolling out into other database technologies. And honestly, like it's it's about time because, you know, SQL, I think the 96 standard or the 99 standard is kind of the last ones that had any significant changes in them.
And, you know, time for us to keep on improving our lives and how we work with SQL and our data.
Richie Cotton: Absolutely. I mean, yes, so SQL had its 50th anniversary earlier this year. I've had Don Chamberlain on the show talking about it. But yeah it was amazing at the time in terms of simplifying things, but things have sort of moved on since the 70s.
Ryan Boyd: Yeah, no, I mean, I'm a strong believer in SQL, and it's funny because, like, I used to work at Neo4j for years. And they have a language called Cypher for interacting with graphs. It's a graph database. I still very strongly believe that graphs are the most natural way for our brains to work with data.
Unfortunately, commercially that hasn't won out. Um, And, you know, commercially, people think of, still of SQL and they think of tabular data. Cypher is now in SQL as a GSQL or something like that. in the latest standard, but people still work with data in, in tables, whether I like it or not, whether I think it's natural or not.
that's life. that's what we build with, DuckDB and MotherDuck. Even though you can actually use DuckDB to make graph like queries and all, there's some great blog posts on that. And I guess I should do a call out to, we just, we have a website called duckdbsnippets. com where we put all sorts of things from the community where they find interesting SQL queries or interesting Python, scripts for using DuckDB or R scripts. You can submit any of that and there's, know, I don't know, probably a hundred different snippets out there right now. Always looking for more contributions from the community of just fun things that you can do in SQL, whether it's, DuckDB specific or not.
And so check that out, duckdbsnippets. com.
Richie Cotton: That sounds like a fun educational resource. Now, you've mentioned Python and R a couple of times. And I'm curious, do you think there's an influence of Python and R on SQL now? Is that influencing the direction of the language?
Ryan Boyd: I think it just comes down to a lot of people have worked with those languages that work with data. And, sequel is having somewhat of a renaissance in terms of people realizing that, okay, the no sequel movement. Yeah, it was good for some things. But, you know, sequel still persists. Then, you know, the Python and R folks are starting to pick up SQL to be able to interact with their data in their data warehouses or other places.
And so, tools like, like DuckDB actually started as an embedded database for R. the creators of DuckDB, Hannes and Mark, just saw with pain how the R users were interacting with their data. And they're like, Why don't you use a real database? And they're database professors, right?
Like PhDs and databases. And they saw this pain of their colleagues and they're like, we're going to, we're going to fix that because the colleagues were saying, well, databases are hard to install or databases you have to connect to over the network and the network is sometimes slow and all of this.
And that's, that's really the origins of DuckDB is how to make life easier for analysts in R. It eventually became the Python version is a little bit more popular with DuckDB, but the origins really were there, so I'm sure whether consciously or unconsciously the Python and R data analysis stuff is rubbing off into, into SQL.
Richie Cotton: it's quite a nice story, like the fact that all these sort of analytics communities that were traditionally separate are now interacting and good ideas are bleeding from one group to another.
Ryan Boyd: I actually have a hat over here.
So, uh, data person because, you know, we say data analyst and data scientist and machine learning engineer and data engineer and all of that. And, vast majority of people we talked to end up doing components of all of that. So there really are. Data people out there. And I think, especially, in economic times people are trimming down like you're expected to know how to do all this stuff if you're on if you're on a data team in most in most organizations.
So yeah, I think the Python and R stuff rubbing off on the SQL and SQL rubbing off onto the Python and R users, think it's just naturally going to happen, except, maybe some places in academia or what have you. But do have registered for our conference, the Small Data SF Conference.
We have, folks that are oceanographers and folks that are, deep scientists in various areas. they're starting to use SQL, which might not, they might have been in MATLAB 10 years ago, right? Or things like that.
Richie Cotton: Yeah, that's cool. I do love how almost every industry has some sort of weird thing, weird role where you don't expect there to be data use, but somehow actually data's just invaded that space now.
Ryan Boyd: think everyone within an organization won't necessarily have a data title, but we'll need to access data. And we need to make it easier for those types of people to access data. I mean, right now, MotherDuck is young. We just went GA earlier this year.
So I'm not going to claim that we're there yet, but the aim is, yeah, every data analyst that just, knows SQL should be able to use MotherDuck without a ton of data engineering experience and things like that because they have data that's sitting on their computer, or they have data that's sitting in, in HubSpot, or in Stripe, or whatever.
And they might want to just download that and upload it to MotherDuck before they end up doing, full data pipelines and things along those lines or, hooking into 5chan, et cetera. And yeah, I think that comes everywhere within the organization. Like, I've worked in database companies where, I didn't have access to the central data lake house, you know, largely because I didn't feel like filling out all the forms for the security stuff.
I, I was a senior director, but like, There's still a lot of paperwork to fill out to get access to things, but, like, I had access to the raw sources. So, like, I could go into, our marketing systems and download, spreadsheets or CSVs of the data and then upload it to a different bucket and use it.
I probably shouldn't admit all of this, but uh, you know, I think that everyone kind of works around and uses what tools that they have at their disposal. And some of that is just working around policy restrictions.
Richie Cotton: Yeah, I'm, I'm sure next time we do a security episode, there's going to be someone grumbling about what you've just said. But yeah,
Ryan Boyd: there's smart people who grumble about it, but, like, it's true, like, that's, that's how organizations work. People want to get stuff done. Yeah, well, I was about to swear. GSD you can look that up. But, people want to do their jobs and get home earlier in the evening and play with their kids, so why not enable that to happen easier?
And that's all the way down the stack from, what they're running on their machines to what they're, coding their, their analyses in. And that's why I've always believed in sort of the well, developer ergonomics is what I've called it, but that, developer is air quoted developer, data analyst, data scientist, whatever it is.
Like, let's make this easier so everyone can go home at night.
Richie Cotton: right, like not very many people want to mess about figuring out with installing software or getting stuff set up. They just want, get the insights from that data as fast as possible.
Ryan Boyd: Neo4j, when I worked there, was purely a local install database, right? Like, they didn't have a cloud version when I was there. And the first thing I did was like, okay, downloading a database. You expect this downloading install to take. hours or days, because like you've all or a lot of people at that time had worked with Oracle, for instance, or even worked with, MySQL and the PHP, my admin and all of that space, right?
And it took time to install all these things, configure them, run them one of the things I did with, Neo4j is just download and install and ran my first query in 57 seconds and created a video out of that just to kind of convince people that we can head to this new world where it's easier download and install.
And the DuckDB side, pip install works great. it's one command to run or brew and tools like that. So, yeah, I mean, I think we've gone from the The really painful world through slightly less painful with things like docker through today, having these frameworks for installing and managing our software like brewer pip or what have you.
Richie Cotton: Okay, yeah writing one line of code. That sounds doable for a lot of people. Just yeah, prove it. Prove and solve
Ryan Boyd: chat. openai. com. You can run one line of code for pretty much anything now.
Richie Cotton: Absolutely. So I'd like to get back to the small data manifesto. I want to kind of test some of the limits of this. So, you said 10 terabytes is sort of the limit of small data these days, are there any types of data that. aren't going to be that small. So I think things like image data and video data seem fundamentally big and everyone's saying that it's an official data type now, we, data isn't just numbers anymore, all these other types are, so what do you do about those?
Ryan Boyd: depends on what type work that you're doing with that data. So, I have a bunch of security cameras at my house and they all upload to you know, an S3 bucket and become glacier objects and things like that. And I'm sure that is many terabytes Am I accessing those security cameras on a regular basis?
Last time I needed to access them was when my bike was stolen out of my garage and, the police didn't even care. So do I really even worry about that data? So you have to think about that. It's like, what type of analysis are you going to do and when are you going to be doing that analysis?
I do do, like, AI on some of the security cameras for people detection, car detection, what have you. But I only need that at the moment, right? Like, I'm not, like, going back into history doing that. So, I did say 10 terabytes. the reality is, is the number will shift over time.
don't want to say, 1.44 megabytes on a floppy disc. If you're beyond that, it's big data or 10 terabytes. Beyond that, it's big data in perpetuity 'cause the the hardware is changing. The max size of an EC2 machine is 25 terabytes of ram. So if you're running on that machine, maybe you can work with a lot more than 10 terabytes in your data.
And then there also tends to be a confusion between the amount of data stored versus the amount of data analyzed. And those are very different things, and the requirements for the software are very different. But, as you mentioned, binary files and things like that. nowadays we have great cloud storage mechanisms that Provide the distributed nature of storage, which allows us to actually run services on a single node because we have that underlying distributed storage network.
And so I do believe in things like, the lake house, for instance. And providing like an organized way to store all the data and the binary data and things like that, especially kind of the older, the archive per se what I might put in, the Glacier storage on an S3, the archive. and that's the one nice thing with, both, DuckDB, as well as, a lot of the data warehouses nowadays do allow you to work with the data in a lake house and also the recent data in the warehouse. And, that I think is important, you know, having easy ergonomics to access that other type of data.
But the distributed storage systems are, great. You can throw whatever you want out at it. You know, we just published documentation about CloudFlare R2 and some of the new things that they allow you to do. do their the distributed storage systems are getting better.
So doesn't worry me that some people do have, quote unquote big data. You still have to ask the question, what are they doing it with? What is the timeframe that they care about? Is it just today or last week or, the quarter and everything else is aggregated before that, kind of nuances that you get there, but I don't think it disproves the idea of small data, even though you have some of these blobs sitting around.
Richie Cotton: You might have vast amounts of security camera footage, but you're probably not going to do crunching numbers on all of it unless.
Ryan Boyd: I mean, I've debated doing that just for the heck of it, because it might be fun to know, like, over time, how many times do I leave and return to the house? But like, honestly, this is, it's not our biggest problems that we deal with, I guess.
Richie Cotton: Okay. Actually so I was curious as to how much data DataGamp has and where our biggest data sets was, so I asked our head of data. And he's saying that by far the biggest data set we have is web events data. Cause everyone clicking on the website, that's a huge raw data set, but you don't actually analyze like the whole thing, like individual clicks, no one cares about.
So it's always aggregated. So what does the. mother duck setup look like then if you've got big raw data sets, but then you want it processed in some way or aggregated before you analyze it.
Ryan Boyd: I mean, I think you do sort of standard aggregate tables. So that's what we have for our own data warehouse is we have separate tables that are aggregates. The raw data is still there, can still be accessed. Oftentimes since, you are just like the event data that you're talking about.
Like, yeah, it might be super large. But also, for instance, you might have the user agent string in there. That's, super huge, right? But you might not care about that. Usually we don't care about the browser that the person is using nowadays. We care mostly about what is the event and what is the user.
Two columns, maybe time, three columns, right? So, I'd be curious if you asked your data people if, like, if you broke it down by columns is it sub 10 terabytes? Is it, that you access on a regular basis? I would imagine so. and maybe Datacamp is in the data, 0.
05 percent or whatever. But a lot of companies out there which don't have the amount of web traffic and all that you folks have. So, I don't know. Yeah, I think that from a, Mother Duck standpoint, it's things like logs like data that, goes through our data dog and things like that.
that's the type of data that's large for us. I imagine for a lot of other organizations, it's things like sensor data from connected devices and such. But like you said, you're not analyzing all of that all the time, right? You're analyzing a subset.
So the question is, what is the data size of those two columns for the last month? And then can I build aggregate tables for each of the other months? How far back do I need to store daily aggregates? And how far back do I need to store all of the events? There still might be cases where you say, Okay.
You know what? I, stored all of this stuff and I forgot to analyze X, Y, Z, and I want to analyze X, Y, Z on all this historical data. Those situations exist and I would say for some cases, maybe you'll use a distributed compute engine. Maybe you'll use Spark or something like that for that.
But also people are finding new and creative ways to, use like lambda functions. We've seen people, run thousands of DuckDB instances at a time for ETL jobs in lambda functions. And there's actually a great story from the folks at Okta Jake Thomas, I believe, at Okta who has built their entire ETL infrastructure using DuckDB, using single compute instances.
Now, granted, he's doing it in a distributed way, but, in a distributed way where the ergonomics have gotten a lot easier due to things like, you know, Lambda functions. Um,
Richie Cotton: Of instances of DuckDB in parallel, is he, is he just reinvented BigQuery?
Ryan Boyd: it depends on how those those DuckDB instances need to communicate with each other. So, you know, that's the real problem with a lot of these distributed systems is the overhead in communicating with each other. So BigQuery, for instance, when I was there, it's sure these things have changed. Every 1100 different machines.
And there would be 550 different sets of data that each of those machines had access to. And, every query, every bit of data would actually be analyzed twice and they would throw away the slowest one so that you didn't have one machine bogging down the whole query.
So there was some fun things like that. And as a geek, like, sure, that's exhilarating. But you don't need that for the vast majority of queries. Is, is kind of the point, especially as hardware has grown, At the time, you might have, At the time, when, when, the max size of memory on an EC2 machine was was, I think, 38 gigs or 41 gigs, something like that, you wanted to have tons of machines if you were going to analyze terabytes of data.
It's just the world has changed. And so, you know, I think, yes, folks like Okta are doing something that's pretty Kind of out there, but they're they are in the data one percent. You know, they do the The authentication authorization for the vast majority of the the fortune 1000 I think and they have lots of data and They just looked at it and said You know what?
We can break this data up and we can have DuckDB analyze it, write it back to storage and then do the aggregates like that. And watch the video. I think we'll, we'll include some links here too. can share that. But it's just innovative way that the folks have approached, like, taking this very simple tool and using it a fun that does fit, largely what you would call a big data use case, a distributed use case but, they found it easier to use than something like Spark.
Richie Cotton: Suppose you've convinced some people about the small data manifesto, what advice would you give to them in order to adopt a small data ethos?
Ryan Boyd: Not to, not to steal an Apple slogan, but, think differently, like, look at your data, look at your infrastructure. Are you building your infrastructure for the size of data that you have and for as fast as the compute technology is growing? we really do want people to kind of have that thought process.
Where could I get better ergonomics for my analysts, for my engineers? do I need the complexity of some of this software that I've been working on? Do I need the maintenance costs of some of the software that I've been running or just something much easier work, something much simpler?
You know, I think the phrase that we now have on our own page is the ducking simple data warehouse, and that's a lot of these technologies as they get built over many years at some point just become so complex that it's painful to use. And a lot of that is due to the distributed nature of technologies.
So, there are some people that are still not going to believe it. my co founder Jordan wrote a blog post called Big Data is dead. I thought we were going to get all sorts of crap for it from from the community, but I would say. 80 percent of people believed it, 10 percent of people said, you guys are full of crap, and 10 percent of people were like, huh, I'm confused, but I'll look into it.
And they're still all Believers or people that hold a definition in their head of what is big data that's different than everyone else. But to me and to, the 80 percent of people that I've talked to we really do need to focus on simplicity instead of focus on, how distributed we can make our technology.
Richie Cotton: Okay, I like that. Pick the simplest solution that's going to fit your use case. Yeah. Okay, excellent. On that note, I think we'll wrap up. Thank you so much for your time, Ryan.
Ryan Boyd: Absolutely. Thank you, Richie. It's great talking with you.
podcast
Data Science in the Cloud
podcast
Data & AI Trends in 2024, with Tom Tunguz, General Partner at Theory Ventures
podcast
How Data and AI are Changing Data Management with Jamie Lerner, CEO, President, and Chairman at Quantum
podcast
Towards Self-Service Data Engineering with Taylor Brown, Co-Founder and COO at Fivetran
podcast
The Past and Present of Data Science
podcast