Skip to main content
HomePodcastsArtificial Intelligence (AI)

Data Driven Venture Capital with Andre Retterath, Partner at Earlybird VC

Richie and Andre explore the concept of data-driven venture capital, the data-driven VC process, the human element in VC, the challenges and opportunities of early-stage investments, the importance of early identification, and much more.
May 2024

Photo of Andre Retterath
Andre Retterath

Dr. Andre Retterath is a Partner in Earlybird’s Munich Office, focussing on enterprise software with a particular interest in developer, data and productivity tools, alongside AI-centric products and robotics. Before transitioning into VC in 2017, he gained more than 5 years of experience as a process automation and predictive maintenance engineer at ThyssenKrupp and further insights as a management consultant at GE North America. Andre also has his own VC, AI & data newsletter, Data-Driven VC.

Photo of Richie Cotton
Richie Cotton

Richie helps individuals and organizations get better at using data and AI. He's been a data scientist since before it was called data science, and has written two books and created many DataCamp courses on the subject. He is a host of the DataFramed podcast, and runs DataCamp's webinar program.

Key Quotes

For us, it's very important to scrutinize the team to fully understand if they can and are willing to go the whole journey to build this multi-billion dollar plus company. That's number one.

We have this new movement of data-driven VCs that leverage data. And then the data-driven VCs among themselves can be split into the camp of augmented VCs. So those are VCs like myself, like Earlybird, that we believe in a combination of data-driven approaches and human in the loop. We call it an augmented VC approach because eventually it's always the human making decision. We believe that founders do not want an algorithm on their board. We believe they want to work with humans. If stuff hits the fan, they want to call someone.

Key Takeaways


Collect comprehensive primary and secondary data, including web scraping and integrating APIs, to enrich your investment funnel with diverse information sources.


Implement AI models to automate the screening process of potential investments, which can significantly improve recall rates and reduce false negatives.


Despite the advantages of automation and AI, ensure that human judgment plays a crucial role in the final investment decisions to account for nuances that data may not fully capture.

Links From The Show


Richie Cotton: Hi, Andre. Thank you for joining me.

Andre Ritterath: Hi, Ritchie. Thanks for having me.

Richie Cotton: Excellent. So just to begin with, what exactly is data driven venture capital?

Andre Ritterath: It actually started from me writing. So back then, out of my research, I started writing more regularly on Medium. And then, uh, I just wanted to give it a name at some point.

So I translated it into a Substack newsletter and it was called DataDriven VC. Because it was really about leveraging data and AI in the venture capital industry. And, uh, today it has grown beyond just a newsletter. It's community. We have different kinds of streams from, uh, the first online conference, uh, to different kinds of reports and so on.

So it's really a community and quite an ecosystem around that movement.

Richie Cotton: That's very cool. But, um, I think maybe coming from the world of data, it seems fairly obvious that you ought to use data in order to invest in things. So why doesn't every VC make a lot of use of data?

Andre Ritterath: Well, that's the same question I asked probably seven years ago.

Um, I come off from a data engineering, data science kind of background. And when I randomly stumbled into VC, I was really surprised how manual and inefficient all of the workflows were. So I couldn't believe back then that those. It's people who back the most visionary disruptive founders in the world still work themselves like in ... See more

the 1950s or mostly the same.

I think one digitization step that happened in the industry is really from pen and paper to PowerPoint and Excel. But that's really it. Uh, it was, uh, obviously data. Some people are like, yeah, why do we call it data driven? Of course we use data to inform our, our decision. But when I think about data driven, it's really.

leveraging modern data technology, modern data stack, and then also AI, more intelligent models and algorithms to first of all, collect the data and then process the data to eventually get to a better decision and to be cautious of biases, to make sure you have all of the information available to make an informed decision and so on.

And when I first got into venture in around 2017, um, it was mostly manual. It's, uh, people sitting here and there. So one example is, I don't know if you go to conference, uh, you download the conference list, you have 400 participants or something. And then you start looking them up. You start with crunch based, D Room, PitchBook.

You look them up on LinkedIn. You look for news mentions and stuff like that. And there's manual research process of just collecting the data. Takes days and weeks. And the same is then true. Once you have the data available at hand, then you need to somehow structure it, put it into your internal format.

You need to create a memo. And then also you need to balance these data points based on your previous experience. So that is where all of the different kind of biases, like recency bias, similarity bias, confirmation bias and stuff like that comes into play, uh, which is, uh, partially important for investment decision making.

But oftentimes also neglects the more widely available sample of information. So just because you had this specific experience with this one founder that might have been bad does not mean that this is to be generalized as a bad experience with a founder. And with these more comprehensive data driven approaches, you can suddenly consider a wider spectrum of Data.

And that was just not the case. Like when I got into venture, there were a few funds, like really a handful of funds that have tried it when big data was a thing and they stopped because, uh, people said, number one, it's all private data. So venture capital is a sub sector of private equity. So it's private investors investing into private companies.

And then by nature, this data is not as available as for public companies. Number two, we invest mostly in early stage. That means data is more qualitative versus quantitative. If you invest in public companies or gross scale up companies, the data, specifically the traction data, commercial data, is more quantitative by nature.

And those were the arguments that people had and said, look, data driven VC as an approach in early stage venture will just not work.

Richie Cotton: Yeah, that certainly makes sense that if you're investing in an early stage company, there's less data available compared to a publicly traded company that's been around for decades.

Um, so I'd love to, um, get into some of the details on like, uh, where data is used, but maybe just at a high level. Um, you mentioned you can use data for researching, um, companies or founders, but, um, can you talk me through just all of the different areas where you do actually use data?

Andre Ritterath: Sure. I think to start with what's most important is to really understand the process.

So for those who are not too familiar with venture capital, we first need to collect money from someone. Uh, this is what we call the limited partner, the LPs. So we have our own fundraising process where we collect the data. Once we have the data available, uh, the, uh, sorry, the money available, We go into the investment process and the investment process, you can dissect into different stages.

The first one is sourcing. That means we need to build up the funnel of all opportunities, and we need to enrich this funnel. So we need to collect as much information as possible about every single company in the fund. And we have the screening, which is the second stage means we need to go from tens and hundreds of thousands of opportunities, top of funnel, And we need to narrow it down to a number of opportunities that's manageable by humans.

This is where different kind of screening algorithms, but then also specifically the enrichment data that can be time series data that can be quantitative qualitative data of whatever nature. Will be used and processed in different models to attribute a likelihood of success to these specific companies.

So the idea is really to say, this company is more interesting than this company. And this is how we narrow down the funnel of opportunities. Once we have a number that we can manage as humans, we get into the due diligence process. So we meet the founder, we do property diligence on a specific company.

We create competitive landscapes. We benchmark the metrics and all of this stuff. So this is what happens in the due diligence where we also do customer references. We do references on the founder and so on. And once we have all of this data available, we actually prepare it in an investment memo. And this investment memo is what we use as a baseline to inform our wider team and the investment committee to make positive investment decision.

And then after the investment decision. It flips actually around because so far we've been diligencing on the founders, but then we need to convince the founders to actually do potentially competitive deals with us. Like we are the right partner for you here. And then once we have this transaction closed, This is where the fourth part starts.

This is portfolio value creation. So we do introductions to customers, to potential hires. We do introductions to follow on investors for follow on financing rounds and so on. And then hopefully if things go well after whatever, six, seven, eight years, we will have an exit and then we can return the money to our investors.

So this is the process, sourcing, screening, due diligence and closing portfolio value creation and exit. And along these lines, Is how we structured our modern data stack. So for the sourcing, it's all about building up the funnel. So we need to collect data. We collect data. I dissected into primary and secondary data.

Secondary data is established data aggregators. So companies that act as an intermediary and collect different information from. whatever providers. So this is the classical crunch based deal room, pitch book, CB insights, venture source, traction, and so on, you name it. And I've benchmarked them in a study before in terms of coverage and accuracy.

And we are just about to publish an update on this. The second dimension is then the primary of verticalized data. So this is, for example, LinkedIn data. This is, for example, buyer intent data. This is, for example, customer references, news data. So very specific data. About all of the companies. And then we merged this primary and secondary data together for the secondary data, you can just buy it.

So you can integrate it via an API for the primary data providers. They oftentimes don't have an API. So this is what we crawl mostly ourselves. So we have been very scalable web scrapers that. Or legally, if we need to buy a license, we buy a license. We use different kind of proxy networks to emulate human behavior, collect this data, and then merge it together with the secondary data.

And then in addition to primary and secondary data, we also have our network data. So we integrate into our email and calendar to get this knowledge graph of different kind of, uh, nodes in our network. And then it's really about entity matching, merging all of that together and creating a single source of truth.

So this is how we start top of funnel. And then obviously, it's more about algorithms for screening and making the data

Richie Cotton: actionable. That's absolutely fascinating. And also seems like if you're one of these primary data source providers and you don't have an API, so other people can make use of it, you're probably missing a trick there.

Um, Okay. So, uh, you mentioned like there's these four different stages, your sourcing, screening, due diligence, and then the value creation. Um, if you're getting started, like which area should you target? Like where is data going to add the most value to begin with?

Andre Ritterath: That's a great question. And I think there are different, um, different approaches to it.

One, let's say a bit more academic top down approaches, you just look at the value creation. There are different studies showing that in early stage venture capital. Majority of the value and more specifically studies say about two third of the value is created in the sourcing and screening stages of the investment process.

which means venture capital is the finding and picking the winner's game. So this is different in venture capital where, by the way, also you have typically a power law distribution in returns. So if we invest, we do a portfolio as an early stage investor of 30, 35 investments. We have this power law distribution and for those who don't know what that means, you might know Pareto principle, the 80 20 principle.

And early stage investing is similar to the 80 20 principle, but with an higher alpha coefficient means we rather have like 10 percent of the portfolio that deliver 90 percent of the returns. And that being said, it's completely different to later stage investors or private equity investors that have more of normal distribution in their returns.

So if we look at it for us, it's this power law distribution in the returns. And for us, we start with getting comprehensive coverage top of funnel and then narrowing down. And this power law is specifically important because for us, the worst thing that can happen is not to invest in a company that we need to write off that gets into insolvency whatsoever.

That's part of the business. The worst thing that can happen is that you're in front of an outlier. But you miss judgment and you deselect them. So this is really the false negatives that we need to reduce. So in data science speech, we need to improve for the recall rate. And this is why most funds actually based on this value driven approach, start now with the sourcing and screening process, and work their way through the value chain to due diligence, and then also into the portfolio value creation, and then beyond the stack into the back office.

Richie Cotton: Okay, so if I understood this correctly, because you're dealing with early stage startups, most of these startups you invest in are going to fail, you get nothing, and then some are going to get an enormous hundred times, thousand times return. Is that the idea? Yeah, I think like our

Andre Ritterath: idea is really to multiply our funds several times.

So you get into the top quartile if you deliver around 20 percent net IRR. And if you do the math, a typical venture fund runs 10 years. So if you do 10 years without compounding, you just take 20%, that's 10 times 20%. So it's 200%. And you need to return the initial money. So that's 300 percent means you need to three times the fund to get into the top quartile of performance.

And obviously, our ambition is higher than that. So we aim for like five, four, six times, every single fund that we invest in. And, um, from this, essentially, um, we have this power law distribution. I said 30 to 35 investments. So it's not a majority, uh, as you might assume, but, uh, probably a third of the investments, uh, probably written off.

So it's zero. And then you have around half of the investments that collectively return once the fund. And then you probably have another handful of investments that return more than like three, four, five X, but then you will hopefully have whatever two or three that are the major outliers in the fund that return 1, 500 times the initial investments.

And that drives the returns for the fund to make the difference between the two X fund and the six, seven, eight X fund.

Richie Cotton: Okay, so when you got this huge difference in outcomes, it sounds like the selection process can be incredibly important. Can you talk me through what that involves?

Andre Ritterath: Totally. So if we, if we look at companies and like, first of all, I like to describe the manual processes and then how we map that into, into our data infrastructure.

So if we look at the team at the, at the early stage ventures, the most important one is always the team. So if we do postmortems, like why did companies not take off, it's in most cases, something related to the team. So the CEO or the management didn't grow into the new responsibilities, there were conflicts among the founders and so on and so forth.

And there's tons of research on that available. And that's the most important because it's something that yes, you can rely on external credentials, like which university did they go to? Are they serial entrepreneurs? Um, did they work for a tier one employer before and so on and so forth. So these are the external credentials.

But what's super important is to sit together with the founders, look each other in the eyes and really understand, like, do they have the passion? Are they going for the big outcome that we all need? Like for us as venture investors, because of this power law distribution. Every single company needs to have the potential to become a multi billion dollar company.

Otherwise, we would not invest. And for us, it's very important to scrutinize the team to fully understand if they can and are willing to go the whole journey to build this multi billion dollar plus company. That's number one. Number two is then obviously related to the market. So we need to look into the market and understand, is it even possible to build a multi billion dollar company?

If you take conservative multiples today, so let's say on, on an enterprise software company, let's say to take it easy, it's a 10 X on the revenues. And if you want to build a 5 billion company, company needs to make 500 million in revenue. And if you want Which means at, let's say, an aggressive market share of 10%, the market needs to be at least 5 billion or something.

So we need to understand what is the absolute market size, what are relative market dynamics, what is the distribution in the market, are there incumbents, is it white versus brown field, and so on. So those are all market related, uh, question marks. Then we look into the product and the technology. So to really understand, uh, uh, is this different?

Is this differentiated from the others? And we spend tons of time diligence in that. It's also what I enjoy most, like really looking into the product and technology with the founders to really see what they have built or are about to build, given that we also invest at the very earliest stages. And from that, you can also see the question towards competition, like how is this different than X, Y, Z is already doing?

So we spend significant time thinking about competitive landscape, how it's different, how are others funded? What is your new approach? And from that, we get into another bucket that we call defensibility. On defensibility, we split it up into backward looking defensibility, which means once you assigned a customer and are working with your customer, how difficult is it for this specific customer to switch away?

So this means also switching costs. Can you increase switching costs for your customers? And then the second one is a forward looking locking effect. And that's everything from, uh, economies of scale, for example, network effect, data network effects, and stuff like that, that with every additional customer, it makes it even easier to win the next customer.

And you can build up these very strong locking effects that prevent others from catching up. And in the data sphere, that means specifically you have access to unique data that help you to train better models with better models, you can attract new customers with new customers, you can collect more data.

And that's the data flywheel. So that's something we look into. And then obviously, if the company is already a bit more mature, we look into traction, commercial traction metrics. Um, Absolute relative and so on and so forth. So those are on a higher level, the different kind of buckets of, um, uh, dimensions that we look into when assessing the company.

And now it's about translating that all into the data stack. How can we collect the data to automate parts of this assessment or automate parts of the data collection for this specific assessment?

Richie Cotton: Okay, so, uh, last one is there, and it seems like some of these things like, um, understanding, um, the total addressable market or, um, like the other market downrights like defensibility, it seems like those ought to be sort of straightforward to do in a quantitative way.

Other stuff, like you mentioned, um, the team at the founders, um, how do you assess that in, um, In a quantitative way, like, how do you do data on this person's really committed to building a multi billion dollar company?

Andre Ritterath: Yeah, that's a great question. And that's, that's also, uh, what we try to prove people wrong when they initially said it's not possible for qualitative data.

To give you an example, so, Let's assume there are two founders and, um, we scrape the LinkedIn profile of the CEO. So from the LinkedIn profile, which is initially one data source, we can extract, um, her previous, uh, academic experience. We can extract the previous work experience. We can extract interaction with other posts like liking, we can look at the sentiment and so on.

We extract these features and then we process these features. So let's look into the academic experience. From the academic experience, we can look into the highest degree. Then we can look at the university and the subject. So we have built dictionaries to represent university rankings. So all of the university rankings that are out there, both from which kind of papers within a specific domain and so on, but then also the subject, because there are some universities which might classify as a tier two.

But if you study a specific subject at this university. That's actually considered as tier one. So it's a combination of the subject and the university that helps us with like hard coded dictionaries that are dynamically changing over time, obviously, but helps us to assess the academic credentials.

Then from the highest year of your first degree, we know with a statistical significance, your age. So we can imply your age. Let's continue the same logic for your previous work experience. So, let's assume, um, uh, you worked at, uh, Google. You would typically say that's amazing, but if you, let's say, worked at, uh, front desk Google for two years, that might not be as useful for an enterprise software company versus if you have done full stack development for six years at Google.

So you can look at the duration, someone's then at a specific position, you can look at the progress across seniorities. You can look at the combination of position at a specific company that implies also expertise. Then you can also look at previous founding experience. You can look at previous founding experience with the same founding team or different founding team.

And then you can also look at the third dimension. into their, um, social interactions. So, which stuff did they like? There are some research streams, um, that we combined. Uh, one that is around, uh, Cambridge Analytica. So, you can actually take, um, the, uh, social media interaction from people and statistically significant predict the big five character traits.

That's one research stream And that means if you like, I don't know adidas, for example, you're more of an introvert But if you like nike, you're more of an extrovert to simplify it And we take this research stream and couple it together with another research stream in marriage research Um, it's from a u.

s professor And, um, in this, you know, with the big five character traits of a couple, you can predict with a statistical significance if this couple will stay together. So we merge these different kind of research streams to predict if there's a likelihood for founder conflicts and so on, just to give you a feeling beyond, is this an introvert or extrovert?

Um, how active are they? And so on and so forth. And I just wanted to describe this example in a bit more detail to give you a feeling of what we can actually extract just from one data source being LinkedIn. And there's tons more, like we create hundreds of features just from LinkedIn that is by nature all qualitative, but can be semi quantified and then leveraged in our algorithms for prioritization of opportunities.

Richie Cotton: I shouldn't be surprised that it actually takes hundreds and hundreds of different features just to quantify a single person, uh, because people are quite complex. Um, talking about the, the big five character traits, so this things like, um, extroversion and how conscientious you are and how open you are to new ideas.

Are there any of those character traits you found, um, make for good leaders or make for good interactions between leaders?

Andre Ritterath: Yep. Um, and I have published some of that research, um, through data to NBC, through my newsletter, uh, before, and the answer is, it depends. So, um, we have something that in some cases.

Statistical world we call the moderator effect, right? So we have, uh, the dependent and the independent variables. Let's say the, uh, uh, variables we are looking at are, um, I don't know, the team and so on, and, uh, let's say the market and then the business model. And actually, uh, the, um, market and business model can act as a moderator effect for which team characteristics are required to build a successful company.

More specifically, that means if we want to train a model to identify e commerce founders. Like we have tons of data about successful e commerce founders. We will most likely find that those are people that are stronger in execution and excellence, what we call it. So most likely people that spend time at, I don't know, McKinsey, BCG, Bain, Consulting, IB, Morgan Stanley, JP Morgan, and so on.

Studied at a business university, did an MBA, so more execution, um, heavy people. And they are more likely to be successful in an e commerce business. Just by the nature of the business. Now, if we look at the moderator effect and change it to, let's say, a core fusion, if we look into core fusion, then suddenly people that have PhD professors, postdocs, like very research heavy, Have a higher likelihood to be successful in that space than someone let's say with the consulting and business background So we see this moderator effect and that's strongest for market and business model combination that determine the impact of the Input variables like the team, like the competitive landscape and so on, on the success of the company.

Richie Cotton: Okay, that's absolutely fascinating that different industries have require sort of different personality types and different experience. Maybe that, now I said that out loud, it should be slightly obvious. But, um, it's good

Andre Ritterath: that there's one, there's one common ground, actually. There's one common ground and we call it the entrepreneurial compass.

And this is very difficult to quantify. This is really something of like, are these people too willing to hustle through everything? What's the intrinsic motivation? Are they driven by money? Are they driven by intrinsics? Is it something like back in their childhood, they need to prove their father wrong and stuff like that?

This is very difficult to project from the outside, but those are the very intrinsic motivations. And we work our way through collecting more data, more innovative approaches to predict as much as possible of that. But eventually it always requires, in my perspective, human sitting together, spending time together.

And then making decision like we actually want to build this together. And this is why I personally believe like we have the traditional world of venture capital where everything is done manual and you meet your friends at the whatever golf house and you make a deal and it's like we will invest in this company.

This is like the world that we've seen probably 50 years ago. And still see today for a majority of the industry, to be frank. And then we have this new movement of, of data driven VCs, um, the leverage data. And then the data room sees among themselves, can we split into the camp of augmented VCs? So those that are VCs like.

Myself like early bird that we believe in a combination of data driven approaches and human in the loop We call it an augmented vc approach because eventually it's always the human making decision We believe that founders do not want an algorithm on their board. We believe they want to work with humans if shit That's the fan.

I want to call someone and then there is the camp of Full quant funds. So there are funds very few of them. It's very early But they say you can even do early stage venture Without any human involved like the human is the root cause of all problems of wrong investment decision And we remove the human from the equation So we have the traditional and then the data driven world and within the data driven world We have the augmented vcs and the pure quant that are just very few and and starting out and we see ourselves as a very early mover Um, in this augmented

Richie Cotton: camp.

So, um, you've got data to assist your decision in the augmented camp, but then you're, as a human, you're sort of making that final decision. 100%. Uh, it's interesting that you mentioned the quant approach, uh, saying like humans are responsible for all the bad decisions. And you mentioned, uh, some biases before.

Can you talk me through a bit more on like, uh, what are these biases that might make you, um, make wrong investment decisions?

Andre Ritterath: It's a super interesting topic, um, which funny enough, I also wrote about, I wrote a dedicated article about these biases, uh, among investors. And, um, if I recall correctly, it's a bit more than 20 biases, I think 25 or something where investors are most prone and it's like a similarity bias.

So the founder went to the same university and like I invest in someone from my alma mater. Um, similar like that. Um, it can be something like confirmation bias. So you have a thesis on a market and then the founder tells you this thesis and, uh, you feel like, Oh, and like this founder is amazing. He's just confirming or she's just confirming what you, uh, had in mind already.

Or you have something like recency bias where you've worked with a founder that has done something fantastically right. And then you're looking for exactly the same, uh, kind of characteristics and new founders and so on. And that might be good. There are, um, like some of the greatest investors, they probably have an amazing gut feeling.

So as a venture capital investor, it's not only about analytics. I think that's the hygiene factor. Everyone needs to be analytically strong, but there are also people who have operational experience that have different experience and just a great gut feeling for people. They just know they can read people in a way of like, this is a great entrepreneur.

And um, if you have this gut feeling, uh, that's very difficult to replicate and scale up. So this is also the people that strongly believe in it. It's a cottage industry. It's all like handcraft and you cannot scale this up. But the problem is that all of these biases are created through a very limited sample.

So if you look around, um, as an early stage investor, I, I mentioned, we have portfolio of 30, 35 companies that we build up over three to four years. So say as a partnership, we do, let's say ballpark 10, 12 something investments per year. Um, obviously we add on that, um, follow on fundings and so on, but let's say over the time period of, uh, five years, every partner probably does some, 10 investments as an early stage investor.

And if you look at the success and failure rates that I mentioned before, over the course of five years, If you do a good job, you probably see one or two, like really successful companies, some mediocre and some you will write off. And then if you project that even further, like venture is a long term business.

So even though the company might raise follow on funding and we see early indication of success, oftentimes it takes six, seven, eight, nine, 10 years or more. To actually return money. So for us as early stage venture capital investors, it takes a decade to know if we are any good. Like you can hit one that can be locked.

That can be whatever, but you need to replicate it. You need to do it twice or three times. And the problem is that until you get to this point, like 15, 20 years out. Your biases will be shaped by a very small sample size And you strongly believe that these biases are right now with data driven approaches You can rely on the global sample of all companies that ever became successful and you can look at Mutual patterns that these successful companies had in common.

And for me, it's a combination of both I think the pure data driven approach in my perspective also doesn't work because Eventually, you need to sit down with the founder and you need to see you need to trust your gut Like does it feel right where I want to work with this person for the next five to ten years?

And also the other side is for me just looking at a limited part of the sample, which can be the right one if you work with the right founders and you have the right biases. But in most cases, you miss some part of the wider

Richie Cotton: reality. Okay, um, so it just seemed like, uh, if you've got someone who's really, really good at reading people, they're probably can go where they got more.

Actually, certainly I've met people like that and they are, uh, incredibly good most of the time, but then when they get it wrong, they're like, they won't change or won't listen to the data because they believe they're right all the time. Um, okay. So, um, Let's talk about how you make this happen at your own company.

So, um, if you want to make your company more data driven, um, who do you need to hire?

Andre Ritterath: It's a great question. So just for context, Early Bird here in Europe is one of the most established, one of the oldest venture capital firms. So the firm has been around since 1997. So it's 27 years in the market and went through different kind of ups and downs.

Um, that means we have, uh, also a legacy, uh, positive and negative. And, um, there is an established culture. So when I came in, in, uh, 2017, uh, for the first time, so I joined in parallel to my PhD, um, then full time early 2018. And, um, the established culture was already very analytical and technical in a way.

That's also why I joined early bird in the first place, because it was not like your traditional consumer investor also was very technical in nature already. And a strong belief in more structured, um, processes. So for me, it was comparably easy to also say and convince the other partners that we would need to leverage, um, more data driven approaches.

So in the first place, I took my PhD research and built an MVP myself. So across the data collection, the entity matching, how you represent the data and the knowledge graph, and then, um, and build a first simple MVP version, and in 2020, I presented it to my partners and said, look, For me, it's very clear that this is the future of venture capital and we need to go this direction.

And to be frank, it was an easy sell. And that was also learning that I had from, from other VC firms. If you don't have buy in from the general partners or the general partners are the ones that own the firm, eventually, um, if you do additional investments in whatever direction it goes out of their pockets.

And if you don't have this buy in, um, you can stop it right away. So you need full buy in from the budget owners. And from there onwards, I decided to hire first a generalist, um, software developer. And then I had many learnings along the way of how to build a team from, I hired two deep expertise, um, uh, I hired two generalists and so on.

So there was a wide spectrum of learnings. But today we have a team, um, constantly around eight people. Which we call our engineering team within early bird. And this team is autonomously building our platform that we call Eli. So it's the sourcing and screening platform that everyone in the investment team uses, and that has also helped us already to identify a few very promising investments.

Richie Cotton: Okay. So you've got a mix of people who are experts in one particular area and people who are generalists and hopefully between that you've got some sort of, you've got the right level of coverage in all the areas that you need. So you mentioned the importance of getting managers on board in order to, Get the right culture, um, for having this data driven approach.

Um, do you have any advice on Um how to get that management buy in and how to get that cultural change to encourage? More use of data.

Andre Ritterath: Yeah, I think One is just Like in terms of values and how you think about the world in the first place, I think that needs to be aligned no matter what, if this is off, if, if people believe, uh, it is all just handcraft and I don't care about data, I know it better than the data, if this is the mindset of the people, then you don't need, need to start finding arguments.

If you have people who are receptive. to these arguments, then provide them with the arguments. And this is also what I tried with with data driven VC, like to provide as many arguments, um, uh, quantitative, qualitative nature, um, for us internally to actually get going. So what I did back then is also, um, Because one argument was from people, okay, let's assume we can collect this data.

Let's assume we can represent these companies and find the digital footprints. But how can we trust with the most important part of our investment process for a machine? And I mentioned before, uh, two third of the value is created in sourcing and screening. And screening is really how to narrow down the funnel.

And then you have this power law distribution. If you weed out the wrong company, you might just Um, like really missed this, this outlier company. So people, uh, struggled to trust this selection process. And what I did is I trained different kind of machine learning model. Uh, back then I worked mostly with an XGBoost classification model to classify companies into successful and unsuccessful.

And what we did is actually, uh, we did a dedicated study to benchmarking humans against the machine, uh, which is super interesting. And essentially we took, um, input data from, I think back then it was 2015, about a sample of companies. We followed these companies until 2019 or 20. And then we looked at which of them became successful by investor standards and which failed.

So we classified them into one being all the success cases and zero being all of the failures. And then we looked at the input data from 2015 to predict the successful cases as of 2020. And with that, we've built a different kind of classification models. Now, we took this model and we provided it with, um, information from companies that the model has not seen before.

We took the same information, we anonymized it, uh, put it into one pagers and presented it to investors, like human investors. And we always ask them, do you recognize this company? Because it's true companies. And then if people said yes, they were deselected for, uh, for, uh, the study. And if they said no, we asked them out of 10 companies.

Which are the five that you think will be successful and which are the five that you think will be a failure and the split in the sample was in reality it was five five and we could find that even the best investor we had around 110 or 120 investors and it's like it includes all of the top names all of the funds participated in that and even the best investor in the sample missed one of the successful opportunities.

And our machine learning model also had a recall of 80%, meaning the model also missed one of the successful opportunities. The average, however, for the human was around 60%, and the median was also in this ballpark. So what we could show is that the model performs at least as good as the best human investor that we had in our sample.

And this model does that on a recurring basis, and it performs significantly better than the average or median investor out there. So what we can say is with these screening models, even the most untrained investors will be capable of identifying promising opportunities if you enable them with these models.

And that was something when we did the study, I presented it to my partners and everyone was like, okay, like if this is really true, then it's a no brainer. Like why would we look through tens of thousands of opportunities? And the only way to scale it up back then was to just hire more interns. So we had more deal flow.

We moved from 8, 000 to 10, 000 to 12, 000 per year. And the only way to end it is like, okay, one intern can handle whatever thousand opportunities a year. So we just hire four more interns. And this one here is way more scalable and also trustworthy and doing studies like that, like very tangible, help me to convince the partners, get a full buy in.

And then also from the team to really roll it out within the team with top down support from all of us as a partners.

Richie Cotton: That's very cool that, um, the, the models are getting the same performance as a top investor. I also think it's interesting that, um, the performance of this model is kind of, um, or at least the cost is measured in like interns per year equivalent ,

Andre Ritterath: because that's, that's, that's the only way to, to really scale it up.

Like there are two options. One is to not reply to this. Uh, we, we dis and we distinguish between inbound means founders reaching out to us and that was the majority historically. and outbound. We're doing desk research, using data driven approaches to proactively reach out to the founders. And we found that majority of this inbound is actually not a great fit because founders don't know about our investment criteria and so on and so forth.

And the only way to scale it up is to hire more people. So I remember back then we had also at times significantly more interns because we wanted to handle the whole volume and we wanted to get back to every single entrepreneur that reached out to us.

Richie Cotton: Just getting into some of the techniques you're using, you've talked a lot about using traditional sort of machine learning models for doing the screening, like trying to predict success.

And then for the sourcing, it was sort of like web scraping, making use of APIs. I'm curious if there are any other techniques you're using and if, well, I mean, generative AI has been a big thing in the last year or two, so I'm wondering whether you're making use of that anywhere in the process. 100%.

Andre Ritterath: Also as full disclosure, we are thankful to be one of the very early investors in a company called Aleph Alpha, which here in Europe is one of the leading companies in the JNI space.

So they mostly serve enterprises and clients with very sensitive data. And we consider ourselves as one. So we have started playing around with large language models in 2021. already and integrating it into our systems to give you a feeling on how we use it. I'm happy, happy to, to share some use cases.

So number one is within our team, we also have a sector, sector expertise. So me personally, um, I'm, I'm looking into enterprise software, AI developer tools, data tools, and so on. And we have like a huge matrix of responsibilities within our team. And, um, We started actually, uh, in 2018 to manually label all of these opportunities.

So we have manually labeled more than 50, 000 companies into this different kind of sectors. And then we, um, did the manual classification model back then, which did not properly work to be frank. And today you can use large language models and you can classify it like whatever, 98 percent accuracy, the different kinds of industries.

So we use large language models to classify the industries, we classify the business model and so on, just based on the semantic description of the company. Then another use case is we vectorize the description. So we also scrape the company websites, public register registrations, and all of this stuff.

We vectorize that together. And then with these vector embeddings, um, we look for similar companies so we can automate what I described earlier, the competitive landscape mapping. Which back then was days and weeks of research going through all of these databases, looking for similar companies, like from left and right, trying different angles to find them.

Today, we can do that by a click of a button. So for all of the companies in our system, we have already automatically created a competitive landscape. Based on the vector embeddings of the descriptions of the respective companies. So we always know live what these companies look like. And then we have also the enrichment data, like who are the investors?

What's the funding? What is the news mentions? Who are clients? What is the head count? And all of this stuff. So that's the second dimension. Um, another dimension that I can talk about is really the interaction with founders. So within our system, Eagle Eye, um, we have, uh, trained on our historic, uh, email, um, and other data, different kinds of LLMs.

So we can essentially just by click of a button, generate an outreach email to the founder. Again, for us, it's very important to have human in the loop here, because the worst thing that can happen is a system that randomly reaches out to people and burns our brand. So we decided proactively to keep the human in the loop, and we have several other use cases where we really use LLMs across the value chain.

Richie Cotton: That first use case in particular, I mean, it maybe sounds a little bit pedestrian, like using generative AI to clean your data, but I can imagine that's absolutely a huge time saver and cost saver. Improving your data quality and not having to, well, I guess in terms again, uh, like clicking on like this, this company is in this industry.


Andre Ritterath: I think one, one that is really, uh, one of the core of our infrastructure is what we call entity matching, because if you just take this example, there is a founder who, I don't know, put something in the Reddit forum and it's like, I'm looking for a co founder or whatever kind of wording, this founder will get an entry in our database.

Now, let's say a week later, she decides to put on her LinkedIn. Working on something new or something fancy. So she will get a second entry. Then she will go to the public register, register the company. She will get a third entry. Then they might launch something on GitHub, like a new project. I will get a fourth entry and so on and so forth.

So suddenly we have like tens of different entries for the same company and we need to merge them together. So this is the deduplication and entity matching. And for that, we also use some, uh, modern, uh, models where we can look at the, uh, vectors and compare similarity across these companies to do both inter and intra entity matching.

So also within sources and across sources.

Richie Cotton: Okay, uh, yeah, so again, uh, it's just making sure that your data is in good form and that seems like, uh, an incredibly important use case. Um, okay. So, um, do you have any big success stories where, um, maybe your models have found some company that was missed otherwise, or, um, you've otherwise had a big win?

I think,

Andre Ritterath: so if you, and I'm happy, happy to share that, but I think generally for us, it's about identifying the right opportunities as early as possible. And I think timing as an early stage invoicer is incredibly important. Like you don't want to see an opportunity once it's obvious to everyone because then it might be incredibly expensive or competitive to get into this deal.

So for us, it's really identifying these companies as early as possible. This is also why we look into forums. We look into stealth founders. We look like we follow thousands of stealth founders across Europe. If we have seen a digital footprint that they are about to start doing, start something, you know, if they think about starting something.

So we start at the earliest point. And some great examples is, for example, I mentioned, um, I left IFA before I left IFA. We scraped when a company got registered. Um, the CEO is the serial entrepreneur. He sold his previous company to Apple. Um, from there onwards, um, he was in the valley. Uh, he saw a lot about GPT tool, different kinds of transformer models, got back to Europe and then started his company with his incredibly strong technical co founders.

And, um, the two of them started this company, uh, initially very few information available. Um, we let the series around then in summer 2021, and we identified the company initially through their public register registration. Then they raised, um, a seed round with some local investors that had a previous relationship.

And once they launched their website and, uh, the round was also, uh, public, we had sufficient data points to prioritize through opportunity. And then through our network got in touch with them. And actually six or seven months after their first round, we then let their series A round. So that's one example, um, where most likely we would have seen them significantly later.

Another example is, um, a company called Ethan AI, um, it's an ETH spinoff in, uh, Zurich. Um, we scrapped the company initially when it, uh, just got registered. Um, so it's two founders who did their PhD in different areas at ETH Zurich. And, um, we looked at the pre seed round, which back then was a bit too early.

And then, um, let, uh, co lead the, um, seed round, uh, together with our friends at La Familia and now General Catalyst. So that's another example. And, uh, there are a few, a few others. Um, but, uh, we see some early evidence that, uh, it's actually delivering great. Deal flow earlier than the majority

Richie Cotton: of the market can identify them.

Okay, that's pretty cool stuff. Although I guess you mentioned like you sort of need to wait 10 years just to get the true results. It's too early to tell. It's

Andre Ritterath: too early today. So we have good initial evidence. Um, Aleph Alpha, for example, they just raised a 500 million Series B round in November, um, last year.

So good early evidence. But as we know for early stage investors, it's too early to tell. Uh, it only counts when you return money to your LP. Peace.

Richie Cotton: All right. Super. We'll bring you back in like 2034 and you can give us an update. And we have a few more examiners to talk about. Absolutely. So, it seems like you spent quite a few years building up this, uh, capability.

So, um, can you talk me through, like, are there any important, um, lessons you've learned along the way? Like what mistakes have you made that other people should avoid? I think what I

Andre Ritterath: mentioned earlier, really the buy in from different kind of stakeholders, but then specifically the people who own the budget.

I've seen many funds and people within these funds complaining that the budget owners just do it for window dressing. Like, and that was really a shift in the market when I started this whole stuff, uh, like seven years ago. Um, there were very few people in the market doing it and majority just said, Oh, it's not, it's not possible.

Um, because X, Y, and Z, I mentioned two reasons before. And there was a shift, I'd say probably around the time of JGBT where suddenly everyone is like, ah, productivity, efficiency, and so on. And now everyone is telling more or less the same story. So there's also like completely new industry that I call investment tech.

Um evolving and in this investment tech We have players that do some of the components that we do internally or externally and provide it to other vcs So they buy it off the shelf and um, then our lps are like Yeah, guys, it's all sounds the same of what you're doing, but in reality, it's a huge, huge difference.

And majority of the VCs today get pressure from the LPs from the market that they need to do something. And for them, it's like, yeah, we have this one data engineer, data scientist, product manager, whatever. And it's for them really window dressing. And I know from several of these people within the organizations, if you don't have the full buy in, it just sucks.

It really sucks. And this is the most important number one. Number two, I'd say learning is that 80 20, um, applies here. Like don't over engineer it. Like we Germans oftentimes do. I don't need a perfect version. Just get going. And I think one, one point to start off is really automation. So simple automations, um, just freeing up time and then it can go more complex over time.

And I think the third one that comes to mind is really on the hiring part. I would rather hire an engineer with like three, four, five years of experience, uh, with a bit of a Swiss army knife versus someone who has a PhD in NLP and 10 years of postdoc experience, because what we need is people who can Execute and build products really fast.

And it's also speaking on eye level with the investment team, because that's what you collectively need to top down buy in from the management who are on the budgets, stakeholder management. You need to deliver something really fast. There's 80, 20, and you need to have the right engineers to really understand of what the investment team is doing because otherwise, and that's the ultimate, uh, blocker.

It's culture. You only have one chance to try it out with the investment team. If they don't like it and you have delivered a premature version, they will be never changing their investment workflows. And that's the biggest struggle that, that I can tell from, from thankfully, uh, not too much my own experience, but what I heard from many others who have

Richie Cotton: tried.

Okay. It sounds like these are actually quite related ideas. So if you hire someone who can build something quick and, uh, provide some value fast, that's going to help you get that executive buy in that's then going to help you scale your program.

Andre Ritterath: Yeah, it's a bit of a chicken egg. Um, I think I bridged this chicken egg back then with with some research and stuff that I've built myself next to my investment job and um, I think if you Need to do and solve the cold start problem right now probably start with with exactly This this generalist to build something really fast try 6 12 months See if it delivers the results that you expect and then double down Excellent.

Richie Cotton: All right. Do you have any final advice for anyone who wants to become more data driven or help that company become more data driven? I think

Andre Ritterath: too many advices to, to put them into, into this conversation, but I think the easiest one is just to check out data driven BC dot IO. Um, this is where I write about lots of this stuff.

Um, I'm pretty transparent on it. So I am a strong believer in open source in community. This is why I'm sharing majority of this stuff. Obviously we keep our secret sauce. So the stuff I'm writing about is probably something that we, that we thought about a couple of years back, but it's just great to, um, to, to start these conversations with the community.

And I've benefited a lot from this community. So I also want to, I want to give back there and, uh, together make this industry really more efficient, effective, and also inclusive.

Richie Cotton: Wonderful. Uh, it's great that you're sharing all your content and your ideas. Uh, so yeah, uh, I would say to our listeners, please do check out, uh, the Dish Driven VC site.

Excellent. All right. Thank you for your time, Andre. Thank you so much, Richie.



Monetizing Data & AI with Vin Vashishta, Founder & AI Advisor at V Squared, & Tiffany Perkins-Munn, MD & Head of Data & Analytics at JPMC

Richie, Vin, and Tiffany explore the challenges of monetizing data and AI projects, the importance of aligning technical and business objectives to keep outputs focused on core business goals, how to assess your organization's data and AI maturity, why long-term vision and strategy matter, and much more.

Richie Cotton

61 min


Data & AI Trends in 2024, with Tom Tunguz, General Partner at Theory Ventures

Richie and Tom explore trends in generative AI, the impact of AI on professional fields, cloud+local hybrid workflows, data security, the future of business intelligence and data analytics, the challenges and opportunities surrounding AI in the corporate sector and much more.
Richie Cotton's photo

Richie Cotton

38 min


The 2nd Wave of Generative AI with Sailesh Ramakrishnan & Madhu Iyer, Managing Partners at

Richie, Madhu and Sailesh explore the generative AI revolution, the impact of genAI across industries, investment philosophy and data-driven decision-making, the challenges and opportunities when investing in AI, future trends and predictions, and much more.
Richie Cotton's photo

Richie Cotton

51 min


The Venture Mindset with Ilya Strebulaev, Economist Professor at Stanford Graduate School of Business

Richie and Ilya explore the venture mindset, the importance of embracing unknowns, how VCs deal with unpredictability, how our education affects our decision-making ability, venture mindset principles and much more. 
Richie Cotton's photo

Richie Cotton

59 min


[AI and the Modern Data Stack] How Databricks is Transforming Data Warehousing and AI with Ari Kaplan, Head Evangelist & Robin Sutara, Field CTO at Databricks

Richie, Ari, and Robin explore Databricks, the application of generative AI in improving services operations and providing data insights, data intelligence and lakehouse technology, how AI tools are changing data democratization, the challenges of data governance and management and how Databricks can help, the changing jobs in data and AI, and much more.
Richie Cotton's photo

Richie Cotton

52 min


Data Security in the Age of AI with Bart Vandekerckhove, Co-founder at Raito

Richie and Bart explore the importance of data access management, the roles involved in data access including senior management’s role in data access, data security and privacy tools, the impact of AI on data security, advice for improving data security and much more.
Richie Cotton's photo

Richie Cotton

46 min

See MoreSee More