Skip to main content

Building Multi-Modal AI Applications with Russ d'Sa, CEO & Co-founder of LiveKit

Richie and Russ explore the evolution of voice AI, the challenges of building voice apps, the rise of video AI, the implications of deep fakes, the future of AI in customer service and education, and much more.
Jan 27, 2025

Russ d'Sa's photo
Guest
Russ d'Sa
LinkedIn

Russ D'Sa is the CEO & Co-founder at Livekit. Russ is building the transport layer for AI computing. He founded Livekit, the company that powers voice chat for OpenAI and Character.ai. Previously, he was a Product Manager at Medium and an engineer at Twitter. He's also a serial entrepreneur, having previously founded mobile search platform Evie Labs.


Richie Cotton's photo
Host
Richie Cotton

Richie helps individuals and organizations get better at using data and AI. He's been a data scientist since before it was called data science, and has written two books and created many DataCamp courses on the subject. He is a host of the DataFramed podcast, and runs DataCamp's webinar program.

Key Quotes

The human brain is 75 % dedicated to visual processing. If you look at the actual distribution of neurons in the brain, there's a reason why they say humans are visual creatures. And so, we are very keen on, differences, or changes in consistencies in visual information and as we process it. The bar is higher for video than it is for voice AI. It's also a technically more difficult problem just because video is so many more times in data.

I think what's changed recently and has elevated voice AI into the zeitgeist and a lot of developers building different types of voice interfaces is AI has just become a lot better.

Key Takeaways

1

Voice AI has become more feasible due to advancements in AI's ability to parse and generate natural language, and improvements in latency, making real-time interactions more viable.

2

Developers can leverage platforms like LiveKit to simplify the use of WebRTC, allowing them to focus on building applications without needing deep expertise in the underlying infrastructure.

3

The quality and realism of AI-generated voices are crucial in applications like language learning and patient support, where empathy and natural interaction significantly enhance user experience.

Links From The Show

Transcript

Richie Cotton: Hi, Russ. Welcome to the show.

Russ d'Sa: Hey, Richie, thanks so much for having me. 

Richie Cotton: So there are lots of different kinds of multimodal AI, and I think voice AI is one of the most interesting. It's been around for a while with things like Alexa and Siri, but it seems like it's a lot happening in the space. So first of all, can you tell me what trends are you seeing in voice AI?

Russ d'Sa: I think, Voice AI has been around quite some time, right? I have a Google Home in my office back home, and, been using Siri for, I think Siri came out in like 2010 or something like that, so it's been around for quite a while. I think what's changed recently is and has kind of elevated voice AI into the, into the zeitgeist and a lot of developers, you know, building different types of voice interfaces is, I think, a couple of things.

I think the first thing that has changed is the AI has just become a lot better. You know, you have these like expert systems or like, you know, big if else trees or conditional trees 15 years ago for how to process queries that someone is. Forming a natural language. and then now you have an AI model that can read and write convincingly in natural language.

and so I think that the AI just being a lot smarter about how it parses input and how it generates output, is the first kind of unlock that has made voice AI a bit more feasible. Then it was 15 years ago where you're just frustrated talking to your, home assistant device.... See more

>

And then the second thing that has really changed is the latency. So these models have become a lot faster at being able to do inference. There's also specialized models now that are very fast for being able to transcribe speech into text and also be able to generate speech from text. Which has made kind of the full end to end flow of talking to a computer instead of, typing to it or texting with it.

It's just made that so much more viable. And think ultimately both of those things, the model is getting smarter. And then the latency of how quickly they can process the input and generate output getting faster, both of those things together have kind of solved the UX gaps that we've had in devices like the Google Home and, Siri before it.

Richie Cotton: Absolutely. Yeah, I certainly remember when Siri first came out, you'd have to ask it the same question six times before I was like, Oh, yeah, this is what you want to do. And it has got a lot better in the last years or so.

Russ d'Sa: Totally, yeah. And I think, you know, trend wise, I think the thing that is the low hanging fruit for, I think there's two, right? One is just folks that are building a much better version of Siri and Alexa, right? So Even if you look at OpenAI's voice mode and advanced voice now in the application, that's having like your own kind of personal assistant that has a memory and can speak to you with a very convincing human voice and process your inputs very quickly.

And there are other companies that are working on these different types of kind of a voice based assistant. So that's one area that you're seeing a lot of energy go into. The other one that I would say is a popular trend right now is in the telephony space, so kind of any workflow or system, mostly on the B2B side where you have to call a customer support line or you have to call a patient intake or the front desk of a hospital, et cetera, any kind of system where you are making a phone call and there's normally been another human being or A really terrible phone tree system where it's like, press one if you want to connect to this department, press two, or, type your full credit card number into the, into the phone using the touch tone keypad any of those kinds of systems.

So either human answering the phone or one of those kinds of older IVR systems. Answering the phone. Those are getting replaced very quickly with these new kind of voice based AI models. So that's another kind of popular use case. just as recently started to grow in the past year.

Richie Cotton: Yeah, certainly calling customer service and getting stuck in those choose an option from trees, that's a special kind of burger tree. So I'm very glad that those are being improved.

Russ d'Sa: Yeah, me too. Me too.

Richie Cotton: if you're building these things, then it sounds like , it's going to be harder than these sort of text AI things.

So chatbots are sort of fairly well developed at this point. If you're building voice AI, what are the challenges there?

Russ d'Sa: I sometimes talk to developers about this specific question, right? And I almost hate the fact that what I'm going to say is, That's the plot of the Silicon Valley TV show, but is not false that the internet as it exists today, as we've been using it for the last 20, 30 years, it wasn't really designed for real time media streaming.

So like when you're doing, building a voice AI application, right? What you're doing is you're streaming speech from your device to an AI model somewhere. And then you're streaming speech again from that AI model, the response, back to me you know, in my device. And, it's really in some ways you can think of it as like having a Zoom meeting with an AI model.

That's really what it is. And Zoom is built on top of technology that is very different from how most of the web is built and most of the internet as we've, you know, accessed it or used it for the last 20, 30 years. the internet itself was built around a protocol called TCP and there's a protocol on top of it called HTTP.

I'm sure everybody has heard of that, right? Like you go into a web browser and you type in someone's URL, it starts with HTTP colon slash slash. HTTP is the Hyper Text Transfer Protocol. So you'll notice it's not the Hyper Audio Transfer Protocol, or the Hyper Speech Transfer Protocol, or the Hyper Video Transfer Protocol, it's Hyper Text.

And so, it was really designed for transferring text between computers. There's another protocol, and I won't get into all of the details of it, but another protocol called WebRTC that was designed specifically for transferring media between computers, not text. media being audio and video, rich media.

 And so when you think about building a voice AI application, building a zoom meeting between you and an AI model, It's a very different architecture from building a web application because you're not transferring predominantly text, you're transferring predominantly speech or voice.

And so I'd say, like, that's the first kind of high level challenge that people face when they want to build a voice. AI application is that it's just a different paradigm for most people than what they're used to when they're, you know, building a Rails app or a Next. js app or anything like that.

It's just, it's a different paradigm. And so that's the first challenge to kind of overcome is, okay, there's a new set of acronyms and a new set of infrastructure that I have to understand a bit more deeply. in order to be able to build something that can scale to millions of users around the world.

Richie Cotton: so that sounds like everything has to change from sort of fundamental web protocols upwards. you mentioned this WebRTC idea as being an alternative to HTTP. Is this something that, I presume, is it software developers need to know about, or does anyone else in your organization need to care about this?

Russ d'Sa: You know, ordinarily, and not necessarily a plug for LiveKit, but part of our goal is to actually make WebRTC easier for people to use by effectively not having to know about it. So there is a protocol called WebRTC and it's built on top of another protocol called UDP. Really, on the internet, there's two protocols.

There's TCP and there's UDP. And most of the internet, as I mentioned works on TCP and for media streaming and things like that, you should actually be using the other protocol called UDP. WebRTC is another layer on top of it that is add some more abstractions, some nice things that you'll have to do anyway if you were building this all from scratch.

But even WebRTC with its, more abstract offerings on top of UDP still is quite complicated. So it's fairly low level, there's a lot of moving parts, there's a lot of things that you have to build even on top of WebRTC just to make it possible to use it in your application. And so LiveKit, my company, we started as an open source project and we are really built around making it easy to use WebRTC such that you don't even have to think about it.

 in a fact Or in essence, what Stripe did for payments processing, like you could go and integrate taking payments in your application if you wanted to by plugging into like credit card processing gateways and things like that, but again, it's very complicated, there's a lot of code you have to write, it's infrastructure that you don't necessarily want to be thinking about, you're trying to figure out how to build your application and how to make something that users like to use or want. that's enough work. And if you add like having to understand the infrastructure and how to plug into payment gateways on top of that you know, it's time that you're going to be spending or attention you're going to be taking away from working on your application. And so Stripe came in and was like this very nice API layer that abstracts away all the complexities of working with these processing gateways.

Underneath and you can think of what we do as doing the same thing, but for communications real time communications. So how do you send all this data back and forth? How do you measure the network in real time? How do you deal with multiple data centers and users connecting from around the world to each other?

All of that kind of complexity gets tucked away behind a really simple, nice set of APIs. That make it simple for you actually build these types of voice AI applications. And so, you know, in a sense, like a developer, if they were going to go do this themselves, yes, you would have to understand WebRTC, how that protocol works, how to build your own signaling layer to communicate between computers.

There's a lot of stuff that you'd have to do, and there's a lot of acronyms underneath WebRTC itself that you would have to understand, and so we make it so instead, you don't have to understand WebRTC, you just have to understand LiveKit, three easy primitives, and then you're off to the races. So, part of our effort here is, going and making that simpler for people

Richie Cotton: this sounds a lot like how there's been a big shift to cloud computing, because caring about your infrastructure is, is sort of a niche hobby. And so for most people, it's better if someone else does it. And it sounds like the same situation here. If you want to do sort of real time media streaming or real time AI, then you don't necessarily want to care too much about the infrastructure.

You want to have someone else do it for you. Is that about right?

Russ d'Sa: yeah, that's like 100 percent right on the mark. I sometimes say that We had cloud computing exactly as you said, and I've sometimes said that we have this. new paradigm that is starting to come about, which is, I call it like AI computing or real time computing is another way to think about it.

But where effectively as AI gets smarter and smarter the interface to that AI becomes more and more human, right? If you're making computer more and more human like, That also means that the inputs and the outputs become more human like as well. So, gone are the days, or will be gone the days that you're typing and moving a mouse to interact with an application, one that's driven by AI is going to be interacted with using your natural human I.

O., which is your eyes, ears, and a mouth, your microphones, speakers, and cameras for the eyes. And so. it's a paradigm shift. It's an infrastructure shift underneath as well. so I kind of, I lump it all under AI computing. It's a new kind of infrastructure layer that you need to build these kinds of applications.

And, and that's, yeah, really the space that we're, trying to be the picks and shovels for folks to be able to make it easy build applications for that new paradigm.

Richie Cotton: Okay. So everyone else gets to focus on the, business ideas rather than focusing on the infrastructure. 

Russ d'Sa: That's the idea. Yeah.

Richie Cotton: So you mentioned uh, more human like AI. And that got me thinking. just saw someone in my social media feed again today, like complaining that open AI deprecated the sky voice, even though that happened months ago.

And it seems like people can get very attached to particular voices for the AI. Do you have a take on like what's happening with like having these more realistic human sounding AI voices?

Russ d'Sa: Yeah, I think it's interesting, right? Because won't comment on the OpenAI Sky voice. Sky was also my favorite, too. When we were working on a voice mode with OpenAI, even from the very start well, it wasn't called Sky back then. But I won't say too much about any of that kind of detail. But yeah, it's a great voice.

And it was not, ScarJo's voice. can tell you that for a fact. But the thing about these realistic voices is certain applications, they're very, very important. And then for other applications, I don't think that they're necessarily as important. So let me give you some examples here. well, backing up, I think if you can have rich expressive voices for free effectively, if there's no trade off, I'm not sure necessarily why you would take the one that is not higher fidelity versus the one that is, but one use case around customer support when you're calling like an IVR system, right, the press one for more options, et cetera, type of system. For that use case, my personal hypothesis is that the quality of the voice is not the most important thing. The most important thing is that it works well, it's reliable and that it's fast. when I've called personally customer support lines, I'm fine if I have to talk to a computer system, the part that I'm not fine with is having to like, repeat my credit card five times because it didn't pick up on it like that's the main pain point with these systems is that they're not reliable, they don't work the same way every single time and that they're not fast, they're really slow at how they process and generate outputs and how many steps it takes get through some of this stuff.

I have to frame my queries very specifically for the system to be able to understand what I'm saying. And so I don't think the voice quality is as important for that particular use case. However, there are other use cases you know, folks, developers, and we've been working with on some things as well around language learning is one example here.

Where You want the model to feel conversational. You want it to feel like a native speaker in the language that you are learning. And the only way really for that model to express itself in that fashion is it has to understand tonality and accents and cadence of speech. It has to understand those things and be able to respond in a way that, you know, does truly feel like you're having a conversation with someone who is a native speaker of the language you're trying to learn.

I think another one that's really important is any kind of patient intake. There's a, company that is using LiveKit to build a suicide hotline or a support hotline for folks. And I think like that's just a very human experience too, right?

Like, you know, not even going as extreme as suicide, but even just for patient intake. I'm sure many of us have had health scares or called a doctor and we just, you know, we're anxious about something. We want to get an appointment somewhere and we have to describe what we're feeling to the person answering the phone to get the appointment scheduled.

And I think there's an element to like bedside manner where you want to feel like you are talking to a voice, an entity that can relate to you. And I think that the way that, that voice sounds, the way it comes across, how expressive it is. All of those things go towards your ability to relate to the system that you're speaking to.

And so I think in those use cases and contexts, quality of the voice, the realism of the voice, the of the voice is really important. So it's use case by use case, I think.

Richie Cotton: Okay, so moving on from voice to video, I know there's a lot of hype at the start of 2024 around video AI uh, OpenAI, I was promising Sora, I sound very negative on OpenAI today, but yeah, it's not quite materialized Runway have got a new model out so it's kind of gradually progressing, but what sort of trends are you seeing?

Russ d'Sa: Yeah, video is, definitely starting to grow quite quickly. I think we, just crossed the chasm on, especially with real time API and doing this voice to voice having a speech to speech model where the model natively understands. Speech, being input into it, and it's has a joint training with, text as well as audio data in one single model kind of a shared embedding space.

Having that and being able to have this voice that can process information very quickly, 300 ish. millisecond average roundtrip latency, which is human level speed and then express itself in a way that sounds very human. I think we just have unlocked voice for real world applications that can run at scale now, right?

From a UX perspective. And now video is This thing that we have kind of turned our attention to as the, next thing that is not quite there yet. And so there's a lot of folks, of course, there is Sora and there's other models now that have been coming out. Some that are doing more general purpose video generation kind of you know, you can use it as a.

To construct different scenes, maybe, you know, one day make your own movie using generative AI. And then there are specific verticals or niches that people are going to within video. So, like video avatars is one example. so that's kind of generative video.

And then the input side, there's people that are working on scene understanding and computer vision types of use cases on the input side. And so I think on the generative side for video. There's two challenges. And It was largely the same two challenges that you saw with voice.

 The first one is latency. So being able to generate these videos very quickly, being able to have if it's an avatar, for example, that's, you know, a video based avatar. It's how do you generate that on the fly in real time, right? Like if you, I don't specifically know any kind of Sora. Numbers, but I, I think I was doing some research on the internet and I, definitely think it takes like quite some time to be able to generate, even like a 10 or a 30 second video with Sora.

And so getting that latency lower, I think is the first step, not specifically for what Sora is trying to do. I think for general purpose video generation. It would be cool to be able to say, Hey, give me a Netflix movie about a cool sci fi, I, you know, I don't know, like, it would be cool if I could generate like the next show in the Dune universe on the fly, but, You know, if that takes like a day or something to generate my Dune show for me, I think that that's not a big deal.

And so for general purpose generative video I'm not as convinced that latency is as important of a problem. But I think for like avatars as an example, I think that being able to do that in real time latency, like, it's necessary because you can do the voice in real time latency.

So you have to kind of like pair these two things together. So that's the first hurdle for real time generative video. I think the second hurdle is the quality, so. does your generated video have, does the character have six fingers or not, right? That's the kind of the classic, the classic issue with even image generation sometimes, which has gotten a lot better in the last year.

But still a big challenge for video. I think it's things like that. It's also the consistency of the video that kind of falls under quality where, it's a person walking through a forest and like, are the trees the same trees from the previous frame? Right? Are they in the same position?

You know, all of that stuff. The consistency in the video that's generated is another problem that people have to solve. But we're getting there. I think that there's some efforts from some startups now too in this area and, it's going to be exciting once we have it.

Richie Cotton: I certainly do like the idea of just being able to generate my own Dune movie just on the fly. That'd be very, very cool. Although I guess we are quite a way off from that. So yeah, I think you mentioned that consistency is kind of a problem, particularly when you have, like, longer videos.

And certainly, I guess, like, the hunter gatherer part of your brain is very keen on sensing motion. and making sure that, like, noticing what's changed. So, yeah, that's going to be a tricky one to solve.

Russ d'Sa: Well, you know, like, the human brain is like 70 or 75 percent dedicated to visual processing. Like, if you look at the actual distribution of neurons in the brain there's a reason why they say humans are visual creatures. it's true even down to like the actual, silicon of a human brain down to the actual neurons.

And so, we are very keen on, differences, changes, inconsistencies visual information. And, as we process it, And so bar is higher for video than it is for voice. It's also technically more difficult problem just because video is more times in data, Like the amount of data that you have to generate every pixel in a 4K image or an 8K image.

We're not even, yeah, 4Ks, we're not even talking about the resolution of the eyes, right? Like, just the amount of data that you have to be able to generate very, very quickly relative to audio or speech. it's a huge gap and, And then on top of that, our ability to discern on the quality side between audio and video that's also the bar is much higher for video.

So it's, a very hard problem and it's not a surprise that it's taking longer for video. to get there than voice did just because of these, natural constraints that I mentioned.

Richie Cotton: And in terms of what use cases of these are, like so far I've been seeing like a lot of B2B examples, like people creating music videos, things like that. On the sort of individual side, I've seen a couple of like two or three second clips floating about on social media. There's not many individuals trying to create longer videos.

So I'm wondering audio and text, these have been adopted by everyone. Is video going to be a B2B thing or will everyone start using video AI? Bye. 

Russ d'Sa: I have my own opinion. I think it'll possibly a controversial one, actually. But I guess I'll preface this with there's going to be these AI native use cases that We just could not have predicted, right? I think a lot of people have talked about, if you recall, like back when Snapchat came out, most people could, they called it like a mobile native application, right?

It was designed, it could only be enabled and designed specifically for mobile phones or, you know, Like Uber was another example of this as a mobile native application and a mobile native use case. And I think there's going to be these AI native use cases as well that, they just don't have an analogy to stuff we're already doing with computers and they'll be net new and I don't know what those are going to be.

 And I don't think anyone else does either, but. think if you're just drawing analogies to the existing world and how I think video will be used, I think video generation, especially longer form generation, I don't actually think it necessarily needs to be real time, and it probably won't.

It'll be fast, but not necessarily real time. I think the place where that's going to be used a lot is in business, for sure. Around what I describe as these like Jarvis style interfaces. I think if you watch Iron Man and you watch the way that he interacts with Jarvis, when he's like in the workshop or the machine shop he's using Jarvis as like a co pilot to like do a lot of the mundane.

Tasks of actually just, the mechanics of creating the content. I think that the content in his case is in Tony Stark's case is like, yeah, making a, you know, an exoskeleton. But In the case of like, I think, generative video, you're going to have like video editing applications where you have this co pilot and you're kind of the director and the co pilot is the one who is generating the scenes for you and then tweaking them based on your feedback and all of that stuff.

I think that's going to be a very popular B2B use case for people that are generating video content or visual content with this co pilot. I think for the consumer case. my feeling is that there's, again, going to be certain use cases where it's very useful. So I think, like, if you're doing telemedicine, like talking to a doctor you may want to have a visual representation for that doctor as something that feels a bit more empathetic, right?

And something that you can relate to a bit better. If, you know, if you watch a movie like there's this movie contact this old movie from like the nineties. And when Jodie Foster ends up in this alien world there is an alien and the alien has like, a certain form, but the alien turns itself into, I think it's her father, if I'm recalling correctly.

And Alien says, like, I don't know if it says it or not I want to say that it did, but that it kind of assumes this human form because it's just, going to be more acceptable or relatable to Jodie Foster. And I think that you're going to have those moments as well, like in these real world use cases where it's going to be, As you use the word empathetic, I think it's it's going to feel more empathetic to have like a visual representation that feels like a human.

And I think for those certain use cases, it will be used there. I think the controversial take is that the most popular consumer use case for generative video is going to be adult content. That's my take. That's my controversial take on it. Just because that seems like the obvious one that people are gonna use it for.

But we'll see.

Richie Cotton: Okay. Yeah, that does actually sound very plausible. I suppose adult content's been at the cutting edge of technology for the last sort of few decades. So yeah, that seems very likely. But I do like the idea of having avatars for any kind of customer service interaction, or you want to interact with like a, What would otherwise be just like I guess the back to that phone tree again.

So having a human that you can ask or human like thing that you can ask questions of then that does seem very useful.

Russ d'Sa: Totally. Another one that's kind of like, maybe not as obvious as, you know, I can imagine when you're calling customer support, it's not even just having a video avatar. I think that's one thing that it'll be used for. But I think another area that it'll be used for I think there's actually two. I think education is going to be used for as well.

I think like having like a human form as a teacher is a bit nicer for students than a kind of an abstract representation of AI. So I could see being used in education quite a bit as well. But outside of just the avatar piece, I think like if you call a customer support line and you're having trouble with you know, it doesn't have to be customer support for a router that you just go and click on a button to reboot it you know, nine times out of 10 is the issue.

But. Like you can imagine, like, let's say customer support for working on a vehicle or a car or something like that, you could imagine not generating just an avatar, but also generating like, a video of how to change or, loosen a bolt in a certain spot on the car's frame or something like that, like or how to work under an engine, looking at the engine bay of a car, some people use YouTube for this kind of stuff.

Like they search YouTube and they have to find the right video of how to. change the oil on a Honda Civic from 2007. You can imagine like, the AI model generating a video of how that's done or a video of the engine bay and being able to help you identify a part within the engine bay. that's a kind of a use case for generative video that I could also see happening.

Richie Cotton: Those how to videos incredibly useful, but often like the, the actual human part of it feels unnecessarily sometimes. So yeah maybe, avatar that's going to work just as well. And I guess related to that in terms of sort of social content you've got things like influencers, do you think they're likely to be replaced by AI?

Russ d'Sa: I think so. I think that that's, again, that's a great one actually I didn't think of or I didn't think to mention. That's one where I don't think that it has to be real time. It probably won't be is my guess. But a marketing agency or some firm or a brand being able to go and canvas style, just go and like, kind of generate a video of a particular product in a place like, you know, I'm an umbrella company and I want to generate a video of this umbrella on the beach and like, it's a beautiful day out.

 And all of that I think that there's going to be those kinds of product placement. Types of videos that can be generated, on the fly don't require like, having to hire a bunch of actors and, have a a shoot or, you know, do the actual camera work down at the beach.

Like it's going to be a, huge efficiency gain for marketing agencies as well.

Richie Cotton: Absolutely. I mean, kind of tough on content creators and as a content creator, I'm like, Ooh, am I going to be replaced? this gets discussed a lot, me being replaced by AI So uh, yeah, actually I have to say one thing I've found quite fun is downloading the transcripts of DataFramed episodes, feeding them into NotebookLM and then generating new AI podcast versions of it.

 quite a fun thing for our listeners to do as well. You get the AI version of me. Very nice. Anyway. A lot of opportunities for replacing humans with AI then. I'd like to know a bit about what are the dangers. So with text, it seems pretty simple. Like you have problems with hallucinations or problems with unsavory content.

Are there bigger dangers then for audio or video?

Russ d'Sa: I think that there are, Not necessarily bigger dangers, but well, there's definitely some bigger dangers. We can get into that well. But I think there's also the same dangers around hallucination. So I think there's like two categories here. I think one is the same class of issues that we faced with text based AI.

Hallucination, for example. But what's tricky about hallucinations with audio or video is it's much harder to kind of verify or catch these hallucinations. And the reason why is that for text, You can effectively write assertions around text because computers can process strings very easily, like strings of text, right?

Like, I know letter expletive, I know how that's spelled, and I can actually write code that checks against the use of four letter expletives in the, set of tokens that are generated by my LLM. I can write deterministic code that executes the same every single time you know, and it conforms to a certain set of rules for text, but for audio and video, it's much, much harder, right?

Like if someone is going and. in audio, for example, if the model ends up hallucinating a swear word do I actually effectively check against that, Do I, I'm now maybe convert comparing waveforms or I have to wait for the audio that is being generated by that AI model.

I have to wait for it to be transcribed. And then check the transcription before I deliver the audio to the user. Because, you know, I check the transcription to make sure it didn't say something incorrectly. Because I can write a program to check against text, but not as easy to check against voice.

and then video is a whole other ballgame. It's like, I have to, imagine if a video is being generated and someone is showing I don't know, someone is holding up like a sign with their hands that, they're flipping the bird or something and they in the, in the video, how do I actually verify that they're doing that and do so before the video is shown?

One way that you could do it is you can, you know, a lot of these models for voice and you know, eventually for video, not yet though is they can generate content faster than real time. So they can generate their output faster than the time it would take to play that audio out.

So, okay, generate 10 seconds of audio in one second. So you could conceivably run these types of checks on the audio stream or the video stream ahead of time and then have some mitigations in place. but it's still a difficult problem and it's a bit more fuzzy. On how to actually do the verification than it is with text.

So that's, like the first class of issues is the same issues as with text around hallucination, but harder to verify and mitigate with audio and video. There's a new set of problems and that's what I would kind of characterize under deep fakes. And deepfaking yes, you can deepfake text, but it's a lot harder to, you know, it sounds like you're going to say something.

Richie Cotton: No, no, no. Again, it's being, me being replaced by AI, I think. But yeah. Tell us about deepfakes.

Russ d'Sa: Yeah. So, you know, if it's, it's one thing to like share a quote from Barack Obama saying something right in text in a paragraph somewhere. I don't know how many people would take it super seriously you know, and think to the degree that written content can be. it's more obvious with written content what the source of that content is and how much you trust it, right?

Like if, the New York Times puts in a quote from Barack Obama, you tend to trust that Barack Obama probably said that from the New York Times. But if it's some random publication on the internet saying a quote from Barack Obama, you're automatically just not going to trust it by association.

Whereas, when you see a video and it looks exactly like Barack Obama and it sounds exactly like Barack Obama and the scene behind him looks like a scene that he would be in, our brains are not trained to debunk these types of things.

We kind of automatically believe. You know, seeing is believing is the quote. and so now that like seeing can be completely fabricated or simulated, where does believing end up? And I, so I think that that's something that we haven't really tackled yet as a society. But it's starting to become, you know, the technology is now here to be able to do this stuff.

 And so we need the security and compliance around it to catch up like think for some time now people have been talking about Oh, what if we attach like digital signatures to content so you can verify the source that it came from? Maybe that's going to be like a great real world use case of like blockchain and like being able to like sign these pieces of content so that you can verify where they came from.

 Maybe that's one tactic that people might use. I think about the old VeriSign from back in the day with domain names. how do you actually verify that this content is legit? And that's, that's a new challenge that I think is exacerbated by video and voice AI.

Richie Cotton: Oh, man. Lots of challenges there. So, yeah, even with that first case of just how do you verify that the content is okay, I was thinking like it sounds like there's a sort of latency safety trade off. Where you can spend more time just trying to make sure that the content generated is okay. But that means there's gonna be a lag before the user gets the response.

I was thinking like, even if you check the text, you're still gonna, like, the AI could say something in a sarcastic tone of voice and that's still gonna be the wrong answer.

Russ d'Sa: And this happens in like real life, right? Like I remember there was the wardrobe malfunction at the Super Bowl, right? Or whatever. I don't know if it was Janet Jackson or someone else, but there was this wardrobe malfunction and, there was no tape delay. And so it's like, okay, well this happened.

And so there are these. Issues that we have in like real live events, you know, that humans are watching other humans partake in where, they've introduced tape delays for this reason so that they can kind of clean up the content if things go awry. And at this current moment in time, I think like AI models.

are more likely to hallucinate than human beings are. And so how do you mitigate these scenarios? it's a challenge that we, we have in real life with real humans as well.

Richie Cotton: And so on your other point about deep fakes you mentioned that society is not really ready for widespread deep fakes. What needs to happen to make us ready?

Russ d'Sa: I really do think that the way to solve that problem is going to be around Adding some kind of authenticity or you know, some kind of marking somewhere that allows us to verify the source of content, right? If there is fake content everywhere, if you can't really trust anything that you see, how do you ascribe trust to something You know, what is the mechanism for being able to do that?

And it really comes down to a certificate of authority in some way, right? Like, how do you trust the website you're visiting is actually, you know, the website that you're looking at or things like that. I mean, we effectively use certificates to verify. the authenticity of something, even like when I buy, you know, a Louis Vuitton bag or I buy a Rolex.

It's like, you know, I don't have either of those things just for the record, but you know, when someone buys them they come with like the, I don't actually don't know cause I've never bought one, but I have to imagine that they come with some kind of, authenticity or verification of it.

 the other way that people do this is, let's say when someone buys a really expensive piece of art, you have like an actual professional go through and like verify with the magnifying glass that this is like real and authentic and you're actually buying, the real art piece that you're paying a lot of money for there needs to be some kind of certificate that is generated or stamp of approval that the content you see, came from its source or is, or is real or authentic or belongs to the person that is, yeah.

In the actual content are being featured or that they gave their permission for it to to exist. And so. I don't know what the mechanism is, the right mechanism is for being able to generate those kind of certificates, but it's going to have to exist because there's just going to be fake content everywhere in the future.

Richie Cotton: Okay. Yeah. And I guess you have the alternate problem where something is genuine, but people don't believe it's genuine. I think it's been manipulated by AI just because that's so easy to do.

Russ d'Sa: One thing that you've been seeing lately with or that's been trending in the last few weeks is not generating video like scenes, for example, or generating avatars, but generating entire worlds. So you've been seeing, there's this one company called Descartes AI and they put out like a generated version of Minecraft where it's been trained on many, many, many hours of Minecraft and now it can actually, in real time, generate.

Like every frame of Minecraft as you move through it, so it's not perfect. It has artifacts There's definitely aspects of it that you can look at and see that it's not truly world there's like weird kind of artifacting and things on the ground and in the trees and there's kind of a fuzziness to the pixels, but you can actually navigate this world and it uses your inputs to, you know, I turned left and it will actually generate all the next set of frames for turning left and what that should look like when the player starts to rotate to the left.

And so I think a popular use case for generative video that hasn't been explored a ton yet. we're even getting there on the real time latency aspect of it is being able to generate entire simulations or worlds. A bit scary too, just because you start to ask some existential questions like, oh, okay, wait, am I in a simulation or not right now?

Like, is this a generated world that I'm in at the moment? But philosophy aside it is a really interesting space that I think is going to have implications for like VR and AR types of app use cases. Video games as well, I think might change fundamentally if a video game world or a new video game can be generated on the fly and delivered as video pixels generated by an AI model instead of like an actual game engine running on your phone.

So I think that that's another really compelling use case that we're just starting break into.

Richie Cotton: Yeah, complete worlds being generated. That's pretty incredible stuff. I like to think I could tell the difference between like, AI generated Minecraft and reality. But once we get sort of more photorealistic worlds, then yeah, I guess Well, I suppose people worry about like, which sci fi dystopia is going to come true.

And I think like the big one everyone's afraid of is Terminator, but maybe we're heading more for what was the Tom Cruise movie? Vanilla Sky. Maybe that's gonna be the thing where you just can't distinguish reality from a simulation.

Russ d'Sa: Vanilla Sky, I was, I was so, uh, I was so down after watching that movie. Uh, I remember I watched in the theater and then long ago and I came out of it. I was just so down after it. So hopefully neither Skynet nor Vanilla Sky.

Richie Cotton: Uh, Yeah we can try and hope for a better, a better outcome than either of those. All right. So given that multimodal AI is taking off are there any skills that you think are important for people to

learn to take advantage of this? 

Russ d'Sa: I think an important skill is really kind of digging into some of the newer models and how to use them from companies like OpenAI with real time API with Anthropic and the new computer use API, I think is a Is another one that's going to be interesting just around kind of like the new agentic I don't like that word still, but these agentic workflows I think is starting to become a real thing and the technology stack to enable them is starting to become real.

think starting to jump in and play with these models, building some simple things like some simple demos and just getting familiar with how to use these models and also just the new paradigm of how to architect or applications for that are based on around voice and video streaming versus kind of the traditional request response.

that traditional web application is built around. I think that that's a great way to kind of start to get your feet wet and build some familiarity and future models that you see come out are going to be directionally going in the same path and they'll get better and better. And so starting to build your familiarity as an application developer with how to use these models we'll really grease the wheels for, as those AI models improve the types and complexity of applications that you can build with this new skill set yeah, I think it'll be beneficial to start just building the familiarity and then, once these models get better and better, by that time, you'll have built the expertise.

Richie Cotton: I like that. Just Have a play around, generate some content, have a go building stuff, hands on learning. I'm a big fan of that. Wonderful. Alright, so just to wrap up, what are you most excited about in the world of AI?

Russ d'Sa: I'd say I'm this current moment in time. The thing that I'm most excited about is agents really. I think that. What we're doing with some of these new models that you're seeing coming out, like I mentioned, the anthropic computer use one there's been, folks writing about this thing called Operator at OpenAI, I don't know anything beyond what other people have read and what you're seeing is you're now starting to see a set of models that are trained on how to use computers in the same way or a similar way that humans use computers.

Cheers. And when that's possible, it now allows you to, especially for businesses, it allows you to effectively have the AI model help with tasks that ordinarily would have it. You'd have to have a human that you train for a while on how to perform those tasks. You can now have an AI model that is able to perform those tasks on your behalf.

And so I think it's going to be this amazing unlock in terms of efficiency and also just free up like people's time to do less mundane things, right? I think going and having to like update the CRM after you had a great customer conversation if the AI model just does that for me, that just unlocks a bunch more time for me to go and have more great customer conversations versus spending portion of my time going and figuring out how to navigate a CRM and remembering what was said during the conversation and entering it into the right place and all of that.

So. think that, computers, understanding how to use computers some of the legacy systems is going to be this really exciting boon for application developers that are going out there and build using that technology in these various workflows. And so I'm excited to, from an infrastructure perspective, to be able to support those developers and.

provide the tools that make it easy construct these types of applications.

Richie Cotton: Yeah, certainly I think everyone has to deal with far too many different pieces of software at the moment. It's like, it's just difficult to remember how all of them work. And so having computers automate some of that, that's definitely going to be a big bonus. And yeah, I guess good luck creating all the infrastructure to help people do that.

Thanks. 

Russ d'Sa: Thanks. Yeah. 

Richie Cotton: yeah thank you for your time, Russ.

Russ d'Sa: Thank you so much, Richie. It was amazing to be here and I really appreciated all the questions. 

Topics
Related

podcast

Developing Generative AI Applications with Dmitry Shapiro, CEO of MindStudio

Richie and Dmitry explore generative AI applications, AI in SaaS, selecting processes for automation, MindStudio, AI governance and privacy concerns, the future of AI assistants, and much more.
Richie Cotton's photo

Richie Cotton

45 min

podcast

The 2nd Wave of Generative AI with Sailesh Ramakrishnan & Madhu Iyer, Managing Partners at Rocketship.vc

Richie, Madhu and Sailesh explore the generative AI revolution, the impact of genAI across industries, investment philosophy and data-driven decision-making, the challenges and opportunities when investing in AI, future trends and predictions, and much more.
Richie Cotton's photo

Richie Cotton

51 min

podcast

The Data to AI Journey with Gerrit Kazmaier, VP & GM of Data Analytics at Google Cloud

Richie and Gerrit explore AI in data tools, the evolution of dashboards, the integration of AI with existing workflows, the challenges and opportunities in SQL code generation, the importance of a unified data platform, and much more.
Richie Cotton's photo

Richie Cotton

55 min

podcast

Effective Product Management for AI with Marily Nika, Gen AI Product Lead at Google Assistant

Richie and Marily explore the unique challenges of AI product management, collaboration, skills needed to succeed in AI product development, the career path to work in AI as a Product Manager, key metrics for AI products and much more.
Richie Cotton's photo

Richie Cotton

41 min

podcast

Designing AI Applications with Robb Wilson, Co-Founder & CEO at Onereach.ai

Richie and Robb explore chat interfaces in software, the advantages of chat interfaces, geospatial vs language memory, personality in chatbots, handling hallucinations and bad responses, agents vs chatbots, ethical considerations for AI and much more.
Richie Cotton's photo

Richie Cotton

45 min

code-along

Building Multimodal AI Applications with LangChain & the OpenAI API

Combine the power of text and audio AI models to build a bot that answers questions about YouTube videos.
Korey Stegared-Pace's photo

Korey Stegared-Pace

See MoreSee More