Can We Make Generative AI Cheaper? With Natalia Vassilieva, Senior VP & Field CTO & Andy Hock, VP, Product & Strategy at Cerebras Systems

Richie, Natalia and Andy explore recent progress in GenAI, cost and infrastructure challenges in AI, Cerebras’ custom AI chips and other hardware innovations, RLHF, centralized vs decentralized AI compute, the future of AI and much more.

Sep 19, 2024

Guest

Natalia Vassilieva

Natalia Vassilieva is the VP & Field CTO of ML at Cerebras Systems. Natalia has a wealth of experience in research and development in natural language processing, computer vision, machine learning, and information retrieval. As Field CTO, she helps drive product adoption and customer engagement for Cerebras Systems' wafer-scale AI chips. Previously, Natalia was a Senior Research Manager at Hewlett Packard Labs, leading the Software and AI group. She also served as the head of HP Labs Russia leading research teams focused on developing algorithms and applications for text, image, and time-series analysis and modeling. Natalia has an academic background, having been a part-time Associate Professor at St. Petersburg State University and a lecturer at the Computer Science Center in St. Petersburg, Russia. She holds a PhD in Computer Science from St. Petersburg State University.

Guest

Andy Hock

Andy Hock is the Senior VP, Product & Strategy at Cerebras Systems. Andy runs the product strategy and roadmap for Cerebras Systems, focusing on integrating AI research, hardware, and software to accelerate the development and deployment of AI models. He has 15 years of experience in product management, technical program management, and enterprise business development; over 20 years of experience in research, algorithm development, and data analysis for image processing; and 9 years of experience in applied machine learning and AI. Previously he was Product Management lead for Data and Analytics for Terra Bella at Google, where he led the development of machine learning-powered data products from satellite imagery. Earlier, he was Senior Director for Advanced Technology Programs at Skybox Imaging (which became Terra Bella following its acquisition by Google in 2014), and before that was a Senior Program Manager and Senior Scientist at Arete Associates. He has a Ph.D. in Geophysics and Space Physics from the University of California, Los Angeles.

Host

Richie Cotton

Key Quotes

Cost pressure is going to drive innovation. A lot of the traditional AI work that's been done today, it has been done on traditional general purpose infrastructure. Those processors are suitable, but they're not optimal for this work. At Cerebras, we've built a completely new class of processor that's built from the ground up for AI.

The richness of applications and what can be done with AI in the future is so exciting. I believe that AI will change the world very significantly. It will become either much, much better than it is today or much worse than it is today. I'm an optimist, so I hope that we will be working on what we like to do and we won't do any boring tasks anymore.

Key Takeaways

Utilizing specialized AI hardware, like Cerebras’ wafer-scale engine, can significantly reduce training and inference times compared to general-purpose GPUs, improving efficiency and reducing costs.

MoE (Mixture of Experts) models can assign different parts of the model to handle different input types, allowing for more efficient computation and higher performance, especially when dealing with complex tasks.

Instead of relying solely on pre-trained models, fine-tune or build models specifically for your enterprise needs to optimize for accuracy, latency, and cost efficiency.

Links From The Show

Cerebras

Cerebras Launches the World’s Fastest AI Inference

Course: Implementing AI Solutions in Business

Transcript

Richie Cotton: Hi, Natalia and Andy. Thank you for joining me on the show.

Andy Hock: Happy to be here, thanks Richie.

Richie Cotton: Excellent. So we've seen a ton of well, quite frankly, dramatic progress around generative AI in the last six years. you see this dramatic progress continuing much longer? Natalia, do you want to take this?

Natalia Vassilieva: Oh, yeah. I think we are just starting. I think there is still a ton of both advancements in fundamentals, how generative AI models can be trained, how we can make them more efficient. But also, I think we're just tapping into what those models can do. There is a lot of excitement these days around, I'd say, educational and entertainment.

use cases. So all of us probably have tried using charge. You can look at image generative models. But in terms of how all of those models can be realized in real life use cases, I think there is still a ton of discovery to be made and it will drive requirements to the models and it will require, it will drive a lot of additional progress both in training those models and also on applications.

Richie Cotton: Okay, that's interesting. So it's really going to be the wider adoption of use cases that's going to drive the technological advancement. And Andy, do you have a similar take on this or have you got a different opinion on how progress is going to go?

Andy Hock: Very similar take. I think we're just scratching the surface. I found myself reflecting as Natalia was respon... See more

ding. Imagine where we were with say, computer graphics 20 years ago and where we are today with rendering and gaming. I think we are today with AI, maybe where we were with graphics or analytics 20 years ago.

I think there's a huge proliferation of applications brewing right now in both enterprise and research. And I think we're just at the beginning of understanding which ML techniques and network architectures and data types are the most amenable. So I'm thrilled to be where we're at, particularly at Cerebris on the AI computing side, but I'm really looking forward to where the field is going over the next one to two to 10 to 20 years.

Richie Cotton: That gaming analogy is kind of cool. It's like, we've just reached the sort of PlayStation one Xbox era. It's like, wow, stuff's in 3d and it looks really cool, but like you can't imagine what, you know, PlayStation five looks like,

Andy Hock: Yeah, absolutely. You know, we make this analogy pretty frequently, but you know, Cerebrus is well known as an AI computing company. We spent a lot of our time focusing initially on software support on our hardware systems for training, but particularly around large model inference. We make this analogy that I think we're in some sense in like the web dialogue era of inference in particular.

where, today we're, in that the moral equivalent of, the dial up era where we're waiting for the modem to dial up before we get our answer. But we're not yet in this, in the modern day network equivalent of like instant responses. So think both in terms of technology and the applications, both on the training and compute inference side we're just scratching the surface, man.

Richie Cotton: okay. Yeah. I think nobody really misses that kind of modem dial up noise where you're waiting for things to load. Um,

Andy Hock: Exactly.

Richie Cotton: yeah. So hopefully, hopefully we get broadband soon.

Andy Hock: That's right. It's coming.

Richie Cotton: Nice. So, lots to be excited about. And I think one of the sort of caveats is that, as a lot of generative AI is starting to get put into production, people kind of go, well, actually, it's kind of expensive at the moment.

There's a lot of costs around just running these models, doing inference, like beyond trying to train things yourself. And so, do you think these costs are going to affect the development of future LLMs?

Andy Hock: happy to. So I think, look, couple of things are going to happen. First of all, cost pressure, I think, is going to drive innovation. But second, I think in some sense, a lot of the traditional AI work that's been done today has been done on traditional general purpose infrastructure.

And that's one of the big cost factors in innovation, right? So people are building models on large clusters of small, general purpose processors. Those processors are suitable, but they're not optimal for this work. At Cerebrus, we've built a completely new class of processor that's built from the ground up for AI.

And as a result, it's faster. But it's also more efficient. So that means that if you're, say, training a model, you can go an order of magnitude faster than you could on an equivalent GPU system. And it's going to be more efficient, not only in terms of developer iteration, being able to test more ideas per unit time, but it's also going to be more power efficient underneath the hood.

And then I think given that right infrastructure for AI research, researchers are going to build not just larger models, but they're going to build models that are more efficient and use more efficient methods like sparsity. And given the right processor, you can harness those advantages and turn those natural advantages, sparsity in a neural network into performance advantage and cost savings.

A practical example of this, and it's something that we're pretty excited about at Cerebrus the very near future, is AI purpose built hardware for inference. And we're launching inference, and just as a data point, right, inference is going to be 20 times faster than GPU. Opening up new applications, but it's only going to be about a third of the cost, given the right infrastructure and new methods, I think cost pressure is going to drive innovation. But the right infrastructure is also going to make it more accessible.

Richie Cotton: Okay, so that's cool that you Not just going for let's try and help the developers. It's also going to be like let's bring this to a sort of wider audience. Now it seemed like, there's sort of two levels here. So you've got the hardware level. You should talk about having custom chips rather than GPUs.

And then it's also the, some of the software techniques. I'd love to get into both of these. So maybe we go with. hardware first. So what makes these chips different from a GPU?

Andy Hock: At Cerebris, we approached the problem of AI compute from first principles. And what I mean by that is we actually looked the workload of neural network training and inference.

And we asked, okay, what kind of compute operations, what kind of precision, what kind of communication memory access patterns does this work have? And we built our processor around that. So, obviously this chip is, big, Most computer chips are the size of a fingernail or postage stamp. The chip that powers our system is the size of a dinner plate.

We call it the wafer scale engine because it's the scale of a full silicon wafer in chip fabs. It's about eight and a half by eight and a half inches on an edge. So how is this different from a GPU is the question, right? And it's not just because it's bigger. AI compute. Requires massive computation, sparse tensor based linear algebra operations.

And the ability to handle data in motion. So high communication bandwidth and high memory bandwidth. And that's why we built our wafer scale engine the way we did. And that's what the wafer scale engine is best at. So this has 900, 000 cores. On one ship, they're all designed from the ground up for sparse tensor based linear algebra operations, mostly at lower precision, because that's what I compute once and all those cores are directly connected to one another on the wafer, and there's 44 gigabytes of all on ship S.

Ram, like a l one cash. So lots of memory close to compute. And the compute can talk to each other very quickly. So all those attributes of AI computing that I talked about, that's what this thing is built for. And it's fundamentally different than a legacy general purpose processor that's small and far away from memory, like a GPU.

Richie Cotton: Okay, so really, because you're not having to solve many different problems like a general purpose GPU is, you can focus on just optimizing for just these sort of linear algebra operations that are needed by the neural network.

Andy Hock: Bingo. Yeah, and the result is, you know, order of magnitude faster on both training and inference, depending on the workload, and more efficient. Thanks a lot. And easier to program. And that trilogy of benefits all acts to accelerate innovation and accelerate compute on the inference side for models and production.

Richie Cotton: So you mentioned a few sort of ideas around how large language models can be done better from a sort of design like the. neural network architecture point of view. So maybe let's get into some of those. The one you mentioned first was sparsity. So Natalia, can you tell me what is sparsity in a neural network and why would you want it?

Natalia Vassilieva: in general, sparsity is when you don't have all of your weights being non zeros or activations. So sparsity is when some of the elements of your matrices are zeros, it doesn't matter which matrices are there. So when we're talking about sparsity as it's applicable to training of neural networks how it can help you can reduce computational cost, and it can reduce also memory requirements to run inference or to train these models.

I think overall in the community, there is a broad acknowledgement that sparsity is one of their I've been used to make training more efficient, to make our models less expensive to train and to run inference with. there are several approaches how you can implement that during your training or during inference.

You can think about activations, parity, especially when you train a neural network that's always interaction between the parameters of the models, the weights. That's one of the es that participates in this computation and data or activations. That's the second argument. And they like combine together do essentially, but multiplication on, in the operations on top of that.

So you can think about inducing sparsity or making sparse either one element or another. And you can take advantage when you do either of those. There are many methods there which allow you to again, reduce cost of compute and reduce memory footprint if you induce zeros in your model parameters.

Bottom line, throw some zeros either into your weights or into your activations. With that, reduce the cost of computation and reduce the memory footprint.

Richie Cotton: So I guess the idea is that the weights of each of the neurons in the neural network that's some sort of measure of like its importance. And if it's sort of quite small, you can just pretend it's a zero and that's it. Unless you ignore it effectively in some calculations, is that right?

Natalia Vassilieva: Yeah, if you think about the whole process of training neural network, you're starting with a huge pool of parameters, and it's very well known that all of the current neural networks are parameterized, so we throw away more parameters than is needed to actually solve the problem. And then the goal of the training is to understand which weights should be large, so essentially which weights are meaningful, and which weights are small, they don't bring any value.

So, just If we can figure out how to do not train those smaller weights which don't bring any money from the get go, it would be better off.

Richie Cotton: So it's just a way of sort of simplifying the computations for bits of the neural network that you don't need, you just sort of ignore them. And then the other aspect was about using sort of smaller numbers, this idea of like quantization. So can you just talk me through again, what does that mean and how does it work?

Natalia Vassilieva: Yeah, that's another approach how you can reduce both compute cost and memory footprint for your model is when you use lower precision to represent either weights or activations in your model. if you recall your course for computer science, every single number that you can represent and the computer can read can be represented in different precision and can take a certain number of bits.

So you have single precision, double precision, half precision, four bit precision, two bit precision, maybe binary, and you just have zeros and ones. And less precision, less precise your computation is. So with the numbers with less precision, you can represent smaller range. So you can do only. There's a number of numbers and also smaller precision, so much resolution you get for each of them.

If you go to lower precision, again, it takes less space in your memory and it's also less expensive to compute, but it's less precise. So are a lot of approaches how to make training of neural networks work. with lower precision. Today when you train large language models, you typically do that in mixed precision.

People typically say half precision, but in reality, you still have to have some operations in single precision 32 bits. But there is a lot of research going on how to make it stable at lower precision. And people have figured out much better how to reduce the precision of weights after the model has been trained.

So let's say I've trained my models, I have all my weights and have precision. Now it's much more clear how to reduce precision of those trained weights without huge impact of accuracy and allow cheaper serving and cheaper inference.

Richie Cotton: Okay, so that's interesting. So you train with bigger numbers, these half precision numbers, and then afterwards you say, okay, well, maybe some of these values, we can sort of shrink them even more because you don't care about the precision in this area of the neural network.

Natalia Vassilieva: Yeah, one additional caveat is to actually take full advantage of all of those from the theoretical algorithmic point of view. Sparsity is great, quantization is great, but to take advantage of all of those in real life, you need to have also hardware which is capable to work with low precision numbers and the hardware which is capable also to work with those par standards.

And that's a very also important notion to keep in mind.

Richie Cotton: Okay, interesting. Yeah. So, you need to make sure that the hardware supports what you're doing with this. And Andy, I think you mentioned before that the Cerebris chips, they're designed to work with lower precision numbers. Is that correct? So it's designed especially for this kind of neural network problem.

Andy Hock: That's exactly right. So as Natalia mentioned with sparsity most of the data and most of the trained models that exist I have a lot of zeros in them, and on a traditional compute platform like a CPU or GPU, when you run the model, and when you run the model, usually what you're doing is you're doing multiplications and additions, And a traditional dense compute platform like a CPU or GPU in particular GPU, will take that zero value, right? And we'll actually execute the multiply by zero operation. So it takes time, it takes compute cycles, it takes power. But we actually know what the answer to multiply by zero is. And so if you had instead a hardware platform that could detect those zero values and just avoid them and move on to the next non zero value, you would end up accelerating the compute and saving yourself time.

Power and that's what the cerebrus platform does. So we sort of acknowledged in the design of our hardware that the nature of the compute workload involved a lot of zero values. And so we built that in so that we can actually take advantage of that. And our users can take advantage of that natural aspect of the data and neural networks.

Similarly, on the on the precision side, We support in our hardware today, 32 bit, 16 bit and 8 bit numeric formats. we're really in the sweet spot of the numeric representations that are common for both training and for inference. And we can execute those at massive performance advantage over traditional machines on both the training and the inference side.

Richie Cotton: So yeah, it's very convenient that you multiply any number by zero. The answer is going to be zero and you can do a little shortcut there. Okay. Nice. So I'm curious, are there any other techniques that are used in order to make large language models more efficient or more performant?

Natalia Vassilieva: Yeah, I think we've discussed a couple of those. So using sparsity and the different types of sparsity. So remember I talked that you can introduce those zeros either into weights or into activations, just activation sparsity or weight sparsity. There are also different techniques on how regular those zeros are among the ways, so we can talk about structured and unstructured sparsity.

Maybe some of you have heard about models called the mixture of experts. That's a perfect example of models with unstructured. block sparsity. Yeah, I think for all of those techniques, the goal is that if I have a certain budget of compute, how can I get to the best accurate, the best highest quality model?

Given that amount of compute, and it's been shown that with sparse models, you can get actually to better models by spending the same amount of compute compared to what you would potentially get with dense models. Mixture of experts is one example of those structured sparse models.

Richie Cotton: Yeah, do you want to talk us through like, how mixture of experts works?

Natalia Vassilieva: Mixture of Experts when you hear Mixture of Experts maybe you think about the common approaches which have been in past called Ensembles of Experts when you have completely separate models, you have completely different experts or predictors, and then you take a vote and decide what the accurate answer is.

Mixture of Experts model is not exactly that. So you still have those experts, but they are built in into a single analytical model. And from the computational point of view, when you train a model, usually with a dense model, you have every single data point to influence every single weight in your model.

So you need to make computation and see how those two react to each other for every single pair. For a mixture of experts it's an example of weight sparse models. So you still feed all of your data points through your model, but you do not get all of the model weights interact with all of the data points in your training data set.

So the model selects, okay, this batch of data, this number of training examples, they will only influence this position of the model. And that next set of examples will influence and will change the values only of that position of the model. With that, you can remove this all to all dependency and only subset of the weights get to be influenced by certain data points.

And that's what's reduced the computational cost at the end.

Richie Cotton: Okay, so I guess rather than having one giant large language model, if giant and large go well together, I'm not sure but you have a few smaller models that are essentially, I guess, glued together. Does, does that sound about right?

Natalia Vassilieva: No, I think it's, you still have this large and giant model, but you don't have all parts of the model to take inputs from all of the data points. The model is still the same, but the model decides inside that model, every single part of the model will look only into some parts of the data.

So not all parts of the model are created equal, I guess. So, and that's what makes, like, why people call it mixture of experts, because you can think of those parts of the models being different experts, and they're only responsible for reacting to certain data points. Does that make sense?

Richie Cotton: Okay, so you got one model, but just different parts of the model are responding to different kinds of

Natalia Vassilieva: Yes. The different kind of inputs. Yes, that's exactly right.

Andy Hock: yeah, I'm gonna go, I'm gonna go out on a limb here because I'm really enjoying this conversation, but I'm gonna go out on a limb with a really potentially really bad analogy. But I keep thinking in my head of like an octopus. And I can imagine like a prompt, like an input comes into octopus and comes into the head and the brain and, that does something.

And maybe it only needs one tentacle to respond. Right? So input comes in, only needs, even though it's got eight tentacles, the model is still large to your question, Richie, right? But it only, maybe a particular input only requires a part, a single part of the model to respond. And then a different input comes in and requires a different tentacle of the model or a different expert in the model sense to, to respond to it.

And therefore, when you're training, you're still training a large model, different tentacles. But at inference time, Maybe you don't need all the experts to respond. You only need a fraction of the experts and therefore the total computational cost at inference time is significantly lower.

Richie Cotton: I really liked that analogy. Yes.

Andy Hock: Well, I think, I think maybe you're being gracious, right?

Richie Cotton: All right. Mixture of octopus models is uh, the next generation.

Andy Hock: I'm going to pay for that, man.

Richie Cotton: All right. So, one of the aspect of at least creating large language models, it seems expensive, is that after you've trained it you need to get some sort of feedback from humans on whether or not it performs well. So it's doing this reinforcement learning step. And whenever you involve humans, I guess things get expensive.

So is that technique going to stay around for a while or are there ways to avoid having to use humans?

Natalia Vassilieva: are several alternatives to that, although I think the jury is still out how accurate they could be. So people use there is a notion of AIF, when you use not human feedback, but you use AI feedback. So if you have a smart model which hasn't been already trained, you can use that to help you to align your new model.

There are also techniques like direct preference optimization, which doesn't require human feedback either. So I think the community is looking for other ways how to avoid or minimize involvement of human feedback. But the jury is still out there of which ones are the best and if we can get away with it without any human feedback at all.

Andy Hock: I'll say maybe, maybe more generally, right? Like reinforcement learning with human feedback is, basically one way to get a pre trained model to behave the way that developer wants it to based on some optimization function, We're all building models to try to do something. And currently, there's pretty wide portfolio of methods that a developer can use to help the model achieve better output.

Right? More, more accuracy for, for a particular task, right? So there's methods like fine tuning, there's methods like reinforcement learning with human feedback. There's methods like DPO and ro ai. F like Natalia mentioned, there's also building models that are augmented by specific databases, Retrieval augmented generation or rag models. So a bunch of different ways. People can take model and basically help it be more accurate for a particular task. And they don't all require humans but a lot of the models are serving functions for humans. So often when you when you're sort of measuring a model's performance, you want that, human feedback in there.

But certainly there are a lot of methods that, don't require sort of rich, interactive time and time intensive human feedback like Harley Jack and coming back to this discussion about compute, right? Like if you're building a new model, which recipe you choose or rich method you choose to both both train your model and then make it work really well for your application.

It's not. As though there is a specification someplace that just tells you if you're an enterprise application developer or an ML researcher that, like, you know, if you have problem X, use method Y collection of methods Y build this model, we're really learning that as a community as we go right now.

And I think one of the things that's really powerful about having the right high performance AI computing platform underneath AI Natalia and her team do solutions engineering some of our customers at Cerebris know very well is the value of that high performance engine underneath the hood is that you can try a lot of different methods, right?

You can try. training a model from scratch. Or you can try taking an open source model and fine tuning it. You can try doing other methods like retrieval, augmented generation or rag or D. P. O. Right? You can, you can try these and you can iterate very quickly because you have a fast engine underneath the hood, even for these compute intensive problems.

what Natalia and her team has seen with our customers is that this combination of sort of new methods and wanting to build models that are accurate for a particular set of domain specific tasks is being able to iterate really quickly both at large scale and small scale with different data representations and different techniques and quickly move not to just building them.

A model building the right model for your application.

Richie Cotton: Okay. Yeah. So I guess it sounds like building a great large language model or probably anything with AI is very much an art form. And so rather than just like a procedural thing. So he's got to have that agile workflow in order to test a lot of different things in order to get something that's great.

Andy Hock: Yeah, and we were we were chatting a little bit about this before. mean, there is definitely an art in the sense that it's evolving and I think involves a lot of creativity, but actually becoming very much a scientific process, right? So testing something and then learning from that test, developing hypothesis, testing it and going back to being able to quickly iterate.

Right. Like I was mentioning with different data sets or different hyper parameter settings or large models versus small models, that's, I think, what gets people to really proficient and accurate models for their specific application. But that's hard on traditional systems.

Richie Cotton: Yes, definitely. If, if you happen to wait for hours or days to get the results of one sort of experiment, cause things are running too slowly, then it's not going to encourage you to do lots of different experiments.

Andy Hock: Exactly. And it was, often joke about this, but if it takes you you know, A couple of days or a week to train your model in grad school. Maybe that's okay, right? Because then you go like for a ski weekend or maybe take a little vacay, right? Or work on another project. But if you're building a model to help predict optimal drug combinations for cancer.

Or if you're building a model to help say detect fraud in time series data for financial transactions or if you're building a you know, the next great. Chatbot to improve customer service or retail engagements, and that's part of your core business. Waiting doesn't work. And so the development velocity is, is, really important.

I think that's what's really driving the field forward. And that's back to your first question, like why we're really excited about pushing into the next, 1 to 10, 20 years and seeing what people will build given the right tools.

Richie Cotton: Absolutely. And so I'd like to talk a bit about this sort of business side. I mean, you mentioned time to value is incredibly important. I guess like the trillion dollar question is like, how do I do generative AI cheaper? I think there's a lot of chief financial officers pushing for this right now. So I guess maybe we'll start with companies that have some kind of generative in their products and they're doing inference.

So What are the sort of main things companies can do in order to reduce the cost of inference?

Andy Hock: I think to start building the right model, There are ways that given a particular inference platform or given a particular performance target, you could choose to train a larger model or a smaller model that will have, say, higher accuracy or lower accuracy, but also higher cost or lower cost.

And so step one is think about the business problem that you're trying to address. Think about what the performance metrics are that matter in terms of, say, responsivity or throughput, how quickly a model can answer questions or how quickly your model needs to answer questions and how accurate it needs to be.

And then build the right model for for just that. I think what we're seeing right now is a lot of enterprises are starting on their generative AI journey by just taking, say, a general purpose pre trained model from open source and like giving a shot and giving a shot or maybe trying a third party proprietary model that's available through a service and giving that a shot.

And they're using that to test the value hypothesis and figure out, like, if a generative AI solution will help them drive revenue or drive efficiency. But then after they get past that initial value testing process, what we often see at Cerebris is enterprises coming and saying, yeah, we tried a couple models.

And now we would like to build a model that's purpose built for our application. You could still start with an open source model, but maybe fine tune it on my own company data. so that the model is really tuned for my application. And we work with enterprise customers. Regularly every day to help them build models that are purpose built for their application, including thinking about the right size to optimize inference performance and cost in production.

That was a long winded explanation of step one, build the right model that fits your application in your inference performance and cost envelope. The second step, of course, is once you have a model that works, deploy it on an inference platform that delivers the performance that you need inside cost envelope that you're targeting.

And as I mentioned earlier, we're really excited at Cerebrus to be announcing Very soon, the launch of inference on our system still uses the big chip, right? So same hardware. In fact, all the attributes that I talked about about this chip earlier that benefited for AI training also benefits that unlock literally GPU impossible performance at inference time.

So we're gonna be launching inference our systems very soon, and it's gonna be 20 times faster. faster than GPU and at about a third of the cost. Build the right model and then deploy it on the right hardware.

Richie Cotton: when you say it like that, it sounds so easy. Build the right model, deploy the right model.

Andy Hock: It's been a great, it's been a great chat. Thank you, Richie. You don't have to say anything. Yeah. Yeah. It's definitely not that easy. And Natalia will definitely, definitely know this better than I do.

Natalia Vassilieva: I just wanted to kind of not only build the right model and use the right hardware for the inference, but also decide from the get go which hardware you're going to use for inference. if you want at the end of the day to optimize for as cheap or as less expensive inference as possible, that's likely also mean that you will have to invest more into training.

But you can save money on inference.

Richie Cotton: Andy mentioned that sometimes like you're going to have different metrics. So sometimes you need something that's really low latency. So you get quick responses. Sometimes you're going to care about high performance. Can you just give me some examples of like, different generative AI applications where you might care about like one thing or the other, and like what might have a different price point or how do you go about deciding this sort of stuff?

Andy Hock: Yeah, absolutely. So let's talk about latency and throughput. Right? Latency is like how long you wait. Throughput is how fast the answers come. Thank you. And maybe just two really simple examples. One example is let's say a conversational chatbot, And there latency and interactivity. is quite important, right?

Throughput is definitely important, too, because how fast, say, the words print out is important. But latency is very important because you're seeking interactivity. On the other end of the spectrum might be a batch offline use case, maybe where somebody has, for example, built a very large complex model that does things like try to simulate how well particular candidate drug molecules might Work on a particular disease, right?

And there may be searching tens of tens or hundreds of thousands or even millions of different candidate molecular drug designs against a particular disease, but they can do it offline. They're not trying to be interactive with it. Their throughput really matters. Because let's say that it takes me six months on a traditional machine to parse through those, hundreds of thousands of compounds if I could do that, not in six months, but in six days or six hours, then my ability to iterate and go faster and find the right solution to my problem.

Becomes dramatically faster. So that's just an example of where, like a latency intensive application versus throughput intensive application. What's becoming really exciting these days, though In language models is not just language models that power conversational chat, but language models that power things like software coding and the so called agentic models where you might have a conversational language model front end and then several agent models that go out and perform different tasks. for those applications, It's not just latency that matters. It's really overall throughput, because those models are processing a lot, a lot of data very quickly to, say, generate a new software program or go out and perform some complex, multidimensional task with many agents. So with our solution, right, going, say, 20 times faster on throughput.

Not only opens up far more responsive chat, getting us past this dial up era of Internet into the broadband era of inference AI, but it also opens the door to new language model inference applications beyond https: otter. ai

Inference applications as well as inference for far larger models, like multi hundred billion or even trillion per hour models.

Richie Cotton: Okay, so I think the term large language model is going to become sort of obsolete. Giant language model or something like that. Enormous language model. I'm not sure what the term is going to be, but yeah, it sounds big. Okay.

Andy Hock: I think you got it. Enormous is pretty good. Yeah.

Richie Cotton: Yeah. Okay. So, just summarize what you're saying. I think like, the idea is that if, if you've got chat in your product feature, then you're probably going to care about latency because humans aren't going to wait a long time for a response. And if you're processing lots of data, then throughput is, thing that you care about.

Natalia Vassilieva: Yeah, maybe another way to think about that. Chat with one single person, how fast the model is responding. Or, can you afford chatting in parallel with thousands of people? And what's more important to you? To report is being able to chat with thousands of people in parallel latency, that being really quick in responding to one single person. great if you can do both of those at the same time.

Richie Cotton: Just thinking about like a podcast with a thousand people, that sounds terrifying.

Andy Hock: Well, that's why we're going to build, we're going to build a digital twin of you, Richie, so you can hand, you can, you can parallelize yourself and do, and do many podcasts in parallel.

Richie Cotton: That does sound fun. So I've had previous guests telling me that podcast is about to become obsolete by AI because we can, I'm going to be replaced by a bot. But I like that idea better. Digital twin of Richie is doing thousands of data frame depth cells in parallel.

Andy Hock: Well, you can ask, you can ask digital twin Richie to perform maybe the podcast that you didn't want to do as much, and then you can dive into the ones that you really care

Natalia Vassilieva: I thought you were going into reinforcement learning, where DigitalTween and actual Ricci are talking to each other and getting better at asking questions.

Andy Hock: Next episode.

Richie Cotton: Okay, yeah, yeah, I'll, I'll try and throw that together. All right one more thing I want to touch on is so I came across this idea of the centralization decentralization pendulum where computing keeps sort of switching backwards and forwards between stuff that's happening in the cloud and stuff that's happening locally on your own computer. At the moment, like, a lot of generative AI services, they're cloud based.

Do you think that's going to persist or do you think it's going to swing back to having local hardware?

Natalia Vassilieva: think it's all very intertwined with the requirements and progress. Thanks. I don't think on the training side, we will move any. And you are soon to decentralized training. Yes, I think there are a lot of works on federated learning and how to make it work with a very large network of independent small devices.

But all of the models which are trained today, it's not even cloud, it's a very special partition of the cloud. And if we are serious, all of the really fundamental foundational models, they are trained on on prem hardware or on part of the cloud, which will have been carved out and specially designed, specially connected.

So you can think about that as a very, very centralized beast. So I think training will continue to require these a lot of centralized extreme exascale computer sources. For inference it's more tangible to envision that inference will happen on smaller devices and we can potentially see that decentralized. And with all of the techniques that we talked about, this quantization is easier to apply after the model has been trained. Things like distillation, things like speculative decoding, things like, there are, I think, many more ways how to reduce the cost of inference and how to make it work on smaller devices compared to training.

Richie Cotton: Okay. So we're a ways off sort of trading GPT 5 on our own laptops, but maybe some sort of inference is going to be local.

Andy Hock: Yeah, and there, and there may even be some very small scale local learning that happens on the device, Like Maybe some learning about personal habits or preferences on device, but I think Natalia is exactly right that the compute requirements for sort of primary model training today is happening on on huge dedicated clusters that are really built for that purpose.

Richie Cotton: before we wrap up, I'd like to know what are you most excited about in the world of AI? Natalia, do you want to go first?

Natalia Vassilieva: I think I'm the most excited, again, about all of those use cases which we, Don't know yet about and when people ask me where I can be used, my answer is usually everywhere because I can fantasize application of AI in every vertical in every industry everywhere. But to me, the most exciting when it's not a fantasy, but when it's really implemented and then there are things real, which we cannot even imagine today, the richness of applications and what can be done with this.

Technology in the future is the most exciting and like, yeah, I think I believe that AI will change the world very significantly. It will become either much, much better than it is today or much worse than it is today. I'm an optimist. I hope that we'll be able to handle that and our kids will live in a much better world and we'll have things like personalized medicine, we'll all live longer we will be working on what we like to do and we won't do any boring tasks anymore.

Richie Cotton: No more boring tasks. I like that. And yeah, the goal is a widespread adoption of AI and a better world. That does seem utopian. I like it. Excellent. Andy.

Andy Hock: It's hard to top that, right? I

Natalia Vassilieva: Bring some reality!

Andy Hock: yeah, yeah, look, I think Yeah. Large scale AI has this massive potential to transform the way we make decisions, the way we live and work. one of the cool things about working with cerebrus, Given the nature of our, hardware being, highest performance training and inference systems in the world, customers tend to come to us not with just tiny problems, not wanting to make sort of incremental advances on existing solutions, but they tend to come to us with these really grand and ambitious world changing ideas, right?

We have customers like, like GSK using And building fundamentally new AI models to accelerate drug development for human disease. We have customers like the U S department of energy. Who are building large and small models alike that are new foundation models for science to better understand physics and biology for things like energy generation and the environment.

We've got customers and partners like our big strategic partner in the UAE G42 that are building fundamentally new. Language models in non English languages that will reach a far broader population of the world and expand the population of inventors and users of AI systems and also building models for say, fundamental new models for medicine and healthcare.

So think the sort of enterprise opportunity to build new models to improve revenue and efficiencies in business are huge and transformative in themselves. But I see this untapped potential. we give people the right tools and we're cerebrists, we're tool builders, right? We're building computer systems that other people build upon given the right tools.

I think we have an opportunity to unlock the potential on all those really, really exciting domains. And back to the beginning of our conversation, right? We're scratching the surface today what people can do. And so I can't wait to see what the next generation of developers builds systems like ours, as well as sister systems in the cloud sort of begins to unlock that potential across enterprise and public and social good applications.

It's gonna be exciting.

Richie Cotton: Okay. Excellent. So, uh, yeah, definitely going for what I'm doing, Natalia Thayer. So it was going to change uh, healthcare and the environment and uh, science and everything. All right. All

Andy Hock: but it is, it is, I think, I think there is a huge opportunity to unlock that, that utopia Natalia was mentioning. I think we are optimists, but we're also pragmatists, right? We, we know that in order to achieve that, that optimal outcome, We need to give people the right tool.

We need to build the right tools so that we can not just build the exciting new applications, but we can also figure out how to do that in a responsible and safe way. I know we didn't talk a lot about that today, but you don't just need high performance compute to build the right applications.

Model for your application. You also need a high performance computing platform to research different data set mixtures and different techniques to make sure these models are not just accurate, but they're also responsible and safe and built on an energy efficient platforms. All these factors come together and no, we don't have time in just today's conversation.

But I think, you know, building the right infrastructure underneath all this is a really important part of not just unlocking those new methods and applications, but also doing that in a way that preserves safety and responsibility allows folks to do this in a cost efficient and energy efficient way.

Richie Cotton: right. Nice. So both very utopian visions. Uh, Yeah, very excited about this. There's lots of good stuff to come. And I think that's where we more or less started the show. So I think that's where we'll end it as well. Thank you so much, Natalia. Thank you so much, Andy. It was great chatting to you both.

Natalia Vassilieva: Thank you!

Andy Hock: Really appreciate it.

Topics

Artificial Intelligence

Generative AI

podcast

The Past, Present & Future of Generative AI—With Joanne Chen, General Partner at Foundation Capital

Richie and Joanne cover emerging trends in generative AI, business use cases, the role of AI in augmenting work, and actionable insights for individuals and organizations wanting to adopt AI.

podcast

Effective Product Management for AI with Marily Nika, Gen AI Product Lead at Google Assistant

Richie and Marily explore the unique challenges of AI product management, collaboration, skills needed to succeed in AI product development, the career path to work in AI as a Product Manager, key metrics for AI products and much more.

podcast

The 2nd Wave of Generative AI with Sailesh Ramakrishnan & Madhu Iyer, Managing Partners at Rocketship.vc

Richie, Madhu and Sailesh explore the generative AI revolution, the impact of genAI across industries, investment philosophy and data-driven decision-making, the challenges and opportunities when investing in AI, future trends and predictions, and much more.

podcast

Developing Generative AI Applications with Dmitry Shapiro, CEO of MindStudio

Richie and Dmitry explore generative AI applications, AI in SaaS, selecting processes for automation, MindStudio, AI governance and privacy concerns, the future of AI assistants, and much more.

podcast

Getting Generative AI Into Production with Lin Qiao, CEO and Co-Founder of Fireworks AI

Richie and Lin explore gen-AI use cases, getting AI into products, foundational models, trade-offs between models sizes, use cases for smaller models, cost-effective AI deployment, excitement for the future of AI development and much more.

podcast

Generative AI in the Enterprise with Steve Holden, Senior Vice President and Head of Single-Family Analytics at Fannie Mae

Adel and Steve explore opportunities in generative AI, use-case prioritization, driving excitement and engagement for an AI-first culture, skills transformation, governance as a competitive advantage, and much more.

See More See More