Creating Custom LLMs with Vincent Granville, Founder, CEO & Chief Al Scientist at GenAltechLab.com

Richie and Vincent explore why you might want to create a custom LLM, the development and features of custom LLMs, architecture and technical details, corporate use cases and much more.

Jul 15, 2024

Guest

Vincent Granville

Vincent Granville is a pioneer in the AI and machine learning space, he is Co-Founder of Data Science Central, Founder of MLTechniques.com, former VC-funded executive, author, and patent owner. Vincent’s corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, and CNET. He is also a former post-doc at Cambridge University and the National Institute of Statistical Sciences. Vincent has published in the Journal of Number Theory, Journal of the Royal Statistical Society, and IEEE Transactions on Pattern Analysis and Machine Intelligence. He is the author of multiple books, including “Synthetic Data and Generative AI”.

Host

Richie Cotton

Key Quotes

Search is going to be powered by LLMs. It’s the end of PageRank. We are at a turning point where traditional search methods are being revisited and can be done much more efficiently with large language models. The future of search lies in leveraging the power of LLMs to provide more accurate, relevant, and faster results, transforming how we access and utilize information.

You might have heard words like mixture of experts, multi-agent systems that are related to this idea of managing these different LLMs. You could create 15 different sub -LLMs, then have a system on top of it that managed all the LLMs. One of my clients actually invented or coined the word LLM router, something to manage these LLMs.

Key Takeaways

Custom LLMs can provide specialized, precise answers tailored to specific professional contexts, making them more effective than general-purpose LLMs for corporate use.

Implementing an LLM router to manage multiple sub-LLMs allows for efficient handling of diverse topics and user queries within a single system.

Organizations should evaluate the trade-offs between building custom LLMs and using off-the-shelf solutions, considering factors like control, security, cost, and expertise required.

Links From The Show

Read Articles by Vincent

Synthetic Data and Generative AI by Vincent Granville

Connect with Vincent on Linkedin

[Course] Developing LLM Applications with LangChain

Rewatch sessions from RADAR: AI Edition

Transcript

Richie Cotton: Welcome to DataFramed. This is Richie. Between GPT, Claude, Gemini, Llama, and all the rest, there are a lot of powerful large language models available off the shelf. And yet many organizations decide that they need to create their own custom LLM. Today, we're going to look into when you might want to do that, how you go about it, and what the buy versus build trade offs are.

Our guest is Vincent Granville. He's the founder and CEO of the research lab GenAI Tech Lab. And he's perhaps most famous as the founder of the data news platform Data Science Central. Vincent's a prolific author having written six books on data and AI in the last three years. Most recently, the state of the art in Gen AI and LLMs and Statistical Optimization for Generative AI and Machine Learning. Vincent was previously the Founder and Chief AI Scientist at ML Techniques and he's worked for Visa, Wells Fargo, eBay and Microsoft. In short, he's doing cutting edge AI research. He's the serial builder of businesses and products and he likes to write books like they're going out of fashion.

I'm very keen to pick his brains on custom LLMs.

Hi, Vincent. Welcome to the show.

Vincent Granville: Thank you very much for the invitation and very happy to share what I've been doing with my LLM technology.

Richie Cotton: Wonderful. So I'd like to know why you might want to use a custom LLM when there are so many sort of standard LLMs out there.

Vincent Granville: Yeah, that's a goo... See more

d question. Initially when I started to design my own technology origins is I was looking for answers using uh, Google search, Open AI, GPT search boxes various websites, websites like Wikipedia or Wolfram Alpha. And I noticed that a lot of the results that I was getting were targeted to non experts, to what I would call the common man.

So it would be difficult to find specialized references or We focus on topic that are more technical. It really required either multiple prompts if using a tool like a GPT or quite some time by using auto search tools, Google or Bing or these search, engines. So that was the, what started the idea of creating a LLM from scratch.

Then the other part was that most of these tools, they, they use neural networks. The training can be difficult. It takes time, lots of GPU. I've read also that they're using billions of weights or tokens that needs to be optimized during the training in particular. There can be latency issues.

And I was wondering if it's possible to design something that run much faster and that potentially could offer better results. And by better, I'm not saying here that my LLM is better compared to everything else. It's better in a specific context. Like for instance, professional, busy users, who want concise, Precise, advanced answers to the prompts.

So that's what I've focused on initially. Actually I called Wolfram Corpus or Wolfram is probably one of the best website that covers all mathematics. You can find on the internet. And so even without neural network, actually. And I was surprised to, I was getting answers to, to my question that were, to me, much more useful than I would get from any other sources, even from the Wolfram search box on its website.

Or even if I do a Google search and I ask Google, I'm looking for results only on Wolfram I got much better results from what I've been doing. Over time I got a few Fortune 100 customers that were interested in what I was doing. And then it became more and more clear about the benefits of my rather simple architecture.

So just to put some context, initially I focused on a specialized simple LLM that would focus on one top category, which was in that case, statistics and probability was just focused on that, which is one of the 15 top categories on the Wolfram website. The Wolfram website covers all the mathematics, some computer sciences as well.

It's not that comprehensive on statistics need to be augmented, but that's something you can talk about later. So it started with one. I would call sub LLM and then, for instance, in the Wolfram website, like I said, it's 15 categories. Calculus is another one. Number theory is another one. So you could create 15 different sub LLMs.

Then a the system on top of it that managed all the LLMs. So the One of my clients actually invented, coined the word LLM router to manage those LLM. You might have heard words like mixture of experts or multi agent systems that are related to this idea of managing these different LLMs.

Or so, What I wanted to do is not just return a very wordy, long English sentences to the user, but concise bullet points, different sections. One section would be URLs, links or references. Another section would be related concepts. Another section could be synonyms or definitions, stuff like that. And for each of them, offering the user a relevancy score for each of the item being turned so that by allowing the user to play with some of the parameters, it could see, you know, how it impacts the relevancy score, what's the order in which things are returned up to the point that two different users with the same prompt could get two different resource if they use different sets of hyperparameters.

But to get back to these clients ABI actually is one of them. So InBev a brewery company in Brazil, which is a very big company. of the concerns were we want uh, local implementation. We are concerned about security, data leakage, and stuff like that. So that was an important Thing for them.

Another thing was latency. Then, although the issue was if there is some hallucinations or stuff like that, or poor results or wrong answers, it can be a liability for them. could end up in you being sued or stuff like that. If decisions are made on wrong. Answers. I mean, it's a liability and the ease of implementation because there's no neural network when you test it's much faster to enhance the system to test new features, stuff like that.

That's something they were so, interested in. I've mentioned other things, but that's the, what the main arguments that remember what's interesting also is, so I've been improving and developing these, I call it XLLM for extreme LLM very fast because of that, there's many features that I've added over time and I've seen many other We call it competitors if you want to, but OpenAI, Meta with LLama and stuff like that, or Databricks, they started to kind of copy my idea.

I'm not saying that they read my documents and then they just copied. Some, yes, and probably they invented on their own as well. But I was happy to see that they do what I do, like a three month delay or something like that, and kind of validate what, what I'm doing was very people with, uh, with that, for instance, you might have heard the concept of a multi token.

I invented the word like three months before, so it's used by a meta mixture of experts. I didn't invent that word, but the concept of a multi LLM with a number of sub LLM. Essentially, and each of them specialized, like an expert on some topic is turned out to be called multi expert LLM by, by some of the companies just to give you an example.

Richie Cotton: It really seems like there is sort of two motivations then for customizing LLMs or creating your own custom LLM. So the first one you mentioned was in terms of changing the style of the output to suit different audiences. So, for example, you said that you wanted more technical information and a lot of like the sort of foundational LLMs.

They will talk to the common user. So having that sort of technical language seems like, quite an important use case. And secondly getting that domain specific knowledge. So in this case, you said you wanted information about the stats and maths.

Vincent Granville: Yeah, by, by, by technical, it does not necessarily mean technical, for instance, did the corporate clients, what they are interested in is to deal with the corpus. A company like uh, InDev, they have some information that employees are accessing regularly. So they have a lot of information regarding HR, marketing, finance sales, tech and stuff like that.

They're interested in a sub LLM for each of them. And we can assume that the user who is asking for something specific, trying to retrieve, you know, documentables, images, and stuff like that. Is it's not really a beginner, so it's not the, you know, the common man, the user typically in this case, so far, would be an employee of in BEV, so he knows what he's looking for and expect to get answers that are not basic, that are going to be really valuable to him or her.

The other issue was the cost as well. So because those like, GPT 4, stuff like that that are based on neural net, deep neural networks, we use a lot of GP, billions of tokens. And I believe they've not been designed to serve the Needs like that, like for corporate plans, they may have been designed to replace a Google search or something like that. So, they're so charged by the, the token. They have a large number of tokens. It is an ecosystem around this, that everybody's AP you charge by the tokens tons of weights company likes NVIDIA also making lots of things to that. And I might be one of the few that went into an opposite direction and want to make the things as inexpensive as possible without jeopardizing the quality, actually even delivering better quality on, specific topics.

One feature in the XLLM architecture is that in addition to the prompt you can have another box where the user can specify which sub LLM he wants to query. And by sub LLM, I mean category, top categories. Are you looking for information about HR, or are you looking for stuff about legal? And sometimes, some stuff that the user is looking in one sub LLM is top level.

is going to be found in another sub LLM. Like for instance, I'm talking about the statistics and probability sub LLM which is a mathematical flavor because Wolfram is a mathematical website. If you look for something like gradient descent in the statistics and probability sub LLM, you're going to find little, maybe nothing.

But there's plenty of stuff like that on the calculus. So that's where the LLM router is here to, tell the user even explicitly or not, that it needs to get that stuff from a, from a different sub LLM And one of the things that the corporate client was also interested in. So you can have different types of LLM based on the content, but you can have also different types of llm, not based on the content.

The content might be identical, but based on the type of information you want to provide to the user. Is it someone who is looking for very highly technical stuff or more like definitions of the flavor, to put it this way, the flavor of the LLM could result in different LLM serving. different goals. You might have heard words like actions in that context has been used by, by some of the folks.

One thing that I would like also to talk about quickly at some point, I don't know if we have the time to mention it here. So one of the key feature of xLLM is that it's based on taxonomies knowledge graphed from the very beginning. So it's kind of the bottom layer. Well, these days you hear knowledge graphs is becoming popular.

Everybody's starting to, but they are putting it like on top layer. It's trying to build a knowledge graph. In this case, the taxonomy, the knowledge graph could be related concept. It's embedded into the website or the corpus itself. For corporate corpus, you have a breadcrumb.

You know where you are in the tree or hierarchy of categories. You also have related concepts. on the web pages, you look for something is going to point to some other links. So you, you have all this taxonomy and knowledge graph actually there on the website. You can just simply call it.

So, you may say, It's kind of a cheating. in this case, I've not built a knowledge graph from scratch. You can do it. I don't have the time to go into the details like that. You can augment the taxonomy. The taxonomy can be poor. Some part of the taxonomy can be poor. You can improve that. But for instance, in the case of Wolfram, you had a pretty decent taxonomy that you can really crawl.

And one of the reasons other folks are not doing it is that use tools like Python library Beautiful Soup, for instance, it may not be obvious how you're going to retrieve those taxonomy elements. But they are there. It requires a little bit of, one or two hours of work to find how am I going to detect these pieces of information and rebuild the taxonomy.

But it can be done. And what is interesting is that all the major corpuses that I've looked at, corporate corpus, Wall from Wikipedia. They do have a, a strong structure. So you'll always hear this unstructured text. Actually, it's pretty well structured and you can recover that structure in the case of Wikipedia.

very similar in all these corpus in the case of Wikipedia. For instance, machine learning would be a, a sub lm. the taxonomy is. It's good when there are some things that you can uh, ignore a statistics to the contrary, the, so you can browse Wikipedia. What I'm saying here, you can browse Wikipedia by looking at kind of a top page that tells you what are the categories, subcategories, and stuff like that.

And then from these top pages, as a user, as a human being, you can go down. And so that's what I'm talking by the taxonomy that you can recover from the website and on Wikipedia, the statistics statistical science. Taxonomy is pretty it's better in, in Wolfram. I'll find their benefits on Wikipedia in the sense that even for it is a poor taxonomy for statistical sciences, the content is pretty good.

it's more comprehensive than on Wolfram, but even for Wolfram is a better taxonomy. That's where, I probably don't have the time to go into the details, but in the end, you're going to end up not only with content augmentation, but also taxonomy augmentation, or graph augmentation. If by any chance you're working with a corpus that don't have these features, you can use some external taxonomy

Richie Cotton: It sounds like there were two really interesting architectural decisions you made there in terms of creating your own LLM. So the first one was around using this mixture of experts approach where you've got a two stage approach to the LLMs. You've got the first layer just chooses which of the experts are going to give good results, then you've got these experts which are smaller LLMs which are going to provide the actual response.

And then the second one was the use of knowledge graph. So I'd love to get into both of these and maybe a bit more detail, but we'll cover the mixture of experts first. So this is the approach that you mentioned was has been championed by Mistral. And I believe I've heard rumors that GPT 4 from OpenAI is also based on a mixture of experts approach.

Can you talk me through how you decide what each of the individual experts should be?

Vincent Granville: So, for corporate corpus, we're working on that right now, but the, the idea would be that, you have Pretty well separated component. I mean, they overlap, of course, but you know, like HR, legal marketing, sales you know, all the things that you, you find in corporations and you can have also uh, we're not there yet, but you can imagine and M made of multiple subm and the subm, a sub sub lms a tree like that.

But I don't think we we'll get to, to that points too much granularity, in my opinion. risk is that those subm, if they are too small. It brings this kind of an optimal size. Too big, you get hallucinations potentially. You have billions of weights or tokens. And what I've been, I've found myself, because at some point I created multi tokens.

which are tokens instead of data science, the word data science is just one token. And then what I call contextual token, like data science, two tokens may not be adjacent. They could be found in the same paragraph. But not adjacent. But when you start working with these multi token or contextual tokens, you can have an exclusion in the number of tokens, even in my much smaller system.

What I found in the end is that very few, typically I'm going to be having maybe like 5, 000 tokens for sub LLM. If I don't pay attention with this multi tokens, it can grow to 10 million, something like that. And what I found is that most of them are never fetched from the backend tables when you serve a user query.

So they're kind of, No, it's usually less, and even worse than that, in case they are actually fetched, most of it is noise or garbage, it can corrupt a little bit the answer that you're going to provide to the user. So, I think, you know, a big token or, or weight tables can cause problems. And I think tables that are too small are also going to cause problems.

It's some kind of an optimum somewhere.

Richie Cotton: I find that approach really interesting in that it mirrors the way you would ask your colleagues for help. So if you think, okay, I've got a problem, I want to ask one of my colleagues what the answer is, and you think, well, who in my organization is going to be the best person to answer it? And then you ask that person.

So in a way, this sort of mixture of experts approaches sort of, is mimicking that. So, you decide, Which LLM or which LLM expert should I call? And then that expert gives you the answer. And it sounds like in a corporate situation, each of these smaller expert LLMs is going to represent maybe like the head of department for that particular team.

She said you had like a marketing expert, a sales expert, an HR expert and so forth. so the second idea you mentioned was around the use of knowledge graphs. So my understanding of these is that it's very helpful for letting the AI understand relationships between different objects. Can you talk me through how that knowledge graph interacts with the neural network part of the LLM?

How do they fit together?

Vincent Granville: Knowledge graph. There are two things about them. First they are made of high quality. words high level keywords that may be made of two or three tokens, but it's not like written text or it's high quality. Sometimes it's actually it's been created by uh, humans, like all these on the world from alpha.

They've been made by might be like 5, 000 of them, probably more than 5, 000 categories, subcategories, and more from event related concept, which is the knowledge graph. They've been created by human beings. so they're supposed to be on high quality and they consist of worded tokens that are very clean, of superior quality compared to the words that you're going to find in text. In the, you know, summary word. So if you use them also in your embeddings and you put the high weights on these knowledge graph words, you're going to in the end get better results in the output that you show you're able to also provide to the user. in the answered related concepts, so it's much more easy. So that was one of the things is, so to, to use them for in the embeddings and you put, you may put the high weights on the, the tokens that are coming from the categories or knowledge graph as opposed to the one that are coming from straight text. Then the other thing is you have all these relationship that and provide, so some user are going to want references or related concepts, and that's where, These knowledge graphs can be particularly useful.

Richie Cotton: So, I think just to make sure that I've understood this correctly, the idea is that because the knowledge graph shows relationships between different Concepts is then going to link to different related bits of text. So if you ask a question about one thing, it can if there's some text that is in some part of the knowledge graph, then it's also going to return ideas that are in adjacent nodes in the graph.

Is that about right?

Vincent Granville: It is the ability to jump from node to sub nodes, to sub sub nodes, up or down. higher level, lower level, parallel. risk is you can return a tremendous amount of information user. Even for the answers that are extremely concise, like bullet items. You can have and if you're asking to go two level deeper, two level above and stuff like that you can add up with massive output return to the user.

It could be of some valuable, in my case actually, I, I can I allow for that? And you can save the results as text file because the, the output can be, can be pretty big. But the, challenge is to find some kind of optimum maybe to stick to the closest possible. Items in the, in the knowledge of, and everything here comes with a number of not necessarily an integer number a score, which are a real number, a positive number typically, or maybe normalized between zero and one.

And that score is going to tell you relevancy between some original piece in your prompt. So your prompt might have multiple. pieces, maybe multiple embeddings that you can derive from your PROM, which by itself also if you have stuff unrelated unrelated concept, you are, you're asking for two different concepts in your, queries, be better to, to separate them.

But then you need to be careful about, the amount of information you're going to return to the user because of the knowledge graph. Knowledge graph, essentially it's global context. You, from one query, you can return the whole thing. You know, it's like, human beings.

You, you are connected to any other human being on earth. We live a six link connection. You probably have heard about the story that we all six, six links apart from anybody else on earth. So, yeah.

Richie Cotton: Okay, that's interesting because I think one of the most common complaints with these large language models is you give them a simple question and they just return paragraphs and paragraphs of text and you think, well, it just answers that in the first sentence that it wasn't didn't ask for all that.

So I can certainly see how making use of these knowledge graphs you could end up with way too much text and then that's going to give us really verbose responses.

Vincent Granville: And so with knowledge graph, one of the issue when that can cause to return tons of output is if you focus on individual tokens, you look at all the information attached to secure data science, you return everything that's related to data, you return everything that's related to science, but you should focus on what's related to data science.

Here I'm providing a simple example, but you get the idea. The problem with LLMs that are pretty small, for some keywords, you are going to have information from one individual token, but not necessarily for the joint token. It's probably that's one of the reason for the, the ways for the biggest possible LLMs with a massive amount of input sources and weights is to be able to, in your database, to a Word consists of three or four tokens with enough information about them to return you know, useful information.

Another way to do that is to have a specialized LLM, like an expert LLM on some topic and then if a bunch of them like I, I like to say the statistics and probabilities of a LM you have 15 something, like 15 of them. You cover everything, mathematics, and here the focus is knowledge management.

I'm not trying to produce code or all the stuff like that. It's more like, it's actually very useful as a surge. And then with 2000 of them, you, you essentially covered the entire.

Richie Cotton: Okay, wow. So actually you're mentioning that you can consider the phrase data science is a single token. Suddenly something's clicked in my brain because the recent announcements of from matter about Lama three and from open air about GPT four. Oh, they both mentioned the idea that they're doing embeddings more efficiently, where that's The number of words you put in corresponds to fewer tokens than before, and I hadn't really understood why that was the case, but it sounds like if you've got data science being one token rather than two, then that becomes more efficient in the model.

And then It's going to be cheaper for people to use as well, because I think particularly with open AI, they charge like per token of input, per token of output. So you're going to get more relevant results because it's considering the phrase rather than two words separately. And it's going to be a bit cheaper for you.

So that sort of all makes sense now. All right. So maybe we'll move on from some of the technical talk to about how you how and why you might want to get started with a custom LLM. So, what sort of organizations are you seeing make use of custom LLMs?

Vincent Granville: Yeah, so the one that they've been most interested in what I'm doing, they're going to be clients as opposed to companies that design AI, engine AI. So I mentioned in Beth, it's one of them. Morgan Stanley is another one that showed the interest and it was actually the same thing.

They were interested in managing the corpus and returning meaningful users much better than what they already have in, in place because they already have something in place. So, two things, now that I'm thinking about that. You mentioned efficient embeddings, so multi tokens is part of it.

Being careful to not make the number of tokens explode by using multi tokens, that's a risk. And one of the features in x. xaml is variable length embeddings. So I did not invent the concept. Not many people talk about that. It might not be that friendly if you use vector. databases, which assume a fixed length for embedding.

But when you use variable length embedding they can be much shorter if on average, they can be shorter. So it's going to be faster as opposed to because, you know, those fixed length embedding big system, they, they might 8, 000 tokens. So yeah, in each embedding is something that's a typical. Then another component I've heard some people talking about this idea that's not that many yet is the prompts. So to use the prompt is input to augment data to put this way. So you have millions of queries. You can analyze them. And so they're going to be multiple. interesting things coming from it, you're going to see what the user are looking for.

This might help you refine your, or your taxonomy, find stuff. Oh, this is a lot of users are looking for this stuff. This stuff is not in the corpus. So maybe we need to augment the corpus with external sources that answer. Maybe we need to create new pages on the corpus to answer these questions for the users.

So that's another interesting component. And then the other component that some people are also talking about it, you might have heard the word is caching mechanism. So you look at the prompts. What are the typical embeddings, what are the top 1, 000 or top 5, 000 embeddings that can be derived from the, from the prongs.

And you just put them in a caching, there's a caching mechanism. So in real time, when you deliver these results in real time to a lot of users create embeddings derived from the user prong, and then you need to merge them into fine. those that are close to the, those coming from the user prompts in the backend tables.

That can take times, especially if these tables are immense. So caching mechanism can help as well. Variable length embedding can also increase speed. So the way I've been doing this with variable length embedding is uh, nested. So Ash is a dictionary in Python, so it's key value database, but the value is ash itself. It go, can go two or three level downs. The key can be ash itself. it's very efficient to handle Sparse data. What I mean by sparse is if, if you have a million tokens or a billion tokens and you did the association between one token and another token, for the immense majority, there's no association.

You pick up one token at random and another token at random. There's no association. That's what I mean by sparse and tokens that are really connected somehow and associated in these huge list of tokens represent minority. To put it in a different way, let's say you have 1 billion tokens, you want to create a table of associations between tokens, say 1 billion by 1 billion would be, you know, the dimension of the table, but in reality, they might be on the 3 billions relationship that exists.

in the corpus. so those nested hashes it's an efficient way to deal with it. it's a format that's a little bit similar to JSON, but you can also use graph databases to represent that kind of structure.

Richie Cotton: Okay. A lot to go there. So, let me see if you've got this. So, it sounds like the most important use case then or the most common use case you've seen for these custom LLMs is for corporations want to have some sort of Q and A for their own corporate knowledge systems. It seems like managing what you know as an organization is incredibly tricky.

And I like that you mentioned the sort of feedback there, where if the AI gives poor results, then it's probably going to tell you you haven't got documentation on whatever it is people want to know about.

Vincent Granville: Yeah. And with the, with a type of documentation, which could be people looking for definitions, people looking for case studies, for examples, for code. Not here that MyXLM, they're not generating code, but they're going to retrieve code that users are looking for. then it's becoming quite odd these days search in general, especially search for advanced search or search for experts, essentially replacing Google.

You might have heard that OpenAI I don't know where they are right now, but they, they seem to be interested in fighting, going to war against Google for search. And even myself, if you look, there's a sample of my output and I can share later, it's everywhere anyway, it's easy to find.

But the first. Section that I returned to the user, not for the corporate client, but in the kind of research section is URLs and the title is Organic URLs. So what it means, have in mind that this can be, uh, revenue could come from sponsors. URLs and actually do have a a client who's very interested in this idea.

So that's interesting.

Richie Cotton: I'm assuming there would be an interesting economic model where you have paid or promoted search links within the AI response. I'm not sure people want that, but I suppose the economics of generative AI

Vincent Granville: yeah, there would be different layers. There would be one with pretty much everybody is doing tech these days. You have this free version that comes with advertising. And may have a paid version come with SDK or You get full access to the backend tables, the embedding and stuff like that.

It gets updated more frequently so if you pay just the way it works right now, but I wouldn't charge by the two. I want to avoid that.

Richie Cotton: And I suppose thinking about it more like if I was like shopping for new fancy shirts, then if I'm asking the AI what to buy, then having some links to the actual products would also be helpful rather than it's naming some brands and

Vincent Granville: Yeah, I'm not going into that direction. I will leave that kind of stuff to Google and the other folks. It would be more if you're looking for more scientific or topics in biology, medicine, computer science, I'm not going to, for instance, for Q's, that's not the plan on looking for the best restaurant around your place or stuff like that.

So it's more like knowledge, essentially, as opposed to anything else.

Richie Cotton: actually, well, we're talking about the economics of this. So a lot of the technical that you mentioned. So things like the variable length encoding, caching more efficient use of sparse data. These all sort of point to the fact that gerontive AI is still quite an expensive technology, like both for training models yourself and also for running these large language models it's quite computationally expensive.

So can you just maybe give me an overview of where these big costs come in? Like what are the most expensive parts of working with gerontive AI? Yeah,

Vincent Granville: they are using a billion weights. So what the weight is, is could be a number that's going to be attached to a token. So there's a lot of updates. These updates typically, What it is, is you have a deep neural network, so a weight, when you change a weight, it means you may activate a neuron in your neural network, so that's what eats a lot of the time in these implementations.

That's what requires a lot of GPU. That's why there's a lot of cost associated with it, the main cost then. Now, if you have tons of users in real time and you need to deliver answers very fast, there's another cost in, because you need to have an efficient mechanism. That's why you probably heard words like approximate nearest neighbor search, which is one of the techniques they use in a vector database to, Kind of match a, an embedding coming from a user prompt to an embedding in backend tables.

And, but when you have that on a very large scale you have these discuss testing, evaluating is also expensive. 'cause when you change to these neural networks, retraining is going to take a lot, lot of time. And in kind of an implementation that that if you don't have those those costs, I like to joke, I even mentioned that with my client, you can download the whole internet, 99 percent of what's going to be useful in terms of knowledge on your laptop and not even the most expensive or, you know, my laptop is, I don't know, maybe six years old.

I could do that on my laptop. going to take time in this case is going to be the, the crawling. you can distribute it you know, you can make it in parallel, but still. If you call it, the whole internet is a little bit easier for when I called the, the wall from Corpus, I think it was like 15, 000 euros.

It's kind of a small, it can cause real fast to check, check, check, check. It's really fast to, to, to extract the whole thing, but you get blocked. If you realize it's always hitting the whole website. So, If instead of, you have plenty of websites, you're not going to eat massively one website at a time.

It gets distributed, so, you avoid these problems. And I know OpenAI, they don't even do any crawling. They use OpenCrawl or something like that. So they you can also use those external tools. But it brings also another concept. Should someone like me or anybody else be allowed to crawl those websites?

And I mean, sure, it's not illegal, but is it ethical? Is it, is it fair? So should I pay Wolfram? Well, you know, these are Important questions, the ethics behind that, and what you're retrieving as well. Me, I'm essentially retrieving knowledge, so I have fewer problems than maybe companies like OpenAI.

If you retrieve personal information and stuff like that, then it's another challenge.

Richie Cotton: absolutely. Certainly just trying to scrape the whole internet or scraping like everything from some particular website. You gotta be very careful about the legalities and I'm sure most websites have a notice about like what's permissible buried within there. Like there'll be a robots. txt file or something to tell you what you're allowed to do.

But yeah, that's very interesting the idea that actually. if it's sort of ethical and website owners are okay, you can just get a load of data on something you're interested in, download all that to your laptop, and then go and create an AI, create your own bot from it.

Vincent Granville: Yeah, the case of Wolfram was interesting because initially they blocked me. Obviously, I found some, some workarounds, but eventually they unblocked me. And I believe they must be very happy because it got a lot of publicity. A lot of people have seen this result coming with Wolfram, but with you it was fun.

that's also one big thing. OpenAI is not going to tell you where that content comes from. Me here, this is the URL. So I believe that in the end, they were very happy with that. It's me mentioning that it's one of the best websites for everything mathematic, if you want to start with. So

Richie Cotton: Yeah, that level of transparency is incredibly important. It has been kind of lacking from a lot of the big players in the generative AI movement. Particularly open AI isn't very forthcoming about what sort of data they use. And I agree with you. the bull from uh, site for their glossary of mathematics is, is absolutely amazing. Alright, so, just going back to the cost of things. It seems like a lot of organizations, if they want a custom lm they've got this choice between did they build something themselves? Do they maybe, buy something off the shelf and fine tune it. Do they get someone else to try and create the LM for them?

Can you talk me through how organizations should think about this sort of build versus buy decision?

Vincent Granville: if you design your own LLMs, It's not necessarily perfect. So there are also some drawbacks that come with it first. It's not been as thoroughly tested as stuff like open AI. If you have an expert who's developing the, the uh, LLM for your company, what if that guy disappear or that guy leaves and it's not properly documented or stuff like that?

so you, you're facing issues issues like that, you know, obviously the benefit is going to be much less, should be much less costly. over what's being done. And uh, there's increased security. But you have a local implementation for these days, many of these big LLM companies are also offering local implementations. Yeah, so, it's not, you know, black and white, one is better than the other, or stuff like that. So there, there are drawbacks approaches.

Richie Cotton: Do you have a sense of who you need to create one of these things? Like, is it gonna be a single engineer? Is it gonna be a whole team of people? What do you need to get started?

Vincent Granville: Another issue, actually, with this specialized LLM, especially if you want high quality I use a lot of, customized tables. Stop words if I use autocorrect libraries from Python, unspecialized, and then they're causing problems, the autocorrect stuff that should not be autocorrected. So we do not go correct, autocorrect stuff tables and building that I don't think it can be fully automated.

It's going to be requiring some human interaction by, you know, you look at the results you get, and then you find out which stopwatch you need to put in your table. So there's human beings, there's a cost there. So I actually made a competition one day that for the whole internet if I was to cover the whole internet, you make a big scalable, you company, I would need like, 20 full time people.

working just on optimizing, looking at the doing that kind of stuff. So a team of 20 people focusing on that, that would be the biggest, most expensive component, more so And the engineers, IT, and then, because someone asked me the question do you need a lot of human beings to, fine tune your your, your system?

By the way, I also use a lot of human beings. I even myself receive offers for testing prompts and stuff like that. so the, cost completely like companies, mostly it's going to be engineers. While in the end of my 20 employee employees paid full time, not gonna be cheap employees, you can be experts it's still less than the number of people a company like OpenAI must have.

I don't know how many employees they have. I would assume hundreds, I may be wrong.

Richie Cotton: Yeah, I think there might be thousands of people at this point in opening, but so that's interesting that the biggest cost then is, so I believe this is like the reinforcement learning from human feedback step. So this is people who are checking the results of the output and then marking it as good or

Vincent Granville: Especially if you want to do it fast and cover the whole internet. For instance, the case for a corporate client, I'm going to be doing it myself. I'm not going to spend many hours looking, but it's a very small part of the whole knowledge you know, doing those kinds of tests myself. I don't have those expensive GPU and training costs that because besides the engineers, of course, there's the, computer time.

Of course, that openly I am mistrial in those companies are facing,

Richie Cotton: That's interesting. So, let's sort of recap, like, which skills you need then. So it sounds like you're gonna have to have some people who are involved in in a lot of, like, doing the data processing. You have data engineers people who are experts in data management up front, and then you need some AI engineers to actually build the model.

And then you need some experts to be able to test the model output. Is there anyone else that needs to be involved if you're wanting to create a custom LLM?

Vincent Granville: obviously you need all the traditional people that you would find in a company. In my case I'm pretty lucky. It's new. People are not seeing what I'm doing. So I don't have a sales team. People are knocking at my door. In company like InBev, they came to me, Morgan Stanley, they came to me. I'm also, self-funded, but very well self-funded. I also, my previous company, so I don't take this pressure to try to make money A SCP that I'm going to run out of money. So I don't take that that that aspect. sales.

Richie Cotton: It sounds like you need all that kind of standard support. So your project managers, project managers and your um,

Vincent Granville: I do have an engineer actually uh, which working with me is designing web APIs for my product. The guy was uh, is in India actually was a Oracle principal staff or something like that in, uh, in India. But I think you need more like ML dev engineers. As opposed to research scientists.

Well, I do a lot of research myself, but I would not need to hire another guy doing what I'm already doing. You need a guy like that, obviously. And if possible, a guy like that who, for instance, in my example, I'm not just a pure academic research guy a guy who knows how to code, a guy who's going to be very close to the engineers and like a co founder which, you know, I am in this case.

And I think you're also interested, maybe, to know the Python what new skills. So there's what I call smart crawling. So it's not just what you typically would learn from a, a simple bootcamp or something like that to extract the taxonomy, to retrieve the knowledge graph. and to do it efficiently.

So, stuff like simply using beautiful cell in Python is not enough. So you're going to need some training possibly in Python on web crawling. It's certainly very good to be familiarized with all the NLP libraries. So, of course, on that topic, I faced many issues with those libraries and Some of the stuff that I rewrote from scratch better serves my needs in this context than some of these Python libraries.

Yet you still need to know what they are doing, what drawbacks come with these libraries, and in some cases these drawbacks are easy to fix, like do not stop word or do not autocorrect to create stuff like that, and at the same time you still leverage Those libraries, right,

Richie Cotton: so, just that note, so these Python natural language processing libraries, I presume you're talking about things like NLTK and spaCy and things like that. So it's interesting that these are still very relevant in the age of generative AI because I think some people sort of thought, well, okay, now we got these big large language models.

I don't need those older NLP tools. So that's interesting that they are still useful in the context of generative AI.

Vincent Granville: any better tools, any new tools that come that are going to replace these libraries are definitely worth investigating. So it certainly would be nice to find out what the new libraries or more specialized libraries that come with it. So another example is I did use the LLM for clustering predictions.

Everybody's doing that as well. But I use SK learn Python libraries to do some clustering based on text. So there's a disability. matrix there, but I use it. So like I mentioned, I really leverage sparsity. You're really dealing with very sparse data. So, SKlearn, unless I'm missing something in asking me to build a distance matrix.

So if I have a million token, it's a million by one million token, I need to do clustering on that matrix. It's very simple to create that you know, distance matrix out of your similarity matrix. based on hash tables, but the problem you run out of memory is that maybe I'm missing something with the scalion maybe there are some other libraries, but the simple libraries like that are not going to be you know, in my case, I do it some very efficient way to do clustering Connected components is graph related technology that leverages the system.

So what I'm saying here as a summary, if you just use the standard libraries, about to run into problems. You need to go beyond that. And at the same time, you don't want to reinvent the wheel. It's a risk also, so you need to be knowledgeable of what's happening. What you can use you can even use the open AI to create some code asking some code to, to solve some problems.

I did. And in, in some instances I got some, some great stuff actually,

Richie Cotton: So, yeah, it seems like, an understanding of sparse data is going to be incredibly important for this and then understanding knowledge graphs as well. So there's quite a lot to learn. All right. So, before we wrap up, what are you most excited about in the world of generative AI right now?

Vincent Granville: or turning points where there's a lots of old stuff like Google search that, are being revisited and can be done much more. Efficiently, was just writing this very recently. Search is going to be poured by a rug. Is the end of page rank

Richie Cotton: The end of page, right? That's a bold statement. I think it can be a lot of people at Google are a little bit frightened about that.

Vincent Granville: Yeah. Then you know, I've been working on some of the projects as well, like, uh, low, there's plenty of low hanging fruits. People are complaining about how bad many of these tools are. The, the I found this is exciting. So much thing to improve, it's gonna be improved. There's a lot of work to do. There are some jobs that can be created and I create my own jobs, but I mean, people are, are talking about this technology is going to replace everyone. But for now, my opinion is that it is plenty of low hanging foods. There is a lot of demand. I mean, people are talking about billions of dollars associated to LS the market.

Costs are going down. This system are being improved. I mean, they just, like the first cars they were driving, you know, 40 miles an hour. You need to know how the engine works, the broken road with portals and stuff like that. Now you don't drive a car. It's gonna be, you know, I believe it's going eventually in the same direction.

Richie Cotton: Wonderful. That's a very optimistic taking. I like that. The idea that, well, you know, if there are problems now, then that just means that there's something to be solved. So there's interesting things to do.

Vincent Granville: Yeah, and especially when people, some people, and even sometimes myself and mathematics, I think everybody's been discovered already. We're in the end of it. There's no more, no more thing to discover, but that's not true at all.

Richie Cotton: That's wonderful. I love that. So thank you so much for your time, Vincent.

Vincent Granville: Thank you so much, Richie. Very nice talking to you. Thanks for the invite.

Topics

Artificial Intelligence

Machine Learning

Data Scientist

blog

9 Top Open-Source LLMs for 2025 and Their Uses

Discover some of the most powerful open-source LLMs and why they will be crucial for the future of generative AI

Abid Ali Awan

13 min

blog

What is an LLM? A Guide on Large Language Models and How They Work

Read this article to discover the basics of large language models, the key technology that is powering the current AI revolution

Javier Canales Luna

12 min

podcast

No-Code LLMs In Practice with Birago Jones & Karthik Dinakar, CEO & CTO at Pienso

Richie, Birago and Karthik explore why no-code AI apps are becoming more prominent, uses-cases of no-code AI apps, the benefits of small tailored models, how no-code can impact workflows, AI interfaces and the rise of the chat interface, and much more.

Tutorial

LlaMA-Factory WebUI Beginner's Guide: Fine-Tuning LLMs

Learn how to fine-tune LLMs on custom datasets, evaluate performance, and seamlessly export and serve models using the LLaMA-Factory's low/no-code framework.

Abid Ali Awan

code-along

Understanding LLMs for Code Generation

Explore the role of LLMs for coding tasks, focusing on hands-on examples that demonstrate effective prompt engineering techniques to optimize code generation.

Andrea Valenzuela

code-along

Fine-tuning Open Source LLMs with Mistral

Andrea, a Computing Engineer at CERN, and Josep, a Data Scientist at the Catalan Tourist Board, will walk you through the steps needed to customize the open-source Mistral LLM.

Josep Ferrer

See More See More