Skip to main content

Accelerating Data Science with Nick Becker, Technical Product Manager at NVIDIA & Dan Hannah, Associate Director at SES AI

Richie, Nick, and Dan explore new battery technologies, the role of data science and ML in material discovery, the integration of NVIDIA's GPU technology, the balance between computational simulations and lab work, and much more.
Jun 1, 2025

Nick Becker's photo
Guest
Nick Becker
LinkedIn

Nick Becker is a Technical Product Manager at NVIDIA, focused on building RAPIDS and the broader accelerated data science ecosystem. Nick has a professional background in technology and government. Prior to NVIDIA, he worked at Enigma Technologies, a data science startup. Before Enigma, he conducted economics research and forecasting at the Federal Reserve Board of Governors, the central bank of the United States.


Daniel Hannah's photo
Guest
Daniel Hannah

Dan Hannah is an Associate Director at SES AI Corporation. At SES, Dan leads a research program focused on discovering new battery materials using machine learning, chemical informatics, and physics-driven simulations. Prior to joining SES, Dan spent several years as a data scientist in the cybersecurity industry. Dan holds a Ph.D. in Physical Chemistry from Northwestern University and did a postdoctoral fellowship at Berkeley National Lab, where his focus was the discovery of novel inorganic materials for energy applications.


Richie Cotton's photo
Host
Richie Cotton

Richie helps individuals and organizations get better at using data and AI. He's been a data scientist since before it was called data science, and has written two books and created many DataCamp courses on the subject. He is a host of the DataFramed podcast, and runs DataCamp's webinar program.

Key Quotes

Physics-based simulations, in most cases, don't tell you whether the molecule is going to be good in a battery. We're not there yet. The only way to know the molecule is good in a battery is to go put it in a battery. But, the same models can tell you, with a pretty high degree of fidelity, whether a molecule is going to be bad in a battery.

Most people are aware that GPUs are great for AI. But it's not as obvious sometimes why it actually makes sense if you're doing data frames or clustering or machine learning or graph analytics. Why do GPUs help there too? It turns out that the real key insight is that GPUs and accelerated computing are really well designed for parallel processing.

Key Takeaways

1

Take advantage of open-source libraries like NVIDIA's Rapids to seamlessly transition from CPU to GPU computing, enhancing performance without requiring significant code changes.

2

Adopt GPU-accelerated computing for data science tasks like dimensionality reduction and clustering to significantly reduce computation time, enabling more efficient data processing and analysis.

3

Use surrogate models to bypass expensive physical simulations, allowing for faster and more extensive exploration of potential battery materials, thus accelerating the discovery process.

Links From The Show

NVIDIA RAPIDS External Link

Transcript

Richie Cotton: Hi there, Nick. Hi there, Dan. Welcome to the show.

Dan Hannah: Hey Richie.

Nick Becker: Hey, good to be here.

Richie Cotton: Brilliant. So, uh, , I love to the story about, searching for new batteries. , Dan, can you tell me what  does looking for a new type of battery involve?

Dan Hannah: There's actually a lot of moving parts that go into discovering a new battery, and some of those are my focus area and others are, are something that's being done by other people at the company or other companies, just kind of depending on what part of the geometry and the architecture of the hardware you're talking about.

So give a really quick overview of the sort of different ways that. You kind of plug into the art of discovering a new battery. And then I'll talk a little bit more specifically about what SES and my team at SES is doing. So overall, if you want to make a battery better, ultimately that usually means you either want it to last longer.

So if you're talking about an electric vehicle, I want to drive for a longer distance in between charges, or I wanna improve some other performance related aspect to that battery. So that might mean for a given amount of weight, I would like to get more energy out, or maybe I'd like the battery to perform better at higher or lower temperatures, for  example.

So with all of these performance metrics, there's different ways you can go about trying to improve the battery. So in some case... See more

s, those improvements are more at a hardware level. For example, if I can develop a more efficient way of packing battery cells into a hardware enclosure, I can get more energy for a given volume.

And so , that's sort of a hardware level improvement. On the other hand, there are more fundamental improvements as well that are related to the materials that we choose to make the battery out of. , , a great recent example of this has been a shift towards making batteries out of iron instead of say, nickel, manganese, and cobalt, which are sort of these rare, expensive minerals that we anticipate creating a lot of squeeze on.

And so you've seen that there's been this big shift towards what people call LFP, that stands for lithium iron phosphate. Iron is a lot cheaper, it's a lot more readily available. And this is fundamentally, uh, an  achievement on kind of the material science side of batteries. We discovered a new material that we think can give us satisfactory performance at a much lower cost with so added performance benefits as well.

It's , the material science aspect of this that SES especially kind of my corner of SES, is really focused on. And in our case, what we're trying to do is improve what's called the electrolyte. So the electrolyte is the liquid , in sort of a conventional battery architecture. The electrolyte is the liquid that's in the middle of a battery.

So if you take a battery out of any device in your house, like a double A battery is gonna have a, negative side, a positive side. And we usually refer to those as the anode and cathode respectively. And then there's liquid in the middle. And that liquid in the middle is the electrolyte. And it has a few very important properties, some of which are often at odds with one another that need to be simultaneously optimized.

So today, , especially for a lot of next generation. Battery chemistries. These are things that, , you'll often  hear called lithium metal batteries or silicon anode batteries. The reason we develop these batteries is because they offer a much higher energy density. So what that means is that fundamentally if I make a car battery out of one of these things, my car can drive a lot farther, , between charges for a given kind of size or weight of that battery.

The problem is that these high energy density materials pose a big challenge for the electrolyte because sort of by virtue of the fact that they store a lot of energy, they are likely to want to give up some of that energy in the form of a chemical reaction with that liquid in the middle of the battery.

so fundamentally, we think of the electrolyte. As the sort of limiting factor in being able to productionize these more advanced battery materials. So for, , the anode and the cathode, we have good solutions. We know what we would like to do. We would like to use either pure lithium metal or silicon for , the sort of anode and for the cathode.

I, I spoke to this  advancement earlier, but you're seeing a lot of people gravitating towards lithium iron phosphate. And so there's been some developments for sort of the two ends of the battery. And then that liquid in the middle kind of remains a lingering challenge that we haven't solved yet.

And so at SES what I'm trying to do is figure out what should we make that liquid outta and , when we kind of zoom way into to what I specifically am working on, I'm trying to discover new molecules. That we can make that electrolyte out of. So some people do what's called formulation optimization, and that's a different, , but also very important part of making that electrolyte.

So formulation optimization is when, I sort of have a sense of what ingredients I would like to use. I'm just sort of trying to figure out what ratio, I would like to combine them in or in some cases would I like to add in additional things that may improve the performance , in useful ways.

And then , even sort of one step more fundamental than that, what ingredients should I be using? Are there new ingredients out there that are going to be a quantum leap in making  that electrolyte more stable? And at SES, we believe the answer to that question is yes. That's kind of our fundamental bet, is that somewhere out there in what we call the molecular universe, there are molecules that would be a better choice for the electrolyte.

And the reason we believe that is because if you really dive in and you look at what has been done in the past, . Few decades. Decades, we've really only explored in, kind of a principled detailed way, maybe a few hundred, maybe a thousand on the high end, different molecules to be used in this electrolyte.

And one of the reason that's been so slow is that for a long time, the only way we could investigate these molecules was to procure them, either by synthesizing them or buying them, going into the lab. In some cases, actually putting them in a battery and measuring their performance. But as essentially both , our kind of theoretical understanding of how these molecules behave and our sort of access to computing power has increased exponentially, which is one of the reasons we're talking here  today.

It's actually become easier to get a much better sense of how these molecules will perform. Just on a computer before you ever do anything in the laboratory. so , with kind of the advent of computational material science and , the sort of accelerant that ML and AI have been able to, provide and make computational material science go even faster than it used to, we're now at a point where we feel like we can start to meaningfully explore the molecular universe and maybe find some of those key winners.

So I'll pause there. obviously happy to elaborate, but that, that I think is sort of the, the process that we're engaged in , as we try to improve batteries at the sort of molecular level.

Richie Cotton: Okay. Thank you. That's such a fascinating problem. , you've got, uh, batteries in almost everything. So from cars to bikes to almost every kind of gadget. so I can certainly see how, , having a lighter battery or a smaller battery or something that's cheaper, like there are a lot of different things you might, would, uh, optimize around these batteries.

it sounded an awful lot like, uh, this material. So it's a lot of chemistry, a lot of  physics. Can you, , go into a little bit more detail on what the data science evolves? Like where does the computation come in?

Dan Hannah: Absolutely. , so as you noted at sort of the beginning of this, we were taking a physics and chemistry driven approach to this problem. , a lot of this is just driven by human intuition. This is people who went to school for chemistry and physics and chemical engineering and things like that.

Just applying what they had learned in courses to try and sort of rationally design the best formulation or the best molecule for this task. Then computers got a little better and our theory got a little better and we got to a point where we can run sort of physics driven simulations of how these molecules will perform in a battery.

And that was great for a while. But if you think about how fast these simulations go and, or really what I mean is how long they take to run on a computer, and you think about the size of the molecular universe, which , for our kind of purposes, you could think about it as sort of a few hundred billion molecules  probably, and I can, sort of unpack that, but, you know, applying kind of some, common sense constraints to the size of the space.

You're still looking at about a few hundred billion candidates. , the speed of sort of explicitly, , simulating those by the laws of physics, it would actually take way too long, even with our current sort of computing architecture and, hardware. And so that's where the data science comes in.

That's where the machine learning comes in. We need a way to both winow down the number of molecules we'd actually like to give more scrutiny to. And at the same time, we also need a way. To consider these huge data sets holistically. So , if I wanna make a judgment call about a, a set of a hundred billion molecules, it's really the data science provides the tools of the trade for drawing conclusions about, what is the distribution of some property, , within a particular corner of this dataset.

 so that's one piece of just even  analyzing this dataset actually requires some non-trivial data science. The second piece where data science really plays a key role is in allowing us to sort of bypass these expensive physical simulations where instead of explicitly simulating how a molecule will behave in a battery, if I do that for a large enough number of molecules, and I do that for sort of a sufficiently diverse set of molecules, I can reasonably train a machine learning model that just predicts the answer I want.

I. To within a useful degree of accuracy. So a lot of times people in the community will refer to these as surrogate models rather than, running the physical simulation explicitly. I'm actually just trading a machine learning model to give you the answer you would've gotten from running that simulation, but much, much faster.

And , that speed up, , which is often several orders of magnitude, these are, non-trivial speed ups we're talking about. That speed up is really key to allowing us to access those scales of hundreds of billions.  Billions, hundreds of billions where we could sort of credibly claim to have exhaustively explored a space to find the best candidates for a given application.

And that's true for batteries, but that's also true for various biological applications and, and various other kinds of molecular , technologies.

Richie Cotton: So, uh, I'd let talk a bit about the technology you're using. I know you're using, Nvidia technology. , Nick, do you wanna talk about how, Nvidia Tech is supporting, , this search for better electrolytes?

Nick Becker: , so I think one of the, big pieces here is that Nvidia technology is broad and it's really built on top of the Cuda platform. And on top of this Cuda platform, which many folks are likely familiar with, um, NVIDIA's, CUDA, there's a whole ecosystem of what we call Cuda libraries or Cuda X libraries, hundreds of libraries.

And in this world, there are libraries for, very specific things like application, specific tools that might be for, molecular dynamics or, computational fluid dynamics and things like that. And also  more general purpose libraries that are really about bringing accelerated computing into entire domains regardless of what you're doing.

 so in this space, , one ecosystem of libraries is called rapids. It's a set of libraries that are open source, designed to bring accelerated computing into the world of data science and machine learning to complement how the world of AI already has that acceleration, um, that we've built up over the past, , 10 years as a community.

And so this rapids ecosystem has libraries for tabular data processing, machine learning, dimensionality reduction, clustering, all of these different tools, and techniques that, folks like Dan and then and Dan's teams can tap into. To be able to take the work that they're doing with their CPU based tools and go 10 50, even hundreds of times faster by using accelerated computing on the Nvidia platform.

, and what's really important with these libraries is that you don't need to reinvent the wheel. You don't need to  change your code. You can actually tap into this using the tools you're already familiar with by flipping the switch, so to speak, and no code changes required. so this is just one part of the ecosystem of NVIDIA's, , accelerated computing libraries and tools.

there's other techniques , and more, , I'll mention one other in pass in here, that, Dan and the team at SES are using called NVIDIA Alchemy, which is really about making it easier and effective to bring those AI techniques to bear for all the work that Dan was talking about just a few minutes ago.

And so the combination of things like Nvidia Alchemy Rapids for accelerated data science, um, there's a whole bunch of different capabilities in the Nvidia Nemo. Platform and, , ecosystem of frameworks around how do you actually prep the data for these AI models. These are some of the foundational pieces that are available to everyone, but in particular here, we're excited to work with Dan and Dan's teams to help explore the molecular universe.

Richie Cotton: Is like your low level tech then for interfacing with the GPU, doing kind of broad computations. Then rapids is the, the more data science  focused stuff that's on top of that. so I, I'd love to hear more about like, when do you actually meet a GPU? Like what sort of data science problems are amenable to GPUs?

Like, when do you need this stuff?

Nick Becker: Yeah, so I think this is actually a really counterintuitive point for a lot of people. You know, I think most of us, I suspect most of listeners are aware that, GPUs are great for ai. And if you're using, you know, AI frameworks, of course it makes sense to use, , , accelerated computing and GPUs.

But it's not as obvious sometimes why it actually makes sense if you're doing data frames or clustering or machine learning or graph analytics. Why did GPUs help there too? , and it turns out, I think that the real key insight here is that GPUs and accelerated computing are really well designed for parallel processing.

And it turns out that a lot of problems have latent parallelism that's not quite as clear on first glance, but when you decompose, the problem becomes actually very obvious. So a perfect example of that would  be, I think many folks here likely have used structured or tabular data before. , if you're, taking tabular data, you're processing that data, maybe you're using your favorite data frame library, , depending on what, programming frameworks you use or your teams use.

Data frames frequently need to be connected together. You imagine, you know, people who are analyzing data from any transactions or, , consumer applications or web impressions. Suddenly you wanna understand, well, how does the weather in this city affect this? I need to connect the weather data set to my internal data set.

That's often called a join or a merge. And when you think about that operation, you might be not just connecting data sets that have a few hundred rows or a thousand rows, but it could be millions of rows or even billions of rows in these dataset. And it's a huge computational problem. And that operation, that computational problem, has a ton of underlying parallelism in it.

You think about joining data sets together, what you're actually doing is taking columns in different data  sets, comparing rows in one data set to rows in the other, looking for matches. If they match, you're gonna, , take some action, maybe connect them together. If they're not, you'll move on. But each of those comparisons is independent.

And so it turns out that type of pattern is something that pops up all over the place. Whether you're doing ai, whether you're doing clustering data frame operations like joins, and you group aggregations and more. So what we've done is build up over, the last several years, an ecosystem of accelerated computing tools and libraries that make it possible to tap into that power of that parallelism you can get with the Cuda platform on Nvidia systems without needing to be an expert.

You can stay in your favorite programming languages like Python or you like c plus plus or Apache Spark if you're in that world and just flip the switch and get going.

Richie Cotton: Okay, so that's kind of interesting, the idea that something as as common as a join, , can be done in a parallel way. , I have to say, trying to reason about this myself, that sounds like something that's  very hard to do. Do you need to do, if you need to have this, uh, implicit parallelism?

Nick Becker: I think the first thing to say on this is hopefully it's actually gonna be even easier, rather than trying to reason that whether you can implement it yourself. our goal as Nvidia and as part of this open source community that we work closely with is to make it possible that you can just grab the tool that's gonna be able to handle all this for you.

So as a perfect example of that, , we're talking about data frames and, and joins. If you're using maybe some of the most popular data frame libraries in Python today, maybe you're using pandas, which is maybe the, the world's most popular data frame library or polars, one of the more recent increasingly popular libraries, or even in the Apache Spark world for its big data processing.

Whatever you're doing, you don't need to worry about how to handle that implicit underlying parallelism. can rely on the fact that we've done the work to build this Cuda powered foundation, the Cuda data frame world, and wrap it up so that you can actually just take advantage of it  whenever you want to go faster to improve productivity.

Or the other side of that coin, if you have things that are running expensively, using accelerated computing can knock those costs down.

Richie Cotton: Okay, so I'm Cloud is like something that's just sort of taken care of. I don't need to worry about that myself. That's very reassuring. yeah, it just. Wanna give us a, quick overview of like how you get set up. If, so suppose you're using Pandas already, how do you switch to , the Rapids version?

Nick Becker: So I think the first thing is if you're using Pandas already, awesome. You might have your favorite workflow. Maybe you're working on your laptop or a desktop, maybe you're in the cloud. Um, wherever you're doing your work, our goal and what we've been able to work together with the open source community to create is an experience that really has two steps.

The first step is to decide, is this something that I actually wanna go faster? , maybe you're only processing a couple hundred rows and you're using pandas and it's great, and you're just totally happy. That's okay. , but as you start to think about data sets getting bigger,  that's when this starts to become really powerful.

And so as your data sets are growing into the tens of thousands, hundreds of thousands, millions, sometimes billions of records, and you heard Dan talk about, a molecular space of millions and billions of candidates at this point, , there's often no alternative. And so what you can do then is think, how can my organization get access to Nvidia GPUs?

And maybe that's through centralized resources. Maybe you have a data platform team, maybe your workstation or your desktop has, , an Nvidia GPU in it, or you're in the cloud wherever you want to do your work, this ecosystem is ready for you. And as soon as you wanna make that switch, no code changes required.

You just simply need to get an up and running on that system with an Nvidia GPU, whether you're in a container or running bare metal, it's up to you just install the packages and you're the.

Richie Cotton: do you want to talk us through, like, , do you have examples of like specific, , data science problems you're solving and talk me through, , yeah, how'd you make use of rapids?

Dan Hannah:  Yeah, absolutely. So one, I think that is very relevant here, which, is sort of what drove us to initially adopt rapids in our workflow was this problem of applying dimensionality reduction to a huge dataset. So does dimensionality reduction. So dimensionality reduction means that I have a very high dimensional data set, so that means that each row in my dataset is described by a large number of columns effectively.

So for like a scatter plot, you know, an XY plot, each row in your datasets only described by two columns, your X value and your Y value. But when data sets get out to very high number of dimensions. So in our case, this is hundreds of dimensions, and I could ex, I could explain what the data set is here at a moment.

But fundamentally, when you wanna visualize a data set like that. We can't see things in 500 dimensions, we think in two or three dimensions when, when we sort of reason spatially and look at things. so in our case, we were applying what's called a dimensionality reduction technique. So that's a way to take that 500 some dimensional space  and compress it down to two dimensions.

And the trick is doing it in a way that preserves the structure from the high dimensional space. , so what that means is that, say if, two of my rows of data are close together in the high dimensional space, I'd like them to be close together in the low dimensional space as well. And I, I would like for all the structural elements to be preserved, and there are a number of different algorithms for doing this, each of which has their own set of sort of trade-offs and, , advantages.

And so. A lot of people are probably familiar with principle component analysis or, um, t sny, which is, , a slightly more advanced, , and, and it's sort of solving a slightly different problem. But fundamentally they're both dimensionality reduction techniques. So we're applying a, a more recent one known as UAPs, that stands for uniform manifold Approximation and projection.

, , but the problem it's solving is the save. It's how do I take this high dimensional data set and compress it down to two? And I can sort of talk very briefly about the reason we wanted to do this, just to  provide a little bit of context. Because I actually think this is a really cool , and sort of surprising problem.

We actually went down this road because we were trying to figure out how to inject human intuition into a pipeline we had already fully automated. So in our first version of this molecular discovery pipeline, it was headless. It was something where you would enter some things on a command line and you would get back a list of answers.

And what we were finding was that, and, and there, there were all sorts of kind of machine learning models and physics models kind of chained together on the inside doing various things to try and figure out which molecules would be best, for your, query essentially. And we were getting back a list of molecules and we were kind of finding that functionally they were all too similar actually.

when we went and discussed these results with our experimental scientists, they said that, you know, these are great, but a lot of these variations seem to be variations , on kind of a similar motif. And that's risky because  if there's something bad about that motif, fundamentally , you're gonna waste a bunch of time if you screen all 100 of those in the lab.

So ideally you want to increase the diversity of the dataset , you want. A lot of different motifs represented to sort of de-risk that massive time and monetary investment of actually testing these in the lab. so we started down this road of trying to automate that. I was saying, well, okay, how do we automate this idea of bringing in different motifs?

And it turns out that chemical intuition is extremely complicated. , the series of decisions that an expert chemist makes in their mind when evaluating a molecule, it's nuanced and it varies from molecule to molecule. Even for a given property. Take like melting point, for example, , the sort of underlying physics that determine one compound's melting point may be different from the physics that determine a different compound's melting point.

 so effectively where we landed was our experimental scientists said, you know, for now , this idea of the appropriate amount of kind of  structural diversity in, kind of a, a lab, an array of lab candidates. , you just know it when you see it. so now we had to think, okay, well what do we do?

So we actually needed a way to allow our human scientists to inject their chemical intuition into our screening pipeline. So we wanted to bring some, some sort of human feedback into this. And to do that, we needed to give our human scientists a way to look at how all these molecules in the dataset related to one another.

And that's where we got into this business of dimensionality reduction, because ultimately, our pipeline produced a fingerprint for each molecule. So a fingerprint is basically an array of numbers that describes. Each molecule. So for those of you who might be coming at this for more of a machine learning or data science background, you can think about word to vec.

Right? Word to VEC has something called an embedding space. That embedding space describes each word with an array of numbers. So you can pull out the latent space of your neural network and see that, oh, each word has a  representation. , that's just these parameters. And so the molecular fingerprint is no different.

It's a huge series of numbers that we trained to generate an optimal representation of each molecule in the dataset. in our case, these were specifically fingerprints trained to capture, , chemical structure. so we thought, okay, well maybe the right way to do this is to give our human scientists a map that they can actually navigate.

They could say, well, here's one area of the chemical and sort of structural space. Maybe we've adequately sampled it if we just pluck a couple out of it. So now I wanna move somewhere else. But the relationships that govern chemical structure are, are sort of complex and in many cases, second and third order sorts of interactions.

And , so that's why we, wanted to build a map that was based on these rich representations. so in order for our scientists to actually visualize that, that's where this came in. So we wanted to take this 512 dimensional dataset and squash it down to something. Our scientists could look at a scatter plot,  and that's where Umap comes in.

Here's the problem. The molecular universe is big even for a subset of molecules. Even if you sort of zoom in on a very specific application, in a very specific use case, you're still looking at oftentimes tens of millions of rows of data and. 

Problem is maybe too, too strong a term, but one of the challenges with Umap is that you do not know apriori, what parameters are going to be best. Oftentimes with these unsupervised clustering approaches, what you're really looking for, , or I should say dimensionality reduction, clustering as part of this, but that comes later.

with these dimensionality reduction techniques, you're kind of looking for a result that matches your intuition. You have some kind of intuitive sense of how you would like to see this data, and you need to find the parameters that sort of match that intuition. And so, so what you need to do, what does it amount to computationally?

It's a parameter suite. I need to take the  adjustable parameters in this UAP algorithm, and I, I need to either do a grid search or do other, some kind, some other kind of searching in order to find the best parameters for my dataset. , this is why Umap got complicated because on our data with this sort of five, hundreds of, dimensions and millions, in some cases, tens of millions of rows running UAP on A CPU would take four to five, sometimes six hours for a single run.

And if you think about a grid search over three adjustable parameters, that becomes prohibitive. And with kind of the pace of materials development at SES and the types of problems we're trying to solve, it was basically a non-starter to sync over a hundred computing hours into every data set that we needed to investigate.

 so we were sunk in terms of that approach. You know, we were very excited about this idea, but scaling it. Seemed impossible. And that's where Rapids came in, right? So  Rapids as, Nick, I think described in, , just a few minutes ago, gives us a way of doing this, , that is much faster. , tens of times in some cases, hundreds of times faster.

And now each Umap run instead of taking four to six hours takes on the order of minutes. And now you can actually do that grid search appropriately, get a umap that you have confidence in. You get a visualization of this data set where you're, kind of confident in the relationships that are being captured.

, it's beautiful. 'cause once it's sort of tuned properly, you get these islands that have really interesting groupings of molecules and it turns out to be a great way to bridge the gap between what our machine learning models are doing and what our human scientists , are sort of trying to tell it.

 Ultimately Rapids enabled this by making those scans, you know, each set of parameters takes only a couple minutes. So now we can get the UAP working correctly. And as a bonus, and Nick, described this, but I, I wanna double tap on it 'cause it was a, huge, , quality of life improvement.

You could do this with no code change. So we  already wrote the code to do this using the CPU implementations of these libraries, and it was beautiful. We just switched to Rapids and it works exactly the same way. it was not only much faster, but it was also very easy to use. So that was super cool.

Richie Cotton: , Okay. That's nice. No, um, it is an important point that, if you, if the technology just works, you can go faster as well. That's gonna speed up , doing the science. Also, I'd love to shout out to you map. So I think a lot of, , people like, it's like, oh yeah, all the innovation's happening in the AI space.

I know you map's a couple of years old now, but it's like, it's a data science algorithm that's new and it's been kind of fundamentally game changing.

Dan Hannah: Yeah, absolutely.

Richie Cotton: alright, brilliant. Just make sure I understood this sort of flow. It was like you start with like this huge dataset, a hundred, , what was it, a hundred million or something

Dan Hannah: Yeah. Oftentimes, tens.

Richie Cotton: Yeah. So then you're doing some kind of embedding you, you word defect for molecules. You've got, a, a big matrix I guess, and then , you're doing dimensional production to something that humans can visualize, I guess. what do you find from this? Like what these molecules I need. 

Dan Hannah: Yeah. Yeah, absolutely. So what's really interesting about this is if, if you generate this map, so now you're looking at, at sort of a, 2D visualization , of, you know, the molecular universe or in many cases a a subset of it, you actually find that a lot of what has already been investigated, and this is really interesting, right?

I mentioned earlier that if you dig into the literature, you find it's really just on the order of hundreds, maybe a thousand that have kind of been meaningfully investigated. So when you get down to brass tacks and you try to visualize the space of, kind of electrolyte relevant molecules, you actually find that there are about.

There's 23 galaxies, so I'll be concrete that you, you find 23 galaxies. That's these clusters on this scatterplot where molecules have kind of segregated and we're able to visualize the segregation again, because what we're doing is we're applying that U map and then looking at that, latent space.

And you find that of these 23 eyelets, these 23 galaxies depend, depending on what kind of metaphor you wanna lean on here.  , the molecules that have already been investigated fall really in just two of them. so, one conclusion that we came to is that , the space of molecules has kind of been woefully poorly explored.

that was kind of confirming an earlier supposition that we had made. But what's more exciting is that using some of these more expensive physics based models that I described earlier, you could start to spot check these things and you could start to look at, well, what molecules would at least be.

Potential candidates for use in a battery. So you have, this kind of first order filter that used to build the whole dataset. That's, you know, you have a hundred million molecules here. You can start to apply other filters once you run some of these physics-based simulations and these physics-based simulations, in most cases, they don't tell you whether the molecule's going to be good in a battery.

, you know, we're not there yet. At the end of the day, the only way to know the molecule's good in a battery is to go put it in a battery. But  it can tell you pretty well with sort of a pretty high degree of fidelity whether a molecule is gonna be bad in a battery. It's very good at identifying things that are not gonna work and then you're kinda left with the space of things that may work.

And so we started to do this. We actually took, you know, increasingly large subsets of this data and we ran some of these physics-based simulations and we keep throwing out the bad molecules and we kind of see what's left. And what was really interesting is that. You see these potentially good candidates distributed among many of these clusters, there are some of these galaxies that are totally empty.

And so those are places it, it starts to suggest that's not a good area of the molecular universe to be searching it 'cause there's nothing there that seems to fit the bill. But what's exciting is that there are a lot of galaxies that are totally unexplored, but at least from a computational modeling standpoint, seem to contain molecules that are promising.

And so what came of this, what came of this is that we're now able to develop much more robust and sort of lower  risk sets of molecules that we actually want to see through and spend the time and money to test them in the lab. Because now we can say, well, okay, , we've looked at this galaxy, we've got kind of the best performers that we're aware of in this galaxy.

Rather than continuing to draw molecules from this galaxy, we're gonna make sure. We get a data set that samples all these galaxies because there are so many things that can happen when you actually take this molecule and put it in a real sort of warm and wet system where it can react with all kinds of different things.

There may be something fundamental about one galaxy where oh, it turns out nothing in that galaxy works, but it's for reasons that are not captured by our current set of predictors and simulators. And this happens all the time because when you're in a real device, there's a lot that can happen that is not captured when you're simulating the behavior of sort of an isolated molecule or system of molecules.

So to de-risk that, to guard against that, to hedge our bets appropriately, what we're doing is we're actually trying to sample the  universe in a way that is, a little less biased towards one galaxy versus another. Just in case some of those galaxies have showstoppers that we're.

Richie Cotton: , this is absolutely fascinating and, , I like the kind of the workflow you use there. So you start off with the data science 'cause it's relatively cheap. , and then you move on to physical simulation, which is a bit more expensive. And then after that, whatever's sort of left, then that's when you start doing lab work.

Is, lab work is, uh, even more expensive. , so I'd love to get into the sort of, , the adoption process, but like, how do you actually go about using GPUs? So you mentioned before that you started with CPU and then switched to GPU. Maybe Nick, can you talk about, , like what are sensible workflows for adopting GPUs?

I mean, I have to say like when I work, it's like. MacBook, it's like there's no GPU in there, so I'm only gonna have access to A GPU some of the time. taught me what a sensible workflow for, , adopting GPUs is.

Nick Becker: , I think it's gonna ultimately depend on, what your initial setup is, right? So if you're under MacBook Air, if that's where you do your best work, that's awesome. , in that system, just  like you said, you know, you're not gonna be able to tap into that Nvidia capability on that MacBook Air, and that's okay.

the soon, as you now think I want to take advantage of that now, is this sort of decision point where you can decide where to go next. so depending on where your organization is, you might, maybe have a system that you can SSH into, if you're maybe an engineer or data scientist working with shared resources, maybe you're in the cloud and you can just use the data platform.

Your team is set up or, spin it up yourself. Or if you've got a desktop system, increasingly we're seeing desktop systems, , being, you know, used for these kinds of things. And so depending on where you're starting, the workflow changes. And I think, what's a really great way to look at this is, there are libraries for all sorts of different things, , we started talking originally about sort of AI libraries and machine learning libraries.

Then we kind of pivoted a little bit talking about, , data frames and there's the CUDA data frame library. There's also the Cuda machine learning library, and that's what Dan was talking about that brings you that accelerated umap and clustering and supervised learning and tree models and more.

 so, depending on what you're actually doing is the other piece that matters here. So you've got your workflow. If you've got an NVIDIA system today, and that's where you're doing your work, maybe you've got, , a desktop system, maybe you're using the cloud. Consistently managed platforms, if you've got an NVIDIA system available to you, the software maturity of this data science ecosystem with accelerated computing.

It's pretty strong now and it's really great to see that we've been working with the open source community for years to build this up. so I think it's actually fair to say right now if you've gotten Nvidia GPU for all these data science problems, you should be default on for accelerated computing regardless of what you're doing.

And part of the reason for that is that zero code change required experience, but it's actually a second piece. It's not just as Dan mentioned, that they, , the work they were already doing was, , using a bunch of code they've already written for CPU libraries. They can just flip the switch.

That's one half of the coin, the second half of the coin that makes it so that I think if you've got NVIDIA system, it's default on as the starting point. Now is the  right move is that if some of your code can't be accelerated, that's okay. The systems and the libraries are designed so that you can have a seamless experience with CPU U fallback.

If, for example, you're using some tool or library or. Maybe some specific API or function in your favorite library that isn't GPU accelerated. It's okay when you turn on the GPU accelerator switch, if that individual operation can't be accelerated under the hood, it's gonna handle all that C-P-U-G-P-U fallback and moving the data around and, and going back.

So you don't need to worry about it. So I think if you've got an NVIDIA system default on all the way for this accelerated data science world, and that's, that's really exciting. And it's a real, honestly, a step change compared to if we look back, , years in the past as we were building this out with the community, , looking around now, it's a whole different world and that's great.

 , if you don't have an NVIDIA system, I think the right question to ask is, would we benefit as a company,  as a team, or me personally, if my work were faster? Or if we were saving money. And if the answer is yes, I think it's the perfect time to go explore. Where's the easiest way for us to get access to these systems?

And once you get access, you can just go off to the races again. You can keep using your same code and whenever you're ready, that's when you can just flip the switch.

Richie Cotton: so, suppose I'm wanting to, , use stats models. One of these kind of like smaller machine learning, , Python packages it's not GPU enabled, so I don't need to worry about like, oh, well I'm writing some code on GPU and some code on CPU and having to manually switch between the two.

It's. All taken care of. Is that correct?

Nick Becker: Exactly. So for example, stats models, , if, folks are using that, it's somewhat of a very statistics oriented, modeling library for Python, fantastic library, , other libraries in machine learning space. So things like PS Kit Learn, which is maybe, one of the most, common libraries for general purpose machine learning.

There's of course Umap, which we talked about. There's a bunch of libraries for Cuda machine learning. What we've done is take that foundation you've been building  over the past, , five, six years with the community and bring a zero code change required experience to this world. And this was actually launched recently.

, we've had this for data frames for a while, for graph analytics, but now we've finished the puzzle, so to speak, , with machine learning. And that was actually announced, uh, about a month ago at GTC, , in, uh, mid-March. And so this experience, what that means is that you can be confident that when you flip this switch, all the things for Python machine learning that you are using ps, Kitt Learn or New Map or HDB scan, which is a, you very state-of-the-art clustering library.

Those are all gonna be accelerated, but it's gonna be compatible with any of the third party libraries you're using. So if you're using your favorite library, maybe your colleague wrote some super cool library and tools that everybody uses in your team and it uses all this good stuff, it's all gonna be compatible because it's designed to be. right in the middle of this ecosystem, so you don't need to worry about that. , but as a flip side, if there's something that you wish where GPU  accelerated but maybe isn't today, that's the best thing about the open source community. All of us in the sort of Nvidia data science and Rapids group, we are open source contributors.

We are helping maintain a lot of these projects for GPU and CPUs and the whole ecosystem. Um, it's a community and so if you wish there was something accelerated, let us know. Get in touch on GitHub, let's figure it out together.

Richie Cotton: Okay. Uh, I like that. Yeah. Uh, certainly. Uh, please send Nick a message if, if there's something you want. nice. actually, so it seems like a lot of your, focus at the moment has been around, building up machine learning libraries. can you give us a preview of what you're working on at the moment?

Nick Becker: Yeah, absolutely. so again, you know, this ecosystem is broad. , it is a full accelerated data science ecosystem for. Cuda power of machine learning with NVIDIA's ML library, cuda machine learning. There's CDF Cuda data frames, and there's all sorts of additional things including graph analytics and graph neural network accelerators.

 since we're talking about machine learning and Dan has been talking about the work they're doing with umap and other key pieces like that, I think maybe there's a couple of things there  that I, I think would be great to highlight. , so again, all this work is open source, so if you're interested, join the community.

Let's get involved, let's work together. but one thing that we announced as coming at GTC, um, as well is another library in this ML world that we haven't talked about yet called XG Boost. , for a lot of data scientists and engineers out there, structured data is bread. 

When you're processing kind of structured data sets or data sets that might have some, you know, nons spatial structure, we tend to see that tree models in particular gradient boosted tree models like XG Boost and, light gbm, there's a bunch of these toolkits. They tend to be very effective and people often kind of rely on them for these kinds of tabular data sets for machine learning.

Well, XG Boost is something that I think is, you know, one of the most popular, if not the most popular of all these libraries in this space. And it's, again, an open source library that we develop in partnership with the community. And XG Boost 3.0 just came out and as part of  this, there's a key work stream that we've been working on toward making XG Boost work better for larger than memory data sets.

So if you step back for a second, one of the, coolest things about these kind of deep learning workflows when you're using PyTorch or things like that, is that you can process data in batches. so you can get. A ton of data, batch by batch, bit by bit through the model and learn from huge data sets.

But for tree models, historically it's been really hard to do that. turns out you kind of need to get the data mostly in memory. There's a little bit of, fluidity there, but broadly speaking, you can't do that batch training. , so one of the things we've been working on is the ability to do that, and it's really powerful.

And the, upshot of this is that means that, you know, you get a single GPU now with the right system, you can train on dataset for xg boost of, , over a terabyte of data just on a single GPU. , now I say the right system because to do this efficiently, , there's a whole bunch of work that goes into that under the hood, and it's designed to be able to  be tapping into the advances in hardware that make it possible to do this.

 so things like coherent memory between the CPU and the GPU on some of those Nvidia Grace Hopper and Grace Blackwell systems. You know, the takeaway is that XG Boost three, , which is now available via PIP and additional packaging is coming. , this is right there. So if you are training XG Boost models on large data sets with, you know, large systems, you can now do more with less.

And that's, really exciting to us to help make large scale dataset training more accessible. it's not magic. It's so you can't just batch by batch everything through, every tree still needs to see all of the data, but you can do it in batches for each tree. And so that makes it a little bit more deep learning, like makes it way more scalable.

and we're super excited to see what people do with that. maybe just one more thing and the same theme around scalability. ML is something that, as Dan said, you know, we're trying to get, enable people to process the hardest problems, the universe space, even that, you know, subs sample space that  Dan was talking about of, tens of millions or hundreds of millions of records. Running umap on that is a challenge on CPUs. just, pure performance, you kind of hit the wall, so to speak. And with QPU acceleration, we can make that possible and tractable, one of the things we're working on is continuing to make that even more scalable. and so again, this is open source work, but it's super exciting to see that what if you could do that on hundreds of millions of records, which you can actually do now today.

if you go get the absolute newest of these releases, that's now possible. so these are two of the things that we're doing right now, really trying to push the envelope in accelerated machine learning.

Richie Cotton: That's very cool. And I think in the last few years, neural networks have kind of stolen the limelight in terms of algorithms for machine learning, the around them. But yeah, if you're working with structured data, then there's a ton of alternatives that often perform even better than neural networks.

Nick Becker: Yeah, and I think just, I think on that same note, you know, it's about finding the right tool for the right problem. we've talked about it a little bit and Dan mentioning that kind of those molecular fingerprints that are essentially like these  AI embedding representations.

One of the coolest things about this AI revolution is that there's so much more going on with data processing, ai, data science, and we're seeing these things coming together, right? The workflow we've been talking about for the past 45 minutes is a quintessential example of how something that's only possible with ai, but also only possible with data science and data processing and, and that whole ecosystem coming together is what makes this new world possible.

And you know, I think the AI revolution is helping, you know, pull all of these things together.

Richie Cotton: Absolutely. Yes. . In order to like, solve the problems of like, how do I find a better electrolyte? You really wanna be concentrating on the material science. You don't wanna have to worry about the infrastructure so much. It just wants to, just needs to magically work for you. , so on that though, I guess, , Dan, , I'd love to know, does it magically work or, how did you go about, adopting your whole tech setup?

Dan Hannah: Yeah, so when we started down this road. both for sort of the specific example that we've drilled down on here, which is that example of, building a  two dimensional visualization of that embedding space. But for a lot of other problems as well. We started with the CPU focused workflow and that includes even some of these physical simulations I was talking about.

Right. We use physical simulations to actually make sort of concrete predictions about how that molecule will perform in a battery. Well, originally that was driven by a CPU centric workflow that we ran on in sort of a more traditional high performance computing environment. So, you know, you could imagine , there's sort of controller nodes and then there's compute nodes that are essentially aggressively leveraging CPU, multithreading and MULTIPROCESSOR type optimizations.

 And that ended up being. Not fast enough to meet our needs actually, especially when you sort of do some cost normalization. It turns out that because the physics of what we're calculating is actually very favorable in this case because it involves an enormous number of matrix transformations and matrix operations that GPUs tend  to be very good at.

so when we pivot to GPUs, yes, if you think about renting A GPU, , the sort of cost per hour is higher than a CPU, but it's so much faster that there's a net cost savings migrating GPU. So you get your results faster, you also save money. It's a win-win. So know, we started out , in sort of this conventional, , high performance computing type of environment.

And when we realized that we would like to make a pivot to GPUs, there was this, this sort of challenge of, okay, well we do need, we need different infrastructure. So in our case, I actually think the infrastructure was the harder part. A lot of the software, not only the, the software developed by Avidia that offers sort of that zero code change, solution.

We were also very fortunate that there's at least a couple of our physics-based tools that have sort of a GPU version of the same library that respects the same software contracts and interfaces. so the  software piece was actually a pretty easy pivot. So what we did is we, basically. Figured out, you know, what are sort of the places that work for us to, spin up and, and provision some GPU infrastructure.

And we built up a, a sort of GPU farm where we're looking at, , a centralized, computing, workhorse , that multiple people within the company can all sort of log into and there's a queue management system. And , so on the software end, this actually looks pretty similar to that high performance computing environment where there's a queue manager and you submit jobs and you can monitor them and things of that nature.

And the difference is that in this case, we now have just CPUs are just serving as controllers for the nodes. And then the nodes are, \ eight GPUs, whether that's an A 100 h, 100, and so forth. so we built up , this sort of GPU cluster and , we worked with sort of a couple of different places, including NVIDIA , and their DGX Cloud offering to do some of this work. you know, once we had the infrastructure in place, the pivot was pretty simple. It's everybody just changes their accounts.  And then thankfully, most of the software we're using invested a great deal of effort in making it very easy to sort of turn on the GPU switch. And so we just had to do that and then we were off to the races.

 So we really at SES benefited tremendously from the effort that folks, both within NVIDIA but elsewhere in the open source community, to make this software migration seamless. That was a huge boon to us and, sort of a huge, booster in, in our confidence that we could easily make this transition from a CPU driven workflow to A GPU driven one.

Richie Cotton: Okay, so it sounds like, I mean, it's maybe not quite plug and play, but it's just a case of like, get your cloud set up. just, different machines in the background. You're still using the same scheduler and most of the same software, so, uh, that actually sounds a lot easier than I expect.

Dan Hannah: Yeah, it was gr it was great. Like, you know, the experience of a computational scientist at SES changed very little. I'm still submitting jobs to s lm, I'm still invoking, uh, you know, a con environment and a python script, and I'm still remoting in to some sort of large computing cluster that is located  in some physically separate location.

All of those things actually remain the same, and so that was great.

Richie Cotton: Okay. Um, you made a point before about, how you just want to check like how much of a speed are we getting and like the relative cost of kind of cloud, machines with GPUs versus not GPUs. So there's some cost benefit calculations there. And would in general, like how do you, measure the, return on investment for all these sort of, data science infrastructure choices?

Dan Hannah: Yes. That's a great question. So in our case, the, thing that we were sort of interested in was, all in, how much time did it take me to get? well like pick a number, say a million, say, I wanna calculate the, I wanna physically simulate the properties of a million molecules.

We are gonna ask ourselves, how long does it take to get those million answers and what did it cost us? And both dimensions are important because , we sort of care about for, for various reasons, timing the market appropriately. And so there's a high amount of value in getting things done quickly.

In our space. so even if something is more expensive, if it's a lot faster, we kind of make a  call on a case by case basis. Whether that added speed is worth it. In this case it was a no brainer because it ended up being both faster and cheaper. So what we were looking at is what ultimately, how many molecules do I get sort of per day if I'm like cranking through a stack of work and what does each day cost?

And GPUs one on both axes. But those were kind of the two ways we were evaluating this. And one thing that's kind of interesting is that by the time our team started to engage with Nvidia and Nick's team, we had actually already begun that pivot. To a GPU centric workflow. The GPU centric workflow adoption initially was driven by speeding up those physics-based simulations.

And, and , I'll just repeat Nick's, shout out to the Alchemy team. Also at Nvidia, they helped us a bunch with moving all of our physics-based simulations over to something that was powered by CUDA and GPU hardware. So , we sort of already had this in place and then we kind of got this awareness of, the Rapids Library where we say, oh, well we can use GPUs  for this dimensionality reduction bit as well.

And that's fantastic. But a lot of the original pivot was actually driven by that cost and time analysis related to the physics based simulations we were doing

Richie Cotton: Okay, so. Describing just how much time you saved. it's gonna give you some justification to kind of, pitch your boss on getting some cool hardware, actually, uh, just, related to this. so I think maybe one of the most, common audience questions is just, okay, I'm a data scientist.

How do I justify like what I'm doing, the impact of my work to my boss? This is so important. I think I'm gonna get both of you to answer. So, maybe, Nick, do you wanna go first? You not spoken for a while. How, if you're a data scientist, how do you justify the impact of your work?

Nick Becker: Yeah, I mean I think it's gonna vary of course, depending on what you're doing. But, you know, I actually used to be a data scientist, so I've lived this problem. , at Vidia, obviously I'm more on the engineering and product side, but I used to be a data scientist. . you know, for me, I think one of the, the most important things to figure out is what is the North Star?

not just for the individual project that I'm on, but for our group or even the company  and linking what we're doing to that North Star , is always, always effective. because at the end of the day, we're all working with, teams and colleagues and, and different groups to figure out how to move our organization forward and what we are doing if we want to, you know, get some buy-in for something, explaining how, and, and really, getting consu to how what we're doing is going to move the needle for, some of the most important things, is often very effective.

 so, you know, when you're thinking about an individual project or you know, if you wanna go explore something, what are we gonna learn from this project? And how does that tie into some of the most important things we are focused on as a company or as a group? Or if you're not talking about a project, but thinking about, you know, some tool or technology you wanna try.

what is the possible outcome if we do this? What are we going to gain if we put in a little bit of investment in figuring out whether or not maybe a new technology can make us 10 times faster in something that we're doing every day, or it can save us 30 or 40 or 50% of the cost of something we're running every day.

Those are kind of key  things to link back, and I think those would be great, places to start.

Richie Cotton: I do like the idea of just like, find out what the North Star are, like what business metrics you're gonna be able to impact, and then link all your work back to that. wonderful. Dan, do you have anything to add to that? Uh, have you got any tips for data scientists?

Dan Hannah: Yeah, so I think this answer may be a little bit focused on my corner of the world, which, which I would broadly call maybe AI for science, which is sort of an increasingly popular term. , but I think some of these notions actually do generalize. So I'll, share some thoughts I have on this.

I think in my world there's sort of two ways qualitatively that a data scientist has an impact. One way is what, what I would broadly term optimization, and that's using data and data-driven approaches to do something that is already sort of possible in theory, but doing it in a way that is either much faster or much cheaper or both.

so you have sort of optimization as one category. And then I think there's also what I would call sort of net new capabilities. Those are things that essentially are  problems you cannot solve even in the abstract without heaps of data. I can sort of quickly explain sort of two examples that are kind of relevant to SES.

So one of these things, is predicting what I'll call the sort of voltage that a molecule will break down at. you think about like batteries, they have a voltage and that voltage is changing as you charge and discharge the battery. And one of the things you worry about is how far can I push it before the electrolyte inside the battery starts breaking down.

And it turns out that's something that you actually can predict to some extent with a, , physical simulation that is. Fairly expensive in the scheme of things, , but reasonably accurate, under a certain sort of set of assumptions. so when I talked earlier about surrogate models, one of the things you can do is you can put in the work to run this calculation a whole bunch of times.

And that takes a long time because these are sort of expensive calculations. But then you could train a model that does it very quickly. So in that case, you've now kind of moved the needle  in the sense that whereas before maybe you, could, tear through a set of candidates or a chemical space in weeks, you can now do that in hours and that definitely changes the way you work.

So I don't wanna downplay optimization. Optimization can actually be extremely important, particularly at large scale. A 1% cost savings per row on a hundred billion rows is a lot of money saved. So there's huge value there in speeding up things that are already happening. Take a totally different example, I'll choose melting point as an example.

So now, again, both physical properties, , but melting point, this is one that I referenced earlier, is actually extremely complex and it would be very challenging to sort of build a comprehensive, purely physics driven framework that could get you accurate melting point predictions across a broad range of molecules.

but with, with sort of enough data, it turns out melting point is something you go in the lab and measure you heat it up until it melts and then you write down when it melted. it does  take time, but it's possible. It's a very experimentally attainable. quantity. even though the, physics that underpins, accurate kind of prediction of melting points is, is in some sense hopelessly complicated for, from a practical kind of screening standpoint.

If you have enough data, I. You could train a machine learning model that can actually do this pretty well. so now that's a net new capability. Before, the only way I could get the melting point was to go in the lab and measure it. But now I've kind of allowed, say a neural network to discover that, in sort of this incredibly complex, network of, interactions, it actually was able to discover features about these molecules.

They may not be human interpretable, but that's the beauty of data-driven approaches. They don't need to be. it discovered some way to find meaning in that, numerical space and actually find some set of features that was predictive of melting points. So now you have a net new capability.

So in one case, you're doing something faster in the other, it was something you literally couldn't do before. And I think actually a lot of data science efforts spanning from kind of machine learning and AI to  things that are more in the realm of analytics usually fall into one of these two camps, right?

Even analytics is often about making a business level decision, but faster. It's about making a decision that, you could have arrived at with sort of lots of explicit research and lots of explicit kind of examination, but analytics allow you to generate summaries that, that sort of get you to that point faster.

So I think as a data scientist, when you're thinking about how to have an impact and how to justify your impact, it's useful to categorize your work into one of those two categories and I think that can help you frame the way that you explain the value of what you're doing. And I totally agree with Nick's comment about a North Star.

That's a super important piece of this.

Richie Cotton: just to finish, I'm always in need of, new people to follow. So, whose work do you recommend that I look into? Nick, do you wanna go first?

Nick Becker: Yeah, I mean, I think, there's so much good stuff going on right now, in sort of the research space, but also really in the applied space. and I think depending on the audience, I think, or depending on the day, what you're looking for, different angles are, appealing.

 one thing I would  shout out I think is all of the work going on around, bringing in new graph analytics into the world of, of rat, which we haven't talked about today. And maybe that's some topic for a future, era. But the world of graph analytics is something that's part of this data science meets AI world.

and you know, there's. Exciting work coming out of, a line of research that is coming out of kovic group at Stanford, and this just work around graph neural networks. This is the group that came with, PyTorch geometric. They're doing the work around graph neural networks.

And I think it's an incredibly exciting area that sort of has implications for all sorts of things because, you know, we spent a lot of time here talking about molecules and structured data and latent spaces. turns out graph neural networks and graphs are ways that we can represent and capture the connections between data and concepts and ideas and entities.

And so we haven't talked about it all today and I'll kind leave it there, but I think the work going on with, graph NEUR networks, and particularly  exciting.

Richie Cotton: Right there with you is a very exciting new field, I feel. Yeah, it's a whole separate episode.

Nick Becker: Yeah, absolutely.

Richie Cotton: another. Wonderful. alright, Dan talk who your recommendations for.

Dan Hannah: I think one of the coolest things here is The advances in, in machine learning have kind of caught up to a point where they're now sort of meaningfully useful for chemistry and material science. And, and that's been true for a number of years now, but not necessarily maybe true say 20, 30 years ago.

so what people are coming to realize is the value and the importance of data. So the thing that I'm actually super excited about is seeing. More and more releases of these large, chemically diverse, chemically meaningful data sets. So, so rather than kind of focusing on models or techniques, and there are plenty of exciting developments there.

I wanna plug data sets. so I'll start with maybe just like a second of shameless self-promotion. SES, has, seen so much internal value in the dataset  aspect of this and the data exploration and how it can sort of augment the work of human scientists that we're actually launching a product which we call molecular universe, that is intended to facilitate exactly.

This kind of chemical exploration. So I'm obviously very excited about that. but beyond that, I think that you can look to places like Lawrence Berkeley National Lab, who runs something called the Materials Project, that is a database that, we are at SES are kind of focused on molecules and liquids.

The materials project is endeavoring to do something that is notionally similar but for solid materials. And it's essentially this giant catalog of all the possible solid materials and their properties. then you can go to Facebook AI research. They maintain a data set of catalysts. So you have, I think, OC 20 and OC 22.

These are data sets that are fundamentally related to how molecules, and the reactions that they experience can be sped up by something called a catalyst. in a lot of cases, these are sort of computational data sets, and I think as more of these come out,  the types of modeling that they enable are going to be transformative.

But then even beyond that, I think what is also really exciting is you're seeing more and more people package up experimental material science data sets and release those. So, for example, I think Samsung has released a dataset on battery failures. So, there was sort of a lot of, publicity about phones, maybe, catching on fire and things like that.

 so some of these battery failures as, we know all too well, can be quite spectacular. but. These, these data sets of, cycling data for phone batteries and things like that, these huge data sets of actual battery cycling are going to be extremely valuable in developing models that are more holistic, my focus and my team's focus is really on the molecule, but there's an incredibly diverse, set of modeling problems out there as you go up to the cell level and the pack level and the car level, and a lot of those. Models, require data. That's even harder to get because I think what we've seen in the community time and time again is that unfortunately  there's very few substitutes for field data that's, you know, running the phone battery in the course of normal operations at realistic temperatures.

It's driving the car around and collecting that data. And I think what we're seeing is that people are understanding how valuable that is and, and thankfully in many cases, they're making it available. So I actually think the thing that is really exciting, in addition to the approaches and the methods, I think for material science and chemistry specifically, you're seeing both companies and universities releasing incredible data sets that serve as valuable in their own right, but also as great pre-training data sets to teach a model about something that you could then possibly fine tune for a specific application.

So I'm, I'm very excited about where the data's going.

Richie Cotton: Um, yeah, certainly. Uh, lots of great, project there for the audience. If you're. Yeah. Analysis of battery fires around the world. That'd be, uh, that'd be very fun to do, I think. Uh, excellent. Alright, thank you so much for your time, Dan and Nick.

Nick Becker: Thanks for having  us. 

Topics
Related

podcast

Data Science Trends from 2 Kaggle Grandmasters with Jean-Francois Puget, Distinguished Engineer at NVIDIA & Chris Deotte, Senior Data Scientist at NVIDIA

Richie, Jean-Francois, and Chris explore the role of AI agents in data science, the impact of GPU acceleration, the evolution of competitive data science techniques, model evaluation, communication skills, the future of data science roles, and much more.
Richie Cotton's photo

Richie Cotton

44 min

podcast

The Data to AI Journey with Gerrit Kazmaier, VP & GM of Data Analytics at Google Cloud

Richie and Gerrit explore AI in data tools, the evolution of dashboards, the integration of AI with existing workflows, the challenges and opportunities in SQL code generation, the importance of a unified data platform, and much more.
Richie Cotton's photo

Richie Cotton

55 min

podcast

From BI to AI with Nick Magnuson, Head of AI at Qlik

RIchie and Nick explore what Qlik offers, including products like Sense and Staige, use cases of generative AI, advice on data privacy and security when using AI, data quality and its effect on the success of AI tools, how data roles are changing, and much more.
Richie Cotton's photo

Richie Cotton

43 min

podcast

[AI and the Modern Data Stack] Accelerating AI Workflows with Nuri Cankaya, VP of AI Marketing & La Tiffaney Santucci, AI Marketing Director at Intel

Richie, Nuri, and La Tiffaney explore AI’s impact on marketing analytics, how AI is being integrated into existing products, the workflow for implementing AI into business processes and the challenges that come with it, the democratization of AI, what the state of AGI might look like in the near future, and much more.
Richie Cotton's photo

Richie Cotton

52 min

podcast

How Data and AI are Changing Data Management with Jamie Lerner, CEO, President, and Chairman at Quantum

Richie and Jamie explore AI in the movie industry, AI in sports, business and scientific research, AI ethics, infrastructure and data management, challenges of working with AI in video, excitement vs fear in AI and much more.
Richie Cotton's photo

Richie Cotton

48 min

podcast

Deep Learning at NVIDIA

The modern superpower of deep learning and where it has the largest impact, past, present and future, filtered through the lens of Michelle Gill's work at NVIDIA.
See MoreSee More