The Past, Present, and Future, of the Data Science Notebook

Jodie Burchell discusses notebooks and the challenges facing data science today.

Apr 3, 2023

Guest

Jodie Burchell

Dr. Jodie Burchell is the developer advocate in data science at JetBrains. She completed a PhD in clinical psychology and a postdoc in biostatistics, before leaving academia for a data science career. She has worked for 7 years as a data scientist in both Australia and Germany, developing a range of products including recommendation systems, analysis platforms, search engine improvements and audience profiling. She has held a broad range of responsibilities in her career, doing everything from data analytics to maintaining machine learning solutions in production. She is a long time content creator in data science, across conference and user group presentations, books, webinars, and posts on both her own and JetBrain's blogs.

Host

Adel Nehme

Key Takeaways

Data Science tools will further evolve to cater to people from both research and engineering backgrounds, and the finer details that accompany the way they approach data science.

Generative AI tools will become more prevalent in the future of data science, automating processes and allowing people to focus more on problem-solving, and less on coding.

To ensure effective communication between data and engineering, data leaders need to help data teams focus on the business problem, while also taking into account the constraints and requirements engineering teams might have when it comes to production.

Key Quotes

In 2023 I think we're gonna be increasingly moving towards the idea of standing on the shoulders of giants, taking these large language models that have already been trained, and fine-tuning them for our own purposes. This leads back into this ethical conversation because we are now taking models that we don't necessarily understand the training data for in some cases, and using them.

I think, it's gonna be a year where increasingly we have conversations around things like data ownership, whether the production of the data was ethical, whether the data is biased, and whether these tools should have even be made and released at all?

I'm a researcher. I will always be a researcher. I will never be an engineer. And for people like us, the tools that we need are like, something that reduces the barriers to do the research. So I actually think for people like me, these all in one managed solutions are going to become more prevalent.

Whereas then you've got that part of the ecosystem who are data scientists with more interest in engineering, maybe an engineering background, maybe they are machine learning engineers and they're doing production. They're gonna need engineering tooling. And I think for them you're gonna a increasing integration and support of sophisticated notebook tooling, which then integrates with, something that then helps them move to production.

Links From The Show

DataCamp Workspace: An-in Browser Notebook IDE

JetBrains' Datalore

Nick Cave on ChatGPT song lyrics imitating his style

GitHub Copilot

The Past, Present, And Future of The Data Science Notebook

How to Use Jupyter Notebooks: The Ultimate Guide

Transcript

Adel Nehme: Hello, welcome to Data Framed. I'm Adel Data Evangelist and Educator at DataCamp and Data Framed is a weekly podcast in which we explore how individuals and organizations can succeed with data. Today we're speaking to Jody Burell, developer advocate at Jet Brains. If you've been in the data space, even if it's been for a short while, You probably know that Jupyter notebooks are a pretty important aspect of the data science ecosystem.

The concept of literate programming or the idea of programming in a document was first introduced in 1984 by Donald Nth, and notebooks are now the defacto tool for doing data science work. Heck, even DataCamp has DataCamp workspace, which is a cloud-based managed notebook environment aimed at users of all types to bridge the gap from learning to application and help teams collaborate at scale.

JetBrains has data lore, a notebook based IDE aimed at data teams. So I wanted to speak to Jodi about the state of the Jupyter Notebook today. It's passed, and more importantly, where it's going throughout the episode. We speak about the notebook landscape today for data professionals and. The challenges teams have that led to the innovations we see in the tooling space today.

What large language models and chat GPT mean for notebooks and the data profession in general, in nature of tooling, fragmentation in the data space, how you become a developer advoca... See more

te, and a lot more. If you enjoyed this episode, make sure to subscribe to the show, and now on to today's episode. Jodi, it's great to have you on the show.

Jodie Burchell: Yeah, I'm super excited to be here, so thanks so much for having.

Adel Nehme: Thank you so much for coming on. I'm excited to speak with you about the state of the data science notebook, your work at JetBrains as a developer advocate, the tooling needs of modern data teams and more. But before, can you please give us a bit of a background about yourself and what got you into this space?

Jodie Burchell: Yeah. So I have maybe in some ways a kind of typical background, like I'm an ex-academic, but I didn't do a PhD in physics or machine learning. My PhD was actually in clinical psychology. So the way that I. Kind of went from psychology to data science is in Australia, the way that we teach the, like psychology is as a behavioral science.

So you have to do like a ton of statistics and research methodology units, and most people hate them. I loved them. So basically completely fell in love with statistics, which not really something you hear that often. And when I finished my PhD I took on a postdoc in biostatistics and then when I realized, you know, didn't wanted be an academic data science was actually just heating up.

So I had really good timing and yeah, basically started my data science career. Been doing this for about seven years and. Yeah, like few different areas. Mostly natural language processing, but I've worked across like a whole bunch of industries, so it is been amazing, very interesting. And yeah, the latest move has been to develop advocacy.

Adel Nehme: That's really great and we're definitely gonna unpack what it means to work in developer advocacy. You know, I work in developer advocacy in a lot of ways as well, so it's really nice to meet someone. who's in the same boat as I am. So let's, talk about that at the end of the episode.

But first I wanna get into the meat and bones of today's episode and talk about really the state of the Data Science notebook. I'd love to set the stage for today's conversation by really trying to understand where the data Science Notebook is today. In a lot of ways, the Notebook interface Has been around here for decades. The concept of literary programming or the idea of programming in a document was first introduced in 1984, I think by Donald Kno. A lot of innovation happened since, right? The Jupyter Notebook was released in the early 2000 and tens, and today we see an explosion of like notebook based IDs, CL collaborative notebooks.

So I'd love to first understand from you, Jodi, how have you seen the Jupyter Notebook ecosystem evolve over the past years or the past?

Jodie Burchell: Yeah, and it's actually a bit scary to say. I do remember when Jupyter Notebooks came out I actually started my data science journey in R, which I guess is unsurprising cuz my background was so statistical. . So I got my first taste of this sort of literate programming environment with our markdown scripts, and they blew my mind.

Like they really just felt so much more intuitive and cleaner to me for the research process than trying to use straight scripts, which always just felt like. A compromise for me. So I think this is probably a pretty common feeling that a lot of academics or ex researchers would have when they start using Jupyter Notebooks or our markdown notebooks used to be basic, like really basic.

and It's really cool, I think to followed along the journey and seen how much richer the notebook ecosystem has gotten. Like, I think especially in the last, I would say, three to five years things have really changed a lot.

I think. Really the biggest change I've seen and was a much, much needed change was the move from notebooks from local to remote. So we've seen, kind of explosion of managed notebooks. We've seen Jupyter Lab of course, which has been like a real revolution. But a lot of companies, including my company, jet Brains, have their own solution.

And of course, DataCamp has their own solution. Some of the most famous ones are AW, s H Maker. We've got the GCP and Azure managed notebooks got CoLab of course, and we've got data law, our product at JetBrains. We've got the ability of ides to now connect to remote servers really easily. So pie, charm, and data.

Spell again. Jet Brains Products can do that. And then of course, data Camba has their own workspaces, so it's like a lot. And I think. This move to remote has really opened up a lot of possibilities for seamlessly developing more complex models straight from the notebook environment. You don't need to do anything special to run your, the script on it, like an easy two instance or something like we used to have to in the old days.

Yeah, and I think also like when you move to remote, it makes accessing increasingly bigger data, easier. so you. Basically connect to databases that are stored on aws. If your notebook is hosted on aws, it's much easier. I think other interesting things is notebooks have really become a direct launchpad into production.

This is a new area, but one of the first ones I saw was AW W s SageMaker prediction Endpoints it's an interesting area. I, I'm not sure I'm say I'm gonna entirely agree with notebooks to production, but it's an interesting idea.

And then, yeah, I think it's even basic stuff that a lot of software engineers would take for granted that we'd never had in notebooks. Things like code completion or introspection debugging. So it's, amazing what a notebook is now compared to when I started using them quite a long time.

Adel Nehme: That's really exciting and there's a couple of things I wanna harp on here from what you said. You talked about when, you saw the introduction of I Python notebook. I wasn't in the data space at the time. I'd love to understand how was the rate of adoption at the time, because what you mentioned here is really the magic of Notebooks is in a lot of ways is in the interface.

And this reminds me now of chat, c p t, like the interface of the chat interfacing with a large language model is the reason why it's so adopted in such a quick time. How was the Notebook interface contributed to the rate of adoption at I Python at the time of release from your memory?

Jodie Burchell: Yeah, so I can really only speak anecdotally, of course. But this was during my postdoc, so, I remember a friend of mine sending me this and saying, Hey, have you seen these IPO and notebooks? They look super cool. And I'm like, oh, yeah, they look similar to what I'm using in r And I do remember, like I had friends who were particularly working in.

Biomedical engineering and things like that. they really like talked to the notebooks and it became this word of mouth thing where people were like, oh my God, have you seen this amazing new format. So yeah, it was, I think a lot of positive feelings and adoption among the Python users that I knew.

Adel Nehme: Yeah, definitely the interface plays a big role. And you mentioned as well, Of how notebooks have made it much easier to work on data science projects and to get to models much faster. One thing that we've seen as well, like with the collaborative notebooks and cloud-based notebooks is that the rate of adoption and like from learning data science to doing has been also much higher.

Like the ability to just open up a co-lab or any kind of cloud notebook environment and get in is really wonderful in the education sense as well. Because I was writing an article recently on how to install Python, and I'm still surprised at how relatively difficult it is to like manage different Python installations and versions on your machine.

and if you're a total beginner, that's also like a pretty difficult endeavor, right? Like it will bog you down in your learning journey.

Jodie Burchell: Yeah, I think this is actually. One of the most powerful things about these managed environments that it's, so I have, I have like a real maybe passion for reproducibility, probably coming from a psychology background where that was quite a problem. So. it's a really exciting thing when you have a managed notebook.

And again, data law is the one that I'm most familiar with. It's this thing where you basically have a one-to-one relationship between a notebook and an environment and because it's containerized. It's a docker container under the hood. Basically everything is. Reproducible, you start up your server again and all of the same, like environmental dependencies get reinstalled.

And the beginners, this is so much more approachable because they can get straight to all the fun stuff and they don't have to deal with all this overhead, which is not so fun.

Adel Nehme: Yeah, agreed. So in a lot of ways, the innovations that we. About in the notebook space, Arise from the challenges encountered by data teams over the past few years. What what do you think are some of these tooling challenges that data teams have faced that led to the rise of this modern I D E Modern Cloud collaborative notebook?

Jodie Burchell: for me, honestly, one of the biggest challenges, and I really saw this when I was going through my data science journey, is. Access to resources and as the size and complexity of models grows, like especially with the increasing use of deep learning compared to just, classic machine learning, you can't reasonably fine tune something like Bert or G P T on your local machine.

It's just not gonna happen. So that I think is like one major challenge that notebooks really needed to rise to. I also lived to this, but data has grown as well. Like the volume of data that we're expected to deal with is enormous. And I think increasingly data scientists have had to move away from like the CSV on my local machine to working with like these huge remote data sources and databases are obviously a huge one.

So. having good database tooling and like the ability to actually see what's in your database and like even plan queries and things like that can actually be a really important thing to add into your Jupyter Notebook. Spark. So we've actually seen a new kernel come out, which has spark magic notebooks.

I love them. you spark a bit and I can say setting up a spark session yourself is not very fun . So having the notebook take care of it for you is wonder. And then like some other things are, collaboration because we're moving to a remote first world. I think a lot of tech companies are saying that a lot of people do not wanna come into the office all the time.

But then that means, if you've got a team of data scientists, you want 'em to work together on the same project, they're gonna need to collaborate. So, there's been changes to notebooks to deal with that. And then, yeah, finally the reproducibility thing that I already talked about. So plenty of changes.

Adel Nehme: Yeah, and this definitely speaks volumes over how fastest space is evolving as well. I think one major thing that we've seen the notebook place in and really provide value for organization, which is a challenge the organizations have faced for a while, is the need for democratized insights.

Within DataCamp, notebooks, like workspace, notebooks especially have become like more the defacto way to read data insights, Within the organization. Because, People who have business analyst skills, for example, are like SQL based people, right? Like can use a SQL node, like SQL cells.

Visualizations are easily surfaceable and that report interface makes it easy to gain access to insights. Walk us through how the notebook as well has evolved to accommodate that need of democratizing data insights and being able to publish reports within the organization.

Jodie Burchell: This is such an interesting topic. So, in my past roles I've worked really closely with non-technical teams sales or marketing. And. It's a funny thing that non-technical people see code and their brains just instantly like, go, this is not for me. Like, it's not that, it's not that they couldn't understand it, it's just that it's it's not part of the usual workflow.

So what I've actually found is you cannot present just a notebook to people in these roles. It's fine for people who are used to code, but people who are not, do not really welcome it. So a nice thing that I've seen coming up with notebooks like is changes that allow you to switch between like this development and presentation mode.

Like a really basic one is in Jupiter you can just hide cells . So you can just hide the inputs and you can just show them the nice graph. And I think there's also been an increasing move towards, pivoting, reporting off notebooks. So again, I'm gonna sp data law. I'm probably gonna be talking about my products a lot because I get very excited about them.

But data law has this ability to build like actual dashboards off notebooks. And for me, this is cool as well because one of the big challenges you have is okay, you've done your analysis, you spent a lot of time on it. You just need some feedback from someone in the business. You don't wanna then spend like a day making a PowerPoint slide deck, presenting it to them, then going back to your analysis.

So it's much quicker being able to pivot straight off the notebook.

Adel Nehme: Yeah, I completely agree. like gooey interfaces within the notebook as well, to be able to provide democratized insights within the organization is so crucial. And it's such an interesting space and aspect within the modern notebook environment because it.

Does signal to a certain extent for me that there will be some portion of the Excel crowd, like the highly technical Excel crowd that will maybe adopt notebooks down the line, especially as that going interface becomes stronger. So something though, like fridge switching here from aspect of the modern notebook, the kind of the more data tooling space in general.

Something that I've been following pretty closely is how the data tooling space over the past few years right, has become extremely fragmented, We've seen over the past few years the rise of new category of tools the rise of ML ops tooling. There seems to be extremely limited consensus within the data industry about what constitutes a defacto data.

So I would love to hear your thoughts on the state of fragmentation and data tooling today. Where do you think we'll be in a few years in comparison to today?

Jodie Burchell: Yeah, so. This is actually something I thought about from a different angle. I thought about this a lot. So I came into data science from a non-engineering background, and for me, engineering tools are still not my tools because they don't work with the way that I think and. I feel like this fragmentation that you are seeing that I've seen as well that you've observed is actually based on a fragmentation of the field.

So I think data science 10 years ago was a bunch of academic refugees, , and they were all just working like research scientists back, when they were academics. And now we're seeing a lot more diversity in the background of data scientists and you're seeing a lot more people with engineering backgrounds.

Like it could be up to 50% based on surveys I've seen. So, I think that means that the tooling needs and the tooling approaches and even the place where they sit in the data science workflow will differ. So you're always gonna have people like me who like, died in the world. Data sciences, like scientists.

I'm a researcher. I will always be a researcher. I will never be an engineer. And for people like us, the tools that we need are like, Something that reduces the barriers to do the research. So I actually think for people like me, these all in one managed solutions are going to become more prevalent.

Whereas then you've got that part of the ecosystem who are data scientists with more interest in engineering, maybe an engineering background, maybe they are machine learning engineers and they're doing production. They're gonna need engineering tooling. And I think for them you're gonna a increasing integration and support of sophisticated notebook tooling, which then integrates with, something that then helps them move to production scripts and like do all of the work around that.

So I don't know that we're gonna see consolidation. I think we're gonna see further siloing, if I'm being perfectly honest.

Adel Nehme: that's a very interesting insight. I never thought about it this way. So I'd love to deep, dive into it a bit more. So do you think that data teams, to a certain extent, there will be two different types of data teams, one with a more predominant, you type of data scientist to a certain extent with Tooling that fits to your workflow and then other data teams that are engineering based, or do you think that there will be a consolidation at least in terms of, the type of profiles on one data team, but siloed data stacks within the same organization?

Jodie Burchell: This is like, I'm not gonna say I have all the answers, but I've seen quite a lot of data teams. I've worked in quite a lot of different configurations. So everything from being, a siloed data science team to being the loan data scientist in a team of engineers. So basically I think the conclusion that I've come to, I've always worked in like medium to large institu.

Is once you start getting a certain level of complexity with your projects, it's impossible to have one person doing everything from research to not just productionizing it, but maintaining it in production, because that's where you start getting a lot of debt. So like in my mind, what actually makes the most sense in terms of like ease of hiring and also, I think.

The happiness of the teams is to hire specific data science teams who do your prototyping and they do your research. Hire engineers who will be responsible for your data pipelines and also your production, but have them work together on a project basis, so I think they're called squads in some places, however you like it.

And you need to have like constant communication. You can't just go and plan a data science project without consulting engineering, cuz you have constraints, you have technical constraints about language and about latency and things like that. But I think honestly, the most sustainable model that makes everyone the happiest is this sort of team-based approach based on skills and expertise.

But then project based based on, the actual thing you're trying to.

Adel Nehme: That's great. And you mentioned here communication. what do you think constitutes effective communication? Engineering teams and data scientists.

Jodie Burchell: Yeah, so in my kind of view, the data scientist, it's probably gonna be like one of the more senior data scientists on the team. They need to take in some ways a project management perspective. So they need to be a bit of a translation between. The business requirements and the technical requirements, they won't do all of it.

Obviously. Like I can't speak for engineers. I don't have that expertise, but you need to be like, okay, you want to, I don't know, increase your customer retention by 30%. We need to work out if that's feasible from business perspective, but assuming that's true, here is the approach that I would take.

Now let's go and talk to engineering. And let's get a list of all of the constraints they have, because that's gonna affect the model I can build that's gonna affect the data I have available. It's gonna affect the potential feasibility of the project that I thought, from a business perspective is feasible.

So at the beginning you have this back and forth communication and you need to keep revisiting it because research projects evolve and you get more information. So you need to keep in constant communication. Business. Oh, you know, this is gonna take three months longer than I thought. Do you wanna keep going?

And you need to keep in touch with engineering and be like, okay, this is the best I could come up with in terms of the predictions from this model. Is this fast enough? Things like that.

Adel Nehme: I love the level of depth that you go into here. We talked about, the current state of the data science notebook, we talked about, the fragmentation of tooling. I'd lo also love to get from you where you think the Data Science notebook and the IDE is headed. So I'd love to understand where do you think the Notebook is going to look like in the future?

What do you think are standard features of, notebook and ides in the future?

Jodie Burchell: Yeah, so I. I think definitely like remote first is probably gonna be a standard. And I'm seeing this increasingly. Although , I have seen the specs on those latest max, they have something like 96 gigs of RAM on a local machine. Like it's insane. we, we were joking at work that we were like, oh, maybe we don't need remote after all.

I have my server in my office . Um, I think also, look, I can't say whether it's gonna happen in terms of like the culture of coding and data science, whether we're gonna move towards like better coding standards. There's been a lot of discussion about this, but I don't know that data scientists see it as a priority.

Developers certainly do, but I think in any kind of project where you have data scientists that need to hand code over to engineers, definitely. Okay. Can't see if it's gonna be there, but it would be useful having more of these tools that help data scientists write better, more maintainable code. even things like introspection.

Do you really want to, write this python function without declaring the type, because it's not gonna be as readable. You need to put in a docs string, things like this. And then I think probably. There's, I think this conversation about the role of the notebook in ML workflows is gonna evolve, and I think we're gonna see more tooling around this.

So I think, potentially having reproducible environments from the beginning for the notebook. So at least if you hand over to engineering, they know, they know what your dependency were, they know the Python version, they throw things like that. Assistance to connect to remote resources.

And even integrations that make deployment easier, like things like AWS toolkit, give you a lot of extra benefits or extra tools that allow you to go from messing around in research to actually productionizing things. So I think these are the kind of big challenges coming.

Adel Nehme: And you mentioned here, the importance of code quality debugging introspection, like having maintainable code. One thing that we discussed behind the scenes was chat, c p d, the rise of like large language models. Over the past few years we've seen like, auto complete code tools like GitHub co-pilot I think open AI codex.

How big of a change do you think these tools will introduce to the coding workflow? Do you think they'll be able to. Code quality, improve the workflow of data scientists. And what do you think this will mean? This is a big question, right? What do you think this will mean for data science in the next few years?

Jodie Burchell: the opinion I guess I've, I've always had, and it's really only been strengthened over the years, is that the job of a data scientist or the job at a, of a developer is not to coat. It is to solve problems. And, knowing what I know about large language models really they're very sophisticated, probability machines.

Like they generate, likely next word in the sequence based on amazing amounts of data that they've seen. So while they may approach, I guess, human looking, Reasoning they can't problem solve. Not in the same way that we can. So I don't wanna speak for the next three years of large language models.

Like I would say, even being an N L P for a long time 2022, knocked me on my butt. Like I just did not expect what we saw. But you know what I think is we've had, IDs for a long time. We've had tools like Cocom completion introspection for a long time. I think co-pilot, it's a, it's a, it's a leap forward, but I still think like co-pilot and chat g p t are extensions of this.

They will help us write better code, they'll help us understand our code better. They'll help us explain red jacks, which frankly is one of the best things to come out of chat, G P T but I don't think they're gonna, you know, place all these things, even those things I was talking about, like you communicate with your business stakeholders, you manage the project like this, all still needs to be done.

So yeah, I think it's just gonna make the coding part more efficient. and it'll free up time to maybe make projects more efficient, that we don't spend as much time writing the code. We spend more time thinking creatively about what to do next.

Adel Nehme: So I completely agree with you, you know? so, inspired by Chad c p t and kind of what large language remodels have been able to do in the past two years. I do think there are only probability machines. I think there's a lot of people who are sounding the hype bells around, this is gonna replace problem solving.

I don't think it's gonna replace problem solving anytime soon given it's current configuration. You never know what happens in the next few years. been, it's been an interesting space to follow over the past couple of years. I do think that, it will be able to raise the standards, For, your average data scientist in terms of like the coding workflow, it will be able to help them bridge the gap between data science and engineering in a lot of ways, as you mentioned on code quality best practices, all of these things. I think the only thing that reinforces well, like the rise of large language models is that maybe coding prowess is not going to be as useful in the future as much as problem solving and conceptual knowledge about why you're doing this choice.

I'll give you an example. We were testing out like this machine learning workflow with chat G P T. You need to tell chat. G P T I need a logistic regression for this problem. You need to choose what evaluation metric that you need to tell it to code, all of these is like conceptual knowledge that requires you to be able to really understand the space, to really understand why you're doing what you're doing, and to have a good, and to have a solid business acumen, for example, when approaching a problem.

So that's where I agree with you quite fully here.

Jodie Burchell: that's such a nice example as well, because I think it really shows that, like, if we go back to that idea of like the business constraints, you may have reasons for choosing that logistic regression over something more complex. so, there's layers of decision making there that, again, these models cannot replace.

So yeah, I really like that example.

Adel Nehme: So in a lot of ways, it's essentially an assistant that helps you become better and faster what you do, but you're still in the driver's seat, right? And I think that we're gonna see an evolution of the data science role in that regards, But the conceptual knowledge and the ability to make educated decisions around what will be useful for the business will only become more important.

In the future. So second part of my question is, what do you think is gonna happen in the next few years when it comes to the data science workflow, especially with large language models? I know this is a big fuzzy area at the moment. We never know it's gonna happen in six months, let alone five years in the N L P space, but I'd love to see some of your predictions here.

Jodie Burchell: so I think my prediction is we are really gonna start automating a lot of this like boring boiler plate work. So, potentially what you're gonna see is a narrowing of the gap, as you said, between The engineering part and you know, the data science part, it may be even a lot of this stuff that I've been talking about, like this is really me just making wild guesses, but why not?

A lot of this sort of code that's needed or like, workflows that are needed to take your notebooks into production code, potentially could be suggested by these models. So something I know that a lot of data scientists struggle with. how do you, structure a, a project?

Because we don't think in projects, we think in notebooks. So potentially these models could say, okay, I would suggest dividing up this code into these particular files, what I can do is I can, reformat the code and put it in that, and then potentially bring in more robust coding structures, like moving away from dysfunctions to classes where it's appropriate, you reusing code and things like that.

So it would be super cool if that was automated because it would be amazing to hand that over to the engineers. One problem would be though, is you still need ownership over this code. and I don't mean this in an ethical sense, but I do want to touch on. more in the sense that when something's put into production, it's not a dormant thing, it's a living thing.

And when systems go down, someone needs to be responsible for it. So if you have a machine like an algorithm writing the code, that is then your production code. Who owns it? Who, who understands it, who's responsible when it breaks? So, Yeah, I don't know. Even if my little fantasy comes true , whether it would entirely solve the problem, it's designed to.

Adel Nehme: That's a great point at the end that you mentioned here because. I do anticipate that there's gonna be, you know, let's say we do scale the use of large language models, en coding workflows, there needs to be some form of human in the loop type interaction, That is doing the qa, but. I anticipate that for large code bases that will be very scalable necessarily, that would like, require quite uh, re-imagination of what the coding workflow looks like.

And you mentioned the ethical thing here. Do you wanna maybe talk about also the ownership of the training data, for example, and what that means? When code is being put into production.

Jodie Burchell: Yeah, so I think everyone who would be familiar with the controversy around where the data to fine tune G p t three for Codex, which then became co-pilot came from. So it's an interesting thing because I think it's very interesting for me to look at. these sort of things because my background again was health sciences.

We needed to explicitly ask consent for every bit of data we used. And it's then crazy to see companies using all of this data and then maybe making money from it when they didn't ask for that permission. I think it's more complex than saying they shouldn't do it.

But this, this data doesn't come for free. It has been created by someone. It's the same for the language generation models, and I know stability AI are going to let people opt out first able division three. But I think it really needs to be thought about, not just in terms of the ethics of where you get the data, but also who owns the.

Adel Nehme: Yeah. It's. An entirely fuzzy area of machine learning that I didn't anticipate is gonna be such a massive issue.

Think about it just in the arts, right? There's a couple of examples that I loved. One is, so Nick Cave Australian Singer someone created a song in the style of Nick. Cave, they asked him like, what do you think of this song? Nick Caves, lyrics comes from a lot of emotional trauma and pain that he's been through.

You can check out his life story. And his response was pretty telling, he was like, this is essentially meaningless, right? Even if it does sound, relatively well calibrated, If it's enough to confuse people, is this a Nick Cave song? Who owns this song? Right? Because the original style was in the style of Nick Cave, Similar here on like, if you create an image in the style of, a living artist today or a living photographer, who owns the image? Because the training data is owned by, Artist. So, you can take that example on code bases as well.

A lot of code bases are, open source, for example, stuff that you find on GitHub. I'm not an expert in the legality of being able to take that data and translating it into outputs. But it's a big legal issue that the industry's gonna have to come to terms with in the next few years for it to become a very productive area and.

Jodie Burchell: and I think maybe when we were scraping Wikipedia or Google books for training tar back in the early days of word embeddings no one really anticipated this, so.

Adel Nehme: it's, it's a, it's a, it's an interesting area. Very excited to see what happens in this space. Now, Jodi, before we close out, we talked about it at the beginning, I'd love to talk to you about your experience being a developer advocate and what does it take to become a developer advocate?

A lot of ways we share a similar role, so it's always nice to talk to fellow evangelists and advocates. Walk us through the process of having become a developer advocate, and what does that.

Jodie Burchell: let's just say it fell into advocacy. vaguely knew about it because my husband is a developer. He used to go to a lot of conferences, so we knew a lot of advocates. And a good friend of mine is an advocate, which is how I got into it. But I'll, I'll talk about it in a second. I'd been working as a data scientist for quite some time.

I think, I had six years under my belt by that point, and I really liked it, but, I used to be a psychologist and I missed in some ways, I really like the teaching aspect and the sort of, he like not the helping people. Like, it sounds a bit like wooy, but I, I like being able to make people who feel maybe a bit unwelcome in this space feel really welcomed and like they can do it.

So, I actually ended up just talking to my friend about this role and when it came open. She was like, Hey, you should totally apply. And I was like, I don't know. not sure if it's for me. And then, you know, she sold me on it. She, told me that it's a lot of that stuff that you like, you get to teach people, you get to mentor people, you get to help them.

And when you create materials, it's explicitly to teach, not to sell the products. So yeah, that's, how I, I got in. Did you want me to maybe explain, About what advocacy is, or maybe you could explain.

Adel Nehme: I think advocacy, from my experience, advocacy differs from organization to organization and role to role. So maybe walk us through, in your experience, what advocacy entails in your role.

Jodie Burchell: At Jet Brains, they're very strict when they hire advocates that we are not we're not there just to sell the products. you will not get the job if you do not genuinely believe in the products . So that's a good start. So because, I really believe that these products can help people be better data.

My job is really just to Help show people how they can be better data scientists by using these products. But then also, the products aren't perfect. So I can take feedback from the community and be like, Hey, if we do this, I'm gonna make the product better. Or in my own opinion, I can say, I think this feature is really lacking, or this feature's too complicated.

there's also all the materials around that. So the, the content creation, if you wanna call it, that is a secondary kind of thing. The primary goal is always to help people be better data scientists. And it's super nice because I think when I started. It was overwhelming. Like I was scared of the command line, and so I really want people who maybe feel like, oh God, like I really like this stuff, but I just don't think I can do it.

I wanna maybe be a friendly face and say, Hey, . I'm not gonna be able to teach you everything, but I can support.

Adel Nehme: Yeah, that's really great. Contrasting it with my own experience, evangelism. Cam, so there's Richie as well who co-hosts the, the podcast. He's also an evangelist at Data Cam. I would say it's really predominantly about sharing the insights of the data community with the wider data cam audience, right?

Like, we also don't necessarily like sit there trying to sell the product. It's just of being the interface between the community and the organization. So that's been, humongous privilege for me. I think my role is a bit more creation focused, If you think about we run our webinar program, our life training program we really think about, how can we make as many magical experiences for the community as possible, Because in a lot of ways, our content team as well, the ones who interface with the wider community and creating courses also act as that interface between the learner. And the rest of the organization, So in a a data comes position is a bit different. I get to be really just like creative and like focus on the creation side of things, which is why I told you like, it's different from company to company. I talk to a lot of people in this space. Some people are more in the professional services consulting space. They talk to customers, help them adopt the product better, So it's really different from one organization to another.

So it's interesting to see how it looks like a jet.

Jodie Burchell: Actually, can I quickly tell you a very cute story about Data Camp

Adel Nehme: Yep.

Jodie Burchell: So I actually, I really genuinely love Data Camp courses. I think they're such a nice starting place for beginners. But I have a friend who, she's been trying to learn Python for two years for data analysis and she kept like getting, Distracted.

She went off and started watching this like CS course, like fundamentals, and I'm like, no, no, no, you don't need that. And then she found Data Camp and she's so happy because she finally feels like she's making progress, learning what she needs to, and she's like really excited about it. So yeah, now she's like learning pandas and she's like really enjoying it.

Adel Nehme: that warms my heart. it's always like we're being reminded of these learner stories. Like, reminds us why we got into this space in the beginning. I did a data science masters. I wasn't very passionate. About necessarily just seeing their coding.

I was really excited about the education side of things, so it was like a perfect match for me to stay at Datacap. But yeah, you mentioned here that story seeing this like on a daily basis on LinkedIn, people tag you for your courses, stuff like that. It's, it's really heartwarming. And it's great that your friend found value. Like one of the main reasons why I liked the data camp from the get-go as a learner before I joined the company is that there was no analysis paralysis when I got into the platform. Like if you go to any other education technology platform you have like Introduction to Python, you get 10,000 results whereas I just want to know where to get started.

Jodie Burchell: Yes,

you have the tracks.

Adel Nehme: Hundred percent. So what would be your advice for someone looking to get into this type of job?

Jodie Burchell: Yeah, so I think honestly you need to have a particular type of personality. So I think Dell and I have already alluded to it, but I think if you're the kind of person who, like you wanna come into work every day and you wanna know exactly what you're gonna be doing for the next two weeks and people just leave you alone with this project that you know very deeply there is another job for you.

And obviously you need to like be relat. People oriented. I think if you're a curious person who likes to learn continuously, you like outreaching to others and I dunno, you like doing new things all the time, I think it could be a really cool role. And you just need to bring a lot of passion to it as well.

Adel Nehme: I completely agree. Just the ability, to juggle different projects at the same time, being able to talk to customers, create content, being people, person, like loving to speak to people is very important. I think one of my main weaknesses, for example, is I'm not very present on social media, It's because I'm always spending time on podcasts, like talking to people and like doing webinars and stuff like that. But yeah, this is something that, I think. Part of the course when it comes to the evangelism role. Finally, Jodi, as we wrap up today's episode, do you have any predictions for data science in 2023?

Jodie Burchell: I have a few, so I think we've already alluded to it, but models got big and they got really big last year, and I think, I think last year was probably the point where we realized, okay, a lot of the cutting edge models that we're getting value from. Most of these models cannot be trained by individuals or even like small companies.

So I think we're gonna be increasingly moving towards the idea of like standing on the shoulders of giants taking these large language models that have already been trained or these image generation models or whatever, and fine tuning them for our own purposes. This leads back into this ethical conversation because we are now then, Taking models that we don't necessarily understand the training data for in some cases and using them.

I think, it's gonna be a year where increasingly we have conversations around things like data ownership, whether the production of the data was ethical, whether the daughter is biased, and you know, with some of these models, Should they even be made and released at all? Like Galactical was a very notable one last year.

I think these conversations are gonna heat up more this year, and I hope they do. And then I have a very boring prediction. But I think in most companies it's just gonna be like classic machine learning business as usual. Like I think there's all this sexy stuff happening, but I think most of the money's still coming from.

Random Forest or xg, boost or like linear regression, even

Adel Nehme: Boring AI for the.

Jodie Burchell: boring AI for the win, and I'm there for it. I'll always love you legit Linear regression,

Adel Nehme: No, I, I, I completely agree. Linear aggressions, let's just, Russians, these simple interpreter machine learning models are what's gonna drive the majority of the value in organizations today. I think we'll see a lot of more productivity tools emerging from large language models, but most organizations are not gonna have the ability to train and deploy their own large language models at.

Jodie Burchell: No, especially when you need to use APIs, like it's expensive.

Adel Nehme: Yeah, could definitely agree. So it's gonna be an interesting year for data science. Jodi, before we wrap up, any final call to action before we end today's episode.

Jodie Burchell: Yeah, like I would just encourage you to come check out our tools. So I've mentioned them all, but I'll just reman them. So we have data law, which is more for collaborative data science and managed solutions. We have data spell. Which is our IDE dedicated for data science. And then within pie charm, we have really rich scientific computing and data science capabilities.

So yeah, like I'd encourage you download the downloads. We have a community edition of both pie, charm and of data law, although a lot of the scientific computing is not in community pie jam. And yeah, just give them a go. Reach out to me on Twitter and give me some feedback. You like it or you don't like it and love to hear what?

Adel Nehme: Awesome. And do make sure to check out DataCamp Workspace. Thank you so much, Jodi, for coming on the.

Jodie Burchell: Yeah. Thank you so much for having me. It's been a really fun chat.

Adel Nehme: Likewise.

Topics

Data Science

Data Scientist

blog