Brian Granger is an associate professor of physics and data science at Cal Poly State University in San Luis Obispo, CA. His research focuses on building open-source tools for interactive computing, data science, and data visualization. Brian is a co-founder of Project Jupyter, a leader of IPython, co-founder of the Altair project for statistical visualization, and an active contributor to a number of other open-source projects focused on data science in Python. He is an advisory board member of NumFOCUS and a faculty fellow of the Cal Poly Center for Innovation and Entrepreneurship. Along with other leaders of Project Jupyter, he is a winner of the 2017 ACM Software System Award.
Hugo is a data scientist, educator, writer and podcaster at DataCamp. His main interests are promoting data & AI literacy, helping to spread data skills through organizations and society and doing amateur stand up comedy in NYC.
Hugo: Hi there, Brian, and welcome to DataFramed.
Brian: Hi, Hugo. Thanks so much for having me on today.
Hugo: It's such a pleasure to have you on the show. And we're here today to talk about Project Jupyter, about interactive computing, and in fact, you sent me a great slide deck today of yours that you've been giving recently. Something we're going to be focusing in on is actually a slide that you have there. And I'm just going to quote this before we get started. You wrote, "We are entering an era where large, complex organizations need to scale interactive computing with data to their entire organization in a manner that is collaborative, secure, and human centered." Now these are all touch points we're going to be speaking about during this conversation. But before we get into all of this and before we get into the conversation about Jupyter, Jupyter Notebooks, JupyterLab, and all of these things, I'd like to know a bit about you. So first maybe you could tell me a bit about what you're known for in the data community.
Brian: Yeah. Not a problem. So I'm a physics professor at Cal Poly for the last, close to 15 years. I've been involved in a number of open source projects in the scientific computing and data science space. In the early years I was involved in SymPy, which is a symbolic computer algebra library for Python. And then also IPython, the de facto, interactive shell for Python. And then in the more recent years I was one of the co-founders of Project Jupyter. And in the last few years I've also co-founded Altair with ... See more
Hugo: And speaking of Altair, you're actually currently working on an Altair course for DataCamp, right?
Brian: Yes. You're being a little bit optimistic about your verb tense there. It's been a little bit stalled with all the different activities we have going on in the Jupyter world, but yeah, I think I'm around maybe two-thirds to three-quarters done with the DataCamp course for Altair.
Hugo: And so as a project lead for Project Jupyter, I'm wondering what type of skills come into play there. Because I know you have a very strong background, you're a physicist. You have a lot of data analytic skills. A lot of design and engineering and entrepreneurial skills presumably come into this role as well. So I'm just wondering what type of things you need in order to do this job?
Brian: Yeah. It certainly has evolved over the years. In the sort of early days of IPython and Jupyter, we were spending most of our time doing software engineering. There was a very small amount of design work, UI/UX design work. When it's only a handful of people, in principle but there's organizational work and community work to be done, but it's at a very small scale that is in the background relative to the software engineering. As Jupyter has grown though, I would say the demand for more time and effort on the organizational and community side, as well as the design aspects of the project, have really increased. One of the challenges in working on open source is that projects like Jupyter or Altair tend to attract really top-notch developers and software engineers. And so that aspect of the project tends to be reasonably well staffed. That doesn't mean that we all have as much time to put into the projects on the software engineering side as we would like. However, as these projects get big, there's nothing in particular that attracts top-notch UI/UX designers, for example, to Jupyter. That continues to be a challenge for us and other open-source projects in terms of how do we build design into the process and figure out how to engage designers in the development of the projects.
Hugo: So in terms of, I mean you're speaking about a number of things here that include design but also hiring, structuring an organization, I know that you think a lot about getting funding for the project. You're talking about community development, which these are all things we think about at DataCamp a lot as well. So it sounds somewhat similar in several ways to running a company.
Brian: It probably is. I've never run a company but when I talk to other people who are in different roles leading companies, there's a lot of overlap there. And our business model doesn't involve selling things to people in the traditional sense, but most certainly we have customers. And our interaction with those customers is very similar to that of a company who has paying customers in terms of, we exist in a very dynamic, fast paced part of the economy. And it's the type of thing that if Jupyter were to sort of relax and begin to coast, there's hundreds of other open-source projects and for-profit companies building products, quickly put Jupyter in a position of becoming outdated. And so there's a lot of thinking we do and work that we do around looking ahead, our three to five year growth map, where we see data science, machine learning, and interactive computing going and how do we build the resources to tackle those ambitious things on those time frames but also build a sustainable community along the way.
How did you get interested in data science?
Hugo: So how did you get interested in or involved in data science, as opposed to being a physics professor and researcher? How did you get into data science initially?
Brian: Yeah, that's a great question. So we began working on interactive computing as part of IPython. I was a classmate of Fernando Perez back in grad school at the University of Colorado, and Fernando created IPython in the early two thousands. And at the time the world of interactive computing was really something that was done in the scientific world, in academic research, in education. I'm sure there were some companies at the time that were doing little bits of it here or there, but it wasn't something that was pervasive like it is today. And so as we started to build IPython and Jupyter in the 2000s, initially we felt like we had, what we imagined was a very grand vision that everyone in the world of academic research and scientific computing would be using Python and the tools that we and many other people were building. What we didn't see is that the whole world was about to discover data and that really sort of opened up a whole new audience and set of users to open-source data science tools and scientific computing tools that we never imagined.
Brian: And so, honestly, my own journey is more that we were doing what we had always done in terms of scientific computing and then woke up to realize that we were sort of right in the middle of the data science community that was forming both in the academic research side but also on the commercial industry side as well.
Hugo: I want to delve into a bit more about the general nature of Project Jupyter in a minute. But before that, I'd like to speak a bit more about interactive computing. So I'm going to quote Project Jupyter. Project Jupyter states it "exists to develop open source software, open standards and services for interactive computing, across dozens of programming languages." And I'm wondering what a general working definition of interactive computing is and why is it important?
Brian: Yeah. This is a great question and I think in the history of computer science, interactive computing has not even really been a thing that's acknowledged in terms of a topic worthy of study and something that is worth really thinking about carefully and clarifying. And it's something that we've been doing over the years and really the Jupyter architecture is an expression of our thinking about interactive computing. And I'd say that the core idea of interactive computing is that there is a computer program that's running where there's a human in the loop. As the program runs, that human is both writing and running code on the fly but then looking at the output of the result of running that code and making decisions about what code to write and run subsequently. And so there's this sort of interactive mode of going back and forth between the human authorship of the code and then the computer running it and the human interacting with the result in an iterative manner.
Hugo: And this is ideal for so many aspects of the scientific research process, right? From exploratory data analysis to writing code embedded with in-line results and images and text and that type of stuff?
Brian: Absolutely. This is something that we really think a lot about, and that is that when humans are working with code and data, eventually, at some point, for their work to be meaningful and impactful, the code and data need to be embedded into what we think of as a narrative or story around the code and data that enables humans to interact with it, make decisions based on it, understand it. And it's really that human application towards decision making. It really makes a difference to have a human in the loop when you're working with data.
What is IPython?
Hugo: And for all our listeners out there who may not know what IPython is and how it differs from Python, would you mind spelling that out for them?
Brian: Yeah. So Python's the programming language. IPython stands for interactive Python and it originally was a terminal-based interactive shell for working with Python interactively. It had a lot of, and continues to have a lot of nice features that you want when you're working interactively, such as nice tab completion, easy integration with the system shell, in-line help, features like rich interactive shell. And the interaction between IPython and Jupyter today is that originally when we built the Notebook we called it the IPython Notebook because it only worked with Python. And then over the years we realized that the same user interface and architecture would also work with other programming languages. Core developers of IPython then sort of spawned another project, namely Project Jupyter, that's the home for the language independent aspects of the architecture. And IPython continues to exist today as the main way that people are using Python within Project Jupyter. And so it continues to be a project that we're still working on.
What is Project Jupyter?
Hugo: So could you give us a high level overview of what Project Jupyter is and what it entails?
Brian: Yeah. So you've read a good summary of Project Jupyter in that it's focused around open-source software, open standards and services for interactive computing. And I think a lot of people are familiar with some of the software projects that we have created, namely the Jupyter Notebook, and I'm sure we'll get to talk more about that here in this conversation. But underneath the Jupyter Notebook is a set of open standards for interactive computing. I think when we think about Jupyter and its impact, it's really those open standards that are at the core of that. And one way to think about it is, it's a similar situation as to the modern internet where, yes, there's individual web browsers and websites, but underneath all of that there's a set of open standards, namely HTTP, TCP/IP, HTML, that enable all of those things to work together. The open standards that Jupyter has built for interactive computing serve a similar role as those other protocols do in the context of the broader internet.
Hugo: So I want to find out a bit about the scale and reach of Project Jupyter, but I want to preface this by saying... it's a story both you and I know but for our listeners, recently I attended a talk you gave at JupyterCon here in New York City. And before you started speaking, you asked people to put up their hand if 10 or less people in their organization used some aspect of Project Jupyter, then asked people to put their hand up if 50 or less, a hundred or less, 500 or less, a thousand or less and so on. And people put their hands up at every point. And then you asked: is there anybody in an organization which has over 10 thousand people using some aspect of Project Jupyter, and a number of people put their hands up. And that was a really large aha moment for me, thinking about the scale and reach of the project as a whole. So with that as a kind of intro, I'm wondering what you can tell us about the scale and reach of the project?
Brian: Yeah. So this is something that's been really fun to be a part of over the last few years, to see the usage of Jupyter literally explode and take off in ways that we never imagined. And there's a number of different ways of thinking about this. Being an open-source project, we don't have an accurate, precise way of tracking how many users we have. Our users obtain and install Jupyter through a number of different means and that does make it a challenge. One nice thing that we're watching is the number of notebooks on GitHub. And this can be obtained by querying the GitHub APIs. We have an open-source project where we're tracking that over time. And as of this summer the total number of public notebooks is on the order of two and a half million. And from talking with the GitHub folks that we know, it looks like there's roughly another similar amount of private Jupyter notebooks that are not visible to the world.
Brian: And so the interesting thing there, obviously the absolute number currently is interesting. I think what's more telling is that over time we're seeing an exponential increase in the number of notebooks and the doubling period right now is around nine or 10 months. So that really points to very strong current numbers as well as growth. It's difficult to put a number on the total number of people that are using Jupyter worldwide. Some of what makes it challenging is that most of our staff is in the US and Europe and yet we know from our Google Analytics traffic that Asia right now is one of the most popular continents that's using Jupyter. And so we don't have many contacts with people there. We don't know how Jupyter is being used but we see a very strong signal that it's being used heavily.
Hugo: And how about in terms of contributions and amount of developers working on the project?
Brian: Yeah. So along with the usage there's definitely been an increase in the number of contributors. I think that the total number of contributors is somewhere over 500 and it's a fairly large-scale open-source project. We have over a hundred different repositories on GitHub spread across a number of different orgs. The core team right now that are core maintainers of the project, many of whom work mostly full time on the project, is around 25 people. The Jupyter Steering Council is a key part of the leadership of the project and I think there's currently 15 steering council members. There's a number of new people who joined the steering council this summer is why I don't remember the precise number.
Brian: And one thing that I want to emphasize with this is that sort of what is the right narrative to have about the different contributions of people to Project Jupyter? I want to sort of make an analogy to Hollywood in terms of, if Jupyter were a movie, what type of movie would it be? And I think it's important to note that it would not be a movie where there is a single superhero who comes and saves the day. So sort of like a Superman narrative really doesn't fit the reality of how Jupyter has been built. A movie that I think that would be a better analogy to how Jupyter is built would be something like Infinity Wars, where you have a bunch of different superheroes, all very diverse in skills and strengths, contributing to the overall project.
Brian: I think it's really important to note that, yes, I'm the one that's here talking to you today but I am one among many people who have done absolutely amazing work on the project.
How to get involved with the project?
Hugo: And for our listeners out there who would like to get involved in perhaps contributing to the project, what are good ways to get involved and what are, I hesitate to say bad ways to get involved, but what are less good ways to get involved?
Brian: Yeah. So this is one thing that I talked about at JupyterCon in terms of, and in that context it was more thinking about what are healthy and productive ways for large companies to engage with open source. So for individuals, I would say one of the best ways would be to find a part of the project that you're interested in and then come on to GitHub and begin to interact with us. A lot of our popular GitHub repos have issues that are tagged for first-time contributors. And so we're working hard to try to make the project a welcoming place for new contributors. We welcome people to come and talk to us there.
Brian: We also have chat rooms that are public, on Gitter.im. This is an online, web-based chat platform that is integrated with GitHub. And so, for example, the Jupyter Notebook, JupyterLab, JupyterHub, Jupyter Widgets, all have chat rooms on Gitter. And both the core contributors as well as the broader community hang out in those contexts. So that's a great way for people to get involved.
How can organizations contribute to the project?
Hugo: And how about organizations that want to contribute to the project, Brian?
Brian: I think in this case it's really helpful to have a good mental model of how open-source projects work and how contributions function. My favorite mental model is actually from Brett Cannon, who's one of the core Python devs and works at Microsoft. In a Tweet he said something like, "Submitting a pull request in open-source project is like giving someone a puppy. And that is, you have to understand that the person accepting the pull request is essentially agreeing to care for that puppy for the rest of its life." And one of the patterns that we see in organizations, companies that want to contribute to open source, is that they're interested in particular features and so they have their employees contribute to open source in a way that generates a lot of large new pull requests for those new features.
Brian: And I think this is where the puppy mental model really helps. And that is, oftentimes the core maintainers of open source projects are completely overwhelmed with just the maintenance of the project. That could include bug fixes, releasing issue triage and managing issues, but also reviewing other contributors' pull requests. And so one of the most helpful things that we're trying to cultivate is basically a balanced perspective of contributions that includes not just submitting pull requests that have new features but also includes reviewing other people's pull requests, involves helping other users with particular issues, and even fixing bugs.
Brian: One really nice thing that GitHub has done recently is in their contributor user interface, they have a new user interface for expressing someone's contributions to a particular GitHub repository and there's sort of an X, Y coordinate system and four directions around that. And it shows someone's contributions to, I think it's code review, pull requests, issues, and there's one other. From that you can get a perspective on how balanced someone's contributions are to an open-source project.
Brian: And so a simple way of putting it is, encouraging people to have a balanced way of contributing to open source. Now we also want to specifically address first time or new contributors. And there I think the idea again is balance but in a way where the core contributors and existing people working on the project can help new contributors to come along and begin to contribute in different ways. And so even for new contributors, those contributions don't necessarily have to be pull requests. Even checking out existing pull requests, just testing them locally, is really, really helpful for open-source projects.
Hugo: And it's incredible that GitHub now has the feature you discussed which kind of facilitates just figuring out this balance, right?
Brian: Oh, absolutely. I was thrilled to see them release that and I think it's happened in the last month. Off hand I don't even remember exactly what they're calling it.
What the main uses of Jupyter Notebooks for data science and related work are?
Hugo: As we've been discussing, the scale and reach of Project Jupyter is massive. So I'm sure there are so many different uses of notebooks and the Project in general. But I'm wondering, to your mind, what the main uses of Jupyter Notebooks for data science and related work are?
Brian: Yeah. In terms of numbers of people using Jupyter for a particular purpose, I would say interactive computing with data, so data science, machine learning, AI, is one of the most popular ways that Jupyter's being used. Both by practitioners, so people who are working in the industry on data science and machine learning teams, but also in educational contexts. So within universities, with online programs, with boot camps, you have instructors and students doing those activities around data science and machine learning but in an educational context. I think that really captures the bigger picture of Jupyter's usage.
Hugo: Yeah. And in fact, we use them at DataCamp for our projects infrastructure, which, as you know, we teach a lot of skills in our courses. In our projects we teach kind of end-to-end data science workflows using Jupyter Notebooks.
Brian: That project style workflow is something that I've seen when I have taught data science at Cal Poly, my university, in that oftentimes it's helpful to start with students in a very highly scripted manner where the exercises are very small scale and focus on a particular aspect of a particular concept. And then eventually transition to more open ended project based work. I know in those course, towards the end of the quarter when the students have an opportunity to do sort of end-to-end data science that's a little more open ended, the learning really increases a lot and the students get a lot out of it. So I'm thrilled to see that DataCamp has that type of experience as well.
Hugo: And of course, we see notebooks pop up everywhere. From in the slide deck that we discussed earlier, you have a great slide on the Large Synoptic Survey Telescope. On top of that of course the gravitational waves were discovered by the LIGO project. And they've actually published all of their Jupyter notebooks. So this is in basic scientific research, right? I mean there's a lot of stuff happening at Netflix with Notebooks now. So it's across the board, right?
Brian: Yeah. And this is really a pattern that we've seen emerge in the last two years, and that is the transition from ad hoc usage by individuals in organizations to official organization-wide deployments at scale. And so we're starting to see a lot more organizations adopt Jupyter in a similar way to LIGO or LSST or Netflix where it is officially deployed and maintained by the organization and many, many users are using Jupyter on a regular basis. Some of the larger deployments that we're aware of are many thousands, or even on the order of 10 thousand or more people. So the scale is definitely getting large in these organizations.
Hugo: I'm going to say two things that I think are facts, and correct me if I'm wrong. Netflix runs over a hundred thousand automated notebook jobs a day. And at least two, either contributors or core contributors to Project Jupyter, work full time at Netflix as well.
Brian: Yes, absolutely. So Kyle Kelley and M Pacer are on the Notebook team, I don't know if that's exactly the name of their team, but they're one of the tools teams at Netflix. They work both with us on some of the core Project Jupyter projects but also they have a number of other open-source projects that work with the different Jupyter protocols. One of those is InterAct, which is another user interface from working with Jupyter Notebooks that has a focus on simplicity and personas, where individuals do want to work some with code but they're not living and breathing in code all the time. Business analysts would be a great example of the type of persona that InterAct is targeting. And then, as you mentioned, Netflix has really innovated in the area of using notebooks in a programmatic way, running them at batch jobs every day. And I think your number of around a hundred thousand batch jobs that are notebooks a day sounds about right from what I remember.
Brian: And then a number of open-source projects out to help with those type of workflows. One of those is Papermill, the other is Commuter. And I think one of the things I love about what's going on at Netflix, and I think this really comes from the leadership of Kyle Kelly there, and that is a deep understanding of the value of Jupyter's open protocols. And that is sort of a recognition that the different software projects that we've built on top of those protocols are sort of like a Lego set that you get. You bring it home from the store and there's a default instruction set to build something out of the box. But then realizing that the same pieces can be reassembled in different ways to build whatever your organization needs. I love how that thinking has really sort of seeped into all the different ways that Netflix is working with data. And I think they're doing really interesting things as a result.
Hugo: And the Notebook team at Netflix actually published a really interesting blog post article recently, which we'll link to in the show notes along with a lot of other things that we're talking about.
Brian: Yeah. And they also gave a number of talks at JupyterCon and those talks will be posted on the JupyterCon YouTube channel here in the coming month I think.
What is the most surprising use of a Notebook you've seen?
Hugo: Okay. Fantastic. So when we have 2.5 million public notebooks on GitHub, I'm sure there's a lot of surprising stuff happening out there. This may be a bit of a curve ball, but I'm wondering if you've seen any uses Jupyter Notebook that have surprised you, you've been like, "Oh, wow, that's interesting." So what is the most surprising use of a Notebook you've seen?
Brian: Yeah. I mean one of the fun things about working on Project Jupyter is to follow all the amazing things that our users are doing. And I think seeing the impact that Jupyter's having in the world of scientific research is something that we're really proud of. So to see large-scale science such as LIGO and Virgo winning a Nobel Prize in physics and as part of that publishing Jupyter Notebooks that anyone in the world can use to completely reproduce their analysis all the way from the raw data to the final publication-ready visualizations. That makes us really proud. I don't know that surprise is the right word to use there. Some of that is that that's the community that we came out of and so we've always worked really hard to make sure that Jupyter was useful for those usage cases.
Brian: In terms of surprise, the most surprising or shocking usage of Jupyter was by Cambridge Analytica and SCL Elections to build machine-learning models to manipulate the 2016 elections.
Hugo: Right. And I do think surprising is one word there. Shocking is another word. And I actually remember the first... I saw a Tweet, and I think it was Wes, it was Wes McKinney who tweeted words to that effect. And we saw a screenshot of a Jupyter Notebook with pandas DataFrame with some scikit-learn fit and predicts or something like that. And that was a moment where I also stepped back and really thought, all these tools can be used for all types of purposes, right?
Brian: So Cambridge Analytica, all of their web presence and GitHub presence is gone. SCL Elections, which worked with them, hasn't taken their stuff down, or hasn't taken all of their stuff down from GitHub. And so there's a project called JupyterStream. You can tell that the people working at SCL Elections were typical data scientists who were excited to use these tools to do data science. And the thing that's scary is if you look in the demo subdirectory in notebooks, there is a notebook there and it's very clear the type of things they were doing. Now in this particular case, it's nothing particularly sensitive. It looks like they're tracking voter registration counts by calendar week and working with a pandas DataFrame with that. But we were certainly ... again, I'm not quite sure what the right word is, surprised doesn't quite capture it.
Brian: I think it was really a moment of waking up for us and realizing that having an open-source project with a very free and liberal open-source license is very similar to free speech in that it literally is a licensed open-source project can and will be used by just about anyone, and that includes people doing really good things, but also people doing really evil things.
Hugo: It is creepy how this repo also says Jupyter to the rescue, exclamation point.
Brian: It's really a trip. I mean that was the original interview that Christopher Wylie did. So he was one of the data scientists at Cambridge Analytica. And in that first interview that came out, I think it was in The Guardian, he used the phrase “build models”. And immediately I thought, "Wait, hold on a second. This is a data scientist. They're talking about building models. What they're really saying there is import scikit-learn in a Jupyter Notebook." And initially it was sort of like, "Yeah, they might have used it. Or maybe they used RStudio." But then over time it's become very clear that they certainly were using Jupyter at some point.
Where should Notebooks not be used?
Hugo: So this is a nice segue into my next question which may seem like that's an answer to but I suppose it isn't necessarily. My next question is, adoption of Notebooks, as we've discussed, have been huge and I'm wondering if there are places you see Notebooks used where they shouldn't be used?
Brian: Yeah. I'm not quite sure I would phrase it where shouldn't be used. But certainly I think there's a little bit of an effect where the Notebook is a hammer, and so everything starts to look like a nail. In particular the type of workflow where Notebooks begin to be rather painful is when exploratory data science and machine learning becomes more software engineering and more about data engineering. And in those usage cases, it's not a fantastic software engineering environment. It's not designed for that purpose. Now this is something we're hearing from our users that right now there's sort of a very steep incline between working interactively in a notebook and software engineering. And as someone moves across that transition, at some point today they get to the point where really the right thing for them to do is stop using Jupyter and open up their favorite IDE and start to do traditional software engineering. And that can be rather painful in that most of the IDEs that people love are not web-based.
Brian: And so if anyone's working with significant amounts of data and running in the Cloud, those IDEs may not even really be a great option. And so we are working a lot to improve the experience at that boundary between interactive data science and software engineering. We don't envision that Jupyter's ever going to replace full-blown IDEs, but it's really at that boundary where we're seeing a lot of user pain currently where notebooks themselves start to be not the best tool.
Hugo: Yeah. And that dovetails nicely into my next question, which is around the fact that there are several common criticisms of Notebook, such as they may encourage bad software engineering practices. And I suppose most famously recently, JupyterCon accepted Joel Grus' talk, I Don't Like Notebooks, to be presented at JupyterCon. I'm just wondering what you consider the most important or relevant or valuable or insightful criticisms that can help moving the project forward?
Brian: Yeah. So I think there's, I really appreciate the talk that Joel gave at JupyterCon. It was a really well received talk and we want to hear things like this. It's really important for us. Some of the criticisms that he had about Project Jupyter are in the category of things that we fix. So existing user experiences or features that we offer or don't offer or could improve. And most of the things he brought up in that category I think the whole core Jupyter team is more or less on the same page.
Brian: The other aspect that he was bringing up gets more to the heart of interactive computing with Jupyter Notebooks. I think it's helpful to bring those things up as it really forces us to clarify the value proposition of that type of workflow in a notebook compared to just traditional software engineering. And so I think the discussion that has emerged out of that has been really helpful and something that is helping us to clarify, when should you use Jupyter Notebooks or why would you use them and why should you not use them in some circumstances.
What is JupyterLab?
Hugo: Absolutely. So we've discussed Notebooks, of course, but something I'm really excited about, and I know it's something you're incredibly excited about, is the next generation user interface for Project Jupyter, which is JupyterLab! So maybe you can tell us what JupyterLab is and why working data scientists would find it useful?
Brian: Yeah. JupyterLab is definitely something that I and many other people in the core team are excited about. JupyterLab is a next generation user interface for Project Jupyter. We've been working on JupyterLab for around four years now and just in the last month it left beta, so it is ready for production.
Hugo: And congratulations.
Brian: Thank you very much. We're really pleased to get through that hurdle. It's still not at at 1.0 release because some of the developer oriented extension APIs are still stabilizing. One of the big things we heard of users of the classic Notebook is that people wanted the ability to customize and extend and embed aspects of the Notebook with other applications. And the original code base in the classic Notebook just wasn't designed in a way that made that easy. So one of the core design ideas in JupyterLab is that everything in JupyterLab is an extension and all those extensions are in PM packages. And the idea there is that the core Jupyter team can't build everything for everyone and a lot of different individuals in organizations will come along and add new things to JupyterLab. Those extension APIs, which are the public developer oriented APIs of JupyterLab, enable those extensions to be built and we're still in the process of stabilizing some of those APIs.
Brian: But I want to emphasize that from a user's perspective, for people who are using Jupyter on a daily basis, JupyterLab is fully stable and production ready and in many ways, at this point, I would say it's a better user experience and more stable than the classic Notebook.
Hugo: Great. And what type of features do you have in JupyterLab that you don't get in the classic Notebook?
Brian: One of them is the ability to work with multiple different activities or building blocks for interactive computing at the same time. So the classic Notebook, each notebook or terminal or text editor, worked on a separate browser tab. And that made it very difficult for us to integrate those different activities with each other. So an example of how that integration would work in JupyterLab is, if you have multiple notebooks open side by side, you can just drag a cell between those two notebooks. Another example would be if you have a markdown file open, you can right click on the markdown file and open live markdown preview and then also open a code console, attached to that markdown file and start running code in any of the different languages that Jupyter supports in a manner that's more similar to an experience like RStudio. So having the different building blocks, places to type code, outputs, terminals, notebooks, integrated in different ways to support some of these other workflows that come up.
Hugo: And also a CSV viewer, right?
Brian: Yes. So another big idea, design idea in JupyterLab, is the idea of more direct manipulation of user interfaces. And so in many cases, writing code is the most effective way of interacting with data. However, there's many situations where writing code is a bit painful. And a great example of that is, if you have a new CSV file, you don't know what's in it, and you simply want to look at it. Of course you can open up a notebook, import Pandas and start to look at the CSV file. But in many cases, more direct modes of interaction are highly productive and useful. So JupyterLab's file system access is based around the idea of the possibility of multiple viewers and editors for a given file type.
Brian: And so for example, for a CSV file, you can open it in a text editor and edit it as a plain text CSV file, or you can open it in this new grid or sort of tabular view of it we have, and that viewer's the default for CSV files. So you can just double click on a CSV in JupyterLab and it will immediately open in a form that looks like a table.
Hugo: And I recall from one demonstration that it supports wildly large CSV files as well, right?
Brian: Yeah. So one of our core contributors, Chris Colbert, spent a lot of time building a well designed data model and a viewer on top of that. So the data model for this grid viewer does not assume that all of the data is loaded into memory. So it has an API that allows you to request data from a model on an as needed basis. And where that's used is in the view that sits on top of that, if you have a really large CSV or tabular data model, the view is only going to request the portions of the data that are visible to the user at a given time. And so for example, right now, some of the demos that we're doing, you can just double click on a CSV file, it has over a million rows and it's big enough, those files are big enough that they don't open successfully in Microsoft Excel on the same laptop. And they open just fine in JupyterLab anr the viewer or the renderer that Chris wrote, it uses Canvas, so it's a very high performance tabular data viewer.
Brian: And to keep ourselves honest, we've tested it with synthetic data sets. So these are not concrete data sets, they're generated on the fly but they have a trillion rows and a trillion columns. And the tabular dataset viewer works really well and you can scroll through the dataset just fine.
Brian: And I think another side effect of direct interaction with data is that when you make it easy for users to interact with data in those ways, they're going to do that, right? So if you can double click on a CSV file, a user's going to find, "Wow, that's useful," and they're going to do that. And you have to spend a lot of time making sure the underlying architecture doesn't let them do things that are going to have adverse side effects. We're trying to build an architecture that has both good user experience but also can deal with the realities of large data.
Hugo: And there are several other features that we could discuss, but I'm just going to pick one which I think is very attractive and fantastic, which is the ability to collaboratively work on Jupyter Notebooks with colleagues and collaborators.
Brian: Yes. So this is something that we've been working on for a while now. Our first take on this was a JupyterLab extension post talk that UC Berkeley wrote, Ian Rose. And this provided integration with Google Drive and the Google Realtime APIs, which enable multiple people to open a single notebook and collaborate in real time on that notebook and see the other people working and editing the notebook at the same time.
Brian: And then in the last year and a half, we've started a new effort to build a real time data model and data store for JupyterLab for two reasons. One is that the Google Realtime API has been discontinued. And then the other is that we've heard very clearly from our users that there's many organizations for whom sending all their data to Google APIs is a no go. And so it's become really important for us to have a high performance, really well designed real time data storage. We've been working on that for the last 18 months. Again, Chris Colbert, who did the data grids, is the person working on that.
Hugo: Great. And listeners out there, this has been kind of a whirlwind introduction to a bunch of features in JupyterLab. I'd urge you to go and play with it yourself and check out some of the demos online as well if you haven't yet.
Brian: And I want to clarify, the version of JupyterLab that's out today, does not yet have the real time collaboration.
Hugo: Okay, that's right.
Brian: Still not quite released yet.
Hugo: So we've discussed IPython, we've discussed Jupyter Notebooks, we've JupyterLab. What else exists in the Jupyter ecosystem? Could you give us just a brief rundown of a couple of the other things?
Brian: Yeah, absolutely. Probably the biggest other constellation of projects is the JupyterHub project. JupyterHub is its own organization on GitHub and there's a number of different separate repos and projects there. And JupyterHub provides basically the ability for organizations to deploy Jupyter at scale to many users. With the patterns of adoption that we're seeing right now, that usage case is really, really important. As a result of that, JupyterHub has seen both a lot of interest from people contributing and also organizations using it.
What are the challenges the project is facing?
Hugo: We discussed earlier that the talk you recently gave at JupyterCon in which you stated that, "Project Jupyter is undergoing a phase transition from having individual users in organizations to having large scale institutional adoption." I'm wondering what the unique challenges the project is now facing due to this transition?
Brian: Yeah. So there's both organizational and technical challenges we're facing. On the organizational side, I would say the big challenge is that we're seeing an increasing number of organizations coming to us and wanting to interact with us rather than just individuals in those organizations. And that really changes the type of people you're talking to in the organizations. So in many cases in the past, it may have been data scientists or machine learning researchers. And increasingly it's managers, project managers, and other decision makers who are thinking about the broader data strategy at the organizations.
Brian: From a technical perspective, it brings out a lot of new usage cases, in particular in JupyterHub, to address the needs of large organizations. Some examples of those are security, security is a really important thing for large organizations, particularly when there's sensitive data in the mix. Another aspect of that is that in these large organizations there are typically a wide range of different skill sets, responsibilities, roles, access permissions, and priorities of the people working with Jupyter. And so it's not necessarily just people who are living and breathing code all day long, but a lot of other people in the organization working with data that don't necessarily want to look at code all the time, or even most of the time. And so there's a lot of work we're doing thinking about, how would Jupyter need to evolve to address those usage cases?
Call to Action
Hugo: Absolutely. So Brian, as my last question, I'm wondering if you have a final call to action for all our listeners out there who may have used Jupyter Notebooks, may not have, but may be interested in doing so?
Brian: Yeah. So I think there's a couple different calls of action. One is for people to engage with open source, individuals. If you're a data scientist or someone doing machine learning at a company or a student learning about these tools and techniques, engage with the open-source projects. Find an open-source project you're interested in, understand more about the project, maybe help with documentation, and a lot of what we've found is that innovation happens when diverse groups of people come together and talk to each other and work towards common goals. And so the more people we have joining the projects and contributing and helping us think about these things, the better off and more healthy the open-source projects will be, but also the users of those projects will be better served.
Brian: And a second call to action would be for people working in large organizations that are using open source tools in this space, I think it's important to note that many of the open-source projects, in particular those that are community driven, like Jupyter and many of the other NUMFOCUS focus projects, we continue to struggle with the long term sustainability. And there are many core contributors to these projects that continue to lack long term funding and the ability to focus on working on the projects. So if you're in an organization using these tools, I would really encourage you to talk to the people in the organization, to think about and understand how you can support the core contributors and the broader sustainability, both in the sense of community but also particular the financial sustainability of these efforts. That would be really, really helpful.
Hugo: And I'll add one other final call to action there, which is, I started using JupyterLab all the time instead of Notebooks mid last year I think. I would urge anyone out there who still uses the classic Notebook to jump into JupyterLab. I think you have no reason not to these days. It's such a wonderful place to use Notebooks among many other things.
Brian: Yes. That's a great point, Hugo. Even for the part of my job where I get to use the Jupyter Notebook, I transitioned, in particular in teaching and some research, to using JupyterLab back in January and it's worked really, really well in this context. At this point I'm using JupyterLab basically all the time. And so I echo what you said and I appreciate the kind word.
Hugo: Fantastic. I'm glad you agree. And, Brian, thank you so much for coming on the show. I always love our conversations and it's been an absolute pleasure formalizing this and putting it out there.
Brian: Yes. And thank you so much, Hugo, for working on this podcast. I know a lot of people really appreciate it and thanks for having me on.
How Data Scientists Can Thrive in Consulting
Pratik Agrawal, Partner at Kearney, joins us to discuss how data teams can scale value in consulting environments.
Unlocking Scalable ROI for Data Teams
Increasing Diverse Representation in Data Science
Reshaping Data with pandas in Python
Reshaping Data with tidyr in R
Data Quality Dimensions Cheat Sheet