3 Things I learned at JupyterCon
Project Jupyter may be most well-known for the Jupyter notebook but, as we'll see, there are so many other exciting developments surrounding the project. If you haven't experienced the Jupyter notebook for interactive, reproducible data science analysis, computation and communication, you can check them out here. The Project itself says it best:
The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, machine learning and much more.
As we'll see, the Jupyter ecosystem contains more than notebooks. Project Jupyter is also more than kernels and browser-based front-ends. Peter Wang, CTO and co-founder, of Anaconda, put it well when he said that Jupyter is a substrate for innovation. I recently attended the inaugural JupyterCon in New York City. This was the first conference dedicated to the Jupyter ecosystem, which has been growing so much over the past years, and I was really excited to be attending it for the tech, for the developments, for the open source and for the community. These are some of my thoughts inspired by the conference.
1. The Python community has a strong, huge vision for the future of Data Science
The JupyterCon keynotes focused on and gave perspective to some of the most pressing challenges of our time. Whether you're a budding or a well-seasoned Data Scientist, the points touched upon in these talks will be relevant to the work you do today and what you'll be doing in the future: the sustainability of open source projects, the necessity of reproducible data science, the education landscape in Data Science and the vision that our community leaders have for the future of interactive, reproducible computing.
Fernando Pérez, creator of IPython, co-lead of Project Jupyter and Professor at UC Berkeley, spoke about the long-term sustainability of the project. He discussed the motivations behind IPython and Jupyter, among many other projects that are now firmly placed in the scientific computing Python stack. These projects provide computational environments that mirror and reflect the scientific tasks at hand, along with associated conceptual and cognitive processes, the ability to rapidly load data, explore it, visualize data and tell stories. On top of this is another layer of motivation, an ethical one to do with access to tools and the possibility of collaboration. For example, if Fernando worked primarily with proprietary tools that require expensive licenses, he would not necessarily be able to work with his former advisor in Colombia. Moreover, he made a strong point about the pitfalls of using closed source tools: "If science is about opening up the black box of nature, we shouldn't be doing science with tools that we are not legally allowed to open up and understand."
The goal is to build human-centered tools for interactive computation and science; tools to help you think and reason through complex problems, to make possible human-driven computation, exploration and communication.
Moving forward, Fernando suggested that low-level protocols should not be where forking occurs and that protocol and formats need to be agreed on. They are the trunk of the infrastructure and there must be competition and evolution, but this should be in the leaves as we need to agree on the foundations.
So what's happening in the branches and leaves of the ecosystem? JupyterLab, an extensible environment for interactive and reproducible computing, nbdime, which provides tools for diffing and merging of Jupyter Notebooks, Jupyterhub, multi-user Hubs for notebooks, and Binder, opening notebooks in executable environments, among many other projects.
He also discussed the challenge of sustainability, the role of funding and the need for organizations such as NUMFOCUS to steward projects in open source and to keep in mind their growth, health and sustainability.
Peter Wang, co-founder and CTO of Anaconda, stepped back to look at what similarities we can draw between Jupyter and Anaconda and made clear that, in both these cases, we're around an inflection point. The user bases have undergone a technology shift from innovator and early adopter crowd to mainstream users. The latter are people who may not issue pull requests on Saturday night, but use the tool to get their job done. To meet these new needs, Peter reflected, will effectively change our job as a community, analogous to moving from a band playing gigs in a garage to playing in a stadium. Peter essentially said that we need to have this conversation as a community because, at these critical points, sustainability is essential. Projects need to be rigorously and well-documented, we need to emphasize tutorials, workshops and growing and adding new developers to the relevant teams.
As there is a substantial amount of money flowing in from public and private interest, it is paramount to organize it, which is one reason that organizations such as NUMFOCUS is so important.
Peter went on to correct a common misconception that Jupyter and Anaconda are just tools and stated that they are in fact substrates for innovation. They form common frameworks for atomic computational tasks that allow us all to establish a lingua franca through which a network of values emerge between innovators, creators and consumers -- in essence, a marketplace for Data Science.
Nadia Eghbal, working in Open Source Programs at github, talked about Where money meets open source, framing the sustainability conversation in terms of funding.
She first posed a hypothetical question to the open source community, "If you have money, how would you spend it?"
To answer this, Nadia asked another question, "Why do people contribute?" and the most common answers are
- "I want to solve a problem" (especially early on in a project)
- "I want to build my resume" (a public resume of sorts)
- "I feel like I belong here." (stay for the community!)
- "It's fun for me!"
Note that, for any individual, these incentives can change over time, for example, a user can become a contributor can become a maintainer. And depending on what stage a project is at, money may be best spent in different ways. For example,
- At the beginning of a project, there may be start-up costs;
- Then your money may be best spent evangelizing your project to get users, sponsoring community events and conferences;
- At some point, lowering the barrier to contribution will be important via in-person sprints, workshops and maintainer meet-ups.
Thinking about the role of funding in open source software development is an important, unsolved question and it is key to its success.
Lorena Barba, Associate Professor of Mechanical and Aerospace Engineering at the George Washington University, spoke about the role of Design in scientific reproducibility. She asked "Why do we care about computational reproduciblity?" and stated that we care because we're using computation as a way to create new knowledge about the world and scientific discovery. And science demands reproducibility! For a project such as Jupyter to remain sustainable and relevant, it needs to speak to the needs of its users and reproducibility is one of these.
Lorena questioned how interactivity can promote reproducibility. For example, GUIs are not appropriate for reproducibility as reproducing all the point and click steps is not easy to document). But if we view science as a conversation in which we all have goals, a shared language and hopefully agreement interactivity, then Jupyter can be viewed as an interface defined for reproducibility, a shared language to increase agreement and establish trust.
Wes McKinney, creator of pandas, stepped back stated that the challenge to Jupyter is larger than Python and concerns the general problem of interactive computing and reproducible research. A major issue is how to do to load, manipulate data, transform it and report on it in a reproducible way when we have what are effectively Data Science language silos (Wes described the landscape as almost tribal): pandas built a lot of cool things, however it is Python-dependent; in R and Python, you can do the same analysis (for example, a
group_by), but the implementation under the hood is totally different.
Wes' vision includes a shared data science runtime to make these silos smaller, just as Jupyter makes silos smaller on the front-end. But how?
- The ability to have a DataFrame in-memory format that's portable across environments and has zero copy interchange (no cost for moving a DataFrame from R to Python, for example)
- To share DataFrames between ecosystems without overhead;
- High-performance data access;
- Flexible computation engine.
The project of Apache Arrow is to create a language-agnostic data frame format with zero-copy interchange. The MVP is Feather, realized by Wes and Hadley Wickham last year.
Jeremy Freeman, Computational Biologist at Chan-Zuckerberg Initiative, works at the intersection of open source and open science. His vision is firmly centered on practical ways to make scientific research and progress faster, more efficient, effective, scalable and collaborative. In a word, to accelerate scientific progress through software and computational tools. The biggest challenges he has identified and is working on are:
- Enabling analysis (how do you analyze data in realtime using notebooks when your workflow requires so many tools?)
- Building collaboration, e.g. Human Cell Atlas, the goal of which is to systematically characterize all cells in human body). This involved data coordination platform for 100s of labs, re-imagining what large-scale collaborative scientific efforts can look like given the cloud; re-imagining what a modern, cloud-based, extensible and highly modular version of these collaborative efforts (data coordination platform) would look like!
- Sharing knowledge. Jeremy made the salient point that science shares knowledge in a pretty old-fashioned way: static documents that contain neither code nor data, many of which are behind paywalls. Possibilities for the future involve Jupyter notebooks and Binder for interactive scientific research.
As Data Science is growing and expanding into all types of industries, one of the bottlenecks is the lack of working Data Scientists. To solve this, we here at DataCamp believe that data science education is essential. We will also see it become more and more important for all modern citizens to become data literate and develop data fluency.
In his keynote discussed above, Fernando Pérez also touched upon the current pivotal role of education and how tech is changing the face of education and how we can leverage this. For example, the textbook for Berkeley's Foundations of Data Science course contains interactive notebooks. He also spoke of Data Science as consisting of essential skills for modern citizens.
Rachel Thomas, mathematician and co-founder of online Deep Learning school fast.ai, spoke about the fast.ai curriculum, which uses Jupyter notebooks to teach Deep Learning to 10s of 1,000s of students worldwide. This is a forward-thinking education model, in which students from all backgrounds can get up and running with modern Deep Learning techniques with minimal background. The motto of fast.ai is "If you can code, you can do deep learning". It's free, there are not advanced math prerequisites, it is all taught in Jupyter notebooks, they use a lot of data from Kaggle, ensuring both good data sources and good benchmarks and makes the techniques relevant with applications in image analysis and Natural Language Processing, and students get started on a single GPU on cloud instance right away. Rachel even said at one point in her keynote "It's not by listening or by watching that you learn, it's by doing", which is aligned with our motto here at DataCamp, "Learn by Doing"!
Demba Ba, Assistant Professor of Electrical Engineering and Bioengineering at Harvard University, is working to empower his students and to democratize computational education. Citing Fernando Pérez' "Data Science is a critical skill for a citizen of the modern world to learn", Demba's goal is to bridge the gap between Electrical Engineering and Computer Science by creating educational content that integrates theory and computation plus the design of a seamless coding interface in order to focus on learning the content. His courses focus on vertical integration of tools, in which students will collect data on themselves, upload it to the cloud, process it in notebooks, take the output and make real-time decisions in internet-of-things kind of way. His courses are leveraging modern tech by hosting all course notebooks on Amazon Web Services and the emphasis is on data relevant to the students, question forming, data gathering and analysis. Demba stated that "In the future, facility with data manipulation is going to be part of literacy" and that data-centered teaching will necessarily start popping up more and more in other fields, such as government and journalism.
2. JupyterLab is the future of interactive, open data science
I attended both a workshop and a talk on JupyterLab and it's all very exciting news. An added bonus was seeing how excited core JupyterLab contributors Brian Granger, Chris Colbert, Jason Grout and Ian Rose were about it. What is JupyterLab? Recall that Jupyter notebooks provide interactive, exploratory and reproducible computational environments. JupyterLab aims to provide a one-stop shop in which you, as user, can combine all the building blocks that you require in your Data Science workflow:
- file browser
- text and markdown editors
- bash terminal
- .csv viewer
Not only can you configure any number of the above listed building blocks however you wish, but they can interact with one another. You can, for example, drag and drop cells from one notebook into another, you can view markdown preview live in JupyterLab and the preview updates in realtime and you can attach a live console to a markdown document and thus execute code from the markdown directly in the console!
There are a plethora of exciting developments, including a .csv viewer which allows you to scroll in real time through a file of 1 trillion rows by 1 trillion columns), interactively browse json files and git integration (yes!).
The most exciting development for me is the ability to collaboratively develop notebooks and hence collaborate on code, data science communication and computational environments. To be clear, there is now real-time collaboration on Jupyter notebooks where you and I can work remotely on the same notebook (technical note: we'll be running different kernels, however, but this may not always be the case), discuss our work in a chat window and drag and drop code, text and equations from the chat to the notebook.
You can also build your own extensions to Jupyter (this is how the git integration was built, as I understand it), which will be a huge win for the ecosystem and JupyterLab as it evolves. I encourage you to find out more by watching this demonstration from PyData, Seattle 2017.
3. Data journalism is an unsolved challenge that we can all contribute to
Karlijn Willems, Data Science Journalist at DataCamp, presented on the role of Project Jupyter in enhancing data journalistic practices. Karlijn focused on several challenges that data journalism faces, the most important of which are the development of
- A data journalism workflow
- Reproducible data journalism
- Data journalism authoring standards.
I discovered a great deal in Karlijn's talk, I knew about recent data journalism works such as The Tennis Racket and The Panama Papers, but I wasn't quite aware that data journalism went back quite as far as at least 1821, when the Guardian published its first issue and its first statistical table therein.
Karlijn discussed several differing approaches to developing a standard data journalism workflow that could draw on such diverse fields as design thinking and narrative theory. Design thinking is essentially applying an iterative approach to finding your product (or question, in this case) and narrative theory informs specifically how we tell stories, which we all do, whether we are research scientists, journalists, data scientists or data journalists. These approaches, combined with the scientific process and methodologies of open source software development, provide fertile ground for development of a data journalism workflow. And all of us who have learnt from our mistakes from these disparate fields can contribute to help define an emerging field.
Reproducible data journalism, well, reproducible anything is a huge issue these days. If I read a piece of data journalism on fivethirtyeight, how can I verify that its results, workflow and methods are correct? Or that the data was actually what it was claimed to be? As the scientific community is reeling in a reproducibility crisis, we're at a critical point where data journalism can avoid such crises by developing community standards, such as providing the code used to generate analyses. Reproducible data journalism is also critical in an age dominated by clickbait and fake news, an age in which many people don't know what or whom to believe. Jupyter notebooks, for example, are a great way to show such code and doing so on github then allows anybody to view previous versions of the code/analysis also. Questions that we need to consider:
- Is my data reproducible (what if I scraped it on a given day?)?
- Is my computational environment reproducible?
- Is my code reproducible?
- Is it all openly reproducible?
Karlijn, in her talk, put me onto a piece called The Need for Openness in Data Journalism by Brian C. Keegan which describes in detail these challenges (and more) facing data journalism, along with an unsuccessful attempt to reproduce a fivethirtyeight article on gender biases in Hollywood.
One of the most interesting points Karlijn raised that may seem a bit unsexy but is essential is the development of data journalism authoring standards. The tech is currently way ahead of the press. There are several ways to turn computational notebooks and documents into websites (see pelican, jekyll and hugo, for example) but the question remains: if I have produced a data journalism piece in a Jupyter notebook or as R markdown, how do I submit it to the Upshot or fivethirtyeight? For the record, this isn't something that scientific publishing as a whole has figured out yet.
All of this provides a fertile ground for a conversation between journalists, research scientists, designers, open source software developers and data scientists to explore the possible futures of data journalism.