This year at PyCon 2017, two keynotes were given by people from the scientific Python community: Dr Jake VanderPlas spoke on why scientists have and should embrace Python; Dr Kathryn Huff spoke on how the Python community can help the scientific communities and why they should. These two talks combined present a holistic and coherent view of how scientific research and Python programming complement each other, why many working scientists today would be best served adopting Python and why and how the Python community can effectively aid scientific research.
The Unexpected Effectiveness of Python in Scientific Computing
Jake VanderPlas is an astromer at the eScience Institute at the University of Washington, Seattle. He is also active in the larger scientific Python community, having contributed to SciPy, scikit-learn and altair among other Python packages.
Most people envisage astronomers at the the bases of giant telescopes, peering into lenses, however Jake assured us all that most astronomers have never looked through a telescope in their professional life, but rather spend most of their data-gathering time doing so using database queries. But databases are really only the tip of the iceberg when it comes to astronomers' computational skills, which are necessitated by the tasks that they face and Jake gave several illuminating examples. Take the following visualization of the TRAPPIST-1 Exoplanetary System, for example:
Now it's tempting to think that this is an actual photorealistic image but it is in actual fact an artist's interpretation: it wasn't the original data. The existence of the planet was inferred by an eclipsing star and using statistical modeling of such a system on the following and subsequent data (in the form of an image, data from Ethan Kruse):
How was all of this inference and statistical analysis done? In Python, of course. And via an 'incredibly intricate statistical modeling of the system'; moreover, finding these sorts of systems, Jake told us, 'comes down to writing statistical code in an intricate data analysis pipeline'.
This is one of several scientific projects (in astronomy) that Jake showed us that uses the Python programming language in its data processing and analytic pipeline. It is worth noting that all of these projects are not merely written in an open source programming language, but that they all have their code on github, for example, Kepler and JWST, the latter of which is attempting to discover gases in the atmosphere of other planets to sniff out chemical composition of planets around other stars in aid of finding chemical signatures of other life-forms -- these projects are in Python, hosted on github and use Jupyter notebooks, among other tools.
Could this be anecdotal, that is, a handful of projects in Python that are exceptions rather than the norm? Well, not in astronomy. Jake had done an analysis that demonstrated that publications using Python have gone up steadily:
Check out that Python Hockey Stick! So we've ascertained that scientists use Python more and more, but why do they use Python?
The Argument for Python in Scientific Computing
Jake asked 'Why is Python such an effective tool in science?'
1. Interoperability with other languages.
Python as glue: historically, many working scientists have had awkward, complex and essentially ridiculous data pipelines to get them from experiment and/or data to communicable results, involving database querying, command line foo, the use of specialized or proprietary software, through to data visualization tools. Jake Vanderplas quoted Dave Beazley partially because Dave said it well:
'Scientists... work with a wide variety of systems ranging from simulation codes, data analysis packages, databases, visualization tools, and home-grown software- each of which presents the user with a different set of interfaces and file formats. As a result, a scientists may spend a considerable amount of time simply trying to get all these components to work together in some manner...'
And Python can inter-operate with most of the tools you would ever want to use in Scientific research: it binds, it's a glue. As Jake said, Python is used to 'glue together the hodge-podge of tools that people are working with; high-level Python syntax wraps these low-level C/Fortran libraries.'
2. "Batteries Included" + third-party modules
Another reason that research scientists have flocked to Python (and why those that haven't "should") is that
- it has a lot of "batteries included" and
- for those that are not, there's a giant Scientific Python (
SciPy) ecosystem available for use as third party modules.
The "batteries included" allow you to scrape the web, launch web servers, works with file systems and databases, read jsons and much more!
The "third-party" modules are where things get really exciting, though. Examples include, but definitely are not limited to:
- NumPy for array-computing;
- IPython and Jupyter for IDE on top of of Python, as well as means to write Scientific documents and communicate and collaborate on research/results;
- Numba and Dask for scaling and distributed computing;
- Pandas for DataFrames;
- Matplotlib and Bokeh (for example) for data visualization (there's an entire ecosystem of dataviz libraries in Python itself);
- Scikit-learn for Machine Learning;
- NetworkX for creating, manipulating and studying complex networks;
- Scikit-image for image processing;
- PyMC for Markov Chain Monte Carlo;
- The list goes on.
As Jake said, "if you have a problem to solve, you can most likely find a library to help and it's probably on github!"
3. Simplicity and Dynamic Nature
Python is relatively easy to write, especially relative to languages such as C. As Jake said, 'In Python, particularly when coming from, for example, C, you write what you want to happen and it happens; it is kind of like writing executable pseudo-code'.
He also quoted Perry Greenfield, who said 'Python is a language that is very powerful for developers, but is also accessible to Astronomers. Getting those two classes of people using the same tools, I think, provides a huge benefit that is not always noticed or mentioned.'
Due to tools like IPython and Jupyter notebooks, for example, there is a wonderful opportunity for dynamic coding and exploratory, iterative data analysis on the fly. This results in a low barrier to entry because scientific coding is itself nonlinear and exploratory. Jake pointed out that some may complain or feel that these tools are slow, but for the purposes of scientific research, it is the speed of development that is key, not the speed of execution, rendering these tools ideal.
4. Open ethos well-suited to science
Open source software has an ethos well-suited to the required openness of scientific research (in fact, perhaps moreso than much scientific research). We'll get to this in more detail when discussing Katy Huff's keynote but science does not merely occur in the papers published, science occurs in the result and the instructions/protocol and ability to reproduce the result; this is exactly what open source software development seeks to do. Jake provided the following quote from Buckheit and Donoho (1995, paraphrasing Jon Claerbout)
An article about computational result is advertising, not scholarship. The actual scholarship is the full software environment, code and data, that produced the result.' -- Buckheit and Donoho, 1995
What open source software and the ethos of the Python (scientific and otherwise) ecosystem provides is a framework that will hopefully help us out of the scientific reproducibility crisis: 'solving reproducibility is about open science' and the collaborative, peer-reviewed, GitHub-publishable nature of scientific research in the Python ecosystem is key to this open-ness in science. If I want to check out how the gravitational waves (that Einstein predicted) were detected in 2016, all I need to go is go to the LIGO repository here. Or if you're one of an increasing number of astronomers who use the Astropy package for their research, you can find the code and the community that produces it here and the fact that it is so open is a direct result of its community modeling the open source and open science ethos that they found in the greater Python community. Check out the table for what Jake (adapted from Perry Greenfield) feels that the astronomy community has learnt from the Python community:
Do It For Science
"I'm here to harness your power for my own devices" Dr. Kathryn Huff told the PyCon community and she went on to provide a compelling argument as to how and why the Python programming community at large can help science save the world. Her keynote provided a detailed and empassioned argument as to why open source software development, in general, and Python, in particular, has proved so effective in Scientific Computing and scientific research and practice as a whole. It went a step further and showed a multitude of places in which the Python programming and development community at large could use their Python skills to help the scientific community and to save the world. If you think water is a serious issue for humanity, or health or energy, Kathryn had projects and GitHub repositories with open issues at the ready that you could jump into to help. She gave dozens of examples of open projects and packages with issues that you could contribute to.
Kathryn (Katy) is a Nuclear Engineer, long-time contributor to the SciPy community, is involved in the Journal of Open Source Software and devotes a lot of her time to The Hacker Within and the Software and Data Carpentries, which help scientists get better at using computers and make discoveries, helping to improve the reproducibility and ease of scientists using computers.
Katy's mission is to improving the way that nuclear energy is produced; and what's her tool of choice? Python! Before delving into the manifold ways in which the Python community can help the scientific research community, Katy provided a strong argument as to why this is actually necessary. Katy quoted R. K. Merton (1945) in describing science as requiring rigorous, structured, community scrutiny (a great working definition of what organized skepticism would look like); these are all inarguably qualities of the open source Python community.
Katy brought the importance of this home when she enumerated the basic tenets of science and made clear that Open Source Software development has done a better job than any community of keeping to these tenets since the Pythagorean Era (6th Century BC). These basic tenets of science are:
- peer review
Computers "should" help scientists to do the following:
What is, then, the challenge, in getting more scientists to use open source software such as Python so that computers can help them?
- Scientists aren't trained to use computers effectively;
Thus Katy's call to arms to the Python community:
"I would like to involve you".
And such rallying is another reason for working research scientists to use Python: that there is a huge community of open source programmers at their disposal. Katy, as stated, is involved in another approach, which is to work at educating working scientists with respect to using computers for the work. There are several such programs. I would urge all working scientists to check out Software Carpentry and Data Carpentry to see if they may able to help your group, department or university (answer: if you compute, they most likely can).
So how does Dr Katy Huff suggest programmers help? In four simple steps:
- Be curious;
- Pick a project;
- Contribute to science;
- Save the World.
Why Python For Research?
- The Python ecosystem contains a mature and still-evolving assemblage of computational tools for working scientists.
- Scientists are flocking to Python in larger numbers due to Python's
- Interoperability with other languages;
- "Batteries included" and third-party modules for nearly everything you would want to do;
- Simplicity and Dynamic nature
- Open ethos is well-suited to science
- The Scientific Python community is growing and the libraries are constantly being developed; the community is open;
- The Open Source development community keeps to the basic tenets of scientific research moreso than any other community; these tenets are:
- peer review
- There is a wider community of Python programmers who are not scientists and want to contribute to the impact of Python on science and the world and issues that they believe in; to twist the words of Perry Greenfield ever so slightly,
'Python is a language that is very powerful for developers, but is also accessible to [scientists]. Getting those two classes of people using the same tools, I think, provides a huge benefit that is not always noticed or mentioned'.
If you have any thoughts, responses and/or ruminations, feel free to reach out to me on Twitter: @hugobowne.
The 23 Top Python Interview Questions & AnswersEssential Python interview questions with examples for job seekers, final-year students, and data professionals.
Working with Dates and Times in Python Cheat SheetWorking with dates and times is essential when manipulating data in Python. Learn the basics of working with datetime data in this cheat sheet.
DataCamp Team •
Plotly Express Cheat SheetPlotly is one of the most widely used data visualization packages in Python. Learn more about it in this cheat sheet.
DataCamp Team •
Getting started with Python cheat sheetPython is the most popular programming language in data science. Use this cheat sheet to jumpstart your Python learning journey.
DataCamp Team •
Python pandas tutorial: The ultimate guide for beginnersAre you ready to begin your pandas journey? Here’s a step-by-step guide on how to get started. [Updated November 2022]
Vidhi Chugh •