Official Blog
machine learning
+1

Data Science and Ecology (Transcript)

The intersection of data science and ecology and the adoption of techniques such as machine learning in academic research.

Here is a link to the podcast.

Introducing Christie Bahlai

Hugo: Hi there, Christie, and welcome to DataFramed.

Christie: Thanks for having me, Hugo.

Hugo: It's a real pleasure to have you on the show. We're here today to talk about data science, ecology, open science, what the data deluge really means for working scientists and data scientists. Before we get into all of that, I'd love to find out a bit about you. Can you tell us what you do?

Christie: I am an Assistant Professor at Kent State University. I am coming up to my one year professor-versary, I guess.

Hugo: Congratulations.

Christie: Thank you very much. I started August 20th last year, but before that, I was post-doc at Michigan State, and a fellow at Mozilla. I'm interested in long term patterns in ecosystems, so how ecosystems are behaving and how their processes are unfolding over time. I mainly look at this through the lens of insect ecology, particularly their population dynamics. The insects that I'm most interested in are the ones that do things in the ecosystem. I do a lot of work with predators, so insects that are involved in pest control, and pollinators, so insects like bees that pollinate plants, and so help with food production.

Christie: I mostly work in human managed systems, so agriculture, park systems, managed forests, urban landscapes. Anywhere where humans are interacting with the environment and need to get things back out of the environment.

What is ecology?

Hugo: Fantastic. Just for clarity for me, ecology, the general discipline, is about the study of these types of ecosystems?

Christie: Ecology is the study of any ecosystem. Anywhere where there is something living, you can study ecology. Ecology is of course, not just about the living system, like the living part of the system. It's also about the environment that the living things are existing in. There's a big part of ecology that's devoted to biogeochemistry, for instance. That is the chemical processes in the environments that are affecting the organisms and that sort of thing. I'm mostly into organismal ecology, so I focus on the living things.

Hugo: Great. When you said you were interested in long term patterns, this type of looking at patterns, pattern recognition, identifying patterns, this sounds ripe for all the tools that data science has developed in the past decades.

Christie: It really is. I accidentally stumbled into long term ecology. I ended up with a post-doc at the long term ecological research site at Michigan State University, Kellogg Biological Stations. It's in southwestern Michigan. I hadn't even considered that it was possible to do ecology this way before I arrived there. Suddenly, I looked at the data that they had been amassing since the late '80s at this one site where they had been intensively sampling the landscape. It just clicked with me. It wasn't just going out in an ecosystem that we could study, we could also study it from the signature it had left in the data that we'd been collecting about it over time. I thought that was really exciting. My background, I actually started out in physics. My undergrad was in physics and it seemed like a good idea at the time. I got my first lab job in a biology lab, and I had switched back over. I had become a field ecologist in earnest. Working in long term ecology allowed me to apply a lot of the math and modeling that I'd learned in physics to biological systems.

How did you become interested in data science?

Hugo: Great. What happened then in your career in order to get you really interested in data science?

Christie: A big part of it was the fact that I was the biologist who had the physics background, and so people would always ask me for help with their models and their math and their stats. People would send me data sets and say, "Hey, could you help me run an analysis on this?" I started realizing that the human element in data in general was problematic. In physics, most of the data that you get is created by machines, and so it's automatically machine readable. When you have biologists creating data, you have the human element and everyone's hacking out their own way of handling the data. They open a spreadsheet and they design their own way of entering the data. That's how they think of the data. They think of the spreadsheet more as a lab notebook than as a way of recording data for posterity, for future applications.

Hugo: There's not even necessarily a system of best practices around this-

Christie: Yeah, exactly.

Hugo: Data entry task, right?

Christie: Yeah. Especially with biologists. Biologists are taught to think in the ecosystem, and then they're let go and the people set them forth on their data and say, "Okay, do this." They don't really fill that bridge. I noticed that there's a vast amount of data in biology, and it's very inconsistently kept. I saw this as an opportunity, because think of how much more we could learn about ecosystems if we could really effectively compare the notes of biologists. Bring together all their data in a meaningful way, rather than stacking a pile of notebooks together. Making it a synthesis, rather than collection.

Hugo: Exactly. Then my recollection is, you started your blog. Then you started working with data carpentry and software carpentry, right?

Christie: Yes, exactly. I noticed that there were some trends in how people were handling data, and I admit, some of them got me a little bit annoyed and made me feel heated about it, so I decided this is an opportunity to create a blog where I can teach biologists how to better handle their data so that computers can read it better, so that other people can read it better. At this time, I was operating on my own. I had no idea that there was a community. It was actually data carpentry and software carpentry that found me. Greg Wilson reached out and said, "Hey, I think we've got some ideas in common," when he started reading my blog. That's how I got pulled into the community. Tracy Teal, who is Executive Director of the carpentries now, she was at Michigan State at the time while I was there, and so she took me for coffee and she explained what this stuff was. I went, "Whoa. There is a community around this." I had no idea.

Christie: I attended my first hack-a-thon, where we developed lesson material for data carpentry, and the rest is history. I just started really embracing the community and started doing a lot more outreach, got involved with Mozilla, and now I am as much a data scientist as an ecologist.

Hugo: Fantastic. My recollection is that Tracy Teal also has worked in ecology. Is that right?

Christie: Yes. A kind of funny thing, we were both partially appointed to a project at Michigan State called The Great Lakes Bioenergy Research Project, where we were looking at ecosystem responses to biomass crops that could be used in bioenergy production. She was more on the microbial end of things, and I was looking at the insect population. We were in different labs, so we didn't know each other.

The End of Theory

Hugo: Just on a side note, I actually taught my first data carpentry workshop last weekend at Cold Spring Harbor Labs with Jason Williams there, which was really exciting. It was cloud computing for genomics in R and some machine learning in R. That was a real treat. Moving on, a decade ago, Wired published an article by Chris Anderson that was titled, "The End of Theory." Relatively provocative. "The End of Theory: The Data Deluge Makes the Scientific Method Obsolete." In your mind, is this title correct, or will it be correct?

Christie: I love this article because I can use it to provoke so much conversation with people. It's really nice because it was 10 years ago that this was published, and so we know that this has not come true. There are still theoreticians out there, so theory isn't dead yet. The scientific method is going strong, but the data deluge is changing how we look at science. The scientific method as classically taught, comes from this mindset of data scarcity. From a small set of observations, can we observe the patterns that are predicted by theory? Data science turns this on its head. From a vast data sources, can we use patterns to infer theory and mechanism, or predict the behavior of a system?

Christie: From a classical standpoint, from a theoretical standpoint, data science is almost heretical. It is going at things completely backwards, but when you think about it, if you're trying to model something based on a small sliver of what can be known about the system, this is sort of what classical science is based on. You sample a population. When your sample is basically the population, you don't have to guess what the patterns of the whole population are. You can say, "This is what the patterns of the whole population are, and now let's go back and see if we can find the drivers."

Hugo: For sure. Essentially, what's happening there is that we've been, due to cultural reasons essentially, we believe or we utilize this hypothetico-deductive model of science, which essentially is formulating a hypothesis that's hopefully falsifiable, and then using experiments to test whether it looks true or not, right?

Christie: Yeah.

Hugo: Culturally, we've accepted that method, or been forced to use that method due to the scarcity of data. I think as a culture, we've almost adopted the form of belief that this is the only way of doing science, right?

Christie: Yeah, absolutely. It's kind of funny, I think that the hypothetico-deductive model has actually affected how we write about science, how we perform science, to the point where we're almost trying to shoehorn our way back into the hypothetico-deductive model, even when what we're doing when our discoveries are not hypothetically deductively derived. For example, a lot of graduate student work is really, really exploratory. The students don't necessarily have a hypothesis, they are just trying to figure out how the system works. They go out and they make observations, and then there's this pressure on the students and do a statistical test and show that there's some sort of statistical difference. Then you see students applying these post-hoc hypotheses, but they write it in their papers as if they had the hypothesis at the beginning, because their advisors will write in the columns of their papers, "Where is your hypothesis?" The thing is, they didn't really, and that's okay. They went out and they discovered. They observed a trend, and they don't necessarily have a hypothesis about it, but they are developing hypotheses. It's more of a hypothesis generating exercise that they're undergoing.

Hugo: The fact that their advisor will write in the margin of their draft, "Frame the story in this way," is due to the way publication and getting your paper accepted by journals is incentivized as well, right?

Christie: Exactly. There is also this cultural element of scientists need to be right. Everybody needs to be right. Am I right? Sorry.

Hugo: Definitely.

Christie: There is the "Oh, well, I set out to support this hypothesis, and look, my data supports that hypothesis." In a lot of cases, people are going back and changing their hypotheses to fit the data just because it fits the model of publication better, which I think is kind of silly, really. We're not necessarily learning more by doing that, and maybe just accepting that this was a purely exploratory work.

Hugo: For sure. I love that you framed it in terms of we've been ultimately shoehorned by the hypothetico-deductive model, or scientific method itself, by trying to fit everything that we do into these constraints. We're not telling the entire truth of our process and the story. I think this is particularly applicable to data science because as you've said, data science, machine learning, these are essentially very pattern-based processes and methodologies, as opposed to the hypothetico-deductive model. I think that's probably why there's a lot of pushback from the scientific community on data science as a whole, because it doesn't conform to what the scientific community has been doing due to scarcity of data for centuries.

Christie: Yeah, exactly. Scientists are creatures of habit, and they, I'm going to go there.

Hugo: Let's do it.

Christie: Scientists-

Hugo: I'm ready.

Christie: Are elitist, and so they think that the way that they were taught and the way that they are approaching problems, and that they have learned to succeed, is the only way to be successful, and the only way to approach things. I think that is a symptom of the lack of diversity in science, to be perfectly honest. When you have one way of approaching a problem that works for the group of people who are in power, they reinforce this way of thinking. There's a lot of benefit to approaching problems from a diversity of perspectives. I think that data science is a literal diversifying of how we're thinking about scientific problems.

Differences in Research Methods

Hugo: Absolutely. Can you speak a bit more to the tension between traditional research methods and approaches, and those that are emerging now? Perhaps with particular attention to ecology.

Christie: I'll give you a little bit of history on ecology, because the thing is, ecology is a really new science. The ESA, the Ecological Society of America was founded in 1915. The British Ecological Society, it was founded in 1913. It purports itself as the oldest ecological society in the world. Each of them started publishing their journals in ... Journal of Ecology, from the British Ecological Society, was 1913. Journal of Ecology was 1920. This is when ecology was becoming a thing. It was becoming its own science. Just to give you an example, there's no Nobel Prize in ecology because it really wasn't perceived as a science when the Nobel was being formed as an award for great science.

Christie: This is a new thing, and almost all of the culture of ecology has evolved in the 20th century. There's also a little bit of marginalization of ecology from the perspective of the quote unquote "hard sciences" during that time. As late as the 1960s, there's this famous quote attributed to Ernest Rutherford that "All science is either physics or stamp collecting." That gives ecology this inferiority complex. It's like, "No, we're not stamp collectors. We're a real science. We're studying patterns and processes." It became this cultural thing that to prove itself, we had to be quantitative to a fault.

Christie: We embraced statistics. We embraced theory. We wanted to apply an equation to everything. In the time after the second World War ... Ecology slowed down a bit during the Great Depression and the wars. Then, we see this huge expansion of the world university systems after the second World War. Suddenly, there's this massive financial investment in university systems and ecology is a science, and it has a place. This is the cultural element around when ecology was becoming a recognized science. At the same time, the hypothetico-deductive model was the predominant method for scientific inquiry and Fisherian statistics. The Null hypothesis significance test was the canon of the time for statistics.

Christie: We get this conflation of hypothetico-deductive model and hypothesis testing statistics, so the concept of the statistical hypothesis where you reject a Null to support your statistical hypothesis, and your statistical hypothesis would be that there are no differences between two groups, for instance. Then, that's when you accept the Null. If there are differences between the two groups, you'd reject the Null. This idea of that as a hypothesis test and the hypothetico-deductive model, I think got really linked in the minds of a lot of ecologists.

Hugo: This is something that has stuck around even until now, and had a huge impact on 20th century science. Perhaps these two, the hypothetico-deductive model and Fisherian statistics combined are two of the things that have led to the reproducibility crisis we're in now, right?

Christie: Yeah, absolutely. You can't publish a paper very easily if you have no results, that is, no statistically significant results, which it's become this cultural thing of, a graduate student throws away data where they don't see the pattern that is showing up at the level of statistical significance of .05, which is, they sort of canon. The information is still there, and there's probably other ways to detect the patterns. It's almost funny at this point. I was recently reviewing a paper where they had a statistical test between every two numbers they presented, to the point where it was meaningful. It's like, we have a significantly different number of apples and oranges. Why are we comparing them? You're just counting apples and oranges. Why does it matter if they're statistically significantly different? It's kind of frustrating. It's almost a smoke in mirrors attempt to make the science look more quantitative than it is.

Hugo: It almost sounds like a parody in some ways, right?

Christie: Yeah. I was going onto a friend, it's like, "Do we have a statistically significant different number of stairs in our houses, or do we just have a different number of stairs in our houses?"

Hugo: What this really means, as we were hinting at earlier, is that culturally, and I think taking this cultural perspective is actually very telling, because it makes it directly clear why data science and people are interested in exploratory, pattern-driven research are marginalized, right?

Christie: Mm-hmm (affirmative). Absolutely. I've had "Where is your hypothesis?" written in the columns of more than one of my manuscripts. It's really telling that we have this really rigid view of how we can interpret data. I worry that scientists will get left behind, because other organizations, other agencies, other aspects of society are fully embracing data science. For instance, we see that data science is highly effective at figuring out people on the Internet. Citizens on the Internet. Their political affiliations, their shopping preferences, everything. Data science is what led to Facebook advertising to me a sofa that I really like. This started happening not long after I got my assistant professor job. Facebook must know that my income went up slightly, and I bought a house, and suddenly it was ads for this sofa that I really, really liked, again and again and again.

Christie: Facebook knows something about me that knows how to target advertising. I am, I would say a complex system. They don't necessarily need a hypothesis to say, "Christie will probably like this sofa." They just know that people like me probably will like that sofa, and if they show it to me again and again and again, they might wear me down and I'll buy it.

Hugo: Exactly. They've identified that pattern using a scientific model, which is not necessarily mechanistic. They don't know why, or anything along those lines. It doesn't explain the reason in which science historically, has been very interested in, but it can let them know that this pattern exists in order to target you.

Christie: Exactly. Why couldn't scientists apply this to systems where we need to know an action, but we don't necessarily need to know a mechanism? A lot of conservation biology is like that. We need an answer before we can fully understand the system. If we're planting trees, we need to know what is the best practice for planting those trees that it's going to make them survive in a depleted landscape. We need to know the patterns and be able to capitalize on the information that we have in a lot of restored systems. If there's an algal bloom in Lake Erie, this is something that is constantly on our mind here in the Cleveland area. We need to know what to do to act, rather than necessarily the mechanism. The mechanism is excellent and important too, but if we need to keep the tap water clean, we need to act right away. We can do that with pattern-driven reasoning.

Hugo: The statement essentially is that all of these pattern-based techniques are part of science. Science isn't necessarily the exclusive realm of mechanistic modeling.

Christie: Exactly. It's all science. It's all ways of knowing, and all ways of knowing how to better our world and how to push the human condition forward, if you will. I tend to take a pretty applied view of science, in that I don't think science is necessarily just an intellectual pursuit. I think it is a social and political and human pursuit.

What can data science solve in ecology?

Hugo: Absolutely. That's part of the reason that you've identified, or you're interested in a lot of what happens culturally, in terms of the methods that we all use that allows you to think about the human impact. I know you're a serious proponent of open science, and the effects that can have on this type of bettering the world. We'll get to that in a second, but before that, I'd like to know, pivot back to ecology briefly, and hear your thoughts on what the biggest challenges in ecology today that data science can help to solve are.

Christie: That is a hard one, because there's just so much going on. I think that climate change is a huge multifaceted problem. Climate change is not just global warming, and that's the thing. Understanding how it's going to play out in different ecosystems is this incredible multi layered problem that we have to essentially layer lots and lots of data sources to understand how patterns play out in very different ecosystems. You can't just built a model and say, "Turn up the heat" and get a reasonable prediction. There's so many layers. Humidity is affected. The soil quality plays into how organisms respond to heat. There's all these different layers, and every landscape is different. I think that data science is really crucial to preparing for the impacts of climate change.

Hugo: One of the things that you're speaking to there is the amount of data we have and the techniques, in terms of actually scaling our techniques to deal with the amount of data. Also, the heterogeneity in the data, that we have data sources that are incredibly different types, and figuring out how to join and combine them in sensible ways.

Christie: Absolutely. For instance, I have colleagues who are working on monarch butterfly populations. You might have a survey of monarch larvae that's done in Illinois, and you might have information on monarch migration at Point Pelee in Ontario, and then you could have remote sensing data for sensing how prairie patches are doing where there are milkweed hosts. The remote sensing data available at their stopover sites in Texas, and then information coming out of Mexico. You get this incredible multi layered, multidimensional data set that doesn't necessarily easily mix together. We need advanced techniques that bring it together.

Hugo: What about the idea of scaling results in the literature, and aggregating results in the literature? I'm not an ecologist, but if I went and tried to find out a bunch about ecology, there would be all types of different results from different labs, from different people in each lab, how they're actually using data science to get a sense of the story as a whole.

Christie: This is actually something that I'm planning on working on. Capitalizing on some long term data. I've got a grant in review about this right now. The classic ecological study is a three year study. I actually did a search and found that most commonly, ecology is done on about a three year time scale. There's two year studies, there's four year studies, but the reason that that, the unit of time, is that's about as long as a PhD student can get in the field and get meaningful data. You have your three to seven year PhD programs, and the longer ones are usually because there there was a disaster of a field season one of the years. The best PhD programs tend to have about three years of data associated with them.

Christie: When you think about ecological processes, three years is just a snapshot. For instance, in southwest Michigan, the fireflies tend to follow a six year population cycle. What happens if the student who is studying it for the three years that they were going up, or the three years that they're going down? They'd have really dramatically different conclusions. What my plan is, is to take long term data sets and reanalyze them as if they were three years studies, and see if we can find out anything about these three year studies, any sort of characteristics about these three year studies that would tell us how often these three year studies are wrong, and how often they're bucking the longer trend in the system.

Open Science

Hugo: I want to now move to something we've been circling around a lot, which is open science. I know that you're a serious proponent of open science. I'm wondering generally, what does open science mean to you?

Christie: Open science is this really multilayered, multifaceted thing, but I like to distill it into something very simple. It's the movement to put as much of the scientific process into the hands of the interested parties as much as possible. For instance, a member of the public might be interested in a particular scientific paper, and if it's not open access, which is a facet of open science, they wouldn't be able to access it unless they had access to a university library, so taxpayer funded science is actually behind a pay wall. They'll have to pay $39 to get access to it, or some number similar to that.

Christie: There's also the fact that science has been traditionally published as units of the paper, rather than the whole process going into the paper. The paper is this end product of the science. It's a summary document at the end of the science that says, "Wrapping this all up, this is what we found." The thing is, with science, there's so much involved in the process, that we haven't traditionally been able to publish it with the paper, so our raw data sets weren't included. We now use an analysis code to do a lot of our analysis in more quantitative sciences. We can make those available too, now that we have web technology. There's all of these layers that you can have, that essentially make it more easy for other scientists and other people to replicate science. More easy for them to understand science, and more easy for them to access aspects of the science through its process.

Christie: It's got lots of layers. The whole openness and inclusion aspect of things, where we essentially say traditionally science has been a rich, white, European descent man's game. We want to bring more people in, and we have to specifically address barriers to how people of the world access science. Then there's through policy. Essentially, making it so that it's the rules state that we have to make our publicly funded science available.

Christie: Then there's a huge aspect associated with technology. Without the web, open science would be a lot slower. I wouldn't say it wouldn't be possible, but it would be a lot slower. Through the web, we can essentially access information of almost any kind from anywhere. There's a lot of technology development to foster different parts of science through the open framework.

Christie: Then there's open education, which is making educational materials, making learning materials available to people all over the world. That's open textbooks, open course materials, and that sort of thing.

Hugo: I think framing it with these four points is really interesting, and to me, it seems like a no brainer in a lot of ways, that this is definitely the direction we need to be moving in. I think people might be surprised how much pushback there is from within the scientific community towards open science in a lot of ways. We spoke about the generational nature of the people on the National Academy of Sciences maybe didn't need open science to be successful, so there's perhaps pushback there. There's pushback in a lot of other places. I'm just wondering with the respect to pushback and otherwise, what are the biggest challenges facing open science in your mind?

Christie: I'd say that it's almost all a cultural thing, just as you were saying. The very same elements that look at data science with skepticism often regard open science with the same sort of skepticism and for those same reasons. We are in a state of late stage capitalism, and resources are scarce, and we have to compete for them. This is very true in science where we have decreasing rates of grant funding of our work. I'd say that American science is a high stakes, high rewards game, so our job security and our grant money is deeply connected to this idea of being the best scientist, science-ing the hardest, I like to say it as often. You actually create a tragedy of the commons. The idea that open science places its practitioners at a disadvantage in this environment. If you have a system that essentially, rewards the most selfish practices, the most self-preserving practices in science, and then spending time on a more communal practice in science, is actually putting you at a disadvantage through opportunity cost, and also through alienation of others. People are distrustful of others that don't share their value system. I've had some side eyes where I say, "Can we publish a pre-print on our work? Can we aim for an open access journal?" Because people don't necessarily trust those practices. People don't want to share their data. They're worried about their data being used by others. They don't want to share the code because they're worried about their code being used by others and not getting properly credited, or having their code used against them. Someone finds an error in their code, and then it's exposed that they're a fraud as a scientist, rather than a human.

Hugo: Just quickly, people outside academic research may not realize how real a thing scooping is, right?

Christie: Yeah.

Hugo: Literally, people will be protective of their results so that other working scientists don't scoop them and publish that result before them.

Christie: Yeah. I'd say that the real risk of scooping does vary with field, but it's not as dramatic as people frame it, but it's part of this oral tradition in science. I remember grad students taking me aside early on in my career, and telling me to guard my ideas, guard my data, because if someone scooped me, I'd be up a creek.

Hugo: The cost is so high.

Christie: Exactly.

Hugo: The risk may not be high, but the cost is high.

Christie: Exactly. Especially for young researchers, having an experiment fail for a Masters student or an early PhD student, that can be career altering or ending. For an established scientist, it's less risky. However, it's the established scientists that are pushing these ideals down to the less established scientist. This is where I turn this on its head a little bit, because I think that open science can also be its worst enemy at times. With every new movement, there are, shall we say, enthusiastic adopters who become really enamored by the purity of the ideology. It can, in turn, become competitive even within open science, to be the most open, essentially. It's like a purity test. They get very focused on being the most open ever, that they alienate people who are just trying to learn to adopt open practices, because essentially, someone who takes one step, they're essentially telling them, "You're not open enough." Or people who feel that it would be too much of a risk to them, they just don't listen to the concerns. They're not necessarily helping the people that they're trying to help.

Christie: For example, I see a lot of people pushing technological solutions to problems that people outside the open science community don't see as problems. One of the common elements of the people in the open science community is they tend to be very pro-technology. Especially in biology, you have people who are a little bit more technology adverse, and so when you propose a technological solution to a problem that people don't perceive as a problem, you get pushback of, "You were trying to push me to do something I'm uncomfortable with," "You're trying to waste my time," and so this is something that the open science community needs a little bit of soul searching on to see how we can better initiate new members.

Hugo: As you've been talking about all these aspects of open science and data science in academic research, there are so many resonances for what happens in industry. I was wondering what similarities you see in data science and industry and academia, particularly with respect to openness in science, kindness in science, and lack thereof.

Christie: I think that both data science and open science are movements that can make things better and can benefit from each other. For instance, data science really can't open without open science, because data science so heavily relies on data integration, and where's that data going to come from if not from people being open with their data? I think we need to be really mindful about how we approach both data science and open science to make them kind, like make them positive humanist pursuits that aren't necessarily used for evil. I used the example of the sofa earlier on. That's an effective use of data science, but is it an ethical one? That's open to interpretation.

Christie: Open science is a movement that's designed for accessibility and inclusion, but it's subject to the same clique-ish-ness and exclusionary practices, and micro and macro aggressions that greater society is subject to. Data science has been applied in all sorts of various ways, so we won't go into it too much, but to manipulate people into voting a certain way and buying sofas they don't need. That's not in the public good, and it's not kind. The core of both data science and open science is that we need good people, and taking leadership roles in both fields, and advancing both in humanist, kind, inclusive directions.

Hugo: The example of the sofa is interesting, because a lot of it is due to exposure. We've said the word sofa so many times in the past 40 minutes that I feel like I need a new one, essentially.

Christie: I know, right? This has worn me down. I didn't even want to get rid of my old sofa. It's fine. I have young children. I should probably wait until they stop spilling things before I get a new sofa.

Hugo: Or you could just keep getting new sofas.

Christie: That is a better long term plan. Thank you, late stage capitalism.

Future of Data Science

Hugo: Thank you very much. What does the future of data science look like to you?

Christie: That's a really hard one. I don't honestly know. I feel like we're going to be able to make better decisions, but I think we're at a crossroads in society about how things are going to be regulated. The whole Facebook trying to become kind has been something that's been of interest to me, because they are probably one of the greatest engines of data science known to humanity, and they are trying to change their practices, at least outwardly, or at least they're using very effective methods to try and convince me that they're changing their practices.

Christie: I honestly don't know. I feel like we can advance science dramatically through using data science techniques, but I also feel that it's going to depend on the humans that take up the practices, is really what it comes down to.

Favorite Data Science Techniques

Hugo: Agreed. We haven't talked a lot about technical stuff. We've stepped back quite a bit. I'm just wondering what one of your favorite data science-y techniques and/or methodologies is.

Christie: My very favorite tool is OpenRefine. OpenRefine came out of Google, and it is a tool for cleaning dirty,messy data in ways that are so powerful and also so accessible. What you can do is, you load your data, your spreadsheet directly into it, and it examines the data. It looks for commonalities. It examines it for typos. You can essentially find problems with your data that would be sort of needle in a haystack issues that you wouldn't necessarily find out about until down the line, if ever. It's one of those tools that I use in teaching a lot, because not only does it help you fix the problems, but it helps you teach the students what the problems really are.

Christie: The example that I use in my class is, imagine you had someone who did a survey of bees in a crop, and they collected all the specimens and they went back and I.D.-ed them, and then they had their assistant enter what species each specimen was. You've got thousands of bees and you're entering them in the spreadsheet, and your assistant is going to make errors, but how are you going to find the errors in your spreadsheet to find out how many species you have? OpenRefine allows you to locate things that are likely typos, and it bring it up to you and says-

Hugo: Oh, cool.

Christie: "Is this an issue? Might these be the same species?" This is something that really, you could do this through more command line-y, higher order processes before, but this provides this really graphical interface of ... Just allows you to see where the problems are, and so it allows you to highlight to students, "This would be a problem if we were counting the number of species in this sample, and we've spelled our European honey bee, Apis mellifera, several different ways," it would count those several different ways all as different species. How do we resolve that?

Hugo: That sounds like a really cool tool. I haven't checked out OpenRefine, but I'll make sure to.

Christie: It's so powerful. I love it.

Call to Action

Hugo: Christie, my final question is, do you have a final call to action for our listeners out there?

Christie: Call to action. Wow. Hmm. Open science and data science are microcosms, and they're a product of the society that we're living in. Right now, they're not inherently forces of evil or good, but we live in a society where things are not necessarily kind right now. Our economy is not kind. Late stage capitalism is not kind. Individualism is not kind. The cultural elements shape how the powerful new approaches are applied. We need data science leaders and open science leaders like we need leaders in every part of society, fighting for the common good. We can use information to empower and motivate people and bring about better lives for our fellow humans. I think that's beautiful, so let's use these tools for good.

Hugo: I couldn't agree more. Christie, it's been such a pleasure having you on the show.

Christie: Thank you so much, Hugo. It's been great chatting with you.

Want to leave a comment?