Machine Learning & Data Science at Github

What is the role of data science in product development at github, what does it means to use computation to build products to solve real-life decision making, practical challenges and what does building data products at github actua

Jul 2, 2018

Transcript

Guest

Omoju Miller

Omoju Miller is a Senior Machine Learning Data Scientist with Github. She has over a decade of experience in computational intelligence. She has a Ph.D. from UC Berkeley. Apart from her work in AI, she has co-led the non-profit investment in Computer Science Education for Google and served as a volunteer advisor to the Obama administration’s White House Presidential Innovation Fellows. She is considered one of the folks to watch, as part of Bloomberg’s Beta Future Founders program. She is a member of the World Economic Forum Expert Network in AI.

Host

Hugo Bowne-Anderson

Transcript

Hugo: Hi, there, Omoju, and welcome to DataFramed.

Omoju: Thank you, Hugo. I'm excited.

What are you thinking about in regards to data science?

Hugo: I'm super excited. It's great to have you here to talk about data science at GitHub. But before we get there, I want to find out a bit about you, and I want to talk about how you got into data science, what you do at GitHub, but I'd like to take a slightly tangential approach to finding about you first by just asking you what you're thinking about at the moment with respect to data science, or what keeps you up at night, or what really is exciting you?

Omoju: The thing I've been thinking about a lot is the term artificial intelligence and the fact that it is such a misnomer because the work that we do is not necessarily artificial intelligence. Most of us in industry don't work on A.I. We work on massive mathematical problems that are basically variants of some kind of linear algebra. And that's what we do. We are applied mathematicians.
So I've been thinking a lot about that, and then using the right kind of terms, like maybe we're doing things like augmenting human intelligence, or been building like data intensive platforms and things like that, like figuring out a word that more represents the work that we actually do so people don't come off with the idea that we're building some kind of completely autonomous being with some conscience and sentience because that's not what we're even doing at all. And we're... See more

not even anywhere close to that.

Hugo: Agreed. And I think part of the big challenge is that artificial intelligence is a term that has existed for decades, if not longer, in the cultural consciousness from science fiction, for example. So you watch Blade Runner, and that isn't what we're thinking about or what we're doing.

Omoju: Yeah, it's not, it's not at all. There's an essay I read recently by Michael Jordan from Berkeley, the statistician. And it was talking about how we're not even anywhere close to that. I forgot the name of the essay, but the essay was in Medium, and it's a very fascinating, interesting read. And I like it a lot because it really focuses on the major problems that we have ahead of us and the problems that we have to solve, problems around infrastructure. Like, any kind of machine learning approach often needs robust, solid infrastructure that can scale with it as the data scales.
So focusing, targeting on those kinds of problems and where the low-hanging fruit in those areas are, those are the kinds of things that I've been thinking about lately, and that fascinates me.

Hugo: Yeah. Fantastic. So it isn't necessarily about self-driving cars and-

Omoju: The self-driving cars thing, I get recruited very often, recruiter emails about self-driving cars and autonomous vehicles. Personally I am not interested in working on those kinds of data sets. It's not going to get me out of bed in the morning. I don't care about that kind of data. I care about data sets that I can reason from a human perspective about. I like that kind of, that I can use my intuition. I can basically leverage the ability of being a human to solve the problem even faster, so data sets that I know that this data set represents some kind of a snapshot of human activity. Those are the kind of things I care about.
And with regards to self-driving cars, unless they're fully autonomous, I think they're actually quite dangerous because they lull you into a false sense of security. It's going to be very, very difficult if you've gotten, if you have a car that's semi-autonomous and you've been driving it for three or four years, and you never have to engage and do anything, when that instant does happen, it's going to be very difficult for you to like, oh, remember to pay attention because you've just learned to trust this vehicle. And those split seconds, those are the things that actually create lots of danger and I think makes it harder for the public to actually embrace new technology.

Hugo: And I think a certain amount of education, and dialogue around these issues, and data scientists getting in the public eye and having this conversation with the public is incredibly important here, right?

Omoju: Absolutely. Yes. And then having them actually understand what we do for a living, what our work entails, and the kinds of problems that we're solving, and the kinds of problems that we can solve, questions around data privacy, basically a data education so the public has a very good understanding that if they give you data, what does that mean? What is that data being used for, things like that.

Hugo: Where would that data education take place? I mean, essentially in the end maybe we'd want it in primary school, right? But-

Omoju: Yeah, I think in primary school. So younger children are much more savvy. I have a middle schooler, and he's very, very savvy around data, and data sharing, and issues of Internet privacy, and all that kind of stuff. So they are on it. They are very, very savvy. They know exactly what's going on. It's the other people who have not grown up as digital natives who have the bigger chasm to cross in understanding what does this actually mean.
And then beyond all of that, there's so many things now that we don't even have rules and regulations for, or even have mental models of how to think about it because it's just so far out there that we've never really thought about it. Have you ever thought about who has access to your genetic material? Because it's never been like a real thing. So it's like but now it's a real thing.

Hugo: Very much so.

Omoju: Who owns that data? Do I own that data? These are the complex questions now.

Hugo: Yep. And I think you raise an interesting point about children and younger children being a lot more savvy. I think I remember a while ago you were on hanselminutes, and you were talking about your child using the term "physics engine" and knowing what that meant at a very early age because they've been introduced to that, right?

Omoju: Yeah. I was like, "Whoa, physics engine?" I didn't even know what that was for a while. I was like, "Ah." I mean, I don't play video games, but, yeah, like, they understand intrinsically what a physics engine is and what the physics engine is going to help you do. And the way they learn is so interesting because they've just grown up in the age of YouTube, so knowledge for them is not something that you have to go and acquire. The most important thing for them is figuring out the right sets of questions to ask because they just have the assumption that the knowledge is there: "I just need to figure out what's the right query to pull that knowledge up, and then I can apply it."

Hugo: And of course now that we have widespread education, in a lot of respects, it can get a lot better. But we have a lot of education available online. As you say, it isn't about necessarily learning everything at school or having to move to a different city, state, or country to go to the university and sit in the lecture hall. And we've got totally different models of learning that are evolving right now.

Omoju: Absolutely. And the same exact thing happens in it for us too. Machine learning is so ... There's so much happening in ML and in ML research that it's so hard to just keep abreast of everything that's going on in the space. So you find yourself watching videos, reading papers, and trying to build your own versions of those models. And it's basically the same exact kind of thing. You just have to put yourself in the same mindset of the knowledge exists. What's the right question to ask? How can I replicate this experiment? What do I need? Those kinds of things. It's the same exact thing. You have to be in the mind of a child to do the kind of work that we do and do it very well.

Ethics

Hugo: Yep. Something we've been circling around and something you mentioned explicitly was data science ethics. Data scientists are not necessarily trained to be ethical in their jobs. They're not necessarily incentivized to be ethical at work. Do you think this is a problem? And do you think we need to, all data scientists need to be ethically-minded at work in some sense?

Omoju: Yes, I absolutely think that all data scientists need to be trained around ethics because the kinds of things we're doing today involve real people. And the repercussions, if it goes well, are amazing. We change people's lives. The repercussions, when it goes poorly, are devastating. So we need to have that ethical understanding, we need to have that training, and we need to have something akin to a Hippocratic oath. And we need to actually have like proper licenses so that if you actually do something unethical, perhaps you have some kind of penalty, or disbarment, or some kind of recourse, something to say this is not what we want to do as an industry, and then figure out ways to remediate people who go off the rails and do things because people just aren't trained and they don't know. And it's so far removed, when you're working with large data sets, it's easy to just forget that those data sets eventually are going to be used by people. And you can be so distant from it that you completely forget.

Hugo: Absolutely, and we need to be very rigorous in a lot of respects. For example, if you have a machine learning model, I think one of the famous ones is to predict recidivism rate, and you take out the feature ethnicity. You're like, "Okay, well, now we're not including ethnicity." But if you include post code, then you're actually capturing a lot of that information already.

Omoju: Exactly. It's kind of correlated, even asking yourself those kinds of questions and understanding the application of the mathematics. It's not just, yes, you have correlations. Like just things like that, knowing that zip code is a good proxy for ethnicity, just simple things like that, and then just asking yourself those questions.
I think as data science becomes more mainstream at the university level, I hope that there'll be a track that goes along with it, which is the ethical ramifications of data. So in the undergraduate program at Berkeley, they actually have classes on the social implications of computing. In one of the classes, that's a thing. And I would hope that that continues along the track so students can get exposure to it, not just once but every time they're in a class. What is the social implication, positive and negative, of the technology I'm working on right now?
What is the best thing that could happen if my technology is successful? How can my technology be weaponized? What are the safeguards I can put in place to prevent that when such a thing occurs? And how do I educate people on what the importance is of that weakness or loophole in my technology is?

Regulations

Hugo: I think education system's a great place for this. I think having a general reading list of people who necessarily aren't working data scientists who are theorists or social scientist who think a lot about this stuff. I mean, Cathy O'Neil's Weapons of Math Destruction is a great book. A lot of the work that comes out of NYU's Data & Society group is really fantastic. But of course we see rules and legislation emerging in different places. In Europe we have GDPR now. I'm just wondering in your mind where do these movements come from, or is it a confluence? I mean, do we want the movements to be stemmed from data scientists from the ground up thinking about these or directed downwards from legislators?

Omoju: Oh, I would actually hope that we are the ones ahead of the curve, that it's not just the regulators, I mean regulation from governments. And the regulation has its role. And it's a good thing. The biggest challenge is oftentimes policymakers might not have the knowledge necessary to help craft the regulation. So perhaps they do it in tandem with the people with the expertise. And so maybe you know how it is in academia where somebody might go work with the NSF for like a semester or two, we actually should have something akin to that where you take a leave of absence of six months or something like that, and you go work within policy of the government of the country you're working in.
And it's just part of like the normal core work is. Maybe every four years you work in policy for six months, and it's part of like the requirements to have your license as a data scientist or something like that.

Black Box Models

Hugo: For sure. This is really interesting though. This really promotes a dialogue between legislators, working data scientists and stakeholders in the community. What do you think the role of black box models and their counterpart, interpretable models is in having this conversation across a variety of stakeholders?

Omoju: There's some problems that black box models are very good at solving. However, dependent on context, if it's going to be something that can really harm people, there has to be a way to investigate how we'd come up with the solution. And then sometimes maybe you don't necessarily throw away the black box model entirely. Maybe the black box model helps you come up with a set of features that you use.
I think the best approach would be to have an ensemble of different kinds. Like, the black box model can help you triage, and then eventually on top of all that have a decision tree or some kind of hybrid of the two. But deep learning has given us so much.

Hugo: It’s powerful.

Omoju: We can't just throw it away.

Anthropology

Hugo: Agreed. Something we've been circling around is how data science can inform us about humans. And I know that a deep evolving interest of yours is anthropology and the uses of data science for the human. So I'm just wondering how you're thinking about this at the moment.

Omoju: I think it's so fascinating because human beings, we think we have an idea of who we are, and we have an idea of our own behaviors and our own biases. And we often don't. Human memory is so fuzzy. The kinds of things you can actually do, sometimes when you think about all the stuff from cognitive science, how you can literally delude yourself into believing something that is completely false by firing a same set of neurons and creating this strong association.
But if we can actually use data to help you understand your own habits and your own patterns, then you can actually use it to optimize yourself in a certain way. And the thing about it is very interesting to me is often I will find people who are in science, maybe they’re machine learning engineers and things like that, and we're very good at the craft. But we never actually apply it to decision-making for ourselves. We literally just do the work, and we don't apply the same rigor into our own decision-making.
So I'm one of those people that would like to believe I do more of that, and so it's very, very, very interesting to have a good reckoning of what your strengths are and what your weaknesses, and having data to actually give you the evidence that that is in fact the case and that is in fact true. And the question is what do you do with that knowledge? In my mind, you use that knowledge to make better decisions.

Hugo: Absolutely. I love it, and I've actually started doing something along these lines recently. I'll tell you very briefly, I've started doing stand-up comedy, open mics recently in New York where I live. And I've realized that stand-up comedy, I can apply the principles of Bayesian inference to it. So when I write a joke, I've got a prior belief of how funny it is. And then I have to tell that joke maybe 10 times in different settings to get 10 different data points on how funny it actually is for an audience. And I update my prior with respect to the likelihood generated by this data. And then I have some posterior belief of how funny this joke actually is.

Omoju: Exactly, things like that, and just realizing that as a human being, literally everything you do is an opportunity to experience and optimize some kind of data algorithm.

Hugo: Yeah, absolutely.

Omoju: Because your whole life is just a chain of decisions from moment to moment.

Hugo: And as you say, before decision-making, though, there's a process of pattern recognition, classification, and understanding of your own behaviors as well.

Omoju: Yeah. And this, it's fascinating to me because the lack of understanding of this is sometimes the reason why we have so many societal ills. And it's a tricky thing to like walk around because we don't even understand how the human mind works. And the understanding that we do have, we don't apply it very well. The human mind is optimized for pattern recognition. If it wasn’t, we will not be able to function adequately.
And it's also like the first fit model. Like, the first fit is the best fit. You never question the second fit. If it looks like a chair, I'm going to assume it's a chair. You never question, is it actually in fact a chair? Because you have to triage so much information so fast and make decisions. But there are certain times where you want like friction and you want to reconsider the decision you just made. Is it actually a chair or is it a piece of art? And then come up with the heuristic of why you believe it's something you should sit on versus something that you should look at.

How did you get into data science?

Hugo: That's a great example. So how did you get into data science originally?

Omoju: This is one that's a big tricky. I think it goes back to how did I get into computer science. And I got into computer science because of the Internet. I was just fascinated by communication technology, the fact that you could just know whatever you wanted to know. It just seemed like the world was at your fingertips. And then as I continued down that path, I realize, I mean, I did my master's years ago, 2002. And it was all about neural networks and all that kind of stuff, but this was before things like scikit-learn, NumPy, Pandas.
So writing those algorithms in C++ was not a pleasant experience, and then on top of that, we didn't have tons of data. So it was almost like we had all this knowledge, and it was ... We couldn't use it. The power was not there. So I kind of abandoned it and went from symbolic logic. Well, then I realize years later, oh, now we actually have the cell phones, and we have tons of data. And now is the time you can actually use all this knowledge, all that knowledge you acquired years ago, and you can apply it today, and there are all this frameworks & tooling to help do it this thing faster. So I just decided that the time was right.
And so during my dissertation at Berkeley in computer science education, I collected all this data during the data analysis, and I just ... I was like, "Oh, I really, really love this because it's fascinating. It's just so fascinating. You can just keep on knowing more and more and more things about people through the things that you can find in their data. And it seemed to me like the most interesting and fun thing to do. And I was just like, "Yeah, I guess this is what I'm going to do for the rest of my life." I'm very actually happy that it is a thing now.

Machine Learning

Hugo: And you're not merely a data scientist, but you specialize in machine learning. You're a machine learning data scientist.

Omoju: Yeah.

Hugo: What does that actually mean, and what's even the distinction there?

Omoju: The biggest distinction that I see is that we have like data science is such a broad umbrella too, but data science encompasses decision support, so organizations like analytics often gathering data to support a decision versus shipping data products, so building data products, something like a learning control system for an autonomous vehicle is a data product. It's not necessarily doing decision support within an organization. So it's just different users of the same tool set. But on the machine learning routes, what ends up happening is you probably have vast more amounts of data, and the kinds of approaches are very, very ... They go further. I think that's the biggest difference.

Github

Hugo: I want to know what you actually do at GitHub. But before that I want to know what your colleagues think that you do.

Omoju: I think people outside of machine learning often don't know what we do. They ... Magic. And it's not magic. We don't do magic. They don't have a clue. They actually don't understand. They know there's data. Something happens. Decisions get made somehow. So they literally don't have a clue.

Hugo: Is that dangerous?

Omoju: It's very, very dangerous because people think it's magic, or they think that it's a knowledge that is so far out there that the gap between them and the attainment and understanding of that knowledge is too wide. And that is a very dangerous thing because it's not true.

Hugo: And are there, for example, product managers that have more of an understanding than other people within the organization? Or-

Omoju: Yes, we do. We have some amazing product managers. We have one in particular that supports our team, and he knows so much about machine learning because he's a voracious reader. But he is very rare, very, very rare.

What do you do for Github?

Hugo: So now knowing what your colleagues don't know about what you do, what do you actually do at GitHub?

Omoju: What I do at GitHub is I build data models, often deep learning models on GitHub data to help GitHub probably build things like a recommendation engine so we can recommend repositories to people. Do things around, understand insecurity, vulnerabilities in code, do things around figuring out what topic a repository is really speaking about. So that's the kinds of things I do. We build data products. So basically getting data, building a data pipeline, coming up with a hypothesis, building a model. If it's a predictive model, see how good you are at making the kind of predictions. If everything goes well, put it in an API and serve that API so that other engineers can use it and use it to support the GitHub platform.

Online Experiments

Hugo: So this basically machine learning side of things, when you're making product changes or anything along those lines, do you also do, or does somebody at GitHub do online experiments to decide the direction of these changes?

Omoju: We are beginning to start to have a team that's going to start doing that. Right now, machine learning is very nascent to us. Machine learning is around a year or a year plus. So we are at the beginnings of doing all of those kinds of things. We are not ready yet for full-on online experimentation quite yet. But we do a little bit of that.

Hugo: Really exciting times.

Omoju: Yes, very exciting times.

Hugo: I've heard you describe what you do, and we're speaking about this already, but specifically is you use computation to build products to solve real-life decision-making practical challenges. And I'm wondering what practical business challenges does GitHub face that data science can help to solve?

Omoju: One of the things is a lot of open source is on GitHub. Open source often times is looking for contributors, people to contribute to the open source repository. And the maintainers sometimes get overwhelmed. So one very simple thing is perhaps as an open-source library you really like and you use it all the time. And then you realize, "Oh my God, this thing is missing from this library," maybe something like Pandas. And then you just automatically go to the Pandas repo, and you open an issue: "Oh, I see that when I do this and that, this x, y, z happened. Wouldn't it be great if you could fix this for me," so on and so forth. Open an issue, leave the issue in the repo, right?

Hugo: Mm-hmm (affirmative).

Omoju: There might be other issues that have basically said the same thing. So when the maintainer comes to that repo, triaging all those issues becomes a very, very big challenge so that the challenge of triaging, and finding duplicates, and what has been opened, and what has been closed, what's most relevant, all that kind of stuff, just doing all that decision-making around that is a major problem. And that's a problem that can be solved with machine learning.

Hugo: That's fantastic. So I'm just going to stop you for one second and kind of reiterate what you've said, that this is actually one of the biggest challenges in open-source software development. There's a huge rate of burnout in developers, who were doing a lot of this in their own time after working their full-time jobs, taking time out from their family to do this. And it's a huge challenge faced by the community at large. How do we even think about package maintenance, especially with the expectations that come from the user community?

Omoju: Yes. We, the users, we are so ... We want everything now. So I saw it recently, and I was literally aghast. I was like, "Oh my God." So I think it was like matplotlib, NumPy, and maybe Pandas or something, those three SciPy packages for data science pretty much are maintained by 15 people.

Hugo: Yeah. I saw that as well on Twitter.

Omoju: I was like, "What?" It was like three of them, five people each. I was like, "What?"

Omoju: "Five people? This is insane." And the number of people that use those packages, orders of magnitude, millions of people. Five people?

Hugo: Yeah. I was just going to say hopefully those five people are never in the name room at the same time.

Omoju: Exactly. So it's a machine learning problem if as a maintainer you can come to GitHub, and I've already triaged all the issues for you to let you know, all right, maybe you have like 10 contributors. The 10 contributors are available right now. And I know what their skill sets are. I could say contributor number A will be great for issue number B and match them because all I know is this contributor can close this kind of issues very, very fast, so I can just match them automatically for you. I can basically get rid of all the dead issues like the things that we won't fix because it's just out of our control, or it's not on our roadmap.
I could come up with all the duplicate ones. I can find other issues on the platform that are from other repos that look like your issue that have been closed and figure out what the solution was. So these are all the things that can just help you. Maybe then those five people then become like an army of 300 because they are powered by machine learning.

Hugo: That's fantastic, and I actually, this speaks of automating certain things, which people do constantly and takes up a lot of time. When I spoke with Jake VanderPlas on the podcast, I asked him how he got involved with scikit-learn. And he'd written I'm not sure whether it was PCA, or he'd written something to help him in his astronomy research. And he'd emailed the SciPy users mailing list or something like that. And eventually someone from scikit-learn, I think it was Gaël, said, "Hey, why don't we put this in scikit-learn?" And that's how Jake got involved in scikit-learn. That took a period of time and a lot of human hours to figure that out, whereas what you're speaking to is developing machine learning, automated systems that do that work for us.

Omoju: Exactly, that do that kind of work for you, that just take the pain out of it. What's the point? You have them at repo. You need contributors. You are a maintainer. What can we automate away to help you get your two, three hours that you're committing to this every four or five days as meaningful as possible? So you're not spending the hours just like, "Oh my God, all those issues," triaging the issues. Triaging issues is a big deal. Machine learning can triage the issues for you.

Hugo: So you also think a lot about building data products, and you build data products ... Is this the type of thing that you'd consider a data product at some point?

Omoju: Oh, yes, absolutely. Everything I've just told you is a bunch of APIs.

Hugo: Okay. Great. What other types of challenges do you think about, or is GitHub interested in, that machine learning can help with?

Omoju: One of the things that I am so excited about is actually gaining deep understanding of computational competencies, like actually understanding the kinds of computation that we realize in code, understanding the complexity of a code base, and building things towards just doing ... getting to the point of doing like automated code review, giving you some kind of thing that maybe you can plug into atom that can go through your code and tell you all the kinds of things. Like imagine a linter on steroids. Instead of just finding all the PEP 8 errors, but can tell you, "Oh, I see it. It was in this pattern over and over again."

Hugo: "Let's refactor."

Omoju: Refactoring it for you, like highlighting things. Imagine if you can actually look at ... Imagine when you're entering a new code base. A new code base can be so large to just like chew. Where do you even start reading this code base from? What if there's a visualization of that code base as a tree, a different path they can go through. And that tree can then expand and contract based on the kinds of things you need to go down on.

Hugo: Awesome.

Omoju: Just like tools to visualize knowledge, computational knowledge so you can get what you need to get out of it very, very fast.

Hugo: Yeah. And something you mentioned in passing earlier was detecting security vulnerabilities in code also, right?

Omoju: Yes, detecting security vulnerability in code, it's a major thing. What if we can go through your code and find the security vulnerabilities? Number one, we already have a service like this that will alert you if you find certain vulnerabilities in your code so you know what it is. Then the next part will be what if we can tell you what to do to patch that security vulnerability? Like, we found it. Now here's the patch.
The next level is what if we can actually write the automated pull request for you? That we've found it, we've written a solution for you. All you’ve got to do is hit "merge." These are the kinds of things that will save so many human hours, and more importantly have such a major implication on society. Lots and lots of data breaches have so many deleterious effects. All of the people that have been affected by the what's the ... Is it Experian breach? All these kinds of things.
If you can automatically see that, oh my God, I've seen that you've committed AWS key on a public repo, just alert you before you even like push it, like, "Ah, stop it. There's a violation here," things like that.

Hugo: That's fantastic.

Omoju: Simple things that just make you sleep easier at night.

Hugo: And I know a lot of people, myself included, use GitHub really as a way to share code, to collaborate on code, and to discuss code. But I know that in your mind, GitHub is a far larger ecosystem than this. And I know that in particular you also view it as a social network of sorts.

Omoju: Oh, yes. I use the social coding. I use the social coding aspects in GitHub. When I was on campus at Berkeley, it's easy to collaborate with people or to know what they're thinking about. Even if you're not collaborating with them, like your buddies, what are they thinking about? It's easy for you to do that because you're also in the same lab.
When everybody graduates, and goes on, and does whatever it is that they want to do, by following them on GitHub, I still have a way of knowing what they're still interested in. So I can go and look at my friends, and I follow my friends in GitHub. And I'm seeing the repositories that they're starting. Or I'm seeing this person is interested in crypto. I'm seeing this person is now interested in this version of a new JS. And I'm seeing or this person that was totally into front end design is now starring things that with machine learning.
So you're like, "Oh my God. They are now into machine learning." You can hit them up and be like, "So what are you thinking about doing?" So it's like an opportunity for you to keep in touch computationally of what people's interests are.

Hugo: And to see people, what people are committing to, and even, as you said earlier, let's say you make a contribution to scikit-learn, for example, you may get a message saying, "Hey, if you made that commit, this is something similar that's raised in an issue on statsmodels that you might want to have a look at."

Omoju: Absolutely. Exactly. Through it sometimes I look at my dashboard, and I know who's going to what talk. I see they're already preparing their talk. I mean, I can see everything that they're doing, and it's public, it's there, and it's so interesting. And I also use it as a form of discovery because there are so many open-source projects, too much for you to consume that your friends can almost serve like a sort of triage to figure out what is hot, and what are they sharing in common, and things like that.

How do people use Github who aren't technical and not writing code?

Hugo: So you've also described GitHub as a platform for work, and not just technical work. And I'm really interested in this. You've told me that everyone who works within GitHub, even non-technical people, use it for communication. And you've described it as a “collaborative work environment centered around humans”. How do people use it who aren't technical and not writing code, for example?

Omoju: So I will start off with my use of it, my non-technical use of it. So I keep a blog. I have not updated my blog in a while. And my blog is in Markdown. So when I want to write a new thing and I'm just writing ideas, I literally like write it in Markdown and commit it to GitHub. It's like I just have like a ... Basically it's my own version of how I write to text. Even if it's not Markdown, it's like regular English. I'm just writing it there. I'm just committed to GitHub. It's just my own version of I guess Microsoft Word because I don't like Microsoft Word. So I just write in whatever free text editor. And I just commit it to GitHub, things like that.

Hugo: And it pushes directly to your blog.

Omoju: Yeah. Well, yeah, it pushes the blog because I'm using Jekyll and all that, but even just putting my ideas down and just keeping all those ideas together. So I use GitHub, I used the GitHub platform to help manage writing my dissertation. So I'm writing this dissertation. I'm like, I wrote the dissertation using LaTex. I'm just like committing everything to the GitHub repo. And that just gives me piece of mind just in case something I wrote a while back that I decided I no longer want, I can roll back to that commit. I can go all the way back and be like, "Oh, 10 months ago what was I thinking? Oh, there it is. This was the state of the dissertation. This is what's going on." And I like to do that.
And then also I like to look at some of the insight to see my velocity so I can no longer deceive myself. If I'm working on something and I'm not working on it, it's very evident I did not work on it or I worked on it. So that's just from a perspective. But things like I know people use GitHub for coordinating work. People use GitHub for crisis response. So they will open a repo, and the repo may be around how are we going to coordinate all of our resources together so that when a crisis is ... Let's say a hurricane is coming. Once the hurricane passes, how do we collect all the things we need and all that kind of stuff?
So we can open a repo and use that repo to track all those kinds of things. You can track it using issues. So just search for this issue, and it's going to have the thread of all the information of this is what we're thinking. So it's basically a way for us to capture our thoughts on paper. And we can always go back to that issue and see this is everything we wanted to do, these are the things that we did. And they can go back and see where did we go wrong? We can almost go and do like your ... You can debrief and go back to all those issues. So you can see everything, and the conversations, and the people who participated, and what they were thinking about.
So opening issues is one of the ways that GitHub uses GitHub non-technically, just coordinating because GitHub is also distributed. So certain conversations are just better kept in a niche so we have the record of the entire thing. And there's a repo for everything. There's a repo for the finance team. There's a repo for the legal team. There's a repo for the machine learning group. So it's not just technical groups. Everybody has a repo. Everybody has like opened an issue in the repo. If you want to do this, we'll have a repo. Open an issue in our repo, and we'll get back to you, things like that.

Hugo: That's awesome, and I think that the fact that history is always preserved in GitHub speaks ... Like, you close an issue, but it's still always there, and the fact that you-

Omoju: Yeah, you close-

Hugo: ... have versioning around everything means that you have a history of everything, if needed.

Omoju: Literally you have a history of everything. So when somebody gets onboarded and they, "Okay, so you're here to do x, y, z," you can actually see the history that has happened before you got here, see where conversations went, and then you can take it from there. So it's a collaborative work platform. It just so happens that the first major use case was around computation.

Hugo: And I've actually, I can't quite remember the details, I've definitely used infrastructures and plug-ins that allow you to visualize issues. Maybe one's called Zendesk, which allows you to track issues and drag and drop them into particular places to see which stage they're at.

Omoju: Oh, yes. We actually have our own products, like a project board that you can just use within GitHub itself, and you can create different verticals, and you can drag and drop issues. And things can be closed automatically. You can have your Slack integration, and all that kind of stuff.

Hugo: That's really cool. And do you see this becoming more and more a broader use of GitHub as we move into the future?

Omoju: Absolutely, yes. I hope so because having used it in this way, it's ... I started using GitHub this way before I actually got to GitHub.

Hugo: Yeah, that's really cool.

Omoju: It just seems like a natural way to do things, because it's like several products on the same platform. You don't have to keep jumping from one thing to the next. It's the same that can do all that for you.

Accessibility of Git

Hugo: So I want to talk about accessibility of Git for a second, and what I mean by that is I've taught Git in various places. And I know that you're very interested in education. One thing I've always found, Git is incredibly difficult to teach for a number of reasons. GitHub has made it a lot easier, I think, but one example is I try to motivate Git and tell people that versioning is important. Then I tell them that, "Oh, we're using Python, so let's use Jupyter Notebooks," right?

Omoju: Yeah.

Hugo: So it's an inherent challenge and barrier to entry for Git. So how do you see this progressing as we move into the future?

Omoju: As we move into the future, we have a lot more to do to actually abstract away a lot of the pain points in Git and create ... I think one of our killer opportunities is the GitHub Desktop app. I use the GitHub Desktop app probably 90% of the time.

Hugo: As opposed to using your Unix shell, or ...? Omoju: Yeah, as opposed to using terminal because the app, I make less mistakes because there are so many things that are evident to me. I automatically know what branch I am on. It is right there. I don't have to worry what branch I was on. The branch may not be there. I can see that. I can see my diffs visually. I can see the difference. And I like going back into the history in the GitHub Desktop and actually going through.
So it helps make it slightly easier, and as that app becomes better and better, itself will become the learning tool, and then also GitHub, we just launched something recently, the GitHub Learning Lab, which is a bot that is on the GitHub site itself that will teach you how to do all these things. So we're beginning to build all these learning tools to help people cross over that chasm of Git. It's not all the way there yet, but we are nowhere near where we were three, four years ago.

Hugo: Great, and has the Learning Lab gone live in some form?

Omoju: Oh, it's live. It's already out there. People are using it. They've already built several classes, and they are using it.

Hugo: That's great. We'll definitely link to the Learning Lab in the show notes. Omoju: Yes. Absolutely.

Github for Desktop

Hugo: So you spoke to a really interesting principle in terms of saying GitHub for desktop, you can use it visually. And I think thinking in terms of design principles of products, essentially if we're a lot of the time forcing people to use terminal to interact with such systems, we're losing a huge portion of the population, right, people who actually prefer to do things visually.

Omoju: Yes, people who prefer to do things visually, and then also one of the easier ways to do things, one of the benefits of doing things visually versus by typing or reading is that pictures are not necessarily as language-dependent. So if an arrow is pointing to the right and it's saying something like go from left to right, that arrow saying left to right or the symbol of a bathroom or something at an airport, it is a universal. It is not language-dependent. And because we're sitting in San Francisco and we're often English speakers, we are so tied to the English scripts that we don't leverage all the other things you can do without language, without using text-based language. So I often would rather default to a visual medium, what is the visual way of doing this thing, before even thinking about anything text-based.

Scratch for Data Science

Hugo: Interesting. That actually made me think of something. I had Greg Wilson on, a colleague of mine whom you know and who is well known for his original work with the software and data carpentry, but he's very provocative in a lot of respects. And he made a statement that he thinks the future of data science in decades may be people using Scratch for data science.

Omoju: Yeah. Why not? Drag and drop.

Hugo: Exactly. So you think that's a viable future as opposed to us writing code in Emacs or whatever it may be?

Omoju: Yes because what is the problem you're trying to solve? Maybe you're trying to build a thing that can give you like ranking--like, that is the problem. You're trying to do prediction, for example. Do you honestly care what way you use to solve the problem of prediction? I don't think so. I think you care about solving prediction.
Tying yourself to one medium makes no sense. If there's a medium that is easier and faster and less error-prone, why not adapt to that medium? It has nothing to do with the solution to the problem; it is just the ... I thought it's just the way you're realizing the solution. We don't need to tie ourselves cheek to jowl to only one form of realization of a solution.

Hugo: I agree, and this is actually why I have such an allergic reaction to the language wars when aspiring data scientists, for example, say to me, "Oh, which is better, Python or R?"

Omoju: Who cares?

Hugo: And I immediately say, yeah, "Who cares? And what are you trying to do? Let's talk about the problem you're having. See where we go."

Omoju: Exactly. What is the problem we're trying to solve? What is the best tool for that problem? And then let's do that. And then in certain kind of environments, you have to ask yourself, how will this thing scale? Maybe one language is easy for you to just build a prototype, and you use the language that is easier for you to prototype. Maybe another language is better for you to actually build a full-scale application and production--use that. It doesn't matter.
I think we're eventually going to start moving more towards automated tooling and things that are like drag and drop and they are not as language dependent. And there'll always be an API that you can call and still connect back to Emacs if that is what you want to do.

Lowering the Barrier of Entry

Hugo: Great. And I think this also speaks to something that I know you're deeply invested in, which is lowering the barrier of entry to these types of things.

Omoju: Absolutely. What I am designing, who I am designing for, my perfect person I'm designing for is that person that has a solution in their head of like, "Oh my God, there's this problem, and I could see this solution. But, ugh, I can't code." There are so many people who fall into that bucket. What if with the assistance of intelligent agents powered by machine learning, we can help those people realize a prototype of their solution so they can at least get the proof of concept out there. And if the thing has legs, then it maybe can go raise money and then hire engineers to build full-on computationally robust version of that prototype.

Hugo: What does the future of data science look like to you?

Omoju: I think it will still exist, but it's not going to exist in the way that we think it is now. It's going to be more of like a real discipline with so many branches. You have decision support. You have some versions with its optimization. You have ethics. You have things that are like linear algebra on steroids, like numerical computational methods. It will be all these kinds of things, but we'll have deeper understanding of what it is.
When you think of data science, it will be like, oh, I am in medicine, right? And then you say there will be people who will be like practitioners, who are like internal medicine practitioners, and then there will be people who will be like oral surgeons or people who will be like cardiologists. There will be all these specializations. And it'll be a product discipline with like a committee or boards of like saying, "Okay, this is what it means to be this kind of data scientist. More importantly, this is the kind of knowledge you need to be able to solve this kinds of problems. So if you're interested in this kind of problem, this is the path that you go down." Because right now we don't really have that.

Hugo: No, and we have a problem then with career paths and even job listings. You look at job listings, and it's like we want 10 years of experience with distributed computing and five years of statistical inference and these types of things. I think it promulgates this stereotype of the unicorn existing.

Omoju: Exactly.

Hugo: Specialization will be a restorative force there.

Omoju: Will be a restorative force because what do you need 10 years of distributed computing experience for? What are you trying to solve? And a lot of these things are over the top. Some of the things that they're doing, you probably can write an SQL script. You don't need learning for everything. And some of this stuff, you don't need this. And then some other things you're like, maybe you actually don't need a machine learning ... Maybe what you need more is a data engineer.
Some problems are just pure infrastructure problems. They don't require any kind of extraordinary numerical computational methods. It's more just infrastructure. And there are some things that I've, "Oh, this is truly the bleeding edge. You really need advanced computing, and you really need advanced numerical methods to understand the space of hypothesis." Just we don't even have a lot of that kind of stuff yet.

What's just one of your favorite data sciencey techniques or methodologies?

Hugo: So we haven't got too technical, and I don't want to, but before we end, I'd like to know what's just one of your favorite data sciencey techniques or methodologies, just something you love to do.

Omoju: You know what I love to do? I really like exploratory data analysis. I actually like that stuff a lot. I don't even get to do it as much as I would like to because I have things I need to build, but I just like worming through data.

Hugo: It's a lot of fun, isn't it?

Omoju: Yeah, and I like creating my pretty pictures.

Hugo: Yeah. That's playful.

Omoju: It's playful. That's the thing: It's actually just playful. It's playful.

Hugo: Yeah. It's like a first date. Or it's like the first few dates, right-

Omoju: Yeah.

Hugo: ... when you don't know someone, and, yeah, you're discovering things always.

Omoju: Yeah. I really like that because I think it's interesting. Nothing might come of it. It might just be something that nothing comes of it.

Hugo: For sure. And something you mentioned to me last time we spoke, and this is something I've been thinking about ever since actually, is that reading code is so important. So if you want to think about how ... Don't read the code you'll [inaudible 00:53:47] for scikit-learn's random forest, right? How much fun is that?

Omoju: Yeah. I really like doing that. I love doing that, and I like to read the ... So I would like to read the person's dissertation, like understand how to write in English, and then go see how they write in code.

Hugo: You actually said something to me that stuck with me. You said to me, "You don't only learn to write by writing, you learn to write by reading a lot as well."

Omoju: Exactly.

Hugo: And you can think about that in terms of code.

Omoju: Yeah. Try and read tons of code, and just go through it. Instead of thumbing through Instagram, read some code.

Hugo: I wonder if there's some sort of Instagram for code like some-

Omoju: Oh-

Hugo: ... product you could have-

Omoju: ... my God

Hugo: ... in GitHub, right?

Omoju: I know. We're totally getting nerdy. But that is so awesome. Like, one of the problems is understanding ... After understanding how the human mind works, and how the brain works, and how the human brain actually acquires computation, there are so many ways to game it and to become a master of certain kinds of things.

Hugo: And I just want to circle back to a comment you made that you love exploratory data analysis or EDA. I just want to say that's really heartening to hear from a machine learning data scientist, in particular because a lot of a lot of aspiring data scientists say to me, "What's the best model to use on data?" And I'm like, "What are you talking about? What does your data look like?" Let's actually spend a couple of hours just looking at your data to get a feel for it, to understand its contours and its dimensions, right, before thinking about throwing blended extreme gradient boosting at it."

Omoju: Exactly. And what is the best model for your data? Often the simplest thing will do. You have lots and lots and lots and lots of data.

Call to Action

Hugo: So Omoju, do you have a final call to action for our listeners out there?

Omoju: Yes, one that is quiet self-serving. GitHub is hiring. GitHub infrastructure is hiring. We are hiring in machine learning. So look at the career pages.

Hugo: And we'll put it in the show notes as well.

Omoju: Yeah, and the only other call to action that I have for people is to actually challenge yourself to learn. Like, this machine learning is not magic, it is basically mathematics, it's applied mathematics. And if you really want to understand it, take the time required. It might take you two years, it might take you three years. It's absolutely worth it because this is the future. And I want more people to have an understanding of what it is so they can ask stringent questions of us, and keep us ethically sound, and force us to actually use our knowledge to create solutions to the problems that they have because they will know that we are able to do x, y, and z. They know what our capabilities are. And they can hold our feet to the fire and say, "I want to fund a company that does x, y, and z. The fund is dedicated to this." So it forces us to go build the solutions.

Hugo: Thanks, Omoju. It's been such a pleasure having you on the show.

Omoju: It's been absolutely a pleasure, Hugo. Thank you for having me on the show.

Topics

blog

Understanding GitHub: What is GitHub and How to Use It

Discover the uses of GitHub, a tool for version control and collaboration in data science. Learn to manage repositories, branches, and collaborate effectively.

Samuel Shaibu

9 min

blog

Introduction to GitHub Products: A Complete Guide

Explore GitHub products, from GitHub Free to GitHub Team and GitHub Enterprise. Discover the GitHub tools to streamline development and enhance productivity.

Samuel Shaibu

8 min

blog

The Latest On OpenAI, Google AI, and What it Means For Data Science

Learn about the disruptive language, vision, and multimodal technologies and how it is making us more productive and effective.

Abid Ali Awan

13 min

blog

Using Data Science to Explore Software Development

What can data science mean for software development? In this blog post, you'll discover some interesting case studies of data science in software engineering!