Louis is a researcher at Harvard Business School studying entrepreneurship. He learned to code on DataCamp and is making the transition to using R for his research.
Tell us about your background.
I studied Economics at NYU, then I worked in finance for two years for a French bank in Manhattan, then I decided to go back to grad school. I was interested in working for the UN. I had a background in statistics—I took AP in high school and had a really good teacher. So I have always been into these types of things. I didn't really code much though. I got into coding when I started grad school. I did one year at Columbia and one year at the London School of Economics. That was my introduction to econometrics and statistical programming. I started out on STATA, and I still use it for my professor's research. But I heard more and more about R.
I took a class during my second year of my Master's at the London School of Economics using R. It was a great class, but the professor said you didn't really need to know anything starting out—and I knew nothing—but by the end of it I still pretty much knew nothing. I learned a lot about the theory. The class had an interesting title, I think it was "nonparametric and regression analysis using R". Wonderful class, but I couldn't do the first problem set. It was way too much way too fast, that's why I was looking for another way to learn.
I did some research as to what the best way to learn to code was, and that's how I came across DataCamp. Nowadays I tell everybody, if they're trying to learn to code, to check out DataCamp. I'm a huge fan. Not only to learn coding, but also to learn things like machine learning. For example, another class I took for my Master's was machine learning, and again, I felt like I didn't really understand the concepts and the algorithms by the end of it. I did a couple of the machine learning courses on DataCamp, and not only did I learn how to code them, I learned the theory a lot better. And that is kind of the best part for me, because I've learned things outside of coding as well.
I did a couple of the machine learning courses on DataCamp, and not only did I learn how to code the models, I learned the theory a lot better.
You had the statistics and the economics piece, but when did you realize that you had to learn to code?
It's a really funny story. Basically, I wanted to learn to code because I wanted to move to Paris. The way it worked out, there's this organization that is kind of like the UN, it's called the OECD. It is kind of like the think tank version of the UN. I had gone to Paris after finishing my undergrad, and I loved it. But I knew that they were really looking for people with good quantitative backgrounds, so that's why I wanted to get more into econometrics, and immediately I realized that if I actually wanted to put this to use, I had to learn to code, in some shape or form. So that was a big push for me. I guess it started out as something I felt I needed to do, but very quickly I thought, "I absolutely love this stuff." Now I code because I think it's useful, but more so because it is my favorite thing to do.
What is so exciting about the field of Data Science?
I think there are a lot of comparisons to art, in terms of creating things. My first blog post, that was probably the most creative thing I've ever done, and it is something that I can share really easily and it is useful, and it actually didn't take that much time to do. If I wanted to learn to paint, I would probably be doing it for the next 80 years. The professors I work with mostly use STATA, but I've used R on several projects directly related to my work and it's blown them away. They can't believe it. I've used it for scraping purposes and a couple packages. It has given me a huge boost in the eyes of my professors, who are really good with STATA, but who are more traditional economists, you could say.
Coding started out as something I felt I needed to do, but very quickly I thought, "I absolutely love this stuff." Now I code because I think it’s useful, but more so because it is my favorite thing to do.
What was the inspiration for your project (and blog post)?
I still get a kind of writer's block, in terms of trying to figure out exactly what I am going to apply all these skills I am building towards. I had this conversation with my professor yesterday. He does quantitative studies related to venture capital. In one, we are trying to find evidence of gender bias in venture capital funding. If we find something, it would have a lot of implications. So I've been exposed to project ideas from my professors. But I've definitely had that feeling of not knowing what to do next. Even though I am getting better, I am still trying to find outlets for using the different skills I'm learning and new ways to practice the things I learn in the courses.
One of my favorite bloggers did a course on DataCamp, David Robinson. He did case studies with UN data. I came across his blog, then separately, I found out he did a course for you guys, and I was like, "I am definitely doing this course." It happened to be on the UN, which was an organization I had worked for and something I am still really interested in, and that was kind of a eureka moment. The case studies, after you've got the basis for it, are a huge thing that I really like because of how applicable it is. The whole entire course was a case study, which I really liked. And then for my blog post, I had wanted to get into sentiment analysis but I didn't what to apply it towards. People apply it to various Twitter feeds and stuff like that, that's what David did in one of his. I forget exactly how I got the idea. My family has a WhatsApp chat, and I saw it was very easy to export a text file, so that was it. My post has already been viewed in 83 countries, which is really exciting.
Publishing my first blog post has been a really collaborative thing. A lot of people have asked for help running the code, and it is crazy because I have done that a lot on other blogs. So I'm basically in the author/mentor position, and it feels really good.
What do you like about DataCamp?
I've thought a lot about what the best way to learn data science is. Everybody has an opinion. I would put DataCamp as my top resource I recommend to people. There is one other book I really like by Hadley Wickham, R for Data Science. It is wonderful, but the book format doesn't work for everybody. Nine times out of ten, the format doesn't really work for me. But the fact that I can just write code in the browser, and the hints...DataCamp is much more sustainable learning than if I am sitting at my desk with a book and my computer. That kind of learning brings back the days of high school or grade school where if I am stuck, I just feel lost. And I don't get that feeling using DataCamp, and that is the reason why I have stuck with it. Not to mention that there are so many different topics to learn about. DataCamp just really clicked for me.
I've learned a ton from the videos, in between exercises. They do an amazing job of explaining the essentials and not going into too much detail so I can always look up more detail if I want to. For so many things, I like having the macro perspective but also digging down into the micro perspective. If I don't have the macro, the micro isn't going to make any sense. I've learned a type of intuition about data because concepts are explained so well in the videos and then reinforced by the exercises.
I also love that you can pick and choose. The fact that you have a course on Bayesian inference is super awesome. That might be the key for me to keep coming back. You can only expand so much at the beginner level, it is just going to be exhausted at some point. But at the intermediate and advanced levels, there are just a million things you can do. And the fact that you have something on geostatistics, that's awesome as well, and is definitely going to keep me around.
The professors I work with mostly use STATA, but I’ve used R on several projects directly related to my work and it’s blown them away.
What are some examples of ways you've applied skills you've learned on DataCamp?
For example, I didn't know anything about the dplyr package. Coming from STATA, vectors in base R was tricky for me. But dplyr made everything suddenly doable and I could transfer a lot more knowledge using that. Before I learned dplyr through DataCamp, I really couldn't use R for my daily research purposes. As a beginner, things would just take too long, and I couldn't justify a few hours doing something when I could maybe do it in 30 minutes in STATA.
One of my professors gave me the task of cleaning up some names. We needed some way to predict their gender, because we didn't have a gender variable. So I uploaded the names, and I had taken the text manipulation course by Charlotte Wickham, and it made cleaning the dataset ten times easier. And I was able to manipulate it with dplyr and then I found an R package within ten minutes of searching: it's called gender, and it does exactly what I was looking for. It takes a vector of first names and returns a confidence score and a gender based off that. My professor gave me a week to do it, and within a few hours, I had it done, and he was amazed. It wasn't even that it was that sophisticated, but I knew where to go and I knew what to do and it was all thanks to R and DataCamp.
How does DataCamp compare to other online learning platforms?
Apart from the books which I think are really hit or miss (except Hadley's book, which I really like—I still go back to that sometimes). I tried Codecademy, for a few weeks, but it didn't stick. I think it was that I didn't like the structure of the courses. It just didn't click. But DataCamp did.
I've learned STATA in a classroom settings. Which was fine, but there were two TAs teaching the course who were kind of learning themselves. They knew how to do a bit more than everyone else, but they also had 30 students they needed to attend to in an hour. A lot of people didn't understand how to set up their directory, so that was the majority of the course. With DataCamp, you don't even have that problem, you don't have to set directories, which is a non-trivial thing when you are learning programming. So that's awesome, and especially that you have some of the biggest names in data science.
Again, going back to this machine learning course I took in grad school. I was so interested in machine learning and so let down by the class, because the professor was a really nice guy, but just wasn't that great at teaching. We used the software that was a drop-down based machine learning software. I don't know why anyone would use it, and they have since switched over to R. I left the course and my intuition was really bad. I couldn't explain what was different about machine learning as compared to classical statistics. And literally in the first video of Machine Learning Toolbox, five of my questions were answered. Fundamental things that I didn't understand. So what I did, and what I do often, is I will start with the videos and I take notes off of those. When I started at HBS, one of my friends was a math major, very bright dude, and I talked to him a little bit about machine learning and he knew way more than me. I ran into him a few months ago and he was like, "Wow, you definitely know your stuff now." And that's thanks to DataCamp.
Once I find more time, the cool thing is that I will be able to recreate the experience completely with Python. I know very little about Python, but I am finding more and more reasons to learn it, so it will be double the experience. And then you guys have a course or two on SQL. So the fact that it is all in this format that I already know, it benefits me and I feel like it benefits most people. It is much less of an investment, much less of a risk, because everything I've done is going to transfer over, not only in terms learning to code, but also in learning from the DataCamp format, so that's exciting as well.
The case studies are awesome because I see the relevance to my day-to-day work
What advice would you give someone just starting out with DataCamp?
You know, you're going to spend however many hours learning to code. For me, I just needed some inspiration to justify all the time I was spending. The analogy that learning programming is like learning a language makes so much sense. It's not that things are always really difficult, but they are really time intensive. So the best way to justify learning things that aren't so exciting—like searching for missing commas or parentheses—is to go out online and find a project that is really cool. And suddenly, that justifies all the time that you'll put into it.
You'll have to go through the beginner stuff. Intro to R was useful. To use another analogy, it is like trying to convince someone who wants to learn to paint to practice their brushstrokes for 12 hours a day. For most people it is never going to work. But all you have to do is see one amazing Monet painting, and then it is there and you can't stop doing it. That is why finding these blogs and stuff like that is important. R-Bloggers is another place where I find a lot of inspiration. Seeing what's out there, it justified everything for me.
Each exercise on DataCamp doesn't take that long. That's why I thought "hmm, I don't know how much I am getting from of this." But it actually makes a lot of sense because you are going to be drawing from these short exercises and then applying them in your own way, and that's where the intuition comes from. So the fact that they are short is probably the best thing because it keeps your attention. You don't get too bummed out when something isn't working.
Another thing I used to always hear when I was learning to code, and it was a bit of a catch-22, was "You should apply it to a project". But in order to apply what you've learned to a project that's meaningful, let's say at work, you have to know how to code already. It goes both ways. That's where the case studies really come into play.
What has helped your learning the most?
As far as the courses go, what has really worked for me are the case studies. The case studies are awesome because they help build intuition because you see yourself actually using the stuff for useful purposes. Also, I love the more advanced machine learning and statistics courses because those topics are really hot for data science. That's the key for me.
There was this blogger, and he was trying to start some type of service. It was almost like he would serve as a consultant to his students, and it was outrageously expensive, so I wasn't going to do that. That's another nice part about DataCamp, it is relatively affordable. He made a good point: part of becoming a data scientist is having things become second nature. If you have to plot something, having the ggplot commands memorized is so useful, and having that as second nature is the type of intuition I am looking for. I think there is a good amount of repetition built in to DataCamp to help build that intuition. Again, I think the case studies are awesome because I see the relevance to my day-to-day work and the I see how they can be applied to things that I am already doing.
I found something really cool on your blog last week. It was an article about scraping H1-B visa applications. I do research related to entrepreneurship, and my one professor has multiple papers about H1-B visas, so that was cool. He was super impressed. I didn't tell him that I basically followed a blog post, but he was impressed anyway.
You're a confident STATA programmer. How did you make the switch to R?
On a macro level, I have been trying to switch over from STATA to R. The more I talked to people, they said you just have to make the dive. I couldn't do it though, because I couldn't justify spending that much time to learn R, and I would always get locked up and next thing you know I'd be bothering this dude at the research support center almost every day. Slowly but surely, I became proficient. So that's the best thing: I no longer feel like I have to rely on STATA, or I have to rely on whatever. I can actually use R now. Even bootcamps can only go so far—until you are using it for your job, you'll never really get off the ground. But that was not my experience at all—even though my transition was more gradual, I was still able to make meaningful progress. I will definitely keep using DataCamp, it is my main learning tool for R. It is my main resource.