Chris Volinsky is Assistant Vice President for Big Data Research at AT&T Labs in Bedminster, N.J. Chris got his PhD in Statistics from the University of Washington in 1997. He joined AT&T in 1997 and became Director of the Statistics Research Department in 2004. His research at AT&T focuses on large scale data mining: recommender systems, social networks, statistical computation, and anomaly detection. In 2009, Chris was a member of the 7-person, 4-country team BellKor's Pragmatic Chaos that won the $1M Netflix Prize, an open competition for improving Netflix' online recommendation system. He enjoys using data to dig into interesting scientific problems and providing insight into the answers through data exploration, visualization, and statistical analysis.
For more info, see here.
Hugo is a data scientist, educator, writer and podcaster at DataCamp. His main interests are promoting data & AI literacy, helping to spread data skills through organizations and society and doing amateur stand up comedy in NYC.
Hugo: Hi there, Chris, And welcome to DataFramed.
Chris: Hi, Hugo. Nice to be here.
Hugo: It's great to have you on the show. I'm really excited today to be talking about how data science and statistics are impacting telecommunications networks and your work at AT&T. But, before we get into that, I'd like to know a bit about you. So, my first question is, what are you known for in the data science community?
Chris: I don't know how well known I am, actually. But, I've been here at AT&T in the data science research group for 20 years. And I've been leading the group for five years. I think we're one of the few remaining industrial research groups that still put some energy into publishing and going to conferences. In addition to providing value to AT&T. I like to think that our group is known for being a good, solid, research group and doing some really good data science. I guess the one little bit of fame that I have is I was part of the team that won that Netflix prize, which was back in 2009. But, I think that had some impact in the data science field, and I'm really proud of that experience.
Hugo: When you say industrial data science research group, what do you mean by industrial?
Chris: Well, we're run a by a for-profit company. It's not academic, or government research institution. At AT&T, we have the heritage of Bell Labs. And even though a lot has changed since the days of Bell Labs when companies would have large scale research firms, that were like little mini universities, because that's what Bell Labs was lik... See more
Data Science Career
Hugo: How did you get into data science originally?
Chris: Well, you know, my history is in statistics. So, I was going to school in statistics at the University of Washington. I was kind of taking the regular old route of going to get my PhD, and assuming I was going to be in academia, in some field of Bayesian statistics, which is what I was studying at the time. One summer I got the opportunity to do an internship, here in New Jersey, at AT&T. I didn't know much about what the group was doing, but it was close to my family. I thought, why not take that experience and see what it's like? The first day I was on the job in my internship, I went into my mentor's office. He said to me, that... It stuck with me forever. He said, "90% of what you learn in graduate school is completely irrelevant to what we do here." I just thought... I didn't understand what he meant. I thought that was kind of amazing, what do you mean by that? He said, "Because of the scale of the data that we're looking at..." this was back in the late 90's. This was before Google and Facebook, before big data was a thing. But, at the time, AT&T had some of the largest data sets in the world. Just by nature of the telecommunications business. He said to me, "The scale of the data we look at here, a lot of the things that you're learning that you either don't scale up, or you can't apply them at this scale. You know, we have to invent new technologies, and we have to do perhaps create ad hoc things that haven't been invented before. But, the things that we're doing are impacting customers, and impacting the business in a daily fashion." I spent that summer working on fraud detection problems, and churn analysis. I found it terribly exciting to be looking at large scale data sets, that were based on real customers interactions with the network. And it was really exciting to me. So, that worked itself into a job when I got out of grad school, and I've been here ever since. At the time, we just called it large scale statistics. Or, that was when the KDD conference was just starting. So, we used to call it knowledge discovery in databases. And then, that turned into what we now call data science, and big data, and machine learning. The name seemed to change every couple of years. But, for the most part, I've been working in large scale analytics for the last 20 years.
Hugo: There's a lot of interesting stuff in there. Something you spoke to that I find really compelling is the gap between what happens in education, and what happens in jobs, in business, in industry. Do you still see a large gap between education and what you need people to do when you hire them?
Chris: Yes, I do. In fact, I think a lot about it. Particularly in my field of statistics. I think that some universities have figured out that they need to change their way of teaching in order to equip students for the workplace. But, many statistics departments, I think, are still taught and kind of an old school mathematical way, teaching asymptotic theory with small data sets, in a way that is great for academic thought. And bringing forth new methods, but not necessarily training people for the jobs that are out there today. I think computer science is doing a better job of it, in some places there are machine learning departments or even physics departments or other quantitative departments that are training people in data science fields. But, I think that's why we see a large growth in machine learning, and data science masters programs out there right now. And I think it's because some of the traditional academic departments are not adjusting well enough to train people for the jobs of today.
Hugo: That's really training people to analyze data, which may not happen in statistics departments.
Chris: That's right. Or not analyze real data. You know, in statistics departments there's a lot of... I'm speaking generally now. Obviously, there's a big variety in what departments do. Often, things are taught with data sets that have been used forever. If I had a dime for every time someone looked at the Iris data set, I'd be a rich man now. There's a lot of really interesting and fascinating large scale data sets that can be used to build courses around. I think the departments that get it are starting to use some of those.
Data Science in Telecommunications: AT&T
Hugo: So, what are some of the biggest challenges facing telecommunications networks today in your line of work?
Chris: You know, the challenge of running a telecommunications network at scale is huge. So, at AT&T we have over 140 million subscribers on our mobile service. Trying to make sure that each one of those customers has a great experience is really, really important to us. Managing a large, complex, network has a lot, you know, entails a lot of interesting data analysis that keeps us busy. In particular, we're trying to change our network from a hardware based network, like it's been forever, to a software defined network. AT&T has let the charge in the industry to build out a software defined network that can be nimble and serve our customers for years to come. The reason that's interesting from a data science standpoint is that once you change your network from hardware to software, then you can do a lot of things at the edge of the network, with analytics that you couldn't do before. In the old days, if I wanted to make a change, if I built a model to look, some anomaly detection model to look for, say, cyber security issues and I needed to make a change, I might have to go and install a new piece of software and some switch out in the network. Now, with a software defined network, I can sit at my desk and make an update to a machine learning model and have that implemented within minutes. That's a really exciting change for us. And the ability to have adaptive machine learning models that can learn from the data, and adapt as it sees data and change as the data changes it's much more capable now with the software defined network.
Hugo: And do you run experiments around these types of tests at all?
Chris: Yeah. Once everything is running in software, it's much easier to do things like A/B testing, and just constantly test and learn so that you can make those models as good as they can be. And learn from the great machine learning algorithms that are out there.
Hugo: You said that 140 million customers, was that right?
Hugo: What about customer care?
Chris: Customer care is a huge are where we do a lot of work. There are hundreds of thousands of touch points that we have with our customers on a daily basis. Whether it's, you know, someone walking into a store, or someone calling on of our customer care lines, or someone engaging with us on a customer chat, online. It's a challenge to make sure that we can serve these customers as quickly as possible, and get their issues resolved as quickly as possible. So, one example of a really neat problem where we used machine learning to aid our customers is in our broadband service. So, our customers who have internet with AT&T, when they call in with a customer care issues, often problems can be solved by a simple reboot of the modem. You know, when the service is coming into the house. We ran an analysis where we looked for predictive signs that based on the diagnostics that we see in the network, that a particular customer might have one of these problems that we could fix by rebooting their modem remotely. So, we ended up getting to the point where we could predict it well enough that we would do a predictive, or proactive reboot of their equipment. We could show that in those cases, we could reduce the... first of all, we would solve their problem before they even bothered to call in, or might not even have known that they had a problem in the first place. And we can reduce customer care calls by 15% to 20%, and we can reduce dispatches by 15% to 20%. So, by implementing a really good predictive model, we can both improve the customers experience, and impact the bottom line for AT&T. That was a really good example of a win, win.
Hugo: It sounds like machine learning is playing a huge role in what you guys are doing.
Chris: Yes. Absolutely. Although, you know, it's hard to draw the line between what's machine learning and what's statistics, and what's AI. I don't want to try to do that, other people have tried to do that. I always find they're pretty blurred lines. But, the idea of building a model that can learn from repeated, from additional data, or can take action and then learn from, and get better from understanding whether or not the action that it took had a benefit, I think those are definitely techniques that we use regularly within the telecommunications space.
Hugo: So, building models, model building, running experiments. I presume you have a lot of GEO data as well. Is there certain inference you do around different locations and that type of stuff?
Chris: Yeah. So, location data is something we work with a lot. There's a lot of location data from the mobile devices in our network. You know, one thing that is something we're working on very much in recent years, is a roll out of our 5G networks. So, 5G networks are going to roll out by 2020. One thing about the technology in 5G mobile networks, is that you, you need more transmitters of the signal, in order to get it to the customers. So, in order to roll out 5G, one thing that we need to do, is augment our network with a lot of, what are called, small cells. So, these small cells are smaller antennas. They're smaller than your standard telecom poll that you see on the side of the road, maybe covered with pine branches. These are small cells that can go on top of buildings, or on top of roofs, or on part of buildings. Things like that. We need them to augment the signal that we have out there. So that we can serve all of our customers when we roll out 5G. Understanding where we need to put these small cells, in a cost efficient way, that can give us the maximal coverage. This is an opportunity to use spatial analytics and spatial statistics and other GEO spatial methods as well as visualization methods to help the business understand where to put these small cells.
Hugo: That really leads me to wonder, when you bring results to either business managers or people in product who make these decisions, how it's actually implemented. So, I suppose my questions is, how does the statistics and data science team, how is it embedded in AT&T such that your research is made deliverable or productized?
Chris: It's a great question. It can be... every project is unique. We sit in the labs, in a research arm. I don't have any production resources at my disposal. So, I have to work closely with our technology development teams or other teams that build production systems. Or, in the case of the network, work with the network organizations to make sure that the insights that we learn, and the models that we build can get implemented into the business. I think that research to development process can be a really messy one. I think anyone in an industrial research organization will tell you, that that's a real challenge because the way that a labs team, or research team might address a problem is very different than the way a production team might. You need to get those groups together as early as possible in the process. So that, they can understand what we're doing. And we can understand how things might need to get into, need to, ultimately, get into development because if we use a tool for our research that the production team isn't familiar with, there's very little path to get the insights that we get into development in the business.
Hugo: Yeah. So, your team works in research. There is a model, though, in which data scientists are embedded in other teams as well. Does that happen at AT&T at all?
Chris: Yeah, it does. I think what you're finding nowadays is that teams are hiring their own data scientists. We've also got a business unit at AT&T called the Chief Data Office, which has a whole bunch of data scientists that get dedicated to particular problems, with particular business units. In some cases, that's a great model for work to get done because you really need to have somebody embed themselves with a team in order to get that subject matter expertise to benefit the business.
Hugo: You've spoken to, as we said, a number of techniques whether it be working with geo location data, to building models, to experiments statistical inference. You've been doing these things for a significant amount of time now. Are you still learning new techniques and finding questions that you're not sure how to answer, so, you need to build new methodologies and techniques?
Chris: Yeah. If you're not always learning, right, then you're doing something wrong I think. Especially with the explosion of deep learning techniques, over the last, say, five years. Obviously, neural nets and deep learning have been around for a long time. But, certainly, I think they've come into the mainstream in the recent years. It's not a technique, to be honest, that I personally was familiar with, or even that my team had done a lot of. You make sure that you take whatever courses you have to, and you learn the technology, you read the right papers. Maybe you hire people who are more knowledgeable in that field, and you make sure that you're constantly learning and understanding what problems it can be relevant for, and understanding what problems it's not relevant for is probably the most important thing. You know, you don't want to just... Something like deep learning can be a hammer that you try and pretend everything is a nail for. But, you know, you need to understand what it's really suited to help you with, and what it's not.
Data Science and Big Data
Hugo: One of the things that's been a running thread through this conversation is the scale of the data. Whether it be 140 million customers or petabytes of data coming in everyday, I want your help to demystify this concept of big data in some ways. It's a buzz term, it may contain some meaning, which I'd like your help to get out. But, it's something we hear a lot about. I'm just wondering what your take on -in inverted commas- big data is and what role it plays in your work.
Chris: Yes. I think big data is the eye of the beholder, right? Big data for one person, or one team, might not be so big for someone else. If you have a researcher in a particular scientific field, and they don't have access to the scale of computers that I have access to AT&T, then, big data might mean something different. But, I think the way I define it is, if you can't do the analysis on the machine that you own, on a single machine, it?s big. If it requires some kind of parallelized infrastructure like Hadoop or some other cluster system to store and analyze your data, and that's pretty big because what it means is, you have to alter your... How you analyze the data because of the size of it. Right? So, once you have to start amending how you would analyze or study the data based on how it's stored, then we're starting to get into the realm of big data.
Hugo: This is something you spoke to earlier, that when you started your internship, your boss said, "What you've learned at college doesn't scale to what we need to do here?, I was just wondering, what type of methods was he referring to? That someone would learn at school, that just doesn't apply when they work with the size of data that you do.
Hypothesis Testing and P Values
Chris: Well, you know, in statistics you learn a lot about hypothesis testing, and estimation, and P values. When you get to large scale data, everything is significant. Your P values are always going to be small, and almost every hypothesis your test, you're going to reject that null and it's going to be significant. That's not the right way to think about the problem, probably, because it's not very informative to just determine whether a certain parameter estimate is significantly different from zero, for instance. Since then, there have been a lot of new methods. But, with the things that you do like clustering, right? Clustering in the naive methods are... you need to invert matrices and you need M squared operations. And you just can't do that with large data sets. So, you have to think differently. Sometimes, you might have to come up with an ad hoc method, some scoring method. Whereby you don't necessarily know what the asymptotic properties of that method are, or you don't necessarily know how theoretically rigorous it is. But, you can justify some things based on simulations, or permutation tests and hope for the best. And maybe try something that you don't know if it's theoretically valid, and do some A/B testing to see if it helps you, say, fight fraud, or reduce churn. And, take this iterative path of trying one thing, learning from it, and trying something else. As opposed to sitting down and understanding the math, and the asymptotic properties before you get started.
Hugo: Could you give an example or two of where you would find this type of application?
Chris: Yeah. I think in a fraud detection application, you know, sometimes you just need to understand, try and understand where you're seeing anomalous behavior. And there's not necessarily any model that's going to point you towards finding anomalous behavior. And when you're looking at billions of records on a day, sometimes you just need to apply some ad hoc method to find something that looks weird. And then go explore the weirdness and see if it... Dive into it, go back to the raw data. And see if it looks like fraud.
Hugo: Something you also spoke to that when you get big enough data, or enough data, anything will be statistically significant. This is actually a topic close to my heart. I previously worked in applied math, in the biological sciences. So, that's a field in which hypothesis testing, I personally think, is overused consistently. And the fact that people are looking for significance, as opposed to first reporting on the effect size, which I think you're also speaking to, is a big challenge. You want to know what the effect actually is. If it's statistically significant, you need to state that afterwards. But, you want to know what the effects are actually is.
Chris: Yeah. Of course, in academia, there's so much that's driven by being able to publish. And in order to publish, you need that significance test. And in the field of statistics, there's a little bit of a revolution going on right now because there's a lot of discussion around is P equals 0.05 the right method to use, the right limit to use? Should we be using P values at all? I think those fields are starting to come around, but, there's a lot of institutional inertia around these... The old methods of significance testing. It's going to be hard to break.
Hugo: I don't want to get too technical, but I actually can't help myself half the time. You mentioned, at the start of our chat, that you're working in Bayesian statistics. I'm wondering if you see a future in which Bayesian statistics can help us actually out of this challenge, this P value challenge.
Chris: You know, I think the Bayesian framework of thought is very powerful. With the idea of having prior belief, and then updating them through data. Back when I was in school, there was a raging debate between Bayesian and frequentists. But, I think now, the Bayesian thought has permeated a lot of science. I think that Bayesian methods do help you thinking about large data problems, and the idea of a posterior distribution, and updating prior through data. I think that it's certainly one method that helps us in the world of massive data sets.
Hugo: Do you use any Bayesian inference thinking statistics at your work at AT&T?
Chris: Currently, I don't. I mean, there's some Bayesian projects sprinkled around that I know about. But, I don't currently use anything. I can't say that I'm working on any Bayesian projects right now.
Hugo: We all know how much data there is growing, daily at the moment. I'd just like to have a conversation about ethical considerations to your mind about the concentration of data about all of us, you, me, every other civilian, in the hands of big businesses.
Hugo: And as we see this proliferation of data, we can see that it could actually be used to tell us a lot of things about society. I mean, social network data, we see all types of trends emerging. How can the data big businesses have be used for this type of social research, to your mind?
Chris: Yes. We had done some research years ago on using location data to try and understand how people move through cities. This was researched where we were using location data in a completely anonymous and aggregated, to understand how people move through cities. The goal was to provide services to cities and municipalities, so that they can make their cities more green, and more sustainable, by understanding how to make more efficient the flow of people and vehicles through their towns. By using technology, or using data like a telecom network, we could understand how long people were spending in commutes, and how long they were stuck at red lights, and how long they were stuck in traffic. Also, where the kind of growth areas were in terms of where businesses and residences were opening in a particular municipality. And that would help them develop their five and ten year plan so that they can make their cities more efficient.
Hugo: So, you mentioned something there, which I just want to home in on. You said data that's totally anonymized. What does that mean?
Chris: For instance, if you're trying to understand the flow of commutes from one town to another. Right? If you're doing that analysis, it doesn't matter... There's no identifiers that matter on the phone. We don't need to know who owns the phone, we don't need to know the household, person, phone number, any account identifiers. We just need to know that there is a device that went from Morristown, New Jersey to New York City on a daily basis. As a researcher, I don't need to have access to any of that private information, so I don't get access to that private information. I can just use the aggregate information that doesn't have identifiers on it, and use that. So, companies like AT&T have very strict processes to give access to sensitive data for research purposes. Very similar to what you would see in a IRB in a University.
The Netflix Prize and Machine Learning Competitions
Hugo: So, you're also known for winning the Netflix Prize.
Chris: Yeah. It's kind of an old story by now, but it is a good one. I have to admit.
Hugo: Can you just give me the elevator pitch on this competition?
Chris: Sure. So, back in... I think it started in 2006, Netflix was a much smaller company then than it is now. They realized that giving people good recommendations was going to be crucial to them growing as a business. So, they did something very clever, and very novel. They released a whole bunch of data on their customers, again, anonymized data on their customers. How their customers rated certain movies, and they released that to the public, and they had a competition to see who could build the best recommender system. They released a training data set, of about a hundred million ratings. Then, they had a hold out set of about two million ratings. And you had to use the training set to make predictions on the test set. And they knew what the answers were. So, if you could build a recommender system that predicted what the people in the test set, how they would rate those movies, and you could get below a certain error threshold, you'd win a million dollars. It was great fun. There were thousands of teams that participated from around the world, and industry, and academia. And people who were just hobbyists. The competition went on for three years. We had a team at AT&T that collaborated with a couple of other teams. We ended up taking home the grand prize.
Hugo: The type of data science or statistics or modeling that you do in a competition like this just seems to me really distinct from the type of work you do at AT&T. Is that the case, or am I incorrect?
Chris: It's funny, I was going to say, earlier, when you asked about the challenges in telecommunications companies, telecommunication companies are a lot more than telecommunications these days. Right? So, at AT&T, we merged with Direct TV a couple years ago. Now, we're a TV provider as well. In fact, the largest in the country. So... We've also had our U-verse service with a couple of million households delivering TV for a couple of years now. So, recommender systems are actually really important to us here. So, back at the time of the Netflix prize, I think we realized that it was a technology that we wanted to be really good at in order to help us grow into the space of being a TV provider. So, at the time, it did make a lot of sense for us to compete in this competition. Now, the modeling that we did, I think that we learned a lot of really valuable things. And I think it helped to advance us and the other competitors. We helped to advance the field of recommender systems as a result of the prize. There was a fair amount of what we did during the competition that was very specific to that data set, and maybe didn't generalize quite so much. But, a lot of what we did learn by participating in the competition ended up being valuable to us as a TV provider.
Hugo: My understanding is that, like a lot of machine learning competitions, your model that won the competition was not implemented. Is that right?
Chris: No, that's not exactly right. The competition had a few stages. I'd say in the first year of it, we were really developing new models, and doing some novel things to analyze this data set, in different ways than had been done before and in ways that ended up being very valuable to Netflix, and did get implemented in their system. The later stages of the competition, was really trying to squeeze blood from a stone. One of the most interesting things that came out of the competition was the power of ensemble methods. So, by the end of the competition, we were combining and averaging together, literally hundreds of different models in a huge ensemble that was weighted by some complex neural net, in order to just very much get 0.001 off the root being squared. Those methods, those massive ensembles, are not really... they're not really appropriate to use in a production system. Those kind of last minute methods that we used to actually win the competition, and kind of bolt ahead in the last day., those weren't very valuable to Netflix. But, the things we did in the early days, some of the work on matrix factorization, and nearest neighbor methods were very valuable to them and did get implemented in their system.
Hugo: That's actually what I meant, the final solution, which wasn't used in the way that it was developed because it wasn't productize-able. You?re are speaking to the fact that the incentives of such competitions aren't always aligned with the incentives of what companies need to put into production. Chris: That's right. I think the Netflix prize was the first of its type. I think if they were to do it over again, they would have had a time limit on it, as opposed to a threshold of the error. And you'll see that those types of competitions that are done now have learned from that. They have a competition that runs for three or six months, Kaggle competitions are typically on scale of a couple of months. That way, you learn something, and you hopefully derive some value. And then you move on to the next problem.
Personalization and Recommendation
Hugo: So, recommender systems are everywhere now. I mean, even when we have different apps push things to our phone constantly, that hopefully we'll enjoy and like. A lot of the time it works, a lot of the time it doesn't. But, personalization is becoming more and more important. What does the future of this landscape look like for you?
Chris: I think there's still a long way to go in the space of personalization and recommendation. I mean, I still... I still sit down on my couch and want to watch TV or watch some content. It's almost like there's still too much to chose from. I could go to Netflix and get their recommendation, or I could go to Direct TV and get their recommendations. Or, I could go to Amazon and get their recommendation. I want a service that's just going to tell me what I want to watch. I don't want to have to think about all of the different channels that I have to go to. I don't think that exists yet, a service that kind of goes across the top of all of these. I think there's an opportunity there for growth in the future. But, the other space... I mean, technically where I think is very exciting is contextual recommendations. So, if we can try and understand where people are when they're consuming content, and use that in what we serve up to them in a recommendation because I know that... I took a 45-minute train ride into work today, the type of thing that I would want to consume on that train ride, is different than what I would watch if I'm sitting down on my couch on a Friday night, with time to kill and want to watch something different. Or, what I want to watch when I'm on vacation is different than what I'd want to watch when I'm home. Or even in the morning versus the evening. So, I think these kinds of contextual recommendations and learning how people consume content. I think that's the next step for recommender systems I think could really provide value.
Hugo: I love that idea of contextual recommendations. What type of techniques do you think could be used to develop these type of recommendations?
Chris: I think that you have to do some kind of maybe clustering of a person?s viewership just to understand what they watch in different contexts. You could also do it by engaging the user to help you understand by asking them questions, and obviously having them opt in to collecting certain data set. Ask them, what kind of mood they're in, and what they would want to watch in a particular case versus a different scenario. And hopefully, not ask them too many questions as to be annoying. But, just ask them enough so you can get some information. And feed that information into a recommender system and give them what they want to watch.
Hugo: Yeah. I like that type of user onboarding idea. I signed up to Netflix a long time ago. I can't remember, but I can imagine that it would make a lot of sense when you sign up for a service such as that, they ask you a few questions at the very start. Like three to five questions that help calibrate the service to you personally.
Chris: Yeah. That's right. With Netflix out now, in the beginning, they didn't even break up amongst the different people in the account, and they tried to figure it out. But, ultimately, they realized if you have different accounts for different people in the household or on the account, they'll do a better job. There's always a balance between asking the consumer for information and trying to figure it out. But, we shouldn't shy away from asking the consumer information. You want to keep it short and snappy. But, it can be extremely valuable.
Hugo: Sure. So, for Netflix in particular, my wife and I each have our individual logins on Netflix, and then our joint one where we watch shows together. Those three seem to do a pretty good job to cover us all.
Chris: That's a great example.
Data Visualization and Storytelling
Hugo: We've talked about a lot of different technique and methodologies for a data science and statistical research. What are some of your favorites? What are the things you enjoy getting your hands dirty with the most.
Chris: I'm always amazed at the... This is going to be kind of a boring answer. But, I'm always amazed at the power of some of the old school techniques. Good old fashioned linear regression is still a really powerful and interpretable, and tried and true technique. It's not always appropriate, but often works well. Decision trees are another old school technique, I'm always amazed at how well they work. But, you know, one thing I always find really powerful are well done, well-thought out data visualizations. And, you know, I'm a big fan of the type of data visualization that I see in media companies. Particularly in New York Times does a great job with data visualization. They can present a complex data set in a way that doesn't even require modeling, or analysis necessarily. But, if it's broken up the right way, it tells a story. It allows you to learn from the data. In particular, the way they use the web to create interactive visualizations. You can play with it, and understand the data by poking around a little bit. I'm always very impressed with a really well thought out visualization, where multiple variables are presented to you using shapes, and colors, and appropriate visualization techniques. When I see a really well done visual display of data like that, that gets me pretty excited.
Hugo: For sure. And you're right, the times has a lot of interesting stuff. I remember during the last primaries, they showed what was reported in terms of who was ahead in the primaries. But, then they showed it varying, doing some sort of re-sampling. It varying with respect to the same size, and showed that interactively, it wasn't actually clear-cut. Which I found very interesting.
Chris: Just this past weekend, I was playing around with my family. They have one thing where you can... They ask you questions about how you say certain phrases, like whether you say soda versus pop. Or whether you say garage sale versus rummage sale, and they ask you 20 questions. Then, they can kind of figure out what part of the US you're from. It was quite accurate for the people that did it in my household. But, you think about the text mining that goes on behind that. There's probably some sophisticated data set, data collection, and some text mining models that go in behind it. But, the way it's presented is very easy to understand, easy to digest. It's fun. And it ends up giving accurate results. I think that's great example of a "modern analysis" that doesn't... It doesn't require deep learning, it just requires good statistical thought, and data collection, and understanding.
Hugo: It speaks to the fact that as data scientists and statisticians, we're also storytellers.
Chris: Exactly. That's why we got into this, is to tell stories with data. That's what makes it interesting.
Hugo: So, something we haven't touched on is what actually you use in terms of languages, and libraries, and technologies. What do you like to visualize with?
Chris: Well, we're mostly an R shop. So, we use a lot of the R tools. Certainly
ggplot and various R tools, and the tidyverse, and other packages that are out there for R, I like to use
plotly for some simple interaction and visualization. We developed here, at AT&T an open source tool called Nanocubes that we developed because we saw a need for very large scale geospatial analytics. With Nanocubes you can visualize billions of points on a screen because it's being stored in an efficient and computationally efficient manner. So, sometimes we use things out there, sometimes we develop our own as part of our research. But, we'll try anything.
The Future of Data Science
Hugo: Awesome. So, we've discussed a lot about what modern data science looks like today. What does the future of data science look like to you?
Chris: Well, I think... I've been really fortunate to be in a field that has kind of taken off during my professional life, you know. Nobody used the phrase data scientist then, it was kind of a niche or fringe field. But, now it's pretty mainstream. You see data science everywhere. Every company has data scientists now. People who are going to school in all kind of different fields feel like it's necessary to take a data science course. There's great resources out there. I'm a big fan of the Coursera courses taught by the Johns Hopkins faculty, for helping to train people in data science. So, what's ended up happening is that there was a great democratization of data science. You don't have to be a PhD in statistics anymore to do really good data analytics. You can learn it from taking courses online, and using your laptops. And you can go and collect your own data, and do great analytics out there. And there are some really fantastic blogs and twitter feeds out there from people who are doing great things with data. So, I see this lowering of the bar, of the data scientists. I think we'll be seeing a lot of fantastic work coming from... Not necessarily from the AT&T, and other big corporate labs. But, from the average people in their basement doing data science.
Hugo: I wonder what else we can do from this side of the data science divide because I know people who are journalists, for example, who think to get into data journalism they need to learn about hierarchical linear regressions. Right? Or they need to have done six years computer science. So, what can we do as educators and practitioners to help bridge that gap?
Chris: I think it's a balance. Right? I think people need to... For instance, I think statistic classes help. I think having some statistical thought and understanding how data... Understanding the uncertainty in data, how data is generated. Maybe a little bit of distributional thought is good. You don't have to take it to a huge level. But, I think a combination of understanding some of the basics, and some of the theory. Along with just immersing yourself in data. And trying to solve problems with data is the best way to get people started in this industry, and helping them to contribute to the great data sciences out there.
Hugo: As we move from modern data science, what's happening today, into the big future of data science. What's a final call to action for all of us?
Chris: I think we all really need to push for openness of data. I think there's an incredible opportunity if governments are open with data, given all of the data science talent that's out there right now. I think there's an incredible opportunity for regular citizens. And they use this phrase, citizen science, or citizen data science. There's an incredible opportunity for people to help make society better, and to make society greener. And to help understand how societies work if government data is made available and it's made... And then people are free to use their skills to analyze it. And find great insights. I think it's really exciting to think about all of the great insights that we could be exposed to, if the data is exposed to everyone and people are given the opportunity to analyze it.
Hugo: Yeah. And I think, a good example of that is, we spend a lot of time in New York and the fact that the city makes all its subway data open. Civilians can go in and probe, and use the MTA API to check out how the subway is working and these types of things.
Chris: That's great. I think that extends to all realms of government data, whether it's economic data, or financial data, or health data, of course with the right privacies. I think the type of efficiencies that we can extract out of that data will be tremendous.
Hugo: Chris, it's been an absolute pleasure having you on the show.
Chris: Thank you so much, Hugo. I enjoyed the conversation.
DataCamp Portfolio Challenge: Win $500 Publishing Your Best Work