Official Blog
dataframed
+1

Data Science at Doctors without Borders (Transcript)

Hugo speaks with Derek Johnson, an epidemiologist with Doctors without Borders. Derek leverages statistical methods, experimental design and data scientific techniques to investigate the barriers impeding people from accessing health care in Lahe Township, Myanmar.

Here is a link to the podcast.

Introducing Derek Johnson

Hugo: Hi there Derek, and welcome to DataFramed.

Derek: Hi. How are you?

Hugo: I'm very well. And it's great to have you on the show. How are you?

Derek: I'm doing great. I'm excited to be on the show today.

Hugo: This is super exciting. I'm in Sydney, Australia at the moment. Whereabouts are you?

Derek: I am speaking to you from Southern Myanmar in a place called Dawei.

Hugo: And are you in an office there or whereabouts are you?

Derek: Currently in Dawei, I'm actually in an office. I'm in an office that's kind of attached to a health clinic here in Dawei. I work with Doctors Without Borders and we have been running an HIV clinic out of Dawei for the last 18 years. So that's kind of where I'm speaking to you from for the moment.

What are you doing for Doctors Without Borders?

Hugo: Okay, great. What else are you doing in Myanmar with Doctors Without Borders?

Derek: Currently we have two projects. We have the HIV clinic in Dawei which has been operational for about 18 years, but we also recently have opened up a project in Northern Myanmar in Nagaland in a town called Lahai, and by comparison, that's only been up and running for about a year now. We don't necessarily do a lot of HIV up there, but we're doing more of supporting the health infrastructure up there and doing more general health support. So very, very different than what we do in Dawei.

Hugo: So these are two very different projects that you're involved with. I'd love for you to abstract from that a little bit and let me know, or tell us a bit more about your general role at MSF. For listeners out there, we'll be referring to Doctors Without Borders as Doctors Without Borders, Médecins Sans Frontières, and MSF which is an acronym, interchangeably. Hopefully you can stick along with that.

What is your role at MSF?

Hugo: So Derek, yes. Tell us more generally about your role at MSF.

Derek: I'm actually currently the epidemiologist here. Basically, I help out with a lot of operational research and a lot of monitoring and evaluation. In Dawei, where I am right now, we're doing a bit of operational research on Hepatitis C. We are currently trying to treat people for Hepatitis C in Dawei. Recently, the price for treatment has dropped considerably from tens of thousands of dollars for treatment down to a few hundred dollars to treat people. And so here, we're trying to sort of scale up the treatment for Hepatitis C because it's more prevalent here than in other places in the world. In Lahai, I do operational research as well but I do a lot of monitoring and evaluation so the projects of our doctors who go to Lahe township villages, we're kind of just monitoring the routine data that they collect there. But we also have recently done a big baseline health survey that assesses ... it's township-wide representative and it assesses the health of the several different villages there.

How did you get into Data Science?

Hugo: Fantastic. And that's something that we're going to get into a lot later in this conversation. But before that, you said you worked as an epidemiologist which of course, involves a lot of statistics and thinking about data management, data provenance, all of these types of things. More generally, a lot of things which intersect with data science. And I'm wondering how you think about data science, how you got into it originally, and what your path has been to end up in Myanmar working for MSF.

Derek: I kind of have an interesting path to data science. Originally, I was actually a biochemistry major way, way, way, way back in undergrad decades ago. I guess I kind of realized it very early and ... I got lucky, and realized that working in a lab, while very challenging and super rewarding, is not very social. There were internships that I worked in where I used to run kind of like a mass spec machine. A mass spectrometer. And back in the day, these things were huge, like the size of a car. Basically, I would just be sitting in a basement and the only people I would see would be occasionally my supervisor, maybe one other person, and that would be for 8 to 10 hours a day. I was like, I can't do this. I have to do something a little more adventurous. And so I kind of got into public health, but I still wanted to do a lot of science. I still wanted to do a lot of research. And so epidemiology, it's a great mix. You take a lot of data, and you get to collect a lot of data, which is really fun, but you get to do that kind of in the field, so to speak. You don't necessarily have to be going to the Northern parts of remote areas to do it. You can do it in a clinic. You can do it with just the data that people collect. And then you analyze it, and then you look for patterns or outcomes, or how things are associated with one another. It's like puzzle work, mixed with being a little bit of Indiana Jones. It's great. I love it. I'm super, super lucky to be able to do this.

Hugo: For sure. And how did you actually get involved in MSF originally?

Derek: Initially, I was in grad school and I used to do a lot of HIV work, and STDs in general. I ended up getting very lucky and getting a chance to go to Malawi to look at ... Initially, I was looking at using an HIV drug called tenofovir as a first line HIV drug and for people who know about antiretrovirals and HIV drugs, that's how old I am. Tenofovir became a first line drug in 2010 in Malawi so that's a long time ago. So I just kind of got into that. I was like, hey, this is really great. I can do data science. I can do my own surveys. Collect my own data. Analyze it. Use a lot of statistics. But then get to travel and get to meet really interesting, really crazy people. It was really great. From there, when I finished school, I was thinking, well, what to do now? And so, I met somebody who actually worked on ... MSF has refugee boats in the Mediterranean, off of Italy, and he worked on one of the boats helping refugees. Basically, the boats that would come from Northern Africa with two, three, five hundred people, trying to get to Europe. MSF has these boats that kind of help. Basically, just bring people in to make sure that they're safe, they don't die, they don't drown. And he said it was the most intense experience he's ever had. And I was like, "I'll join MSF. I'll see if I can look at the data and things they collect." And it turns out that, because MSF has operations all over the world, they need a lot of help with data science because that knowledge that they gain helps formulate and helps shape their policies. Much in the same way that when you work in a lab, that you publish a paper, people are like, "Oh, this molecule is related to another molecule. We're going to go with this." And that helps shape science. You get to help shape health policy of sorts.

Hugo: For sure. And I love that you mentioned the data collection process is something that you're very passionate about and involved in, and that's something we'll get to, particularly with respect to the baseline health assessment you've been doing in Lahai township because I think a lot of working data scientists and a lot of our listeners, when they think about data science and data collection, they think clicks, and browser-based stuff, and stuff that's put in a database as opposed to getting there on the ground into remote areas and conducting the types of surveys you do. That's a little teaser for where we're headed. But before we get there, I think a lot of people have a sense of what Doctors Without Borders does, but I'd love a brief rundown from you on the work that Doctors Without Borders does from your position on the ground.

What does Doctors Without Borders do?

Derek: Doctors Without Borders basically is a kind of an impartial and neutral NGO that provides healthcare to anybody, particularly in humanitarian crises. Be that natural disasters like earthquakes or typhoons, or more like manmade disasters like in war zones. They provide humanitarian relief. And so what Doctors Without Borders does is they go to these areas and they basically provide humanitarian aid. Most of the projects for example that I'm a part of, tend to be very short term. Before I came to Myanmar, I was actually in a refugee camp in Northern Uganda doing actually a big household survey on mosquito nets, but in the refugee camp itself. So we were just living there and doing work there but it was short. I was only there for about four months. Whereas compared to a lot of development work where they can be in an area, for example like Red Cross or USAID, there's plenty of development NGO's, that can be in an area for years, for decades. So that's kind of what MSF does, is provide humanitarian aid to places in crises.

What can data science do to solve problems?

Hugo: Great. Given that context around what is essentially a mission statement of MSF, what are the biggest challenges that the organization faces that data science and analysis and analytics can help to solve?

Derek: Basically, getting to know, I would say, in my experience, getting to know the context of where a health clinic works, and how do you take that knowledge and then inform the policies that bring your decisions forward. Because MSF works in so many different areas and so many different countries, it's critical that they know about the context that they're working in, and every place is different. When I was in Uganda, that is completely different than what I'm doing here now at Myanmar. The people who help decide what to do in terms of projects, and in terms of policies, they need as much information as they can get. And so as an epidemiologist, my job is to basically do monitoring and evaluation of some of these projects, and then putting together internal reports that people can read and be then like, oh, okay, these people that are taking Hepatitis C drugs, for example, maybe 25% of them don't complete their treatment. What are we going to do about it? Or in the case of like when I was in Uganda, we were looking at passing out mosquito nets to people in the refugee camp but the problem with that also is that a lot of times, people will tend to misuse the nets. A lot of times you'll use it for fishing, and I didn't know this. I actually didn't know this until I started working there. Until we started collecting the data to look at this. The nets are really strong, and they're just perfect for catching fish. And they're already in a little basket type shape and you just throw it in the water and then they're actually really good for fishing, which is a total misuse of the nets. The survey that we did there, helped us form these educational campaigns. So we would actually go to different parts of the camp and be like, "Hey guys. Don't go fishing with this. This is more for mosquitoes." And so it really helped get people to understand to use the nets better, and that's the purpose of the data. So you get the data, and that actually helps shape what everybody is going to be doing. So it's cool in the sense that if you stick around a project long enough, you actually get to see your results be translated into action which is something that I know when I used to work in the States, sometimes you can publish a paper, and you just never see the results. And then it gets really depressing when you realize that only like three people have cited your paper and you're just like, "Oh, okay." Like what's the point?

Hugo: Yeah. As I said before, it really seems like data science in this form is so different to what people think of when they think of tech data science for example. Maybe you could speak to the types of differences that are I think dominant in your mind for the respectives.

Derek: Yeah. It's different. It's funny because at its core, the statistics in study design are very, very similar and very much kind of the same. Which I think is very fascinating that you can take your survival analysis or your logistic regression or your cluster designs and you can apply it to schools in England or you can apply it to villages in Northern Myanmar. But at the same time, the data ... I'm getting used to finding messy data, and it's smaller data. Nowadays, people are used to having these gigabyte size datasets with hundreds and thousands and thousands of observations. Sometimes, like here, a lot of the times you might only get a couple hundred people and so you're not actually working with big data. And when you have problems with your variables, like if you have a lot of missing variables for example, you might have to start getting creative and do things like imputation, or you might have to drop a question in general, just be like, "Oh, it didn't quite work out." And so it's very different than using larger datasets.

Hugo: And even the data collection process is very different, right? In terms of what you do when you're running surveys, whether it be with pen and paper and then putting them into your spreadsheet or computer program or database later.

Derek: Yeah, the collecting the data, I love it. That's the fun part. The best part of it is when you actually get to go to the field. For example in Lahai, we actually got to take motorcycles and go from village to village. It was kind of like off-roading through the mountains in Northern Myanmar to go to village to village, and then you do these household based surveys when you get there. And so you're right there to collect the data and see how it's collected and see what questions work, what questions don't. It gives you a better feel of where the data comes from. It's nice actually that you get to actually see the birth of your data. That sounds sappy but it's kind of like how your data is generated-

Hugo: No, that's a wonderful-

Derek: Yeah.

Hugo: Yeah.

Derek: Probably the most nerdy thing I've ever said.

Hugo: You understand the data provenance in that sense and you know what all your units are. It's not like you're being handed a CSV or pinging an API where you may not actually have the correct assumptions about your data.

Derek: Exactly, exactly. It's interesting. In MSF, I've done surveys where we use electronic means of data collection. I've used both Epi Info on a smart phone and a program called Dharma. It's a service that's actually started by an ex-MSF staff. She worked in the Ebola outbreak. Not the one that's currently going on but the one a couple years ago, and was like, "We need better data collection tools. We need almost real-time information about what's going on because things, and outbreaks like Ebola, change so fast." They were actually an epidemiologist and they actually ended up just creating this program that allowed for, not only data collection, but it also projected just trends and statistics on the dashboard. It was really good. But at the same time also, like in Lahai, because our teams are out for a week at a time, and it was very rural and rugged, we resorted to paper based surveys which are horrible. I think from here on out I think I'm just going to have to go with the electronic data collection route. Because with paper based surveys, there's no checks and balances. So if you're interviewing somebody and they say their age is really 10 years old, but you put down 100, there's no little automatic check that will be like "beep," like, oh, that doesn't make any sense.

Hugo: No ability to test your data.

Derek: And then also, you have to enter the data which takes double the amount of time. So not only are you collecting the data, but then you have to get somebody to enter the data which takes forever. So electronic data, I'm glad MSF is adopting smartphone tech for entering and collecting data. It makes things a lot easier.

Hugo: For sure. But as you said, if you're in a region where you may not have access to electricity for a certain number of days, there are only so many battery packs you can carry, right?

Derek: Yeah, yeah. That's the problem. It was decided finally that because we're spending about a week in the field at a time ... So the Lahai survey went over the course of about two months, and people would go out for a week, come back for a day or two to rest, and then we would go out again for a week. But these are areas with no electricity, no cellphone coverage for the vast majority of it, until you start getting to the border of India. And then India has some pretty great cellphone coverage, or at least in that region. But you can't charge your battery pack or your phone and then, if you actually drop phone, or something happens to your phone, you lose a whole week of data out of a two week survey process. And it would be very hard to go back and redo it. So we decided to go with paper surveys and then we just had waterproof folders. So it has its pros and cons. I tend to be a little bit more tech oriented and I was like, "But we could've had solar chargers or we could've had four battery packs. We could've gone out for two days at a time." But in the end, we ended up doing paper.

Language Translation

Hugo: For sure. And you very kindly sent me through the study protocol for this baseline health assessment in Lahai township and there were so ... My eyes were stuck to this PDF while I was reading it. There were so many interesting things in there, particularly with respect to the data collection process. One thing that stood out to me was that it actually said that the local Naga dialect is a non-written language. So I presume you had translators there who were speaking Naga to the locals and then writing down the data in a different language or ... Can you just give me the rundown on that?

Derek: It's interesting and it actually comes up a lot when we do surveys in remote areas. Particularly areas with several different languages or different dialects. What ended up happening is we recruited ... we had about 34 people for this study between the drivers and the data collectors. About 12 data collectors and the rest were drivers because we had to carry supplies on motorbikes. We actually needed quite a bit of motorbikes to just carry tents and to carry food and to carry water and things. The data collectors themselves were actually recruited locally and they speak Burmese, they speak Myanmar, but when you get to parts of Nagaland, like you were saying, they speak Naga, Nagamese, which is like an umbrella term for a lot of local village languages, and a lot of these languages are not written down. The paper surveys themselves were translated into Burmese. The people we hired, they spoke Burmese but they also spoke their own village dialect. And so they'd be speaking in their own dialect but then translating it back to Burmese. Usually that causes a lot of problems in surveys, particularly if you're asking about complex things or behavioral questions. But this survey was a lot of health questions and so a lot of was things like, have you been coughing for over two weeks? Yes, no. Is there blood in your cough? Yes or no. So we didn't have to worry too much about the translational differences but that does become a problem. That definitely does become quite a problem if you start asking more sensitive questions. In a lot of refugee camps, if you start asking people about why did they flee? Why did they run away? That's actually quite a different question and it's open ended. And that requires that you actually have the proper translator or people speaking the same language as the person you're interviewing otherwise your data gets to be a little funky and not representative of what you're doing.

Baseline Health Assessment Project

Hugo: Now I'd like to kind of step back a bit and talk about the baseline health assessment project in Lahai township as a whole. Maybe you can give us the rundown as to the motivation behind it and how it played out in practice.

Derek: Yeah.

Hugo: And what is it?

Derek: When MSF decides to open up a new project, they first do what's called an exploratory mission. They'll have a couple doctors and a couple logistics people. They'll go to an area and they'll do a quick assessment to see if there's any sort of glaring health needs. In Lahai for example, they went up there and they found that the access to healthcare was incredibly poor. Most people couldn't access a health clinic if they wanted to. And at the same time, there was a lot of just basic infectious diseases that would go untreated. There was this idea to, okay, now that we know that there's this general need, we need to be more specific. So usually what happens is, after an exploratory mission, then do a baseline health assessment which is a much more formal scientific health assessment of the needs in a particular area. That's kind of where the epidemiology comes into play because you end up designing a survey to collect the data for that area. And then what type of information you want to collect. So the baseline health assessment had a couple different parts. There's your basic demographics. There's health seeking behavior like where do you go if you are sick? How often do you go to a doctor? But then there was also assessments on malnutrition for children. We assess the nutritional status of children under five years old and then we also tried to assess the vaccination status. And so there's a couple different components to this survey that was pretty long, but it's a very cursory, very general overview. One of the things we found in this for example, is that there's a lot of respiratory illness. And because this is all self-reported, just things that they mention, we try to get to what do they think it is but people don't go to a hospital and there's just no way to really tell what it is unless you go to a doctor. So we ask all these questions about it but in the end it all becomes still fairly basic information captured in a very specific kind of scientific way. For example, for this particular baseline survey, the villages in Naga, in Lahai, they're spread out and in the mountains so it's very clustered. So a village might be an hour and a half drive away from the next village but really, it would only be about 30 kilometers away. But it just takes you a long time because you're literally driving on a dirt path. The problem is, within the Lahai township, there's about 107 villages. How do you sample enough villages to represent Lahai township and then within those villages, how do you sample enough houses to make sure you're actually capturing the information in the village? So you end up with this two stage cluster design to your survey which is pretty neat actually. It's different than health surveys I've done before. In Uganda, it's a big refugee camp and it was split into ... Well, where we were in Northern Uganda, the refugee camp was split into six different parts, and we almost pretty much did sort of random sampling in each of the parts. So we didn't do as much of a clustered survey design. It was a little easier to do. That's kind of how the data collection process happens with that. It was kind of neat because it actually ... we mixed a little bit of old tech and new tech to this. The WHO, the World Health Organization, has recommendations for cluster designs, how to sample it, dating all the way back to the 70s where, basically when you get to a village and you're trying to randomly select villages, you throw a pen up in the air and then wherever it lands, where that pen is pointing, that's the house you go to because they didn't have all these smartphones and all this tech. Now what we did, it's kind of nuts, that you can actually get satellite imagery of different villages in Lahai township and so we put that into QGIS, drew the borders around the villages, and randomly dropped points into that picture. Then we're like, "Okay, that point is closer to this house. That's the house you're going to go to." And so instead of doing the pen method from back in the 70s, we actually did this GIS way of selecting the households within the village. So it was kind of cool actually. It was pretty neat. I got to work on some GIS work which was pretty good.

Hugo: That's really cool. How many people are there in Lahai township and how many villages and how many people did you end up interviewing over what time scale? I've just asked you four questions actually so that's far too much, but just trying to get a general idea of quantities here.

Derek: Lahai township is one of three parts of the Nagaland area that's in Myanmar. The majority of Nagaland is actually over the Indian border but there's three sections within the Myanmar side and Lahai township is one of them. Lahai township itself, it's in the mountains and there's not a lot of people that live there. It's about 120,000 people. Lahai town is kind of the center of it and that's about 3,000 people. There's about 107 villages that are in the township and these villages will move every couple of years. Something we actually had to do, that we were told to do beforehand is when we would select a village, we actually had to send a team out there to make sure that village was still there. And it happened once, we selected 30 villages to represent this township, it's called ground-truthing, where you just go out there and you're like, "Okay, does this village exist? Is there enough houses to do this survey?" It happened once where a village relocated 10-15 kilometers away because they had water problems. The stream that was providing water to this village dried up a year or two ago and so everybody just sort of migrated, pretty much to find water. That's kind of the context of Lahai is that it's mountainous and it's sparsely populated. The baseline survey itself took about two months. We did it in 30 villages and we chose those villages based on population size, or as close to population size as we could get. So we had basically kind of the size of population in the village in a Myanmar census and then the chance of the village being selected was weighted on the size. For example, Lahai town itself was chosen twice because it's up 3,000 people whereas the next biggest town we went to had about 700 people. So we chose the 30 villages and then within each of those villages, we did 30 households.

Hugo: So that's around 900 you surveyed in total?

Derek: Yeah, about 900 houses. The average household size came out to be about seven people. The interquartile range was between five and nine people per household. There's a common practice of intergenerational living, like kind of joint families. So you'll have younger children with the mom and the dad and then you'll have the grandparents and sometimes you also have aunts and uncles that live in a house as well. Household sizes can actually be quite big. It ended up being a little over 5,000 people that we would represent. We didn't interview all 5,000. You just took one member of the household and then just asked questions about everybody to that one person. That way, not everybody had to be present.

Hugo: How do you choose which member of the household?

Derek: It actually came down to a little bit of cultural acceptance. Usually it was the father of the household or the male figure. A lot of times, though, people do a lot of agricultural work and so if it was in the middle of the day, a lot of people would already be out in the field. What ended up actually happening more often than not, was a lot of times we did get a lot of female household heads that would answer for the survey. We also got a lot of grandparents that answered for the survey. The only requirement really to be considered a head of household, was that you had to be over 18 and you had to have been able to answer questions for everybody in the household. That was the only inclusion criteria to be deemed the head person to answer the survey.

Hugo: How are the insights that are gained from this baseline health assessment turned into actionables and deliverables for MSF?

Derek: Currently, actually, because we found a lot of respiratory illness with this survey, we're actually helping to put in what's called a GeneXpert machine at Lahai township hospital and so it's to help test for tuberculosis. The old school way of testing for TB would be to hack up your lungs and then you spit your sputum onto a slide and then you stain it, then you put it on a slide and then you have to have a trained health professional to look at that slide and be like, "Oh, that's your microbacteria. You have TB." Which is not very sensitive or specific. It's got a sensitivity & specificity in the 50s or 60s so it's pretty crap. But with this technology, the GeneXpert machine, it's highly accurate and it basically can give you test results in a couple hours. But it's expensive and it requires a bit of a constant electrical supply and we wouldn't have put it in unless we knew that there was a high amount of respiratory illness within the area.

Uganda Refugee Camp

Hugo: And is this type of actionable developed from the insights gained from the health assessment similar to other projects at MSF? For example, your work in Uganda at the refugee camp?

Derek: It's pretty similar. A lot of the operational research MSF does, it's all geared towards doing actionable results. One of the surveys we did was on mosquito net use because there's a lot of malaria at the time. And Malaria is cyclical. So when the rainy season came about, that's when you would see spikes of malaria. We did this assessment on net use right before the rainy season and be like, "Hey, does everybody have bed nets? Or how do you use bed nets?" The results from that kind of led to mass mosquito net distribution, and also the educational campaign to make sure you use mosquito nets right. A couple months after the survey, we ended up actually passing out a whole bunch of mosquito nets. It ended up being close to 13, 14 thousand mosquito nets for the area. It's a great way to see your data in action, and it's great. That's probably the number one thing I like about doing epidemiology for MSF is that if you're lucky, and you stick around long enough, you get to see your results turn into something actionable.

What are data scientists place within organizations?

Hugo: So Derek, something I've been thinking a lot about recently is how we don't necessarily have good models for ... or we haven't settled on a global for how data scientists, statisticians, statistical modelers, are embedded in organizational structures in businesses. So I'm wondering how data science is embedded in the organizational structure of MSF.

Derek: Data has always been there in forms of like internal reports, but the operational research aspect of it is actually somewhat new. And it's funny because you can kind of look at ... like in the early days, MSF never really published a lot of scientific data but then all of a sudden there's actually kind of became two big data hubs in MSF. So there's Epicentre in Paris, which is a big repository of a lot of the data that MSF collects, and they also provide a lot of support when it comes to designing various studies, just in terms of study design. And then there's The Mason Unit in England which does the same thing. The Epicentre tends to focus a lot of the French speaking countries and The Mason Unit tends to take a lot of the English speaking countries. But these are two big data repositories so MSF has actually become quite serious in the last five to seven years. More like seven years, on how data is pretty much the best way to inform your policy decisions. It's hard to argue with the numbers. And before, especially with a lot of the work that MSF does, a lot of can be a little controversial, and a lot of it can be quite risky. So when you have the hard science and the hard numbers in data there, it adds a lot of weight to your policy decisions. For example, with the Ebola crisis that happened a couple years ago, it was very important to have real-time data of where cases were clustering because that's where you would send your health workers. Not just your doctors but educational people. Like your health promoters to be like, "Oh, if somebody is bleeding, bring them to a doctor but don't touch the blood. It's a good way to prevent the transmission of Ebola." But you have to do that fairly fast because Ebola, the incubation time is only a handful of days and then the mortality rate of it, it actually would kill people in about two weeks time. So you had to act really fast. And so the data that you collected was the best way to inform your decisions, otherwise you'll just be arguing over what you heard from people in the village at the time and it would take forever to do something.

Future of Data Science at MSF

Hugo: Something you mentioned earlier was that you see more and more abilities for tech to be used in the work MSF does and the type of surveys you've been doing. What else is there in the future of data science at MSF which isn't there or hasn't been discussed in this conversation?

Derek: Something that I kind of want to see be used a little more, particularly when it comes to health promotion, and something that I'm kind of playing with a little bit is network analysis. So social network analysis. Particularly for health promotion, in addition to outbreak, outbreak epidemiology. That's kind of what most people think about when they think about a network analysis and infectious disease, but it's also a great way to find who in the community has the most influence, and who do you really want to target with your health promotion behaviors. So if you're trying to get people to, like something fun or something silly like brush your teeth more, who do you really want to talk to the most? Do you want to talk to the children? Do you want to talk to the moms? Or maybe it's the grandparents that happen to have the most sway. By doing a network analysis, you can see who has the most connections and then you can see the strengths of those connections, and rather than target an entire village and be like, "Hey, everybody. Brush your teeth." And have a big rock concert on it, you can just target a handful of people knowing that they would just spread the message to a good chunk of the community.

Hugo: So it almost hurts me to say this but what we're looking for are influencers, right? This is influencer culture.

Derek: Yeah, yeah. Exactly. Instead of taking the influencers of like fashion and stuff, you can kind of give it a little bit of a health bend and be like, "All right. Maybe if the cool guy were to brush his teeth in public, maybe he has the biggest influence." But it's getting to know who has these influences in a way that it's really hard to do with traditional data collection methods. So you can find, with a lot of odds ratios and p-values and you can find the strength and magnitude of associations for one variable and another, but you can't really tell, okay, that's just the relation between those things but what outside of it influences those factors as well?

Hugo: For sure. I like the idea that you're thinking about network theory and network analysis in this respect because as you were saying, network analysis is thought about a lot in terms of infectious epidemiology but we know that it has a huge role to play in even non-infectious epidemiology. I think one of the common examples is, people don't contract obesity from each other but if you have a network and you're connected to more people with obesity then you've got a higher likelihood of having it yourself at some point.

Derek: Oh, exactly. That's actually a really great example of it. Obesity and dietary habits are greatly influenced by your friends and network. I can definitely speak from experience. I used to smoke cigarettes for about six years, seven years, and all my friends smoked cigarettes. And it wasn't really until I moved from ... I grew up in Boston, but moved from Boston to Philadelphia and just got an entirely new friend circle where nobody smoked and everybody was healthy and I was like, "Huh, maybe I should quit smoking." But I mean, if I had never changed my friend circle, I'd still be smoking two packs a day.

What is your favorite data science technique?

Hugo: So, we haven't talked much about the technical stuff that we love so much. You did mention that for sampling you do a two stage cluster sampling methodology. We’ve also been discussing network analysis. But I'm wondering what's one of your favorite data sciencey techniques or methodologies. Just something you love to do when playing with data.

Derek: That's a great question. That's really good. I don't want to sound lame and kind of basic but logistic regression.

Hugo: Right on.

Derek: That's super simplistic but odds ratios are great because everybody understands them, they're easy to calculate, and you'd be surprised how much data you can get with a binary response. Like, do you like mangos? Yes, no. Like, do you use condoms? Yes, no. Are you an injection drug user? Did you go to a doctor last week? Yes, no. Anything that is kind of binary. And then by extension, you can do multivariate logistic regression type. You can do it with different levels of categories as well to get different odds. But that works out really well actually. Not only collect and analyze data but when you present it, so if data science is really going to lead the way in helping to change health policy in this case, you have to be able to communicate your results to people who might not be as like minded. So while I didn't get into coding about R and all this data management and things and it's great, and I love it. I know the project coordinator for Lahai, for example, that doesn't like it at all. Unless you can put it in a graph, their attention span kind of wanes after like three minutes. And so, logistic regression is probably a great way to communicate a lot of results, in my experience.

Hugo: I agree. I agree completely. I've said this time and time again on the podcast, but for people who are non-technical, you can show them that a 10% increase in this feature results in this probability in the outcome. And in that sense, it's interpretable which gives you massive gains.

Derek: Exactly. That, kind of what you just said, is a great way to just explain it. It's short, easy to understand. Just something like that. The odds of this risk factor increasing your risk of contracting HIV are such and such. However, kind of behind the scenes, is the way you sample, the way you collect that data to make sure your analysis is correct. And that gets to be quite complicated. For example, the two stage cluster design thing, using randomly assigned GPS points, that's somewhat complicated and it takes a little bit to do. But in the end, you try to distill it down to things that everybody can kind of digest.

Call to Action

Hugo: So Derek, my last question for you is, do you have a final call to action for all our listeners out there?

Derek: Yeah, yeah. I guess I'm a little different than a lot of people that come on this podcast in the sense that I don't always use massive data sets or I'm not always crunching out numbers in very clean data, but at the same time, I think people should get into data science and get into more field work out there. There's definitely a place for people to get out of the office, collect their own data, analyze it, and then actually have actionable results. Don't let the fear sitting in a cubicle somewhere just pumping out code dissuade you from doing data science. Data science is for everybody.

Hugo: Fantastic, Derek. It's been such a pleasure having you on the show.

Derek: This is great. I had a real fun time. Really good.

Want to leave a comment?