मुख्य सामग्री पर जाएं

Data-Driven Workforce Analytics with Ben Zweig, CEO at Revelio Labs

Richie and Ben explore why hiring is a broken two-sided market, why jobs are bundles of tasks (not skills), building universal taxonomies from billions of job postings, which data careers resist AI, when traditional NLP beats LLMs, and much more.
27 अप्रैल 2026

Ben Zweig's photo
Guest
Ben Zweig
LinkedIn

Ben Zweig is the CEO and Co-Founder of Revelio Labs, where he leads the development of a universal HR database built on over a billion public employment profiles and more than 5 billion job postings. He holds a PhD in Economics from the CUNY Graduate Center and teaches Data Science and The Future of Work at NYU Stern. Before founding Revelio Labs, he managed Workforce Analytics projects in the IBM Chief Analytics Office and worked as a data scientist at an emerging-markets hedge fund. He is the author of Job Architecture: Building a Workforce Intelligence Taxonomy.


Richie Cotton's photo
Host
Richie Cotton

Richie helps individuals and organizations get better at using data and AI. He's been a data scientist since before it was called data science, and has written two books and created many DataCamp courses on the subject. He is a host of the DataFramed podcast, and runs DataCamp's webinar program.

Chat with AI Richie about every episode of DataFramed - all data champs welcome!

Key Quotes

Getting hired is harder for a variety of reasons. That matching process used to be more costly in a way. If you were applying for a job, you'd write a cover letter, do all this and that. Now you could use AI to apply to a thousand jobs a minute, and there's not a lot of signal in who applies for a job. Employers feel bombarded by applications and they need to select their candidates using technology, so they can't really evaluate cover letters to the same degree. There's a lot of noise in this matching process now, which has made job search and match a lot more complicated in the last couple years. So it's tougher now than it really ever has been.

In terms of roles that are robust, it's tough. There are some sub-domains within data analysis and data science that seem a lot more robust to me. I think a lot of Bayesian statistics is very difficult to use AI for. It really comes down to the fact that there's a lot of judgment calls in how you design a model that works at scale and also works in very small subsamples. You have to really know the data. You have to really know how these populations interact — what is the real source of variance? This subdomain is something that I think is just not how I imagine AI systems will ever behave.

Key Takeaways

1

Stop designing roles around skills; design them around tasks. Skills are attributes of people — they don't change when your job changes. The real unit of a job is the bundle of work activities it contains, and that bundle drifts constantly. Define, review, and benchmark roles by tasks if you want an accurate picture of what your team is actually doing.

2

The entry-level analyst-to-AI-engineer pipeline has already flipped. Consulting firms are now hiring far more AI engineers and far fewer junior analysts. Hiring managers should update role definitions and interview loops for production-oriented AI engineering skills; junior candidates should invest in shipping, debugging, and system-stitching, not just model training.

3

Don't default to LLMs when classical NLP will do the job for a fraction of the cost. Topic modelling, regex cleaning, and traditional language detection remain indispensable at web scale — Ben's team uses them for translation and review-sentiment work because LLM inference would run into millions per month. Match the tool to the scale of the problem.

Links From The Show

Ben's book — Job Architecture: Building a Workforce Intelligence Taxonomy External Link

Transcript

Richie Cotton: Hi Ben. Welcome to the show. 

Ben Zweig: Hey, happy to be here. Thanks for having me. 

Richie Cotton: Yeah, great to have you. Now I get two really common questions around getting a job, and one of those is people are trying to get jobs. Say, tell me it's really difficult to get hired, and then people who are doing the hiring tell me it's really difficult to hire good people.

Why are those things simultaneously true? 

Ben Zweig: One way to think about it is that hiring is a two-sided market. In, let's say you are shopping for some fruit. You go to the grocery store, you pick out an apple and you say, this looks like a good apple. You pick the apple does not have to pick you, but you shop around for a job that employer also has to pick you.

You're shopping around for a kid, they also have to pick you. So there's there's this like. Two-sided match that needs to happen. And so it's always gonna be more complicated. Same with selecting a university, like you can't just pay money and go wherever you want. They also have to select you.

There's a lot of markets that are two-sided in nature. It adds some complexity, some friction because you have to match based on some criteria now. What also makes that difficult is that there are some types of matching that that are fairly straightforward. If you're looking for a kidney or blood or whatever, you have to have the right blood type.

And if you have the right blood type, it's... See more

either it's binary, it's yes or no. It's a good match. But if you're selecting a university, let's say. You want the right specialization, you want the right location, you want the right culture, vibes, whatever, there, there's a lot to look into.

And with the job it's even more complex. There's a lot of dimensionality. You have to be the right fit with the company. The position has to align, the timing has to align. There's a lot of complexity to work through, so it's hard to find the right match. And there's a lot of discretion in just, interviewing all that.

So I think it's always hard, just in principle. It's just a hard problem to match for and there's not a lot of information. Individuals don't really know what it's like at a company. Companies don't have a lot of visibility into how well someone will perform at that company, what they really want.

But I think it's getting harder. For a variety of reasons. But because, that matching process used to be more, more costly in a way. So if you were applying for a job, you'd write a cover letter, you'd do all this and that. Now there's like a lot of, you could use AI to apply to, a thousand jobs a minute.

And the, there's not like a lot of signal in like who applies for a job. People feel, employers feel bombarded by a lot of applications and. They need to select that, select their candidates using technology, so can't really evaluate cover letters to the same degree.

There's a lot of noise in this matching process now, which has made job search and match a lot more complicated in the last couple years. So it's tougher now than it really ever has been. 

Richie Cotton: Yeah. So it's interesting you talk about like matching, so it sounds a bit like.

Dating, right? There's a whole website called like match.com or whatever. So yeah. It's getting the right person to the right job. 

Ben Zweig: Yeah. And it's hard to date. It's hard to find, it's hard to find a romantic partner, probably even harder. 

Richie Cotton: Okay. I guess the question then is how do you do better matching?

How do you make this process more efficient? 

Ben Zweig: I think it starts with having the right information and then goes into having that information be categorized and ultimately evaluated. So in terms of having the right information. One thing I'd say is that, job seekers don't really have a lot of information about what jobs are even out there, what they're like.

People select into their occupations based on their parents friends, parents, based on their network, which is very anecdotal. There's university career centers, which are. Close to worthless. They don't have a lot of information. It's basically like, how do you format your resume?

So it's really not, I think job seekers don't have enough information and, which is a problem because this is a very high stakes, very important set of decisions that people are making on the employer side. They also don't have a lot of information. You can analyze someone's.

LinkedIn profile, someone's background. But these tools are not very sophisticated. Some of that is what we are doing at Rebell Labs. We're providing information to employers to do more, smarter talent intelligence, smarter sourcing for candidates. Trying to get out there and help employers find candidates in a smart way.

But that's really only half the story. Job seekers also need a smarter way to do it. So you know, having that information is important, but then there's also categorizing that information. So let's take the job seeker's perspective again. It could be that, first of all, there's a lot of language that people use to describe jobs that is not so well understood.

So you or I might know that when one person says lawyer and another person says attorney, those mean the same thing. Because we like know the deal and we might know that in some specific domains. But then there's other domains where it's actually really confusing. In engineering, like sometimes you know, someone says DevOps and other person says CICD, and you're are they really doing the same jobs?

Like maybe in this context they're really the same. There there's cases where you might have a lot of language that is synonymous. So you need to understand like how to bucket people, how to make categories so that you're actually categorizing similar jobs together.

The flip side of that, which is a little harder, is sometimes you have a job title that could actually mean two different things in two different contexts. Maybe you have a product manager at. Facebook, which is really an engineering lead. Maybe you have a product manager at Barclays, which is really a client success manager, and they use these titles in a uninformative cavalier cutaway.

So I think in order to solve this problem, you really need to understand the content of each job. A job is definitionally, foundationally, a bundle of tasks, a bundle of work activities. And there's not a lot of good data on work activities that's standardized. Creating structure and taxonomies for work activities is a difficult thing to do and isn't widespread.

It's not easy to come by. But that really is the way to, to find out what jobs are the same job. So that's the challenge. It's a solvable problem. And something that, I. Spend a lot of time thinking about. But really I think that is the core of the problem, to get common taxonomies of work activities and of jobs so that you can understand what you're looking at when you're looking at someone with a job title, with a certain amount of experience when you're looking at a job posting, to be able to categorize that into things that you do understand.

And then once you have that categorization, I actually think the rest of it's easy because then you can do. Match with the right information. So search becomes easier, filtering becomes trivial, and then the rest of the match process is just interviewing which, isn't so easy, but is about making sure there's the right, vibe, check, culture fit which is an easier problem than just making sure you're actually, you actually are coming in with the right expectations, the right definitions. 

Richie Cotton: Okay? Certainly what you're talking about, people saying words and. It ought to be obvious, but actually you don't know what's going on.

This is very close to my heart. As a Brit living in the US it took me a really long time to work out what arugula was like. In Britain we call it rocket. So it's it's kinda lets them rocket. Arugula is rocket. They're the same thing. And very few people know this. You live in both countries, so I understand like getting confused about words is like a really common thing.

And then people ask me my job title, I'm a senior data evangelist. What is that? Very difficult to explain, which good 

Ben Zweig: Yeah. 

Richie Cotton: Podcaster works for a lot of the time. Yeah. So I love the idea that like job titles are meaningless in a lot of the cases. You need to work out what they are and you deep there in the philosophy of what is a job is a bundle of tasks. Can you gimme an example of like how you took, how you define like the boundaries of a job? I guess for if you're a data analyst, are there a piece. Why? 'cause the skills aren't gonna be exactly the same from one day trial analyst to another.

How do you take, go from these are some skills to, this is a real job. 

Ben Zweig: Yeah. It's a great question. So I would actually not start with skills. I would ignore skills in this question because I think skills are really attributes of people. They're really non attributes of jobs.

I think that the job can really be broken down into tasks and the person. It can be broken down into a series of attributes and skills are one set of attributes. 

Richie Cotton: Cool. 

Ben Zweig: It could be that the reason why I don't like to think about skills and jobs as being in the same kind of space is that you can have a job transform, jobs change.

All the time when skills really don't change. Someone's in a job and the next day they say, Hey, a client needs something. You know who's gonna step in? Who's gonna do this work? We need someone to take over this project. It's gonna be different than what you're doing today. Who can step in?

And then you do It doesn't mean the skills have changed, but the job has changed. So that, that happens a lot. I think it actually happens. Way more than people think. Very often someone comes into a job and that job is really a collection of responsibilities, a collection of things. And then, six months later or a year later, two years later, they're doing totally different things.

Even if someone has the data analyst title maybe they find themselves, morphed into something different, something new. But I would say. What are the borders? It's a really tough question. Let's say coming to a job when you have a job, you have a collection of responsibility.

So let's say it's some distribution of activities that are your responsibility. Now maybe you have, 20 different things all with different weights, so it sums up to one, let's say. So you have that distribution and no one's gonna have the same exact. Distribution that you do, it's gonna be close enough.

So basically if you look at this distribution of activities and you represent that distribution as like a vector, let's say, then you find out okay, who else, where, what are some other jobs, what are some other people with the same job title? Where that. Distance is low.

And you say at some point you gotta make some hard cutoff to say, okay we are gonna consider those to be a category because there's a, there's, it's a lot more convenient to think in terms of categories and taxonomies. That's like how our human brain works. We are natural categorizes.

So we need to disco discretized this in some way. By definition, disco Discretization is about throwing away information. We are throwing out variation at some point. Now, you could have a very granular taxonomy and you could say, all right, at some level, the difference between a data analyst and a bi engineer are like different.

Because they're different. But if you are in a, broader function, let's say you're in the finance department and you're trying to analyze the breakdown of roles, you don't really care about that Fidelity. You just want, okay, data, people here, and then even broader, you're just like, all right, technical people here.

And so there's this hierarchical structure of how you categorize things. And at that broader level, of course you're losing a lot of information, but. That is fit for purpose, for whatever the downstream use case is for that practitioner. Let's say, someone running finances for the op for the organization, trying to report things to investors.

Like they, they really don't care about the difference between BI and data analysis. But if you are in talent acquisition, let's say, and you have to recruit people to fill a certain. A certain opening, you have to understand a lot of the nuance between, this role and a similar role.

So I think these boundaries are fuzzy and, but we, we have to like, make a cutoff at some point. And where that cutoff gets made should be more a function of the application of these categories rather than coming in and saying okay, this is dis, this is like sufficiently distinct by some objective measure.

I think it's subjective in nature. 

Richie Cotton: Okay. This is amazing. I'm starting to imagine like different ways of clustering things down. So I think we can probably nerd out on this like very hard. 

Ben Zweig: Yeah. Let's do it. 

Richie Cotton: But no, but may maybe we'll save that for later. First of all, I want talk a bit about like the benefits of this.

Suppose you do have better data on what a job is. You've got like these, or at least better defined categories of jobs, like who benefits from this and how do you achieve those benefits 

Ben Zweig: a as an economy. As a society. There, there are basically two factors of production. There's labor and capital and that's a little bit of econ, basics.

And the way we allocate capital is relatively scientific, pretty sophisticated because we have an entire sector of the economy, which is about the science of allocating capital, and that is the financial sector. We don't have a science for how we allocate labor. And it is very ad hoc, very anecdotal.

Obviously there's differences, like we mentioned, labor markets are two-sided markets. Capital markets are not, they are one-sided markets a little. It's a little simpler. But another major reason is that we have armies of people in the world. We have millions and millions of people in the world that categorize data manually for capital markets, and they are called accountants.

That is all they do. That is what they do. They are manual categorizes and there's a bunch of rules. There's like generally accepted accounting principles. There's the financial accounting standards boards, there's like governing bodies which dictate how certain things should be categorized so that people can make allocated decisions with respect to capital.

Now in labor markets, we don't have those categorizer, I don't think we're gonna, wake up some day and find millions of people saying you're really this, you're really that. Although, people have tried, there's there's like efforts to do that. Now we have lms they can do that.

So you know that's now it becomes possible to leapfrog where accounting has grown over the last a hundred or so years. But yeah. Who benefits? Ultimately the end benefits are about allocations of capital. It's about matching between firms and employees. So that is really like how labor gets deployed now.

That's like a little too abstract. I think there's there's other kind of intermediate places where there's value. So one, one function I would say is, there's a function in, every large organization. It's called People Analytics. That is the analytics function for talent management.

I'd say. I think that is not the proper definition, but I think it's true enough that, that people analytics is really about analyzing internal data to find out how to improve operational performance of the workforce of a company. To do that, they have to figure out where things are going well, where things are not going well, diagnose problems.

And at the core of all of that, it's about groups of people. It's about groups of employees. Employees have to be grouped somehow. You have to say, they wanna say, all right, we have, a very high attrition rate of engineers in our Denver office. You know that is something they want to be able to say.

So they need to know who's an engineer. Who lives in Denver, these, there's categories who's in a certain business unit. And these categories are very often not good. They are categorized, in weird, bizarre ways. I was involved in a project like this at IBM for years, and we were very constantly struggling with categories that were.

Redundant or made no sense, or weren't well-defined. And then the downstream analysis would be suspect because at its core the categories weren't correct. Even more direct examples could be in like compensation, benchmarking. You wanna know what to pay people and and what level they should be at.

And in order to do that, you have to know. What are similar jobs? You have to define a job the way that the market defines a job. You wanna benchmark yourself to the market. There's other groups of talent intelligence, we mentioned sourcing. You gotta find out like who to hire, where are there enough people, what are supply demand ratios to proxy for like market tightness.

There's so much analysis that happens within a, that uses these categories and it's a lot easier if the categories make sense. But then on the job seeker side, there's also a tremendous amount of benefit, like you may wanna know, okay, if I start in one job, let's say I start in a given occupation, where do I go from there?

What, what are the exit options? What are the follow on occupations? It sounds like a simple problem, but it doesn't really work unless you have, categories of occupations. Or they may wanna know, should I develop in this certain skill? And we mentioned skills are, key attributes of people.

These are the inputs into completing these work activities. Is it worthwhile to invest in this input? Depends on what outputs it creates. And what are the associated salaries or career trajectories for those. So getting a handle on what you know. What are these skills?

How are they bucketed? What do they lead to? These are questions that I think could just produce a lot more sophistication in how people. Allocate their own time and how firms allocate their resources. 

Richie Cotton: Absolutely. The way you say it, it just seemed like a really obvious thing, that if you want to pay people the right amount, you should have some way of comparing jobs.

And if you wanna make sure that what's happening with your workforce, you should know like what job everyone is doing. And it should be somehow comparable between different teams. And I guess. Even with people outside your organization, you wanna compare with your competitors and things like that.

But so why is the state of data so bad then? 

Ben Zweig: One, I can only really speculate, but one, one historical bit of context is that accounting never really considered human capital as very important because our accounting standards were developed in the age of railroads and like industrial, businesses where the value of.

A business was mostly a function of its capital stock. And there's been a lot of analysis more recently most notably bar Lev wrote a book the End of Accounting, there's capitalism without capital, so the economy has become a lot more intangible, a lot more based on knowledge work and human capital and the state of accounting.

The state of categorization has really not caught up with that. Hopefully that's starting to change but it's, it's slow. There's another reason that I think this has faltered. And that is due to the government taxonomy of occupation skills, activities, something called on net.

So on net is the, is, is the US governments taxonomy of occupations and things surrounding occupations. And on it is done by. Surveys they collect, experts in employment and have them watch people and take notes with a pen and paper on what they do with their time.

And, they collect five, five observations a year on each occupation. It's like really not digital. And they have developed a taxonomy that is. Useful for very broad analysis. If you are analyzing, macroeconomic trends, you'll be able to tell the difference between legal professionals and manufacturing and nurses and, like things that are very obvious.

And so it's used in the public sector. It is not used in businesses. There, there are approximately zero companies that use own net for their internal taxonomy. And it's really because they haven't gone deep enough into the idiosyncrasies of each business. We did a recent project with Nvidia where we were helping them with their taxonomy of occupations and work activities.

And for Nvidia, they don't care about the difference between doctors, nurses, therapists social workers. That's not their domain that, that, that is completely irrelevant to them. But there's one part of the space where, you know. They really want to know the difference between like different types of hardware engineers and hardware engineering is like probably 60% of their workforce and there's like different types of systems within that.

And I don't even think hardware engineer is a category and own it. I think it's just engineer. So if they use that. They'd have no visibility into the distinctions that matter for their organizations. And Nvidia is a little bit of an extreme case in being like such a niche type of company.

But really for every company they have a lot of distinctiveness in who they hire and how many people they hire. So using a universal taxonomy that doesn't go very deep is just not sufficient for them. So I think that. Is a little bit of an unforced error. Like I think, this could have existed if if there had been more, more resourcing toward this taxonomy, more granularity, more adaptability.

So I think there's a little bit of like historical reasons why this hasn't been done. And I also think that the state of data in this domain is all text. When we think about work activities, we can get those from the responsibilities section on job postings. We can get those from the bullet points on resumes, and that is free text.

People write whatever they want, there's typos, there's this and that. It's not like a dropdown where you say what you're doing, and it really was impossible to get a lot of value from that data until LLMs. So we started analyzing that in 20 17, 20 18. And that was really like right when like word tove and doc tove were getting popular.

Yeah. It wouldn't have been possible really in, in 2013 if someone had tried this, I think they would've not been able to. 

Richie Cotton: Okay. Yeah. I love that the technology's finally come together to the point where we can actually build all this stuff, at a reasonable sort of, yeah, level of effort.

'cause I was thinking about, I mean you mentioned Nvidia and they've got very specialized roles. Even at data camps, like a, I'm a relatively small company, but like we've got a lot of different roles around variations on course creator. And if you go to own it, it's like a teacher content creator, something like that.

And you 

certainly 

Ben Zweig: not gonna find data evangelists. 

Richie Cotton: Absolutely not. No. Which is actually a problem when I have to like put my name on US immigration forms. What's your job? Data something. Yeah. Not sure. Yeah. Okay. So actually I'd love to talk more about data careers. 'Cause you've got this wealth of data around like what skills are being used and what jobs.

So what are the most common sort of data science or other skills that are in demand at the moment? 

Ben Zweig: Data as a field has progressed so quickly. Faster than really any other field. When I started when I started in as a data scientist. This was 2012 and my title was Quantitative strategist, which is, this was at a hedge fund.

Richie Cotton: They, I guess like it was just in the quant category, 

Ben Zweig: but it was really data science and I retroactively relabeled that data scientist. But then at then went to, I mean it was like, advanced analytics. Consultant or is what whatever it was, and then later got retitled data science.

But then at the time, data scientists were really basically statisticians. I think, I came from a social science background, I'd been an economist. I'm still an economist that doesn't, can't give that up as much as we'd like, but but that, that stays with you. I think a lot of data scientists were quantitative social scientists, running regressions, doing a lot of causal inference. You, really doing statistics, basically in memory. I had never touched sequel before getting to the corporate world and that was, obviously now I wouldn't look at a candidate if they didn't like, have experience with, large data sets.

So data has gotten a lot larger. Just like the history of it is is progressing quite a bit. There was a moment where that became a lot more about supervised learning. That has, I think, gotten much more commoditized. BI has taken over a lot of space.

But now, the big trend now is AI engineers and that is, people that can, I think of it like people that can vibe, code stuff. That is not so easy. I think pe if you believe what people post on LinkedIn and other places, it seems like it's so obviously trivial, but it's quite complex.

Like you, you have to know how to work with different systems. You have to know how to stitch things together. You have to know how to like, debug, how to move things to production. There, there's a lot of ml ops stuff that, that's involved in being a kind of. AI engineer and having it work with data is also not so simple.

LLMs are really built on text. Having them, analyze tabular data is not trivial outta the box. There, there's a lot to figure out, but there's a big demand for that. Consulting firms are now hiring lots and lots of AI engineers and very few entry level consultants.

That is a huge shift. So that's I think they are leading the charge in this kind of role transition. I think that is what a lot of data analyst roles are morphing into. 

Richie Cotton: Absolutely. Certainly AI engineer, it's the breakout career of the last year or so at least.

'cause yeah, AI can finally generate code fairly reliably. And so yeah, even like without decades of coding experience, you can still make some progress, build some software and yeah, make sure this this technology, so that's plastic, gi, related to that maybe the most common question I ever get from Origins is what kind of careers are most robust against being taken over from ai?

You mentioned like junior consultants. It's it's a tough position to be in. Are there any things where you think it's a little safer? 

Ben Zweig: There are some things that, that technology is just not anywhere close to, that are completely unrelated. If you wanna, if someone's like busing tables.

Like that, that there are no robotics that have that sort of, nimble capability. But in the data world, it's a little hard to say. In terms of roles that are robust, it's tough. I would say there are some like sub-domains within data analysis and data science that seem a lot more robust to me.

And, technology could make fools of us all. This is as of, whatever it is today, but I think a lot of besian statistics is very difficult to use AI for. And it really comes down to the fact that like we. There's a lot of, there's a lot of judgment calls in how you design a model that works at scale and also works in very small subsamples.

So I'll give you an example that, that we're personally working on. We analyze a lot of online profiles, online resumes and things like that. And when someone updates a transition, it's not always when they. When they made that change, people don't update transitions right away.

Sometimes they'll wait three months and then retroactively restate it. So we know there's some lags in how people report transitions. Fine. And we can observe that distribution by looking at multiple snapshots of data and that, there's basically like two sources of lag. There's user lag and there's like scraping lag and all that.

So you have to get a little crafty on how to combine these distributions. In an informationally efficient way. And then you have to create a model, a now casting model that, that gives you a prediction of like inflow and outflow rates that you would see in the absence of this lag.

So you have to design this counterfactual. But then what makes it more complicated is that you can't just apply this overall distribution. You also have to make sure that it's that it works for very tiny companies. Let's say you wanna analyze a 20 person company. You know they're gonna have super spiky inflow and outflow rates.

If you apply some like blanket blanket distribution, you're gonna create some wild, wacky scenarios or, and you're gonna get some results that are very embarrassing. And the flip side of that is maybe there's some large company, maybe you wanna analyze this at Walmart and they have their own idiosyncratic seasonality and whatever, or or maybe, some layoff announcement or so something. So that there, there's there's times when you want to be more general and times when you want to regularize toward a population. And though you know the balance of those things.

Is not real. There's no right answer for how you wanna balance those things. Especially since there is no ground truth. We don't know, what the hiring rate of a company is this month. They don't report it ever. And even if they did, they don't wanna report it in retrospect.

So we have to, we have to make some call that kind of balances different needs. That is something that I think takes a lot of, like you have to really know the data. You have to really know like, how these populations interact. What is the real, like source of variance?

Yeah. It's just complicated. But I think this like subdomain is something that I think is just. Not how I imagine, AI systems will ever behave. 

Richie Cotton: So I have to say, when you start to answer, I was getting a bit nervous how you were like it's tough to keep a job in data.

But actually, yeah I've, you mentioned Beijing statistics. I feel like it's still one of those slightly underappreciated fields of statistics. Very important and. More generally, it sounds like these are soft skills you're talking about. You need to be able to make value judgements.

You need to be able to do critical thinking. You need to be able to translate between whatever your domain is, in this case, people analytics to, to some kind of statistical model. And then going back again, these all seem important skill sets to have. 

Ben Zweig: Yeah, totally. Even just understanding how to communicate certain things is, you're, we're still gonna need human communicators, not just evangelists, but also evangelists, but also just like explainers of like how a model is working, how it's thinking. And so being able to do the di diagnostics and like trace at every step of the model, is it like behaving the way we expect it to behave?

I is something that, that we constantly are struggling with and trying to do. A better job in and that's something that, that is, is not gonna go away. I actually think, since Elon's we're, we are hiring more in on the modeling side of data science. Yeah.

And more on the ML ops side too. I think the roles that, it's, the roles where AI is like. Diminishing our labor demand are probably more kind of roles in like front end work. And maybe, scraping and thing, things like that. Like more, engineering where there's a lot of replicability.

So I think the distinction with data is that, every problem, every data analysis problem is really differently. You really have to like. Study the domain, the problem very intimately. But if you are creating a web app and react and you wanna have some widget that does whatever, like that could look like every other dashboard on the internet.

And there's a lot of examples like that. Now, obviously, there's still a lot of interesting cool work to do on the front end, but like there, there's still much of it that can be automated more easily. 

Richie Cotton: Absolutely. Yeah. If you're doing the same task over and over again, you've got something repetitive, then that's prime candidate for automation.

If you're doing these kind of, I guess more boutique things like something more artisanal, which I guess all of this sort of data science modeling work is then that's not gonna be automated. Alright. And you mentioned you've been hiring for some roles more like the modeling roles and or so or modeling ops kind of stuff.

In general, do you have advice for hiring managers? Do you need to start looking at different profiles 

Ben Zweig: for different profiles? So I think we've had a lot of success in hiring, people from PhD programs usually for whatever reason it happens to be in like either economics or physics.

I don't know why. Those are the areas that where we've had the most success. Doesn't have to be, and I've worked with a lot of great data scientists from other things like, political science, chemistry, whatever. I think the reason why I think those domains are particularly strong is that they are.

There, there are areas where you can embed a lot of theory. So I remember I interviewed one person from from a degree in math, in mathematics. And he had done his PhD in mathematics and I was asking him like, oh, what's his dissertation? And he said something about mixing mixing liquids.

And, so the extent to which, let's say there's an oil spill like there, there's some diffusion of that oil into the ocean. And if it gets mixed, like if you stir that is something then that diffusion happens faster. And trying to, quantify you know how to manipulate diffusion rates.

And I said, that, that's really interesting. But like, how is that different from whether, let's say you had that dissertation, but your degree was in physics and not mathematics. Like, how would it be different? And he said it could be written by someone in physics, but what phys what physicists would do is they would actually plug in the parameters and they would say we know the density of salt water and we know the density of oil, and we know the like parameter for gravity.

And we have all these parameters that we're just gonna plug in. And and then it becomes an easier problem. But a mathematician would create a generalized solution and they would say, you could plug in anything. And it's a more reusable, abstract thing that, that is generalized.

You could 

Richie Cotton: use it in space on different 

Ben Zweig: planet and you could use it on some alien planet that will never exist, yeah. You can yeah, you can extend it to, to all sorts of different environments. And I think, I think very often and economics is the same way. There, there are constraints on models based on what we know about the.

Real world, like you have, you can't have certain things go negative or whatever. There, there's, there's things that we can, there's ways that we can manipulate models to make them a little bit more informationally efficient in an environment where we have small samples.

And I think that actually does describe the real world. In most cases, we have, many petabytes of data. We, we have tons and tons of data, and yet we are still trying to squeeze out informationally efficiency for small subsamples. I think Andrew Gelman said there's no such thing as big data.

There's only small data that, that, you know once you drill down deeper, you realize, your data sets are actually quite small. And I think the comfort with being able to like customize a model. For, the way that you understand the world is something that, that comes in handy.

So I think those backgrounds are great. I also think masters in data science programs are excellent. The, I'm based in New York. There there's one at Columbia, there's one at MIU. They're fantastic. People graduate from these programs and they know how to do everything.

They are great engineers. They know all the tools. They are generally really smart. And it's also, it's a tough market for them. So I think you can actually as an employer it's. Good time to be hiring. 

Richie Cotton: Absolutely. Yeah. Yeah, I feel like more companies should be able to take advantage of this.

It's particularly when you've got like people who are very highly qualified and then okay, you give 'em a job and if you've got a highly qualified data managers, they should be doing more than just maybe building a dashboard, take advantage of them. Super. Earlier on we were planning to nerd out later in the show.

I think it's time okay. You've got. Basically a whole internet worth full of job posting data, loads of LinkedIn profiles or whatever details of people. How do you go from all that, like messy sort of web scale data to I've got a set of jobs in a taxonomy, 

Ben Zweig: so let's say let's take job postings as an example.

So there's a little over 5 billion job postings that we've scraped to date, about 90 million active at any moment. And one is we have to scrape that, we have to collect it. And that has to be every day. We have to get, start dates, send dates and all that.

But then there's sections. So sometimes you get some metadata from these postings. That could be job title, geography, company name, post date, things like that. But then there's sections within the text. It can vary by posting to posting. So there could be, a paragraph on information about the company or a section which has bullet points of skills or qualifications, a section on responsibilities.

Maybe a little paragraph about benefits or whatever it is, so you have like different sections of these postings and, of course, we wanna categorize them. So there's a few different categories that, that we create, which really comes from like different it's really based on like our theory of how jobs are structured.

One is occupation. So that, that is one way to categorize jobs. Another is seniority, and by the way, a title. Actually contains two pieces of information. It contains information about the occupation, contains information about seniority. Not all the time, by the way, sometimes you have someone whose job title is director.

I hate when that happens, but but sometimes they don't have any occupational information. They just have seniority information. Or sometimes you just have occupation information in no seniority. Sometimes they say marketing. You're like, is that a junior marketing analyst, is that chief marketing officer?

Like we don't really always know, but but despite that, usually a job title contains information about about occupation and seniority, and then there's skills, so that's another, and then there's work activities. So those are the way we categorize the jobs. Of course, there's also categorization.

Things that aren't related to jobs, we have to categorize companies. Someone says they work at Zoom, there's 11 companies in the world called Zoom. And if we just take the modal company, then everyone will get classified to Zoom Communications and if there's like Zoom consulting or Zoom moving company, then they'll have zero people and that's no good.

We need to have some like. Probabilistic categorization to companies. Then of course we need to categorize geography. Someone says they're hiring in the New York metropolitan area or the broader New York, like that, that could mean a few different things.

Sometimes they'll just say Upper West Side, and then you have to know that's like in Manhattan, which is, and there's complexity there. Manhattan is a county. It's a borough, which is a county within New York City, but very often counties are broader than cities. It's so sometimes there's like some, many to many.

So there's like some complexity in how to categorize things there. But if we just think about the, in the job space. So we have made a call to, start with activities and then cluster occupations. Before, so once we already have activities, because we would like to define an occupation as a bundle of activities, and then skills is also done in its own actually skills is also done after activities because we'd like to use skills as a, as an input into completing work activities.

But it really starts with activities at the center of it all. So we get that from, so for, let's say from the job postings, we parse these different sections and the responsibilities section, that is the corpus of activities. There's, 5 billion postings, maybe 3 billion of those have activities sections in English.

We, we only train this on English so far. So we train it. English, we infer it in other languages so there's like always a difference between like your training and your inference. But yeah so we have, call it 3 billion or so paragraphs. Each paragraph has let's say five, six sentences and maybe sentence has like different chunks.

So let's say there's call it 15 to 20 billion like activities. That, that we, sentences that describe an activity, then those need to be embedded. So we need to have an embedding of those activities. So we use the, a kind of Burt based model to, to embed those activities. Choosing the Dimensionalities, of course a thing to do.

We also have to decide whether we wanna control for industry. There's a lot of like things, decisions that have to be made to get that embedding. But once we have those embeddings. Then then we need to cluster those activities and label them. So that's a big challenge. 

Richie Cotton: Yeah.

Alright. No that's interesting. So you've got like giant data set. It's messy. So basically my sense is you clean it up, try and standardize it in some way, and then you. Taking up the duplicates, and then you've got some sort of I guess smaller, more standardized that you can then work with to feed into your model.

You're about to talk about modeling 

Ben Zweig: in some way. For us, like the model is the categorization. I think like we, we just happen to be in the space where we are really, you know. Our company, our value labs, so we're a data company. Like we're not really trying to make prescriptive recommendations.

We're trying to deliver data that can be, nice and neat and embedded in, into a traditional analytics workflow. But, yeah the, so when we create a taxonomy that, that kind of is the model in a way. Like we are really, there, there's a lot of submodels within this, obviously there's the embedding, there's the, garbage detection, deduplication, cleaning, parsing, all that. And then there's the sort of, then there's the clustering, and that has to be like agglomerative in some way. Then there's the labeling, which is a whole other can of worms, and that really wasn't even possible until generative ai.

So that's and that's like getting better as new models come out. But labelings. Really tough, especially because, with job titles there are plenty of duplicates. Like you could take, you could say oh, for this given cluster what's the modal job title? And that's already like a pretty good way to go for fair activities.

There's 12 million sentences and no duplicates. No one writes the these sentence in exactly the same way. Yeah. Being able to synthesize all those and label them, it really requires some generative ai. 

Richie Cotton: Actually, I'm curious about that because. You are working with primarily text data, and it seems like generative ai LLMs are the standard modern approach for dealing with text data.

Do you ever make use of any more I guess traditional natural language processing techniques? You think about topic modeling and named entered recognition, all these things that were like really popular like a decade ago. Are they still in use? 

Ben Zweig: Yeah. Yeah. So I would say. I would say with name entity recognition, not really, because I think LMS are really just like if you put in a paragraph into Chachi BT and say do NER for me, it'll do a really good job.

So I think it's it usually overkill and like probably expensive and slow. It is expensive and slow, but it'll be so good. So I think like NER is not really necessary if you're dealing with a small enough dataset, topic modeling less so topic modeling we actually do still use more traditional topic models.

I think there are, there, like we, we don't use LDA, which was like the kind of gold standard topic model pre lms. But now there are transformer based topic models. They work the same way as lDA, but but they're just like generally a little stronger but functionally the same, but, internally a little different.

But for those, yeah we use those. I'll give you an example. So we, we collect a lot of data about employee sentiment, and this could be set from sites at Glassdoor, fishbowl blind, et cetera. So where people post employee reviews and we wanna find out like, what do people like about this company?

What do they not like? What are the topics that are emerging? So we need to, extract topics from text. It's a perfect, like textbook example of using LDA. So for that we had extracted topics that typically show up in these reviews. We got like 47 topics or something and that, and now, we have to, so first we have to create those using topic models.

But then, we have to infer topics based on every new review that comes in. And. That's, those are the categories that, that we use. It's completely traditional. And of course, we use RegX and cleaning and what, all the normal, all the normal basic NLTK stuff.

But yeah I think there's still value in the traditional NLP. Sometimes. 

Richie Cotton: Okay. That's interesting. I guess once you do have these giant data sets, then Yeah. Working lms it's computationally expensive and I guess there's a dollar price attached to that. So having these sort of old techniques that are a bit cheaper to run it still has some value there.

Ben Zweig: Yeah, I'll give you an example. Like for translation we cannot do a translation on our descriptions from postings. Using LLMs. Using like generative LL labs. It's just too expensive. Like it just they would do really well. We really ha it would be prohibitively expensive.

It would cost us like a million dollars a month or whatever. Some tremendous amount of money. So it, and maybe someday, like I'm sure some businesses are able to afford that. But yeah for us, like we, we have like more, more traditional ways of detecting language and doing more substring translation.

You know that's a constraint that we have to deal with just because LLMs are expensive. 

Richie Cotton: Absolutely. Alright. Before we wrap up I'd like to know what are you most excited about in the world of people analytics and labor analytics? 

Ben Zweig: It's a great question. I'm excited about all of it.

I think I think, it's I think a lot about categorization. I think that is, mostly because I just wrote a book about it and I wanna, like it's hard to not obsess over something when you're like deep in the weeds a bit. But also I really do believe that the core of our difficulties in labor markets generally are about.

A lack of the ability to categorize and standardize and compare, like we, analytical hr, the part of HR that analyzes data does not compare across companies. It's just not really done. There's no benchmarking and there's no sense of how you compare to peer companies. And I think that's a little sad, I think, michael Porter, who's like the father of modern strategy, had said strategy standing on one foot is about differentiation. Like end of story. And that is in a nutshell like what it means to be strategic. And that's not being done in labor markets, which is unfortunate, but I think it could be done.

So I really do think that we could see a world where labor markets are as sophisticated as capital markets. That's a very exciting world to me. So I wanna help, help build toward that. Lots of hard problems to work on, of course. But but I think that's exciting and I think we could see that in our lifetimes.

Richie Cotton: That's quite a compelling vision, really. Like the idea that we're optimizing for everyone getting their dream job. 

Ben Zweig: Yeah, exactly. That, that is there's so much human value in that. This is like how everyone spends their time. Like what could be more important than that, and whatever, cause someone cares about, whether it's politics or climate or this or that, like that is all downstream of people, doing their job well and having good matches in how, in, in how people work.

Yeah, I think it's, I think it's, helpful for everything. 

Richie Cotton: Everyone gets a happier working life. We get, we make more money. It sounds like a dream come true. Wonderful. And just finally, I always want more people to learn from. Who's work are you most excited about at the moment?

Ben Zweig: Great question. I. So I, I think there's a lot of there's a lot of work being done in like the economics of ai. So how will AI affect labor markets? And there's some really interesting thinkers in that space that I think are very exciting. I'm, I'll make another, I'll make another plug for two more podcast recommendations.

So one is called Justified Posteriors host by Andre Fratkin and Seth Benzel. Really wonderful researchers, a great podcast just talking about like papers and like how, AI is affecting the macro economy and they have, really interesting guests and talk about interesting things.

There's lots of interesting work by done, by, david Otter, Daniel Rock on, on these, these types of things. They're great. And the second podcast I'd like to plug is my own, it's the economics of work, but it's I think we're really, I think the goal is to try to find people who are thinking deeply about labor markets and how they operate.

There's so much of. Just to complain a little bit about labor economics there's so much research on oh, how can we more precisely estimate the returns to education? And I think there's such a big opportunity now to think more foundationally about what are the mysteries of the labor market?

Like, how can we improve search and match, like we spoke about, like how can we analyze the, vulnerability of jobs to ai. How remote work is transforming the world. Like there there's so much interesting stuff to think about. So I don't know. There there's like a, there's not so many people, but there's enough that, that really are thinking deeply about these big questions.

And I, I think it's exciting. It's the right, it's the right time for that. 

Richie Cotton: Absolutely. I'm economics has this bad reputation as the dismal science. So it's like people who are trying to make it a bit more fun than go for it. I like it. At the mysteries of the labor economy, that seems a useful thing to look at.

So making it a bit more exciting. 

Ben Zweig: Yeah, I think so. 

Richie Cotton: Wonderful. Alright thank you so much for your time, Ben. 

Ben Zweig: Thank you. Yeah, thanks so much for having me.

विषय
संबंधित

podcast

Creating an AI-First Data Team with Bilal Zia, Head of Data Science & Analytics at DuoLingo

Richie and Bilal explore rebuilding an underperforming data team, fostering trust with leadership, embedding data scientists within product teams, leveraging AI for productivity, the future of synthetic A/B testing, and much more.

podcast

Our Data Trends & Predictions for 2026 with DataCamp's CEO & COO, Jonathan Cornelissen & Martijn Theuwissen

Richie, Jonathan, and Martijn explore how AI will transform hiring and career progression, why personal AI tutors could become the default learning experience, how AI agents may begin executing real economic activity, and much more.

podcast

The Next Generation of Business Intelligence with Colin Zima, CEO at Omni

Richie and Colin explore the evolution of BI tools, the challenges of integrating casual and rigorous data analysis, semantic layers, the impact of AI on business intelligence, understanding business needs, creating user-focused dashboards, the future of data products, and much more.

podcast

How Data and AI are Changing Data Management with Jamie Lerner, CEO, President, and Chairman at Quantum

Richie and Jamie explore AI in the movie industry, AI in sports, business and scientific research, AI ethics, infrastructure and data management, challenges of working with AI in video, excitement vs fear in AI and much more.

podcast

How To Get Hired As A Data Or AI Engineer with Deepak Goyal, CEO & Founder at Azurelib Academy

Richie and Deepak explore the fundamentals of data engineering, the critical skills needed, the intersection with AI roles, career paths, interview tips, and the importance of continuous learning in a rapidly evolving field, and much more.

podcast

Six Skills Data Professionals Need To Succeed with Abhijit Bhaduri, Brand Evangelist & Former General Manager of Global L&D at Microsoft

Richie and Abhijit explore modern career paths, the importance of experimentation and adaptability, the evolution of career models from 1.0 to 3.0, the impact of longevity on career strategies, essential skills for career advancement, and much more.
और देखेंऔर देखें