Do students describe professors differently based on gender?
Language plays a crucial role in shaping our perceptions and attitudes towards gender in the workplace, in classrooms, and personal relationships. Studies have shown that gender bias in language can have a significant impact on the way people are perceived and treated.
For example, research has found that job advertisements that use masculine-coded language tend to attract more male applicants, while those that use feminine-coded language tend to attract more female applicants. Similarly, gendered language can perpetuate differences in the classroom.
In this project, we'll using scraped student reviews from ratemyprofessors.com to identify differences in language commonly used for male vs. female professors, and explore subtleties in how language in the classroom can be gendered.
This excellent tool created by Ben Schmidt allows us to enter the words and phrases that we find in our analysis and explore them in more depth. We'll do this at the end.
Catalyst also does some incredible work on decoding gendered language.
1. Scraping the web for reviews of professors
Text data––especially gendered text data, is hard to come by. Web scraping can be a helpful data collection tool when datasets are unable for this kind of work. We can write web scrapers to compile datasets on job descriptions, freelancer reviews, and, as in our use-case, professor reviews by students.
ratemyprofessors.com provides a wonderful combination of qualitative and quantitative metrics that we can analyze.
Although the data on their websites is not labeled by gender, we'll use pronouns used by students to label professors "Male" or "Female". Of course, this approach is not perfect, as it relies on the students' use of pronouns. Professors with non-binary pronouns will also be under-represented in the data, since very few reviews will have them, and so it's not trivial to write an algorithm to detect them. These are definitely important questions in the world of gender analysis though, so we encourage you to pick them up as extensions of this project!
Task 1a. What relevant packages do we need for web scraping and reading in data?
# Used to open urls
____
# Used to parse html
____
# Used to pause code intermittently so that our scraper is not blocked
____
# For data manipulation and analysis
____
# To access our data filenames so we can read them
____
Task 1b. Which professors will we be looking at?
The web_scraping.ipynb
notebook provided in this workspace provides some code using selenium that was used to find urls from ratemyprofessors.com that we'll be scraping in this notebook.
Whilst the specific selenium code used to generate this list of URLs is beyond what we can cover today, we encourage you to explore this code to understand how we generated this list of professors!
For now, we'll open the file profs_888.txt
and read each professor's url in a new line, and save this variable as profs
.
with open(r'profs_1244.txt', 'r') as f:
profs = ____
Task 1c. How can we use urls to scrape relevant data about professors?
Each professor has an overall rating that looks like this
and a series of reviews that look like this
The code below can be used to iterate through all or part of the list of urls in profs
, and scrape them for qualtiative and quantitative data. You won't need to run through this whole list though, because the data/
folder already contains the reviews of several professors that we have scraped for you!
- The overall rating for the professor
- All the individual reviews written by students about the professor
- The "emotion" corresponding to each individual review:
😎 AWESOME
,😐 AVERAGE
, or😖 AWFUL
- A numerical "quality" rating corresponding to each individual review
We won't be using the "difficulty" ratings shown here.
# USE ONLY ONE OF THE FOLLOWING FOR STATWEMENTS
# 1. Sample code to loop through the whole list of professors
# for s in (range(40, len(profs),10)):
# 2. Sample code to loop through the first 10 professors
for s in ____:
texts = ____ # Initialzie an empty array
print((s, s+10)) # Iterate through 10 professors at a time
for url in ____: # Iterate through this block
____ # To prevent sending too many requests at once
r = ____ # Open URL
htmlparser = ____ # Instantiate a parser to parse HTML
tree = ____ # Parse HTML returned by the url
text = ____('//*[@id="ratingsList"]/li[*]/div/div/div[3]/div[3]/text()') # Extract reviews
ratings = ____('//*[@id="root"]/div/div/div[3]/div[2]/div[1]/div[1]/div[1]/div/div[1]') # Extract ratings
emotion = ____('//*[@id="ratingsList"]/li[*]/div/div/div[1]/div[1]/div[2]/text()') # Extract emotion
quality = ____('//*[@id="ratingsList"]/li[*]/div/div/div[2]/div[1]/div/div[2]') # Extract quality
texts.append((url,
text,
[i.text for i in ratings][0],
emotion,
[i.text for i in quality],
)) # Append metrics to empty list
print() # Print new line for readability
df = ____
df.to_csv(f'df_{s}_to_{s+10}.csv') # Write result to df in blocks of 10 professors at a time
____ # Pause to prevent sending too many requests at once
2. Reading pre-scraped data
Task 2a. How can we read a directory of scraped professor reviews and concatenate them?
Since we have already scraped reviews from several professors for you, let's begin by concatenating all the files in the data
folder provided. These have already been scraped for you.
Since review
, emotion
and quality
are lists but were recorded in string form, we'll apply eval()
to them to turn them back from a string into a list.