Official Blog
data journalism

How the #MeToo Movement Spread on Twitter

What can data science tell us about tweets with the #MeToo hashtag?

The following article contains material that may distress some readers. It is all based on data analyses of tweets containing the #MeToo hashtag in the past weeks. A link to the code used for the analyses can be found in the conclusion of this article. We welcome and encourage further analyses and dialog.

Word cloud of tweets October 24th to November 7th
Word cloud metoo analysis
Word cloud of tweets November 10th and November 11th
Word cloud metoo twitter


I have used the Twitter API to pull half a million recent tweets that contain the hashtag #MeToo. Given half a million tweets, it is impossible to give a summary of what they all contain. One method to give a sense of the most used words is a word cloud. Above there are two word clouds: the first is generated from tweets tweeted between October 24th and November 7th; the second is from November 10th and November 11th. Can you see any differences? Have a look and see what you can find.

The most apparent difference to me is the change in names of alleged perpetrators: in the second word cloud, 'weinstein' is smaller and thus less represented in the tweets, 'billoreilly' is no longer present and new names, such as 'louisck' (who has admitted that 'These stories are true') and 'roymoore' have now appeared. Related terms such as 'republican' have also appeared (Roy Moore is a Republican). Many of the words that appear will ring true. Others, which include twitter usernames such as 'aliceglass', are not so clear yet. In this post, I'll delve into these tweets and we'll see along the why such words appear in the word clouds. If you have any thoughts, responses and/or ruminations, feel free to reach out to me on twitter: @hugobowne.

You can extract many of the main ideas surrounding #MeToo from these word clouds. For more context, Wikipedia states that

"Me Too" (or "#MeToo", with local alternatives in other languages) spread virally as a two-word hashtag used on social media in October 2017 to denounce sexual assault and harassment, in the wake of sexual misconduct allegations against Harvey Weinstein. The phrase, long used in this sense by social activist Tarana Burke, was popularized by actress Alyssa Milano, who encouraged women to tweet it to publicize experiences to demonstrate the widespread nature of misogynistic behavior. Since then, millions of people have used the hashtag to come forward with their experiences, including many celebrities.

At the time of writing, Wikipedia also reported that

The phrase had been used more than 200,000 times by October 15, and tweeted more than 500,000 times by October 16. On Facebook, the hashtag had been used by more than 4.7 million people in 12 million posts during the first 24 hours. The platform reported that 45% of users in the United States had a friend who had posted using the term.

and that 'The European Parliament convened a session directly in response to the Me Too campaign, after it gave rise to allegations of abuse in Parliament and in the European Union's offices in Brussels.' #MeToo is a movement that has gained a critical momentum over the past month. In this post, I'll look into how it has spread on twitter.

A bird's eye view of two weeks with #MeToo on twitter

For this analysis, I used the Twitter API to pull tweets containing #MeToo from October 24th until November 7th. This is two weeks worth of tweets from about a week after the first #MeToo tweet. The Twitter API allows you to pull a subset of all tweets so, although the absolute number of tweets will not account for all of them, you'll be able to see the overall trend.

Let's first have a look at the number of tweets that occurred over time for the two weeks in question:

Whereas the majority of hashtags have a half-life of minutes or hours, #MeToo has now been present for weeks as you can see from the above. It is a powerful enough movement to be manifesting itself not only online but also in marches and protests. Notice, in the figure above, that from October 23 through October 30, that is, from a week after the first tweet tagged with #MeToo until a week after that, the tag showed no signs of significantly less use. The total number of tweets per day is pretty steady over that week. Only in its 3rd week does it show a decrease. Also note the 24 hour periodicity: the number of tweets is consistently at its lowest between 10pm and 12midnight Eastern Time and at its peak around 12noon Eastern Time. This is consistent with the majority of tweets originating in North America.

As stated above, there is a decrease in the number of tweets over the two weeks in question, but let's probe this a bit further by looking at how many tweets were original tweets and how many were retweets over the period:

We see from the above that the number of retweets consistently dominates the number of original tweets. Morever, the decrease we noted in total number of tweets over the two weeks is more pronounced in retweets than in original tweets. Looking at the number of original tweets by itself, you can see that there was a decrease but not as much as you would think from the initial plot:

I noted above that the number of retweets was more than the number of original tweets. In fact, 60% of tweets with #MeToo were retweets during this period:

The question then emerges: out of half a million tweets, how many original tweets were responsible for all of these retweets? There were 100 tweets that were retweeted more than 1,000 times each at the time of analysis (there may be more now, as retweets keep happening). They account for ~62,000 (13 percent) of the total tweets captured. Note that many of the retweets will not have been captured by our tweet search.

There were 1,000 tweets that were retweeted at least over 100 times each and these accounted for over 25% of all tweets captured.

What were the top tweets? Let's check out the 5 tweets that had the most retweets at the time of analysis:

These were the most retweeted across the board at the time of analysis. What about the tweets that were retweeted most in the data collected? Let's check these out:

'ALICEGLASS' was a username in the initial word cloud. It is now apparent that her username was there as she was retweeted so much in the tweets collected. Alice Glass is a singer-songwriter who co-founded and fronted the electronic band Crystal Castles, who left the band and you can read her statement here. The significant number of retweets is also why the words 'Crystal' and 'Castle' appear in the word cloud.

So we have other languages represented in numbers, which is interesting. What started as movement in North America and English language has not only spread to other languages, but has done so in a significant way, as evidenced by the fact that Spanish, French and Korean tweets appear in the top 5 being retweeted in the two weeks analayzed. Let's now explore the distribution of tweets across languages a bit further.

The spread of #MeToo around the globe

First up, lets look at the languages represented in all the tweets containing #MeToo and check out how often they occur:

English dominates, then unidentified. Then we have French, Dutch, German, Swedish, Japanese, Spanish and Korean. As we have counts over several scales (that is, in the thousands, tens of thousands and hundreds of thousands), it makes sense to plot this figure with a logarithmic y axis, which means that the visual distance between $10^3$ (one thousand) and $10^4$ (ten thousand) is the same as the visual distance between $10^4$ and $10^5$ ( one hundred thousand):

We can now see that, although English does dominate, there are several other languages each having tens of thousands of tweet each: French (32K), Dutch (23K) and and Japanese (16K), to name several. In fact, nearly 40% of tweets are represented by foreign languages or unidentified languages (45K, which are often foreign and just not detectable by the algorithm twitter uses).

Let's now delve into how the use of different languages using the #MeToo hashtag varied over the two weeks:

There are at least three aspects of this figure that grab my attention and that are worth further investigation:

  • on October 26, there's a peak of unidentified languages when English is at a low;
  • on October 30, the number of French tweets shoots up to the number of English language tweets;
  • On November 08, the number of Dutch tweets has a spike and gets close to the number of tweets in English.

Let's now investigate these.

The spike in unidentified language tweets

An explanation for the spike in unidentified language tweets is the retweeting of the following tweet in Catalan from Eva Piquer:

Twitter's algorithm couldn't identify the language of this tweet, Catalan, which was retweeted a whopping 4,876 times on Thursday October 26th.

The spike in French tweets

On October 30th, the spike in French language tweets was due to the retweeting of the following three tweets, which had 555, 347 and 327 retweets respectively:

Also note that the first tweet above also contains the hashtag #Balancetonporc which translates into English as 'expose your pig'. #balancetonporc and slight variants occurs in 27% of French language #MeToo tweets and is now considered to be a French analog of #MeToo. You can read more here.

What about the small spike in Dutch language tweets? I'll leave that as a challenge to the avid reader.

Conclusion

In this post, you saw that the #MeToo movement has gained a sustained momentum and that, although it started in North America, its reach has spread around the globe. You saw through twitter data visualization alone (word clouds of half a million tweets) that new revelations and allegations have occured after the first usage of the hashtag and it is reasonable to conclude that this is playing a part in its sustained momentum. You also saw that many of the tweets are retweets, which suggests that engagement is high, even for those not making original tweets themselves. In the coming days, I'll make available the code used to pull the tweets from twitter and perform the above analysis. I warmly encourage you to see what else you can find in the data. This is no substitute for reading widely around the subject and discussing with people IRL.

If you have any thoughts, responses and/or ruminations, feel free to reach out to me on twitter: @hugobowne.

You can find the code that was used for this analysis in this repository.