Text Mining in R: Are Pokémon GO Mentions Really Driving Up Stock Prices?
Introduction: What Bloomberg Terminal’s News Trends Feature (Doesn’t) Show
On a nondescript commute in late July, In the aftermath of the Brexit vote in England and the rise of the hugely popular game called Pokémon GO, I was listening to Bloomberg radio on my way to work. Bloomberg had taken a brief detour from Brexit and the daily musings of the market to instead focus on technology stocks and how to use a Bloomberg Terminal feature called NT (News Trends) to illustrate the information of both news trends and stock prices, aggregating topic mentions across many news sources as a time series. From the description that is all the feature of the $20,000 a month service does. Admittedly I haven’t been in front of a terminal since grad school and I do not remember the NT feature.
The awkward radio broadcast covered a game the “kids” were playing called Pokémon GO and its relationship with Nintendo’s stock. As the conversation wore on detailing how to use NT within a terminal and thereby illustrate the service’s value proposition, Hilary Clark begrudgingly stated that “the stock ticks up for Nintendo were driven… By the mentions of Pokémon GO” and later added, “the value out of this function [NT] is just to see what is driving up the stock price.”
Now, as a text miner, I take particular note of word choice. In communication theory, a message’s meaning lies with the destination not the sender or channel. This means the word choice, tone and medium have an impact on the audience’s comprehension and ascribed attitudes towards the message. You probably already know this inherently, but it is important to note: choosing your words, tone and channel correctly affects the message meaning, anyone who is married knows this firsthand :).
I am guessing you, a DataCamp blog reader, already identified a flaw in Clark’s word choice: as an authority (sender) speaking on an authoritative radio (channel) broadcast, she’s uses words like “driven” (message) in the NT service, which includes a correlation calculation. An unsophisticated consumer of this message may interpret a causal affect from her words and may even attempt to trade stocks using the NT service. However, as the sophisticated audience (destination) that we are, we own the meaning and therefore get to recreate what we think the NT service is and determines its value for ourselves. If it’s so valuable to be part of a $20,000 a month service, then we should be able to gain some novel insight using Google News Trends and Yahoo’s stock service.
Let’s make our own News Trends feature for free and see how awesome a poor man’s Bloomberg Terminal’s NT service could be!
Create Your Own R News Trends Feature
Load the Libraries
To start load the libraries we will need to make the visuals. The
quantmod library is a popular R package for importing stock market data. Next,
grid are needed to arrange the two resulting visuals. Both
ggthemes are used to construct the time series.
ggplot2 is a great grammar of graphics library, while
ggthemes provides premade palettes for easy implementation. Lastly, the
gtrendsR package provides an interface to get Google Trends data.
It could be that you need to install packages if you don’t have them installed yet. In that case, use for example the
install.packages("gtrendsR") command to import the
Assemble the Google Trends Data
The next step is to assemble the data that will be plotted. In the code below, change the
usr string to your own Gmail account. Then change
psw to your password. These two objects are passed to
gconnect() so that your R console can programmatically connect to the Google Trends service. Next create
gtrends() with a search pattern, and dates. After a few moments a list is returned containing the trends information among other data. In the list each week receives an indexed score between 0-100. You can call
plot() directly on this list to create the figure.
Now check out the code, and make sure to set up your Gmail variables in order to plot the graph!
gconnect() can sometimes fail for account security reasons or due to 2-step verification of your account. This will not, however, prevent you to finish the tutorial. Keep on going!
Assemble the Nintendo Stock Price Data
Next, you need to assemble the Nintendo stock price. Nintendo is traded on the Japanese stock market and is referenced by number, not name. Using
setSymbolLookup(), pass in the stock index for Nintendo. Don’t forget to also add
yahooj, because the data is collected from Yahoo Japan. In other words, you should specify that the stock data from YJ7974.T or Nintendo should be downloaded from Yahoo Japan. The
setSymbolLookup() function is used to create a reference table of one or more ticker symbols.
The next function call
getSymbols() actually retrieves the information from Yahoo Japan. The returned object is an extensible time series (
xts) class. An
xts object is similar to a data frame, but row names are dates not strings. In this case, the object contains dates from 2007 to present along with opening, closing, high, low and adjusted prices for the day. Once again, you can
plot() the returned object to create this second graph:
# Specify which stock data to download and from where setSymbolLookup(YJ7974.T='yahooj') # Load YJ7974.T from Yahoo Japan getSymbols('YJ7974.T') # Plot the Yahoo data plot(YJ7974.T)
Since the raw data has a temporal mismatch, you need to first subset to dates in this year and then adjust the
xts daily numbers to reconcile with the weekly trends data. Adjusting an
xts object is slightly different than a data frame. Placing
2016-01/2016-8 within the index brackets subsets the time series to dates between January 2016 and August 2016. The
xts class has unique grouping functions. In this case,
apply.weekly() is passed the
xts object along with the function,
mean, to be applied to each group.
# Subset the time series nintendo.xts <- YJ7974.T['2016-01/2016-8'] # Get the weekly average of the time series nintendo.xts <- apply.weekly(nintendo.xts, mean)
If needed, the code below changes the
xts object to a data frame so you can use more common functions like
# Change the `xts` object to a data frame nintendo <- data.frame(date=index(nintendo.xts), coredata(nintendo.xts)) # Subset the data frame to only include data from August 2016 nintendo.aug.16 <- subset(nintendo, format.Date(date, "%y")=="16" & format.Date(date, "%m")=="08")
str on the
poke object, you can identify the list element containing the trend information. The
poke$trend element is a traditional data frame. Within the data frame, the weekly trends vector is called
pokemon.go. This vector is extracted in a new object
Make the Data Frame for the Visuals
Now you can assemble the data into a succinct data frame to be used in the visuals. The
poorman.nt object is a data frame with
week.2016 vectors. The first references the weekly average closing price. The next contains the weekly Pokémon GO score. Lastly,
week.2016 comprises a sequence of numbers between 1 and the number of rows in the
nintendo object. You could append week start dates, but a simple sequence makes for a more appealing visual.
Compute the Correlation
The radio broadcast mentioned a correlation of .95. Using the code below, we use
cor() to compute the correlation between
poke.trends, and we get pretty close!
The final step! Let’s make two line charts from the
poorman.nt data. Using
ggplot pass in the data, specify the X and Y axes along with a group of 1 representing individual weeks. The next layer is the geometric line,
geom_line, with some aesthetics. The next layer applies the Google Docs theme. In order to align the charts, another theme call is added to remove aspects of the Y axis. Finally, a title is added using
Use ggplot to Make the Line Charts
The same process is repeated for the trends data. The Y axis is changed to
poke.trends along with the line’s color and title.
Stack the Visuals
Lastly, pass each ggplot object into
grid.arrange to stack the visuals. This makes it easier for your audience to consume the figure below.
It’s clear that the data is correlated. Admittedly, you can even see the news peak happening one week ahead of the stock price peak. Still, market timing based on this type of analysis is fraught with challenges. The periodicity of this visual (never got to see Bloomberg NT for real) is in weeks but the market moves more dynamically. Additionally, the lines do not behave the same after the peak. It is subtle but the lines diverge. So to trade on this information, you time the market peak of the news trend line.
Congrats! You have now successfully built your own poor man’s News Trends feature in R!
At best, NT analysis on a Bloomberg Terminal suffers from simultaneity bias. At worst, news stories represent post-dictors not predictors for a model or qualitative trading strategy. By the time the words appear on a Bloomberg Terminal about Nintendo’s cautious words, the market has absorbed and reacted accordingly. So a stock price and NT analysis is nothing more than a curiosity analysis… Only the earliest of investors can possibly trade on news feeds but those left to manually create and mentally process a graphic will be left in the dust. How would I ever use a delayed news feed to trade in the future?! I expected more sophistication from financial professionals as I am sure most text miners and data scientists would have recognized issues with correlation being sold as causation and a temporal aspect that is ignored altogether.
Word choice matters. It matters not only for journalists. Words matter for text mining and quantitative analysis. Words and depth of modeled problem especially matter if they are used as features in machine learning. I can easily give the benefit of the doubt to the reporters that they did not mean to infer on live radio that correlation and causation are the same. If it were, we should all stop eating ice cream. As ice cream sales increase so do drownings… There must be a causal effect.
Does this case study spark your interest in text mining with R? Head over to Ted’s “Introduction To Text Mining: Bag of Words” course, in which you can learn much more about this advanced data science topic!