Hypothesis Testing Tesla Case Study: EPS vs. Stakeholder Engagement
The dataset eps.csv was obtained from a number of sources. The Twitter Sentiment data from Kaggle (Invalid URL). It also contained stock data which we have left out of this analysis, instead using Earnings Per Share (EPS) data obtained from MacroTrends (Invalid URL). The Google News and Search trends are from GoogleTrends (Invalid URL)
For this project, the dataset has been created to by joining the above fincial and stakeholder engagement data to create engagement_data.
The columns in the new dataset are:
| Column | Description |
|---|---|
date | The date |
EPS | The Earnings per share at the time |
Twitter Sentiment | Normalised (0-1) score of the feelings toward Tesla Brand |
Google News Trends | Normalised (0-100) score of the popularity of Tesla Brand in News Articles |
Google Search Trends | Normalised (0-100) score of the popularity of Tesla Brand in Search History |
The hypothesis that we want to test is:
Higher Stakeholder engagement in this case Twitter Sentiment and Google Trends means increased fincial returns in this case EPS.
Basic Analysis
First we will see if there is any correlation between the variables that we have in the dataset. If there is a strong correlation we can fit data to a predictive model.
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.tsa.stattools import adfuller
# Load the data
engagement_data = pd.read_csv('eps_data.csv', parse_dates=['date'], index_col='date')
# convert in the index Date into the DateTime format
engagement_data.index = pd.DatetimeIndex(engagement_data.index.values,freq=engagement_data.index.inferred_freq)
#Sort DataFrame
engagement_data = engagement_data.sort_index()
# Check for missing values and handle if necessary
engagement_data.isnull().sum()
# Perform stationarity tests and difference the data if necessary
def adf_test(series):
result = adfuller(series)
print('ADF Statistic:', result[0])
print('p-value:', result[1])
print('Critical Values:')
for key, value in result[4].items():
print(f'\t{key}: {value}')
adf_test(engagement_data['EPS'])
# If non-stationary, difference the data
engagement_data['EPS_diff'] = engagement_data['EPS'].diff().dropna()
# Drop the NaN values after differencing
engagement_data.dropna(inplace=True)
# Correlation analysis
correlation_matrix = engagement_data.corr()
print(correlation_matrix)
# Visualization
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()
sns.pairplot(engagement_data)
plt.title('Pair Plot')
plt.show()The graphs above show weak correlation between EPS and the stakholder variables. The most promising link is between EPS and Google Search Trends with a positive correlations of 0.67.
The Correlation between EPS and Twitter Sentimentwhich we were most interested in was very weak at 0.086.
Linear Regression (EPS vs. Google Search Trends)
Continuing the analysis looking at the Google Search Trends data and modelling a linear regression we can see that the R-squared value is 0.39. In finance, an R-squared above 0.7 would generally be seen as showing a high level of correlation, whereas a measure below 0.4 would show a low correlation.
So the Google Search Trends on further examination disproves our hypotethis.
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Load the data
engagement_data = pd.read_csv('eps_data.csv', parse_dates=['date'], index_col='date')
# convert in the index Date into the DateTime format
engagement_data.index = pd.DatetimeIndex(engagement_data.index.values,freq=engagement_data.index.inferred_freq)
#Sort DataFrame
engagement_data = engagement_data.sort_index()
# Check for missing values and handle if necessary
engagement_data.isnull().sum()
# Drop the NaN values after differencing
engagement_data.dropna(inplace=True)
# Search Trend Analysis
# Extract EPS and the engagement metric
X = engagement_data[['Google Search Trends']] # Input feature (Google Search Trends) - Modified to be a DataFrame
y = engagement_data['EPS'] # Target variable (EPS)
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize and fit the linear regression model
model = LinearRegression() # Corrected model type to LinearRegression
model.fit(X_train, y_train)
# Make predictions on the testing set
y_pred = model.predict(X_test)
# Calculate evaluation metrics
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("Mean Squared Error:", mse)
print("R-squared:", r2)
# Plot the regression line
plt.figure(figsize=(8, 6))
plt.scatter(X_test, y_test, color='blue', label='Actual')
plt.plot(X_test, y_pred, color='red', label='Predicted')
plt.title('Linear Regression: EPS vs. Google Search Trends')
plt.xlabel('Google Search Trends')
plt.ylabel('EPS')
plt.legend()
plt.show()VAR (Vector Autoregression) Analysis
For the final analysis to test the hypothesis we will use a VAR model which designed to handle multiple time series variables simultaneously. In the context of analysing the relationship between stakeholder engagement, EPS, and social/environmental impact, a VAR model allows us to incorporate all relevant variables into the analysis, capturing their interdependencies and dynamic interactions.
This approach allows us to capture any potential feedback effects between EPS and the stakeholder engagement metrics.
# Import necessary libraries
import pandas as pd
from statsmodels.tsa.vector_ar.var_model import VAR
from statsmodels.tsa.stattools import adfuller
import matplotlib.pyplot as plt
from math import sqrt
from sklearn.metrics import mean_squared_error
# Load the data
engagement_data = pd.read_csv('eps_data.csv', parse_dates=['date'], index_col='date')
# convert in the index Date into the DateTime format
engagement_data.index = pd.DatetimeIndex(engagement_data.index.values,freq=engagement_data.index.inferred_freq)
#Sort DataFrame
engagement_data = engagement_data.sort_index()
# Check for missing values and handle if necessary
engagement_data.isnull().sum()
# Perform stationarity tests and difference the data if necessary
def adf_test(series):
result = adfuller(series)
print('ADF Statistic:', result[0])
print('p-value:', result[1])
print('Critical Values:')
for key, value in result[4].items():
print(f'\t{key}: {value}')
adf_test(engagement_data['EPS'])
# If non-stationary, difference the data
engagement_data['EPS_diff'] = engagement_data['EPS'].diff().dropna()
# Drop the NaN values after differencing
engagement_data.dropna(inplace=True)
# Fit VAR model on the differenced data. I've chosen this model because VAR offers modelling multiple time series variables simultaneously, you can still use a VAR model with one dependent variable and one independent variable. This approach allows you to capture any potential feedback effects between EPS and the stakeholder engagement metrics.
model = VAR(train_data[['EPS_diff', 'Twitter Sentiment','Google News Trends','Google Search Trends']])
results = model.fit(3)
# Forecast
forecast_steps = len(test_data)
forecast = results.forecast(y=results.endog, steps=forecast_steps)
# Invert differencing if necessary
# If differenced data was used, invert the differencing
def invert_diff(original, diff_series):
return original + diff_series.cumsum()
forecast = pd.DataFrame(forecast, index=test_data.index, columns=['EPS_diff_forecast', 'Twitter_forecast','News_forecast','Search_forecast'])
forecast['EPS_forecast'] = invert_diff(train_data['EPS'].iloc[-1], forecast['EPS_diff_forecast'])
# Evaluate model
mse = mean_squared_error(test_data['EPS'], forecast['EPS_forecast'])
rmse = sqrt(mse)
print('RMSE:', rmse)
# Plot actual vs forecasted EPS
plt.figure(figsize=(12, 6))
plt.plot(test_data.index, test_data['EPS'], label='Actual EPS', color='blue')
plt.plot(forecast.index, forecast['EPS_forecast'], label='Forecasted EPS', color='red')
plt.title('Actual vs Forecasted EPS')
plt.xlabel('Date')
plt.ylabel('EPS')
plt.legend()
plt.show()Conclusion
Our original hypothesis was Higher Stakeholder engagement (Twitter Sentiment and Google Trends) translates increased financial returns for a company (EPS).
The correlation analysis found a link between EPS and Google Search Trends but the other metrics had very low correlation with the finance metric, which suggests rejection of the hypothesis.
The liner regression between EPS and Google Search Trends was low as well suggesting rejection of the hypothesis again.
The VAR model that considered all of the variables together and how they might intereact had a RSME of 0.29 and negative ADF critical values which suggests some correlation and truth behind the hypothesis.
In conclusion, more and varied stakeholder metrics need to be analysised to determine if there are any strong linkes between stakeholder engagement and EPS.