Skip to content

Don't know where to start?

Challenges are brief tasks designed to help you practice specific skills:

  • ๐Ÿ—บ๏ธ Explore: Which index has produced the highest average annual return?
  • ๐Ÿ“Š Visualize: Create a plot visualizing a 30 day moving average for an index of your choosing.
  • ๐Ÿ”Ž Analyze: Compare the volatilities of the indexes included in the dataset.

Scenarios are broader questions to help you develop an end-to-end project for your portfolio:

You are working for an investment firm that is looking to invest in index funds. They have provided you with a dataset containing the returns of 13 different indexes. Your manager has asked you to make short-term forecasts for several of the most promising indexes to help them decide which would be a good fund to include. Your analysis should also include a discussion of the associated risks and volatility of each fund you focus on.

You will need to prepare a report that is accessible to a broad audience. It should outline your motivation, steps, findings, and conclusions.

Stock Exchange Data

This dataset consists of stock exchange data since 1965 for several indexes. It contains the daily stock prices along with the volume traded each day.

Not sure where to begin? Scroll to the bottom to find challenges!

import pandas as pd
stock = pd.read_csv("stock_data.csv", index_col=None)
print(stock.shape)
stock.head(12000)

Data Dictionary

ColumnExplanation
IndexTicker symbol for indexes
DateData of observation
OpenOpening price
HighHighest price during trading day
LowLowest price during trading day
CloseClose price
Adj CloseClose price adjusted for stock splits and dividends
VolumeNumber of shares traded during trading day
CloseUSDClose price in terms of USD

Source of dataset.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.dates as mdates
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error
# For which index we have more data
index_counts = (stock['Index'].value_counts(normalize=True) * 100).sort_values(ascending=False).round(2)

# Plotting the data
plt.figure(figsize=(12, 6))
sns.barplot(x=index_counts.index, y=index_counts.values, palette="viridis")
plt.title('Percentage of Data Available for Each Index')
plt.xlabel('Index')
plt.ylabel('Percentage of Data (%)')
plt.xticks(rotation=45)
plt.show()
#stock_NYA.shape
#stock_NYA.info()
#stock_NYA.isna().sum()
import matplotlib.pyplot as plt

# Convert 'Date' column to datetime format
stock['Date'] = pd.to_datetime(stock['Date'])

# Extract year from 'Date' column
stock['Year'] = stock['Date'].dt.year

# Calculate annual return for each index
annual_return = stock.groupby(['Index', 'Year']).apply(
    lambda x: (x.sort_values('Date').iloc[-1]['CloseUSD'] / x.sort_values('Date').iloc[0]['CloseUSD']) - 1
).reset_index(name='Annual Return')

# Calculate average annual return for each index
average_annual_return = annual_return.groupby('Index')['Annual Return'].mean().reset_index()

# Find the index with the highest average annual return
highest_avg_annual_return_index = average_annual_return.loc[average_annual_return['Annual Return'].idxmax()]

# Sort the average annual return values
average_annual_return = average_annual_return.sort_values(by='Annual Return', ascending=False)

# Display the name of each index and the one with the highest average annual return
average_annual_return, highest_avg_annual_return_index

# Visualization
plt.figure(figsize=(10, 6))
plt.bar(average_annual_return['Index'], average_annual_return['Annual Return'], color='skyblue')
plt.xlabel('Index')
plt.ylabel('Average Annual Return')
plt.title('Average Annual Return by Index')
plt.xticks(rotation=45)
plt.show()

According to this Fig we can see that the index 399001.SZ increased by 14% each year from 1996 to 2010.

I will use The 30-day moving average (MA) is a commonly used statistical method to smooth out short-term fluctuations and highlight longer-term trends.

Analyzing the data for the index with highst avaliable data

#Substing that part of data related to NYA Index

stock_NYA = stock[stock['Index']== 'NYA']
stock_NYA
โ€Œ
โ€Œ
โ€Œ