Skip to content

Baseball and Soccer dataset analysis using numpy

Hello There, There are two datasets to perform calculations on i.e. baseball players and soccer players with their own respective arrays and Values. This notebook is purely intended for practice of Numpy package. This is not just sticking to the course. So I'll be experimenting more functions on my own from here on. So not sure how good am I at documenting my codes. So yep all I can say is "Profitez de votre séjour" :)

Exploration Objectives

Use the arrays imported in the first cell to explore the data and practice your skills!

Baseball Dataset

  • Weight of first 10 players.
  • Median weight of all players.
  • Number of teams and list them all.
  • Number of players above 6.4 ft height and teamwise list.
  • Names of Players over 250 lbs weight
  • Average, Max and Min age of players
  • Positions and Number of players in the position

Soccer Dataset

  • Convert soccer_shooting from decimal to whole numbers.
  • Correlation between soccer_ratings and soccer_heights. Do taller players get higher ratings?
  • What is the average rating for attacking players ('A')?

Both Datasets Combined

  • Who is taller on average? Baseball players or soccer players? Keep in mind that baseball heights are stored in inches!

Importing all arrays from datasets ⏬

# Importing course packages; you can add more too!
import numpy as np
import math

# Import columns as numpy arrays 

#BASEBALL
baseball_names = np.genfromtxt(
    fname="baseball.csv", delimiter=",", usecols=0, skip_header=1, dtype=str
)
baseball_teams = np.genfromtxt(
    fname="baseball.csv", delimiter=",", usecols=1, skip_header=1, dtype=str
)
baseball_heights = np.genfromtxt(
    fname="baseball.csv", delimiter=",", usecols=3, skip_header=1
)
baseball_weights = np.genfromtxt(
    fname="baseball.csv", delimiter=",", usecols=4, skip_header=1
)
baseball_ages = np.genfromtxt(
    fname="baseball.csv", delimiter=",", usecols=5, skip_header=1
)
baseball_positions = np.genfromtxt(
    fname="baseball.csv", delimiter=",", usecols=2, skip_header=1, dtype=str
)

# SOCCER
soccer_names = np.genfromtxt(
    fname="soccer.csv",
    delimiter=",",
    usecols=1,
    skip_header=1,
    dtype=str,
    encoding="utf", 
)
soccer_ratings = np.genfromtxt(
    fname="soccer.csv",
    delimiter=",",
    usecols=2,
    skip_header=1,
    encoding="utf", 
)
soccer_positions = np.genfromtxt(
    fname="soccer.csv",
    delimiter=",",
    usecols=3,
    skip_header=1,
    encoding="utf", 
    dtype=str,
)
soccer_heights = np.genfromtxt(
    fname="soccer.csv",
    delimiter=",",
    usecols=4,
    skip_header=1,
    encoding="utf", 
)
soccer_shooting = np.genfromtxt(
    fname="soccer.csv",
    delimiter=",",
    usecols=8,
    skip_header=1,
    encoding="utf", 
)

Variables Available

Baseball Dataset Arrays

This dataset consists of data of baseball players such as Name, Team, Height, Weight, Age and Positions.

  • baseball_names
  • baseball_teams
  • baseball_heights
  • baseball_weights
  • baseball_ages
  • baseball_positions

Soccer Dataset Arrays

This dataset consists of data of baseball players such as Name, Rating, Positions, Height, Dominant Foot, and other details on them such as rare, shooting, passing, dribbling, defending, heading, diving, handling, kicking, reflexes, speed and positioning scores.

  • soccer_names
  • soccer_ratings
  • soccer_positions
  • soccer_heights
  • soccer_shooting

⚾ BB - Weight of first 10 players.

We'll do it by creating another 2-D array consisting of names and weights of players. Thus we'll have to integrate two 1-D arrays into a one 2-D array. This can be done using 4 functions for the situations in numpy

  • stack() - used this time
  • dstack()
  • concatenate() - not ideal for the scenario
  • column_stack()

Variables Created

  • b_name_weight - 2d array of name and weights.
# Creating a 2D Array of Name and weight
b_name_weight = np.stack((baseball_names,baseball_weights), axis=1)

print(b_name_weight[:10])

⚾ BB - Median weight of all players.

Done simply using np.median() to weights array.

print("Median weight of baseball players is "+str(np.median(baseball_weights))+" lbs")

⚾ BB - Number of teams and list them all.

This can be done using unique() which is used to pickout unique elements from a list. return_index parameter is used to fetch the indexes of team names extracted using unique() and store it in indices. Then we sort it usinf argsort() and store it in sorted_indices. This is then used to get variable b_unqiue_teams consisting of unique team names in the order of we facing it.

Variables Created

  • b_unique_teams
# Obtain unique team names in the order they appear
unique_teams, indices = np.unique(baseball_teams, return_index=True)
sorted_indices = np.argsort(indices)

b_unique_teams = unique_teams[sorted_indices]
count=0

# Traversing unique teams array to print
for teams in b_unique_teams:
    count = count+1
    print(teams, end ="  ")
    if(count%10==0): print("\n")

# Total Number of teams
print("Total Number of teams present = " + str(len(b_unique_teams)))

⚾ BB - Number of players above 6.4 ft height and teamwise list.

Firstly, we convert array of heights from inches to feet. Then we create a dictionary team_count which consists of teams and number of players above 6.4ft respectively. Then we print the team_count variable by iterating one by one.

Variables Created

  • b_height_ft - array of heights in ft
  • team_count - dictionary of teams having players above 6.4ft
# Converting inches to ft
b_height_ft = baseball_heights / 12

# Create a dictionary to store the count of players above 6.4ft for each team
team_count = {}

# Loop through each team and count the number of players above 6.4ft
for team in b_unique_teams:
    val = len(b_height_ft[(baseball_teams == team) & (b_height_ft > 6.4)])
    if val>0:
        team_count[team] = val

# Print the teams with players above 6.4ft
print("Teams with players above 6.4ft:")
for team, count in team_count.items():
    print(team + ": " + str(count), end="  ")

print("\nNumber of Players above 6.4ft = " + str(len(b_height_ft[b_height_ft>6.4])))
    

⚾ BB - Names of Players over 250 lbs weight.

Just traverse the array b_name_weight and checked if player weight is above 250lbs it'll print the name and weight.

# using the existing variable with names and weights of players in lbs
count=0
for i in b_name_weight:
    if float(b_name_weight[count][1]) > 250:
        print(b_name_weight[count][0] + " : " + b_name_weight[count][1] + " lbs\n")
    count=count+1

⚾ BB - Average, Max and Min age of players

Average can be calculated by computing mean of baseball_ages variable.