Skip to content
World Cup 2022 Analysis
  • AI Chat
  • Code
  • Report
  • 1. 2022 World Cup

    The 2022 FIFA World Cup was the 22nd time the prestigious tournament was held, with Argentina claiming victory over the reigning world champions, France, winning their 3rd World Cup of all time. The World Cup is held once every four years and features 32 teams from around the world who qualified through their respective regions.

    This was the second time this international competition was held in Asia, with 64 games being played over 29 days through November and December. The 32 teams were divided into eight groups. Each team played each other once in the group stages, and the top two teams from each group would advance to the single-elimination knockout stages.

    This project aims to see what, if any, factors contributed to teams making a further run in the tournament than other teams. The majority of the data used in this notebook was obtained from a dataset on kaggle. The FIFA rankings and group for each team were also found on kaggle here, which also shows two graphics showing how the tournament plays out.

    2. Libraries and Datasets

    Here we import the necessary packages used for the analysis as well as loading in our data. Unnecessary columns were also removed from the tables. Two of the datasets were merged together into one dataframe called world_cup_data to make it easier to work with throughout the analysis.

    # loading packages for analysis
    library("readr")
    library("dplyr")
    library("forcats")
    library("ggplot2")
    library("stringr")
    options(repr.plot.width =9, repr.plot.height =9)
    Hidden output

    Data importing

    # importing dataset containing stats for each team at the world cup
    team_stats <- read_csv("datasets/kaggle/Squad Standard Stats.csv")
    
    # importing dataset containing an overall view of how each team performed
    final_standings_table <- read_csv("datasets/kaggle/Final League Table.csv")
    
    # importing dataset that contains the world ranking of each team and the group they were in for the tournament
    team_rankings <- read_csv("datasets/kaggle/2022_world_cup_groups.csv")
    Hidden output

    Data cleaning and joining

    #removing 21 columns from team_stats
    team_stats <- team_stats %>%
        select(c(1:4, Assists))
    
    #removing 2 columns from final_standings_table
    final_standings_table <- final_standings_table%>%
        select(-Points, -`xG Difference per 90`)%>%
    	arrange(Team)
    
    #renaming USA to United States and Korea Republic to South Korea for proper joining
    final_standings_table$Team <- str_replace(final_standings_table$Team, "USA", "United States")
    final_standings_table$Team <- str_replace(final_standings_table$Team, "Korea Republic", "South Korea")
    team_stats$Team <- str_replace(team_stats$Team, "Korea Republic", "South Korea")
    
    #refactoring `Depth of the Campaign` in final_standings_table
    final_standings_table <- final_standings_table%>%
        mutate(`Depth of the Campaign` = fct_relevel(`Depth of the Campaign`, c("F", "3P", "QF", "R16", "GR")))
    
    #joining the dataframes together
    world_cup_data <- final_standings_table %>% inner_join(team_stats, by = "Team") 
    
    str(world_cup_data)

    3. Goals Scored

    Visualizing the number of goals scored by each team in the tournament, colored by how far they made it shown in Graph 3-1. 'F' stands for the finals (1st and second place team), '3P' stands for the third place match (3rd and 4th placed team who got knocked out in the semi-finals), 'QF' stands for quarter-finals, 'R16' stands for the round of sixteen, and 'GR' stands for the group stages.

    ggplot(world_cup_data, aes(x = `Goals For`, y = reorder(Team, `Goals For`), 
                                      color = `Depth of the Campaign`)) + 
        geom_point(size = 5)+
        geom_segment(aes(xend = 0, yend = Team), linewidth = 3)+
        labs(x = "Goals Scored", y = "Country", title = "Graph 3-1 Goals by Team") +
        geom_text(aes(label = `Goals For`), color = "white", size =3) +
        scale_color_discrete("Length of Campaign") +
    	theme(axis.text.y = element_text(size = 14))

    In general we see that the more goals you score, the further you make it in the tournament, but there are some interesting observations here. Spain didn't get past the first knockout stage, but still scored more goals than both the third place match teams and a quarter-finals team. Germany has the most goals of any team that didn't get past the group stages. Belgium, a top rated team in the tournament, only managed to score a single goal in their campaign.

    4. Knockout stage teams

    Finding teams who made it past the group stage. This can be done two ways; one way is by filtering for teams whose campaign wasn't "GR", and the second is filtering for teams who played more than three matches. Two new dataframes are created: teams_qualified and teams_not_qualified which will be used later in the analysis.

    # finding qualified teams by standings table, returned as a data frame for use later
    teams_qualified <- world_cup_data %>%
        filter(`Depth of the Campaign` != "GR")%>%
        select(Team, `Depth of the Campaign`, `Goal Difference`)%>%
        arrange(`Depth of the Campaign`)
    
    # finding teams who didn't make the knockout stage, returned as a dataframe for use later
    teams_not_qualified <- world_cup_data %>%
        filter(`Depth of the Campaign` == "GR")%>%
        select(Team, `Depth of the Campaign`, `Goal Difference`)%>%
        arrange(`Depth of the Campaign`)
    
    # finding qualified teams by team stats, returned as a vector instead of a dataframe
    teams_qualified_by_matches <- world_cup_data %>%
        filter(`Matches Played` > 3) %>%
        pull(Team)
    
    print(teams_qualified_by_matches)
    print(teams_qualified, n = 16)

    5. Team rankings

    Taking a look at the different rankings of the teams who qualified and didn't qualify for the knockout stages. We can see the top 4 rated teams of those who didn't qualify, and the lowest 4 rated teams of those who did. The lower the number, the "higher" rank a team is. For example, the team with a FIFA ranking of 5 is supposed to be better than a team with a FIFA ranking of 12.

    Lowest ranked teams that qualified

    # lowest ranked 4 who qualified
    print("Lowest ranked teams of those in knockout stages")
    teams_qualified%>%
        inner_join(team_rankings, by = "Team")%>%
        top_n(`FIFA Ranking`, n = 4)