Switzerland Trade - Cleaning and Structuring

1. Data Source and Definitions

Data Source

UN Comtrade Database (https://comtradeplus.un.org/) is a rich source of information about world trade. There is data regarding chosen countries, trade flows, modes of transport and types of commodities during a selected period of time.

On the website, after creating an account, it is possible to download CSV files up to 100'000 records by choosing specific conditions of searching.

Definitions

I was interested in finding data about Switzerland's trading partners over the last 5 available years: 2020-2024.

Then, I limited the outcomes to the lowest level of aggregation of commodities (HS-2), which means wider and less detailed groups of products, but still detailed enough to draw specific insights.

HS, which stands for Harmonized System, are codes commonly used throughout the import and export process for the classification of goods (https://www.trade.gov/harmonized-system-hs-codes)

To clarify, this can be explained by an example:

HS-2 code no. 52: "Cotton"

HS-4 code no. 5204: "Cotton sewing thread, whether or not put up for retail sale"

HS-6 code no. 520411: "Cotton; sewing thread, containing 85% or more by weight of cotton, not put up for retail sale"

In order to understand physical flows, I also wanted to see modes of transport by which commodities arrive in and are sent from Switzerland.

DataFrameas

Example_of_HS_code

variable

SELECT
	id AS commodities_code,
	TRIM(SPLIT_PART(text,'- ',2)) AS commodities_name
FROM read_csv_auto('commodities.csv', delim=',', quote='"', escape='"')
WHERE id LIKE '52%'
LIMIT 11

2. First Questions and Ideas about Structuring the Data

Questions

First questions that popped out of my head while thinking about that dataset were:

What Swiss Economy consist of?

What are the types of most imported and exported goods?

With whom does Switzerland trade the most?

What transportation channels are the most popular?

How has it changed over the last 5 years?

Structure

The tables come with many columns, which brings a lot of information, but also causes some errors.

One of the problems was special characters in the names of the countries and in commodities' explanations or in their units.

To simplify the process of getting this information, I decided to create 2 dimension tables (name of countries and name of commodities with their codes) and 1 fact table (with the numbers for all the partners throughout 5 years).

It was possible to download the dimension tables separately, although they needed cleaning.

To create 1 fact table, I needed to download data for each year separately (as shown on the print-screen above), while it was the biggest chunk of data (ca. 60'000+ rows) which fitted in the limits. After that, all 5 tables could have been combined as they contained the same fields.

3. Data Cleaning and Structuring (SQL)

Fact Table

Because of the limited encoding options in DataLab, I first uploaded and reduced the tables by unnecessary columns in BigQuery for each Fact Table (each year) : ch_2020,ch_2021,ch_2022,ch_2023,ch_2024.

This allowed to get rid of problematic characters and simplify the tables.

One of the tables had a problem with a CSV export, as nulls were not correctly read.

A way out was to export the same table to XLSX format from the primary source (UN Comtrade), open it in Excel and then export it as CSV.

DataFrameas

query_in_bigquery

variable

-- query in BigQuery for each year

SELECT
  refYear,
  flowDesc,
  partnerCode,
  partnerISO,
  cmdCode,
  motCode,
  motDesc,
  primaryValue
FROM `[Project].trade_switzerland.2024`

After saving the reduced CSV files, I could work on them in Datalab.

First, I combined all of them and renamed the columns to connect easily with other dim tables:

DataFrameas

ch_2020_24

variable

--Combining and renaming

WITH ch_2020_24 AS (
	SELECT * FROM ch_2020.csv
	UNION ALL
	SELECT * FROM ch_2021.csv
	UNION ALL
	SELECT * FROM ch_2022.csv
	UNION ALL
	SELECT * FROM ch_2023.csv
	UNION ALL
	SELECT * FROM ch_2024.csv)

SELECT
  refYear AS trade_year,
  flowDesc AS trade_flow,
  partnerCode AS partner_code,
  partnerISO AS partner_iso,
  cmdCode AS commodities_code,
  motCode AS trans_code,
  motDesc AS trans_name,
  primaryValue AS trade_value
FROM ch_2020_24
	--without World aggregation rows and aggregated transportation ways
WHERE partnerISO <> 'W00' AND motCode <> 0
LIMIT 5;

Dimension tables

After receiving the Fact Table ready for analysis, I organised simple Dimension Tables with all the countries and codes, as well as all the HS-2 codes assigned to the group of commodities.

Dimension Table: countries

Checking the names of currently existing countries with special characters for smoother queries:

DataFrameas

Countries_with_special_characters

variable

SELECT *
FROM countries.csv
WHERE REGEXP_MATCHES(partnerDesc, '[^a-zA-Z ()\-.,\'']')
	AND entryExpiredDate IS NULL;

Creating table for further work with simplified names of a few countries (CASE), deleting the aggregated records (isGroup = 0) and selecting only currently existing countries (entryExpiredDate IS NULL):

DataFrameas

countries

variable

SELECT
	PartnerCode AS partner_code,
	PartnerCodeIsoAlpha3 AS partner_iso,
  CASE
    WHEN partnerDesc = 'Åland Islands ' THEN 'Aland Islands'
    WHEN partnerDesc = 'Côte d''Ivoire' THEN 'Ivory Coast'
    WHEN partnerDesc = 'Türkiye' THEN 'Turkiye'
    WHEN partnerDesc = 'Saint Barthélemy' THEN 'Saint Barthelemy'
    WHEN partnerDesc = 'Curaçao' THEN 'Curacao'
    ELSE partnerDesc END AS partner_name,
	partnerNote AS partner_note
FROM countries.csv
WHERE entryExpiredDate IS NULL
	AND isGroup = 0
LIMIT 5;

Dimension Table: commodities

Creating commodities table for further work with a reduced list of commodities to aggregation level HS2 (wider and less detailed groups of products), selecting only 2 columns (code and name):

Update: I needed to add the HS4 level for further analysis of a few groups of products in more detail.

DataFrameas

commodities

variable

SELECT
	id AS commodities_code,
	TRIM(SPLIT_PART(text,'- ',2)) AS commodities_name
FROM read_csv_auto('commodities.csv', delim=',', quote='"', escape='"')
WHERE aggrLevel = 2 OR aggrLevel = 4
LIMIT 5;

Appendix

It seems that biggest 'deficit-makers' are countries who sell precious metals and stones to Switzerland. As my level of detail is limited here, at this point I need to go deeper and download dataset which has more detail information regarding commodities_code = 71, which is also a group of goods with the biggest trading value.

For that I needed to download next 2 tables (because of website limits) with HS-4 numbers 7101-7118, and combine them:

‌
‌
‌

Switzerland Trade - Cleaning and Structuring

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}1. Data Source and Definitions

Data Source

Definitions

2. First Questions and Ideas about Structuring the Data

Questions

Structure

3. Data Cleaning and Structuring (SQL)

Fact Table

Dimension tables

Dimension Table: countries

Dimension Table: commodities

Appendix

1. Data Source and Definitions