1. Data Source and Definitions
Data Source
UN Comtrade Database (https://comtradeplus.un.org/) is a rich source of information about world trade. There is data regarding chosen countries, trade flows, modes of transport and types of commodities during a selected period of time.
On the website, after creating an account, it is possible to download CSV files up to 100'000 records by choosing specific conditions of searching.
Definitions
I was interested in finding data about Switzerland's trading partners over the last 5 available years: 2020-2024.
Then, I limited the outcomes to the lowest level of aggregation of commodities (HS-2), which means wider and less detailed groups of products, but still detailed enough to draw specific insights.
HS, which stands for Harmonized System, are codes commonly used throughout the import and export process for the classification of goods (https://www.trade.gov/harmonized-system-hs-codes)
To clarify, this can be explained by an example:
HS-2 code no. 52: "Cotton"
HS-4 code no. 5204: "Cotton sewing thread, whether or not put up for retail sale"
HS-6 code no. 520411: "Cotton; sewing thread, containing 85% or more by weight of cotton, not put up for retail sale"
In order to understand physical flows, I also wanted to see modes of transport by which commodities arrive in and are sent from Switzerland.
SELECT
id AS commodities_code,
TRIM(SPLIT_PART(text,'- ',2)) AS commodities_name
FROM read_csv_auto('commodities.csv', delim=',', quote='"', escape='"')
WHERE id LIKE '52%'
LIMIT 112. First Questions and Ideas about Structuring the Data
Questions
First questions that popped out of my head while thinking about that dataset were:
What Swiss Economy consist of?
What are the types of most imported and exported goods?
With whom does Switzerland trade the most?
What transportation channels are the most popular?
How has it changed over the last 5 years?
Structure
The tables come with many columns, which brings a lot of information, but also causes some errors.
One of the problems was special characters in the names of the countries and in commodities' explanations or in their units.
To simplify the process of getting this information, I decided to create 2 dimension tables (name of countries and name of commodities with their codes) and 1 fact table (with the numbers for all the partners throughout 5 years).
It was possible to download the dimension tables separately, although they needed cleaning.
To create 1 fact table, I needed to download data for each year separately (as shown on the print-screen above), while it was the biggest chunk of data (ca. 60'000+ rows) which fitted in the limits. After that, all 5 tables could have been combined as they contained the same fields.
3. Data Cleaning and Structuring (SQL)
Fact Table
Because of the limited encoding options in DataLab, I first uploaded and reduced the tables by unnecessary columns in BigQuery for each Fact Table (each year) : ch_2020,ch_2021,ch_2022,ch_2023,ch_2024.
This allowed to get rid of problematic characters and simplify the tables.
One of the tables had a problem with a CSV export, as nulls were not correctly read.
A way out was to export the same table to XLSX format from the primary source (UN Comtrade), open it in Excel and then export it as CSV.
-- query in BigQuery for each year
SELECT
refYear,
flowDesc,
partnerCode,
partnerISO,
cmdCode,
motCode,
motDesc,
primaryValue
FROM `[Project].trade_switzerland.2024`After saving the reduced CSV files, I could work on them in Datalab.
First, I combined all of them and renamed the columns to connect easily with other dim tables:
--Combining and renaming
WITH ch_2020_24 AS (
SELECT * FROM ch_2020.csv
UNION ALL
SELECT * FROM ch_2021.csv
UNION ALL
SELECT * FROM ch_2022.csv
UNION ALL
SELECT * FROM ch_2023.csv
UNION ALL
SELECT * FROM ch_2024.csv)
SELECT
refYear AS trade_year,
flowDesc AS trade_flow,
partnerCode AS partner_code,
partnerISO AS partner_iso,
cmdCode AS commodities_code,
motCode AS trans_code,
motDesc AS trans_name,
primaryValue AS trade_value
FROM ch_2020_24
--without World aggregation rows and aggregated transportation ways
WHERE partnerISO <> 'W00' AND motCode <> 0
LIMIT 5;Dimension tables
After receiving the Fact Table ready for analysis, I organised simple Dimension Tables with all the countries and codes, as well as all the HS-2 codes assigned to the group of commodities.
Dimension Table: countries
Checking the names of currently existing countries with special characters for smoother queries:
SELECT *
FROM countries.csv
WHERE REGEXP_MATCHES(partnerDesc, '[^a-zA-Z ()\-.,\'']')
AND entryExpiredDate IS NULL;Creating table for further work with simplified names of a few countries (CASE), deleting the aggregated records (isGroup = 0) and selecting only currently existing countries (entryExpiredDate IS NULL):
SELECT
PartnerCode AS partner_code,
PartnerCodeIsoAlpha3 AS partner_iso,
CASE
WHEN partnerDesc = 'Åland Islands ' THEN 'Aland Islands'
WHEN partnerDesc = 'Côte d''Ivoire' THEN 'Ivory Coast'
WHEN partnerDesc = 'Türkiye' THEN 'Turkiye'
WHEN partnerDesc = 'Saint Barthélemy' THEN 'Saint Barthelemy'
WHEN partnerDesc = 'Curaçao' THEN 'Curacao'
ELSE partnerDesc END AS partner_name,
partnerNote AS partner_note
FROM countries.csv
WHERE entryExpiredDate IS NULL
AND isGroup = 0
LIMIT 5;Dimension Table: commodities
Creating commodities table for further work with a reduced list of commodities to aggregation level HS2 (wider and less detailed groups of products), selecting only 2 columns (code and name):
Update: I needed to add the HS4 level for further analysis of a few groups of products in more detail.
SELECT
id AS commodities_code,
TRIM(SPLIT_PART(text,'- ',2)) AS commodities_name
FROM read_csv_auto('commodities.csv', delim=',', quote='"', escape='"')
WHERE aggrLevel = 2 OR aggrLevel = 4
LIMIT 5;Appendix
It seems that biggest 'deficit-makers' are countries who sell precious metals and stones to Switzerland. As my level of detail is limited here, at this point I need to go deeper and download dataset which has more detail information regarding commodities_code = 71, which is also a group of goods with the biggest trading value.
For that I needed to download next 2 tables (because of website limits) with HS-4 numbers 7101-7118, and combine them: