Saltar al contenido principal
InicioTutorialesIngeniería de datos

Snowflake Tutorial For Beginners: From Architecture to Running Databases

Learn the fundamentals of cloud data warehouse management using Snowflake. Snowflake is a cloud-based platform that offers significant benefits for companies wanting to extract as much insight from their data as quickly and efficiently as possible.
ene 2024  · 12 min leer

What is Snowflake?

If someone asked me to describe Snowflake in as few words as possible, I would choose these:

  • Data Warehouses
  • Large-Scale Data
  • Multi-Cloud
  • Separation
  • Scalable
  • Flexible
  • Simple

If they wanted me to elaborate, I would string together the words like this:

Snowflake is a massively popular cloud-based data warehouse management platform. It stands out from competitors due to its ability to handle large-scale data and workloads more rapidly and efficiently. Its superior performance comes from its unique architecture, which uses separate storage and compute layers, allowing it to be incredibly flexible and scalable. Additionally, it natively integrates with multiple cloud providers. Despite these advanced features, it remains simple to learn and implement.

If they request even more details, well, then I would write this tutorial. If you’re totally new to the subject, DataCamp’s Introduction to Snowflake course is an excellent place to start.

Why Use Snowflake?

Snowflake serves more than 8900 customers worldwide and processes 3.9 billion queries every day. That kind of usage statistics isn’t a coincidence by any means.

Below are the best benefits of Snowflake that have so much appeal:

1. Cloud-based architecture

Snowflake operates in the clouds, allowing companies to scale up and down resources based on demand without worrying about physical infrastructure (hardware). The platform also handles routine maintenance tasks such as software updates, hardware management, and performance tuning. This relieves the burden of maintenance overhead, allowing organizations to focus on what matters: deriving value from data.

2. Elasticity and scalability

Snowflake separates storage and compute layers, allowing users to scale their computing resources independently of their storage needs. This elasticity enables efficient handling of diverse workloads with optimal performance and without unnecessary costs.

3. Concurrency and performance

Snowflake easily handles high concurrency: multiple users can access and query the data without performance loss.

4. Data sharing

Snowflake’s security safeguards enable data sharing across other organizations, internal departments, external partners, customers, or other stakeholders. No need for complex data transfers.

5. Time travel

Snowflake uses a fancy term “Time Travel” for data versioning. Whenever a change is made to the database, Snowflake takes a snapshot. This allows users to access historical data at various points in time.

6. Cost efficiency

Snowflake offers a pay-as-you-go model due to its ability to scale resources dynamically. You will only pay for what you use.

All these benefits combined make Snowflake a highly desirable data warehouse management tool.

Now, let’s take a look at the underlying architecture of Snowflake that unlocks these features.

Become a Data Engineer

Become a data engineer through advanced Python learning
Start Learning for Free

What is a Data Warehouse?

Before we dive into Snowflake’s architecture, let’s review data warehouses to ensure we are all on the same page.

A data warehouse is a centralized repository that stores large amounts of structured and organized data from various sources for a company. Different personas (employees) in organizations use the data within to derive different insights.

For example, data analysts, in collaboration with the marketing team, may run an A/B test for a new marketing campaign using the sales table. HR specialists may query the employee information to track performance.

These are some of the examples of how companies globally use data warehouses to drive growth. But without proper implementation and management using tools like Snowflake, data warehouses remain as elaborate concepts.

You can learn more about the subject with our Data Warehousing course.

Snowflake Architecture

Snowflake’s unique architecture, designed for faster analytical queries, comes from its separation of the storage and compute layers. This distinction contributes to the benefits we’ve mentioned earlier.

Storage layer

In Snowflake, the storage layer is a critical component, storing data in an efficient and scalable manner. Here are some key features of this layer:

  1. Cloud-based: Snowflake seamlessly integrates with major cloud providers such as AWS, GCP, and Microsoft Azure.
  2. Columnar format: Snowflake stores data in a columnar format, optimized for analytical queries. Unlike the traditional row-based formats used by tools like Postgres, the columnar format is well-suited for data aggregation. In columnar storage, queries access only the specific columns they need, making it more efficient. On the other hand, row-based formats require accessing all rows in memory for simple operations like calculating averages.
  3. Micro-partitioning: Snowflake uses a technique called micro-partitioning that stores tables in memory in small chunks. Each chunk is typically immutable and only a few megabytes in size, which makes query optimization and execution much faster.
  4. Zero-copy cloning: Snowflake has a unique feature that allows it to create virtual clones of data. Cloning is instantaneous and doesn’t consume additional memory until changes are made to the new copy.
  5. Scale and elasticity: The storage layer scales horizontally, which means it can handle increasing data volumes by adding more servers to distribute the load. Also, this scaling happens independently of compute resources, which is ideal when you desire to store large volumes of data but analyze only a small fraction.

Now, let’s look at the compute layer.

Compute layer

As the name suggests, the compute layer is the engine that executes your queries. It works in conjunction with the storage layer to process the data and perform various computational tasks. Below are some more details about how this layer operates:

  1. Virtual warehouses: You can think of Virtual Warehouses as teams of computers (compute nodes) designed to handle query processing. Each member of the team handles a different part of the query, making execution impressively fast and parallel. Snowflake offers Virtual Warehouses in different sizes, and subsequently, at different prices (the sizes include XS, S, M, L, XL).
  2. Multi-cluster, multi-node architecture: The compute layer uses multiple clusters with multiple nodes for high concurrency, allowing several users to access and query the data simultaneously.
  3. Automatic query optimization: Snowflake’s system analyzes all queries and identifies patterns to optimize using historical data. Common optimizations include pruning unnecessary data, using metadata, and choosing the most efficient execution path.
  4. Results cache: The compute layer includes a cache that stores the results of frequently executed queries. When the same query is run again, the results are returned almost instantaneously.

These design principles of the compute layer all contribute to Snowflake’s ability to handle different and demanding workloads in the cloud.

Cloud services layer

The final layer is cloud services. As this layer integrates into every component of Snowflake’s architecture, there are many details on its operation. In addition to the features related to other layers, it has the following additional responsibilities:

  1. Security and access control: This layer enforces security measures, including authentication, authorization, and encryption. Administrators use Role-Based Access Control (RBAC) to define and manage user roles and permissions.
  2. Data sharing: This layer implements secure data sharing protocols across different accounts and even third-party organizations. Data consumers can access the data without the need for data movement, promoting collaboration and data monetization.
  3. Semi-structured data support: Another unique benefit of Snowflake is its ability to handle semi-structured data, such as JSON and Parquet, despite being a data warehouse management platform. It can easily query semi-structured data and integrate the results with existing tables. This flexibility is not seen in other RDBMS tools.

Now that we have a high-level picture of Snowflake’s architecture, let’s write some SQL on the platform.

Setting Up SnowflakeSQL

Snowflake has its own version of SQL called SnowflakeSQL. The difference between it and other SQL dialects is akin to the difference between English accents.

Many of the analytical queries you perform in dialects like PostgreSQL don’t change, but there are some discrepancies in DDL (Data Definition Language) commands.

Snowflake provides two interfaces to run SnowSQL:

  • Snowsight: A web interface for interacting with the platform.
  • SnowSQL: A CLI (Command Line Interface) client for managing and querying databases.

We will see how to set up both and run some queries!

Snowsight: Web interface

image3.png

To begin with Snowsight, navigate to the Snowflake homepage and select “Start for free.” Input your personal information and select any listed cloud provider. The choice doesn’t really matter, as the free trial includes $400 worth of credits for any of the options (you won’t be required to set up the cloud credentials yourself).

After verifying your email, you’ll be redirected to the Worksheets page. Worksheets are interactive, live-coding environments where you can write, execute, and view the results of your SQL queries.

image8.png

To run some queries, we need a database and a table (we won’t be using the sample data in Snowsight). The GIF below shows how you can create a new database named “test_db” and a table named “diamonds” using a local CSV file. You can download the CSV file by running the code in this GitHub gist in your terminal.

image6.gif

In the GIF, Snowsight informs us that there is a problem with one of the column names. Since the word “table” is a reserved keyword, I wrapped it in double quotes.

Afterward, you will be directed to a new worksheet where you can run any SQL query you want. As shown in the GIF, the worksheet interface is quite straightforward and highly functional. Take a few minutes to familiarize yourself with the panels, the buttons, and their respective locations.

SnowSQL: CLI

Nothing matches the excitement of managing and querying a full-fledged database from your terminal. That’s why SnowSQL exists!

However, to get it up and running, there are a few steps we need to follow, which is a typically slower process than getting started with Snowsight.

As a first step, download the SnowSQL installer from the Snowflake Developers Download page. Download the relevant file. As I am using WSL2, I will be choosing a Linux version:

image1.gif

In the terminal, I download the file using the copied link and execute it with bash:

$ curl -O https://sfc-repo.snowflakecomputing.com/snowsql/bootstrap/1.2/linux_x86_64/snowsql-1.2.31-linux_x86_64.bash
$ bash snowsql-1.2.31-linux_x86_64.bash

For other platforms, you can follow the installation steps from this page of Snowflake docs.

Once installed successfully, you should get the following message:

image4.png

Note: On Unix-like systems, it’s important to make sure the snowsql command is available in all terminal sessions. To achieve this, you should add the /home/username/bin directory to your $PATH variable. You can do this by appending the following line to your .bashrc, .bash_profile, or .zshrc files: export PATH=/home/yourusername/bin:$PATH. Remember to replace yourusername with your actual username.

The message is prompting us to configure the account settings to connect to Snowflake. There are two ways to do this:

  1. Pass the account details interactively in the terminal.
  2. Configure the credentials in a global Snowflake configuration file.

Since it is more permanent and secure, we will proceed with the second option. For platform-specific instructions, read the Connecting through SnowSQL page of the docs. The instructions below are for Unix-like systems.

First of all, go to your email address and find the Welcome email from Snowflake. It contains your account name inside the login link: account-name.snowflakecomputing.com. Copy it.

image9.png

Next, open the ~/.snowsql/config file with a text editor such as VIM or VSCode. Under the connections section, uncomment the following three fields:

  • Account name
  • Username
  • Password

Replace the default values with the account name you copied and the username and password you provided during sign up. After you’ve done that, save and close the file.

Then, return to your terminal and enter snowsql. The client should automatically connect and provide you with an SQL editor that includes features such as code highlighting and tab completion. Here's what it should look like:

image5.png

Connecting to an existing database in SnowSQL

Right now, we aren’t connected to any databases. Let’s fix that by connecting to the test_db database we've created with Snowsight. First, check available databases with SHOW DATABASES:

$ SHOW DATABASES
$ USE DATABASE TEST_DB

Next, specify that you will be using the test_db database (case-insensitive) from now on. Then, you can run any SQL query on the tables of the connected database.

$ SELECT COUNT(*) FROM DIAMONDS

image2.png

Creating a new database and table in SnowSQL

If you are part of a large organization, there might be cases where the responsibility of creating a database and populating it with existing data rests on your shoulders. To practice for that scenario, let’s try uploading the Diamonds dataset as a table in SnowSQL inside a new database. Here are the steps you can follow:

1. Create a new database:

CREATE DATABASE IF NOT EXISTS new_db;

2. Use the database:

USE DATABASE new_db;

3. Create a file format for CSV:

CREATE OR REPLACE FILE FORMAT my_csv_format -- Can be named anything
 TYPE = CSV
 FIELD_DELIMITER = ','
 SKIP_HEADER = 1;  -- Assuming the first row is a header

We must manually define a file format and name it because Snowflake cannot infer the schema and structure of data files such as CSV, JSON, or XMLs. The file format we defined above is suitable for the diamonds.csv file that we have (it is comma-separated and includes a header).

4. Create an internal stage:

CREATE OR REPLACE STAGE my_local_files;

A stage in Snowflake is a storage area where you can upload your local files. These can be structured and semi-structured data files. Above, we are creating a stage named my_local_files.

5. Put the CSV file into the stage:

PUT file:///home/bexgboost/diamonds.csv @my_local_files;

6. Create the table:

CREATE TABLE diamonds (
 carat FLOAT,
 cut VARCHAR(255),
 color VARCHAR(255),
 clarity VARCHAR(255),
 depth FLOAT,
 table FLOAT,
 price INTEGER,
 x FLOAT,
 y FLOAT,
 z FLOAT
);

7. Load data from the stage into the table:

COPY INTO diamonds
 FROM @my_local_files/diamonds.csv
 FILE_FORMAT = my_csv_format;

8. Verify:

SELECT COUNT(*) FROM diamonds;

These steps will create a new Snowflake database, define a CSV file format, create a stage to store local files, upload a CSV file to the stage, create a new table, load the CSV data into the table, and finally verify the operation by counting the number of rows in the table.

If the result returns the row count, congratulations, you’ve successfully created a database and loaded local data into it with SnowSQL. Now, you can query the table any way you like.

Conclusion and further learning

Whew! We started off with some simple concepts, but towards the end, we really dove into the gnarly details. Well, that’s my idea of a decent tutorial.

You’ve probably guessed that there is much more to Snowflake than what we’ve covered. In fact, the Snowflake documentation includes quickstart guides that are actually 128 minutes long! But before you tackle those, I recommend getting your hands wet with some other resources. How about these:

Thank you for reading!

Get certified in your dream Data Engineer role

Our certification programs help you stand out and prove your skills are job-ready to potential employers.

Get Your Certification
Timeline mobile.png

Photo of Bex Tuychiev
Author
Bex Tuychiev
LinkedIn

I am a data science content creator with over 2 years of experience and one of the largest followings on Medium. I like to write detailed articles on AI and ML with a bit of a sarcastıc style because you've got to do something to make them a bit less dull. I have produced over 130 articles and a DataCamp course to boot, with another one in the makıng. My content has been seen by over 5 million pairs of eyes, 20k of whom became followers on both Medium and LinkedIn. 

Temas

Start Your Database Journey Today!

Course

Introduction to Snowflake

3 hr
16.1K
This course will take you from Snowflake's foundational architecture to mastering advanced SnowSQL techniques.
See DetailsRight Arrow
Start Course
Ver másRight Arrow
Relacionado

podcast

[AI and the Modern Data Stack] Adding AI to the Data Warehouse with Sridhar Ramaswamy, CEO at Snowflake

Richie and Sridhar explore Snowflake and its uses, how generative AI is changing the attitudes of leaders towards data, the challenges of enterprise search, management and the role of semantic layers in the effective use of AI, a look into Snowflakes products including Snowpilot and Cortex, advice for organizations looking to improve their data management, and much more.
Richie Cotton's photo

Richie Cotton

45 min

tutorial

Snowflake Snowpark: A Comprehensive Introduction

Take the first steps to master in-database machine learning using Snowflake Snowpark.
Bex Tuychiev's photo

Bex Tuychiev

19 min

tutorial

Snowflake vs AWS: Choosing the Right Cloud Data Warehouse Solution

Discover why Snowflake and AWS are the top cloud data warehouses. Compare their unique features, limitations, and pricing to find the best fit for your needs.
Gus Frazer's photo

Gus Frazer

13 min

tutorial

Snowflake Arctic Tutorial: Getting Started With Snowflake's LLM

Snowflake Arctic is a family of enterprise-grade language models designed to simplify the integration and deployment of AI within the Snowflake Data Cloud.
Zoumana Keita 's photo

Zoumana Keita

18 min

tutorial

Using Snowflake Time Travel: A Comprehensive Guide

Discover how to leverage Snowflake Time Travel for querying history, cloning tables, and restoring data with our in-depth guide on database recovery.
Bex Tuychiev's photo

Bex Tuychiev

9 min

code-along

Getting Started with Data Analysis in Snowflake using Python and SQL

In this code-along session, you will learn how to use Snowpark Python and SQL to perform data analysis in the Snowflake Data Cloud.
Vino Duraisamy's photo

Vino Duraisamy

See MoreSee More