Skip to main content

How to Use dbt with Snowflake: A Comprehensive Guide

Learn how to integrate dbt with Snowflake for modern data workflows. Covers setup, transformations, security, CI/CD, and performance optimization.
Aug 26, 2025  · 15 min read

As modern data teams adopt increasingly complex analytical stacks, there has been a shift to creating standard practices within data engineering workflows. Tools like dbt (Data Build Tool) are an open-source framework that brings software engineering best practices—like version control, testing, and documentation—to analytics workflows. 

The combination of dbt and Snowflake (a cloud-native data warehouse) has emerged as a cornerstone of analytics engineering workflows. This guide explores the foundational concepts, setup, and advanced implementation strategies for integrating dbt with Snowflake, helping data teams achieve more maintainable, secure, and performant data pipelines.

If you’re just starting out with these tools, I recommend you take our Snowflake Foundations skill track and Introduction to dbt course

dbt-Snowflake: Foundational Concepts and Benefits

Let’s talk about how dbt and Snowflake work together. This section introduces dbt’s capabilities and highlights its compatibility with Snowflake and the benefits of integrating them together.

What is dbt (Data Build Tool)?

dbt (Data Build Tool) is an open-source framework that enables data teams to transform, test, and document data directly in their warehouse using SQL enhanced with Jinja templating. 

dbt is not a programming language itself; it acts as a compiler, taking modular SQL models and converting them into executable SQL queries that run in the warehouse (e.g., Snowflake). 

By incorporating version control, automated testing, and documentation, dbt brings proven software engineering practices to analytics engineering, making transformation pipelines more reliable, maintainable, and scalable as part of an ELT workflow.

Its core components include a compiler and a runner. Here are their functions:

  • Compiler: dbt user writes code in the Jinja language, which is compiled into raw SQL.
  • Runner: We invoke the compiled code against our data warehouse, like Snowflake, to get results.

For dbt to function, there are a  few underlying components that help everything flow:

  • Models: The core select statements saved as .sql files.
  • Seeds: .csv files that dbt can load into your data warehouse as a reference for your models.
  • Sources: Data in your warehouse you want to pull data from.
  • Data Tests: .sql files which test your data to make sure it complies with your rules.
  • Documentation: dbt supports documentation website that can showcase the tree of your model.

For more details on dbt, I recommend this excellent introduction to dbt course.

Overview of dbt and Snowflake integration

Using dbt’s Jinja-based templating combined with Snowflake’s powerful datatypes, like Streams, we can get very creative with our transformation workflows.

For instance, maybe we want to transform only newly loaded data or different transformations in our staging tables versus our prod tables so we can perform data health checks. 

The two are quickly becoming a common pairing to help data engineers get more control and flexibility out of their data pipelines with more companies adopting dbt regularly.

Benefits of Using dbt

There are many benefits to dbt: traditional software development methodology, automated testing, and productivity enhancements. 

Code-driven development

We can treat our data and SQL more like traditional code. It gives us access to modularity, testability, and version-control. By taking advantage of templates and Pythonic flow-control statements, we get some powerful functionality from dbt.

Automated testing

Using dbt’s built-in data_test configuration, we can regularly run tests against our data without convoluted SQL pipelines. In fact, dbt already comes with some powerful base-line tests that we can implement easily in our configuration files.

Productivity enhancements

Within dbt are things like documentation generators and pipeline visualizers. New team members can get onboarded more quickly and existing members a deeper understanding of data flows. This helps reduce ramp-up time and improves everyone’s overall understanding of the workflow.

Benefits of using Snowflake

There are many benefits to using Snowflake’s cloud-based platform. It has a great architecture which separates storage and compute resources to minimize resource conflict and improve costs. It utilizes zero-copy cloning, which allows users to create clones of databases without additional storage. 

Plus, it supports some complex table data types, such as dynamic tables and Streams, which allow for incremental refresh logic. For more details on Snowflake as a platform, check out this article explaining how Snowflake functions.

Most importantly, these features are highly complimentary to dbt. We can take advantage of incremental refresh to help dbt trigger transformations based on new data. We can also take advantage of Snowflake’s separation of compute and storage which allows dbt perform more computationally intense transformations without impacting the ability to upload and store data.

All of this comes with Snowflake’s ability to scale rapidly and its pay-per-use cost model. As the complexity of our dbt scripts increase, we can easily size up our Snowflake compute warehouses to meet those demands. At the same time, we can use Snowflake’s monitoring to lean down our warehouses and use dbt to streamline our query performance.

Key synergies between dbt and Snowflake

When combined, Snowflake and dbt play off of each other very well. Snowflake’s constant tuning helps keep the cloud-based costs low. dbt can take advantage of table partitioning methods like clustering keys and Snowflake’s caching to improve data overhead.

Snowflake does not lend itself to being easily managed using things like Git. Since dbt is built upon the same principles as other coding languages, we can use Git-based workflows to help manage our development processes more easily with things like CI/CD pipelines. This can help with version controlling our pipelines, which would be challenging if done completely in Snowflake.

To help developers collaborate more easily, dbt also has a UI which showcases all your dbt jobs in one place. This helps teams track processes together and make sure they are running appropriately. Also, dbt has recently introduced a Studio IDE which allows for easy Git and database integration with testing workflows. Some teams may even choose to use Snowflake’s UI for full integration.

Many data teams have started utilizing dbt and Snowflake together in order to get the most out of their databases. Continue to learn more about dbt in this hands-on introduction to dbt for data engineers and get a sense for how Snowflake data modeling works with this course.

From Snowflake Quickstarts: Accelerating with dbt Core & Snowflake

dbt Snowflake Implementation and Configuration

Let’s go over some basics of how to configure Snowflake and dbt for proper integration. We will also cover some basics on project structure design best practices.

Prerequisites and setup

We’ll begin with Snowflake database set-up then go over dbt set-up. 

Snowflake prerequisites

You will need a dedicated database, schema, and virtual warehouse that will store your dbt data. On top of that, you’ll want a specific dbt user role that dbt can access. When setting up dbt, you’ll provide it with the credentials of that user so that it can communicate with Snowflake. In this section, I’ll give you a few basic guidelines, but for more details on how to manage Snowflake databases and user roles, please go through this Introduction to Snowflake course.

  1. Create a dedicated database, schema, and virtual warehouse
	CREATE DATABASE analytics_db;
	CREATE SCHEMA analytics_db.transformations;
	CREATE WAREHOUSE dbt_wh
	  WITH WAREHOUSE_SIZE = 'XSMALL'
	  AUTO_SUSPEND = 60
	  AUTO_RESUME = TRUE
	  INITIALLY_SUSPENDED = TRUE;
	
  1. Create a dedicated dbt role and user. This is critical for allowing dbt access to your Snowflake server
	CREATE ROLE dbt_role;
	GRANT USAGE ON WAREHOUSE dbt_wh TO ROLE dbt_role;
	GRANT USAGE ON DATABASE analytics_db TO ROLE dbt_role;
	GRANT USAGE, CREATE SCHEMA ON DATABASE analytics_db TO ROLE dbt_role;
	GRANT ALL PRIVILEGES ON SCHEMA analytics_db.transformations TO ROLE dbt_role;

	CREATE USER dbt_user PASSWORD='StrongPassword123'
	  DEFAULT_ROLE = dbt_role
	  DEFAULT_WAREHOUSE = dbt_wh
	  DEFAULT_NAMESPACE = analytics_db.transformations;
	GRANT ROLE dbt_role TO USER dbt_user;
	
  1. Lastly, find your account name for your Snowflake account. We’ll need this later when we are setting up dbt.This can be found under your settings and account details in Snowflake. Please refer to the documentation on Snowflake for more details.

dbt configuration

In order to run dbt, we’ll need to install it using pip and then configure our profiles.yml file.

  1. Install dbt on your Python environment. You will need both dbt-core and dbt-snowflake packages
pip install dbt-core dbt-snowflake
  1. Configure the profiles.yml file and save it to ~/.dbt/profiles.yml
snowflake-db:
  target:dev
  outputs:
    dev:
      type: snowflake
      account: [account_id from earlier]
      user: [dbt_user]
      password: [dbt_pw]
      role: [dbt_role]
      warehouse:[dbt_warehouse]
      database: [dbt_database]
      schema: [dbt_schema]
      threads: 4
      client_session_keep_alive: False
      query_tag: [anything]
  1. Initialize a dbt project
dbt init --profile snowflake-db
  1. (Alternate initialization) If you’d rather not pre-config the YAML file, you can also go through the steps of dbt’s initialization by using the following command.
dbt init snowflake-profile-name

This will go through a step-by-step initialization that generates a profiles.yml file if needed!

  1. To make sure your connection is proper, we can use the  debug function:
dbt debug --connection

This command will check to see if your connection settings are valid and will throw an error if they are not.

Something to consider: instead of having your user and password directly in the  profiles.yml file, try using environment variables. For instance, the dbt_password could be saved to a SNOWFLAKE_PASSWORD environment variable. Then in the profiles.yml you would write password: “{{env_var(‘SNOWFLAKE_PASSWORD’)}}”. This helps prevent people from seeing credentials in the configuration file and adds a little extra security.

Version control setup

Connecting to Git helps get the most out of your dbt configuration. The exact steps may vary if you are doing this as a part of your enterprise. The general premise is straightforward forward though. Create a repository on either something like GitHub or GitLab. Then link your dbt account to the GitHub/GitLab account. Once that is done, you can  git clone your repository! For more information on connecting Git to dbt, look through dbt’s Git documentation for your particular platform.

Project structure design

A consistent and templated project structure helps new projects get going more quickly by providing fixed places where everything should go. 

Organizing models

Structure your project into folders between staging, intermediate, and marts. Within each, you can specify different departments that each have their own models and SQL. 

	models/
	│   ├── intermediate
	│   │   └── finance
	│   │       ├── _int_finance__models.yml
	│   │       └── int_payments_pivoted_to_orders.sql
	│   ├── marts
	│   │   ├── finance
	│   │   │   ├── _finance__models.yml
	│   │   │   ├── orders.sql
	│   │   │   └── payments.sql
	│   │   └── marketing
	│   │       ├── _marketing__models.yml
	│   │       └── customers.sql
	│   ├── staging
	│   │   ├── jaffle_shop
	│   │   │   ├── _jaffle_shop__docs.md
	│   │   │   ├── _jaffle_shop__models.yml
	│   │   │   ├── _jaffle_shop__sources.yml
	│   │   │   ├── base
	│   │   │   │   ├── base_jaffle_shop__customers.sql
	│   │   │   │   └── base_jaffle_shop__deleted_customers.sql
	│   │   │   ├── stg_jaffle_shop__customers.sql
	│   │   │   └── stg_jaffle_shop__orders.sql
	│   │   └── stripe
	│   │       ├── _stripe__models.yml
	│   │       ├── _stripe__sources.yml
	│   │       └── stg_stripe__payments.sql
	│   └── utilities
	│       └── all_dates.sql/

Each folder within the models has a purpose:

  • Staging: Cleaned, renamed raw data.
  • Intermediate: Logic-heavy joins or transformations.
  • Marts: Final business-ready tables.

For each of the staging, intermediate, and marts models’ configuration files, you may want to set the +materialized configuration to define how the SQL output is created in Snowflake. We can choose either views, tables, or incremental datasets.:

  • View: Default, lightweight.
  • Table: Cached but uses more storage.
  • Incremental: For large datasets with frequent appends.

That may look something like this:

# The following dbt_project.yml configures a project that looks like this:
# .
# └── models
#     ├── csvs
#     │   ├── employees.sql
#     │   └── goals.sql
#     └── events
#         ├── stg_event_log.sql
#         └── stg_event_sessions.sql

name: my_project
version: 1.0.0
config-version: 2

models:
  my_project:
    events:
      # materialize all models in models/events as tables
      +materialized: table
    csvs:
      # this is redundant, and does not need to be set
      +materialized: view

Advanced Optimization and Macros

One huge advantage of integrating dbt and Snowflake is the fact that we get optimizations from both platforms. For Snowflake, we can use different ways of handling queries to make them run faster. For dbt, we can optimize our coding with macros and materialization techniques.

Performance tuning techniques

There are some unique Snowflake features we can take advantage of to improve our query performance. 

Query optimization

Two big ways we optimize in Snowflake is predicate pushdown and clustering keys. 

Predicate pushdown is the idea of changing when Snowflake filters in order to have it read less data. For instance, imagine we were querying our data with two layers of filtering: one security filter and one categorical filter. 

The security filter defines who can see certain parts of our table based on security roles. The categorical filter sits in our WHERE statement in the query. Snowflake will choose the filter that goes first, which will require reading the least amount of data.

To help this predicate pushdown, we might consider clustering keys. While Snowflake does a decent job at clustering tables, when data hits multiple TB, it gets challenging for Snowflake to maintain sensible partitions. 

By defining clustering keys, we can help Snowflake partition the data in a way that fits how we often query the table. For instance, maybe we select two columns that always go together, like dates and property types. Each property type might have a small enough number of dates that makes repartitioning sensible and data location effective. 

Combining clustering keys with Snowflake’s predicate pushdown optimization can make navigating large tables much more efficient.

Warehouse scaling

Resource usage in a pay-per-use system like Snowflake is important! You want to keep to a few simple principles:

  • Use Snowflake’s auto-suspend to suspend warehouses when your pipelines are offline for an extended period of time
  • When you access the warehouse after a length of time, auto-resume will allow you to restart the warehouse as needed
  • Make sure to use XS or S warehouses for development and scale up as needed for production

These simple warehouse adjustments will make your Snowflake usage more efficient in the long-run.

Dynamic Tables vs incremental models

On the dbt side, we can leverage dynamic tables and incremental models to simplify update logic. 

Dynamic Tables are a Snowflake table type that automatically refreshes based on defined lag times.. dbt has the ability to materialize dynamic tables as part of its configuration and set them up similarly to how it would be done in Snowflake.

models:
  <resource-path>:
    +materialized: dynamic_table
    +on_configuration_change: apply | continue | fail
    +target_lag: downstream | <time-delta>
    +snowflake_warehouse: <warehouse-name>
    +refresh_mode: AUTO | FULL | INCREMENTAL
    +initialize: ON_CREATE | ON_SCHEDULE

This is convenient for having simple data transfer pipelines that do not require complex SQL. The best part is that we can connect to these dynamic tables and have fresh data as needed.

Another flexible way to build tables is to use the incremental model. When writing our SQL, we filter on a column which lets us know what data is new. Then, we utilize the is_incremental() macro to let dbt know to only use that filter when we configure for a materialized incremental table. 

Your sample SQL may look something like this, from the dbt documentation:

{{
    config(
        materialized='incremental'
    )
}}

select
    *,
    my_slow_function(my_column)

from {{ ref('app_data_events') }}

{% if is_incremental() %}

  -- this filter will only be applied on an incremental run
  -- (uses >= to include records whose timestamp occurred since the last run of this model)
  -- (If event_time is NULL or the table is truncated, the condition will always be true and load all records)
where event_time >= (select coalesce(max(event_time),'1900-01-01') from {{ this }} )

{% endif %}

Reusable logic with macros

You saw me just alluding to something called a “macro”. These are snippets of code that are reusable in dbt. This could be calculations that are recycled or transformations that occur frequently. 

Snowflake-specific macros

We create macros using Jinja templating. A common use case may be to clone databases prior to working on them.

	{% macro clone_tables(table_to_clone) -%}

	–- shows all tables within a schema
	    {% set all_tables_query %}
	        show tables in schema {{ clone_tables }}
	    {% endset %}

	-- take the set results and use the run_query macro
	    {% set results = run_query(all_tables_query) %}

	    {{ "create or replace schema " ~ generate_schema_name(var("custom_tables_list")) ~ ";" }}

	--execute the cloning function
	    {% if execute %}
	        {% for result_row in results %}
	            {{ log("create table " ~ generate_schema_name(var("custom_tables_list")) ~ "." ~ result_row[1] ~ " clone " ~ clone_tables~ "." ~ result_row[1] ~ ";") }}
	            {{ "create table " ~ generate_schema_name(var("custom_tables_list")) ~ "." ~ result_row[1] ~ " clone " ~ clone_tables~ "." ~ result_row[1] ~ ";" }}
	        {% endfor %}
	    {% endif %}
	{%- endmacro %}

This template takes a provided schema, generates a list of tables, and then clones them to the warehouse/schema provided in your dbt config. You can then use this macro in your dbt code or when starting up dbt.

To get more practice with Jinja in dbt, I really recommend this case study on building e-commerce data models with dbt. It walks through some complex Jinja templating with dbt and gets you some pretty advanced hands-on experience!

Masking PII with dbt:

As we work with customer information, it is possible we get introduced to personally identifiable information (PII) and need to hide it in our database. This is simple to do with Snowflake! We can use CREATE MASKING POLICY to let Snowflake know what columns and data types need to be masked when queries are run. These policies ensure no PII is leaked while querying sensitive information. 

We can then use dbt tests with custom SQL that checks that tables are masking properly. For instance, if we define a masking policy that only allows the ANALYST role to see PII, like so:

	CREATE OR REPLACE MASKING POLICY email_mask AS (val string) returns string ->
	  CASE
	    WHEN current_role() IN ('ANALYST') THEN VAL
	    ELSE '*********'
	  END;

Then, using our dbt_account user role to query the data should return a string '*********'

If our test does something simple like:

	SELECT *
	FROM schema.table
	WHERE email <> '*********'

If this test returns results, then it should be considered a failure. No emails should come back as anything other than the masked string. We should then check the masking policy on Snowflake and keep testing until it is working properly.

Security and Governance

As always, security is a key pillar of cloud access. We’ll go over some basics for security and data protection in this section.

Access control framework

Snowflake’s main security framework is role-based access control (RBAC) and the usage of service accounts. 

Role-Based Access Control (RBAC)

Defining RBAC privileges allows us to apply security settings to a broad group of users without having to manually assign them individually. This makes it easier to maintain. 

As noted above, we create roles using the CREATE ROLE statement. To give permissions we use GRANT <privilege> … TO ROLE and likewise remove them through REVOKE <privilege> … FROM ROLE

Here are some best practices:

  • Use least–privilege–based roles. For instance, only give read access to a role like dbt_reader, table creation access to dbt_writer, and delete access only to an admin account like dbt_admin.
  • Consider creating roles that have access to only certain environments like dev, staging, and prod.

Service account

Service accounts are special machine-based accounts that interact without human interaction. These accounts don’t have usernames and passwords. Instead, they use public keys and network restrictions to minimize risk.

Creating a service account might look like this:

CREATE USER dbt_service
RSA_PUBLIC_KEY = <keysring>

This way, you have a singular account that is managed by admins, and you no longer have to worry about individual users. It is recommended to regularly rotate the keys so that they cannot get leaked. Also, make sure to keep a close eye on network activity using Snowflake’s LOGIN_HISTORY.

Data protection

Protect your data and logins. Although we might have hardcoded our credentials as an example, it is actually poor practice in reality. Use secret managers like AWS Secrets Manager or GitHub Secrets to hold onto important credentials. 

These allow you to store credentials online that are only accessible to users with the proper RSA/Security keys. Nobody gets to see the actual value and rotating passwords/keys is easy with their automated management.

Make sure any connections are properly secured and users are on the appropriate enterprise network to access data. This is often established by using VPNs like GlobalProtect. Since we are connecting to Snowflake, a lot of the end-to-end encryption is handled by the cloud. Just another benefit of being on Snowflake! 

Finally, we want to audit our activity. Snowflake has query logging through tables like QUERY_HISTORY and QUERY_HISTORY_BY_USER. We can monitor usage, errors, and any security abuse in these logs. Through dbt, we can track new runs, changes, or edits to projects by using our version control pipelines.

Version control and coding templates help maintain change tracking and compliance. Any issues introduced by changes can be easily noted during mandatory pull requests and addressed before they become a concern.

Monitoring and Cost Management

Let’s talk about how we can save some money! In dbt, of course. Monitoring our model performance is vital. It’s easy for models that used to run without issue to get slower and less performant as table size increases.

Performance monitoring

With dbt, the best way to monitor performance is to use built-in packages like dbt-snowflake-monitoring or external tools like Looker and Datafold to help with visualization and data comparisons.

dbt-snowfake-monitoring usage

Getting started with dbt-snowflake-monitoring is pretty easy! In your dbt packages.yml add the following:

packages:
  - package: get-select/dbt_snowflake_monitoring
    version: [">=5.0.0", "<6.0.0"] # We'll never make a breaking change without creating a new major version.

Then, in each project’s YAML file, add the following:

dispatch:
  - macro_namespace: dbt
    search_order:
      - <YOUR_PROJECT_NAME>
      - dbt_snowflake_monitoring
      - dbt
query-comment:
  comment: '{{ dbt_snowflake_monitoring.get_query_comment(node) }}'
  append: true # Snowflake removes prefixed comments.

Now you’re ready to use the package for monitoring things like your warehouse usage, query durations, and failed models! It uses your dbt user’s existing access to write tables to Snowflake, which contain useful information. For example, if I wanted to see my monthly warehouse spend, I could use the following SQL:

select
    date_trunc(month, date)::date as month,
    warehouse_name,
    sum(spend_net_cloud_services) as spend
from daily_spend
where service in ('Compute', 'Cloud Services')
group by 1, 2

External tooling

External tools like Looker or Tableau can be integrated directly with dbt for easy visualization of data and performance. For alerts, dbt can send emails or Slack notifications.

For deeper comparisons, external stacks like Datafold provide deeper insight into how data quality is impacted by changes. Datafold automates data diffs with every code change you make. This lets other team members easily review your code and the impact it will have on data. This sort of automated regression testing allows us to maintain quality data without having to constantly build new checks and tools.

Cost optimization techniques

We can’t avoid running our ETL processes, but we can definitely plan ahead to minimize the cost impact of computation and storage. Thanks to the split design of Snowflake, we can optimize our warehouses and storage separately to bring a more flexible cost optimized environment.

For warehouses, the best way is to minimize idle time. Spread out your automated pipelines as much as possible to reduce idle time, spikes in computational needs, and competing resources. 

Monitor regularly using the WAREHOUSE_LOAD_HISTORY and QUERY_HISTORY tables to understand how resources are being used. The best practice is to start with XS. If queries are running too slowly, then it is time to size up based on your team’s SLA.

For storage, the best practice is to get rid of unnecessary tables. While Snowflake itself doesn’t have cold-tiers like AWS and Google, you can choose to move your data from Snowflake to AWS/Google for these cold-tiers and reduce storage cost. 

Data retention policies can clearly outline a timeline for tables that are rarely used and when they will be moved to cold storage. 

For large tables that grow frequently, monitor them closely. As they get larger, Snowflake’s ability to automatically and efficiently partition breaks down. Take advantage of the automatic clustering mentioned above to help Snowflake better partition your data. Better partitions lead to less time reading storage and thus saving cost.

CI/CD and Automation

Due to dbt’s native integration with version control, CI/CD, and automated pipelines are a breeze. We’ll cover some automated orchestration options and testing best practices.

Pipeline orchestration

With dbt Cloud, we can use Snowflake Tasks to trigger scheduled runs of our dbt projects. The way it works is we first create a dbt project within Snowflake using our Git repository, and then we create a task that executes that dbt project. 

For instance, we could do something like the following to create a dbt project:

CREATE DBT PROJECT sales_db.dbt_projects_schema.sales_model
  FROM '@sales_db.integrations_schema.sales_dbt_git_stage/branches/main'
  COMMENT = 'generates sales data models';

Then we use the following code to run that project:

CREATE OR ALTER TASK sales_db.dbt_projects_schema.run_dbt_project
  WAREHOUSE = sales_warehouse
  SCHEDULE = '6 hours'
AS
  EXECUTE DBT PROJECT sales_db.dbt_projects_schema.sales_model args='run --target prod';

In fact, it’s so comprehensive, we can even run tests afterward to make sure the project ran properly:

CREATE OR ALTER TASK sales_db.dbt_projects_schema.test_dbt_project
        WAREHOUSE = sales_warehouse
        AFTER run_dbt_project
AS
  EXECUTE DBT PROJECT sales_db.dbt_projects_schema.test_dbt_project args='test --target prod';

For scheduling, we have the option to either set a flat time like 60 MINUTES and 12 HOURS or using cron like so USING CRON 0 9 * * * UTC..

Our models rarely stay static, however, and we often have to iterate upon them. It can be time-consuming to manually build every piece of our pipeline. This is where GitHub Actions come in. We often use GitHub Actions in CI/CD pipelines to automate the testing and building of code. 

GitHub Actions are designed using YAML files inside of a .github/workflows folder inside the repository. We can make them trigger any time we make a pull request, such that we run through some dbt tests.

name: dbt pull test

# CRON job to run dbt at midnight UTC(!) everyday
on:
  pull_request:

types:
openedreopened
# Setting some Env variables to work  with profiles.yml
# This should be your snowflake secrets
env:
  DBT_PROFILE_TARGET: prod
  DBT_PROFILE_USER: ${{ secrets.DBT_PROFILE_USER }}
  DBT_PROFILE_PASSWORD: ${{ secrets.DBT_PROFILE_PASSWORD }}

jobs:
  dbt_run:
    name: dbt testing on pull request
    runs-on: ubuntu-latest
    timeout-minutes: 90
    
    # Steps of the workflow:
    steps:
      - name: Setup Python environment
	uses: actions/setup-python@v4
	with:
	  python-version: "3.11"

      - name: Install dependencies
	run: |
          python -m pip install --upgrade pip
	  python -m pip install -r requirements.txt
	
      - name: Install dbt packages
	run: dbt deps
        # optionally use this parameter
        # to set a main directory of dbt project:
        # working-directory: ./my_dbt_project
	    		
      - name: Run tests
	run: dbt test
        # working-directory: ./my_dbt_project

Setting up your automations properly can make your life a lot easier and help other developers see your tested models. This streamlines the development process by reducing the need for manual testing.

Testing framework

dbt comes with a powerful testing suite. It comes with a few out-of-the-box generic data tests that can be set on your model directly and allow for custom SQL logic. The generic tests are set on particular columns in your model and check if these specific columns pass that test. The out-of-the-box tests are the following:

  • unique: every value in this column should be unique
  • not_null: there should be no null values in the column
  • accepted_values: checks if the values are in a list of accepted values
  • relationships: checks if the values in this column exist in another related table

Here is an example of using these relationships (from dbt documentation on tests):

version: 2

models:
  - name: orders
    columns:
      - name: order_id
        data_tests:
	# makes sure this column is unique and has no nulls
          - unique
          - not_null
      - name: status
        data_tests:
	#makes sure this column only has the below values
          - accepted_values:
              values: ['placed', 'shipped', 'completed', 'returned']
      - name: customer_id
        data_tests:
	# makes sure that the customer ids in this column are ids in the customers table
          - relationships:
              to: ref('customers')
              field: id

We can use custom SQL to create more specific and generic tests. The goal here is for the tests to return failing rows. So if any results appear, the test is considered failed. These custom SQL files live in our tests directory.

For example, if we wanted to know if any sales were negative, we could use the following:

	select
	    order_id,
	    sum(amount) as total_amount
	from {{ ref('sales') }}
	group by 1
	having total_amount < 0

We would then reference that test in our schema.yml file inside our test folder:

version: 2
data_tests:
  - name: assert_sales_amount_is_positive
    description: >
    Sales should always be positive and are not inclusive of refunds

If we wanted to use this kind of check on more tables and columns, we may want to turn this into a generic test. That might look like this:

{% test negative_values(model, group,sum_column) %}

    select {{group}},
    sum({{sum_column}}) as total_amt
    from {{ model }}
    group by 1
    having total_amt < 0
{% endtest %}

We can now use this test in our models yaml file just like the other generic tests dbt comes with! You could check for negative passengers on a trip or negative profit and so on! The possibilities are endless.

To make this testing more seamless and simple, we can take advantage of Snowflake’s zero-copy cloning. Snowflake’s CREATE TABLE <object> CLONE <source_object> can create a zero-copy clone in your dev/test environment. 

Then, all your testing and CI can be done against the cloned object. The benefit of this is that we do not require additional storage space to have a safe testing environment. Plus, whenever we need more testing space, we can create more testing environments with this instant cloning.

Let’s look into the future improvements and trends coming from dbt and Snowflake.

Native integration developments

dbt is constantly improving its integration both with its core product and with AI. In Snowflake, we are able to use Snowsight as a UI interface to help manage our dbt workspaces. By integrating more seamlessly into our Snowflake environment, we minimize the need for working on dbt models externally. Instead, we can centralize our workflows by allowing teams to edit and collaborate through Snowflake’s UI.

To help speed up development, dbt offers an AI-powered assistant called dbt Copilot. This assistant can be accessed through the dbt Studio IDE, Canvas, or Insights and uses natural language processing to help speed up the more time-consuming parts of model development, such as writing documentation and creating tests. 

The dbt Copilot can even be used to help onboard new analysts by providing them with concise summaries of projects and models.

Use both of these tools to speed up your time to production and shorten development time allowing your team to tackle the harder and more challenging data problems in your organization. 

Machine Learning Integration

Machine learning pipelines require consistently clean and well-designed data. dbt is the perfect tool for data transformation pipelines that feed machine learning models. We can leverage the flexibility and power dbt to automate data transformations that integrate seamlessly into our machine learning pipelines.

For instance, we can transform our data straight into a machine learning warehouse. This data can then be used with Snowflake’s Cortex AI to garner deeper insights. If we have multiple models that require similar transformations, dbt’s templating can simplify the way we clean our data and provide more consistency and easier data governance. To get a better sense of Snowflake’s AI tools, read this guide on Snowflake Arctic, which is Snowflake’s LLM.

Containerized execution

Thanks to Snowflake’s native Python support and Snowpark API for Python, we can run code locally without having to move data out of Snowflake. The next level of that is using Snowpark Containers, which are currently (as of writing) only available to AWS and Azure regions, to run completely isolated executions of our dbt code. 

To do so, we package our dbt models and environment into a container, which then lives in Snowflake’s image repository.

We can then use this containerized dbt project easily within Snowpark. The main benefit? The ability to more easily utilize Snowflake’s resource for complex Python-based data transformations and integration of dbt transformations within the larger Snowflake environment. 

If you’re curious to learn more about Snowpark, check out this information that describes Snowflake Snowpark in detail.

Conclusion

Integrating dbt with Snowflake empowers data teams to build modular, governed, and scalable transformation pipelines. With features like automated testing, Git-driven CI/CD workflows, and seamless scalability, this stack is ideal for modern data operations. 

To stay competitive, data teams should prioritize:

  • Better data governance for maintaining data life cycles
  • Enhanced cost visibility and constant performance monitoring to rightsize warehouses
  • Containerization for reproducibility and unifying workflows for analytics and ML

As both tools continue to develop, the dbt + Snowflake integration will only become more powerful for data engineering teams. If you have an interest in learning more about dbt or Snowflake, check out the following resources:

dbt Snowflake FAQs

What are materializations in dbt, and how do they affect Snowflake?

Materializations define how dbt builds models: as views, tables, or incremental tables.

How do dynamic tables in Snowflake differ from dbt’s incremental models?

dbt incremental models are manually managed and triggered, while Snowflake dynamic tables are refreshed automatically based on dependency changes. Use dbt for flexibility and orchestration control; use dynamic tables for real-time data freshness.

What are the CI/CD options for dbt and Snowflake?

dbt supports CI/CD using GitHub Actions, GitLab CI, or Bitbucket Pipelines. Combine version control with automated testing and dbt runs on merge or pull requests for a robust workflow.

What future trends should I watch in the dbt + Snowflake ecosystem?

Key developments include:

  • Snowsight-native dbt editing,
  • AI-powered assistants (e.g., Ask dbt),
  • Deeper Snowpark and ML integration,
  • Data mesh support via modular projects,
  • Dynamic tables for near real-time data ops.

Tim Lu's photo
Author
Tim Lu
LinkedIn

I am a data scientist with experience in spatial analysis, machine learning, and data pipelines. I have worked with GCP, Hadoop, Hive, Snowflake, Airflow, and other data science/engineering processes.

Topics

Top DataCamp Courses

Track

Snowflake Foundations

0 min
Get hands-on with Snowflake! Learn to query, model, and manage data in this interactive intro for data engineers, analysts, and BI professionals.
See DetailsRight Arrow
Start Course
See MoreRight Arrow
Related

Tutorial

dbt Snapshot: A Comprehensive Tutorial

Learn what a dbt snapshot is, how to create one, and how to configure it to track your data changes in this tutorial.
Austin Chia's photo

Austin Chia

Tutorial

Snowflake Tutorial For Beginners: From Architecture to Running Databases

Learn the fundamentals of cloud data warehouse management using Snowflake. Snowflake is a cloud-based platform that offers significant benefits for companies wanting to extract as much insight from their data as quickly and efficiently as possible.
Bex Tuychiev's photo

Bex Tuychiev

Tutorial

INSERT INTO in Snowflake: A Complete Beginner’s Guide

Learn how to efficiently insert data into Snowflake tables using the INSERT INTO statement. This guide covers syntax, use cases, best practices, and troubleshooting techniques.
Tim Lu's photo

Tim Lu

Tutorial

Snowflake Dynamic Tables Explained: Modern Data Pipelines

Learn about how Snowflake Dynamic Tables can simplify your complicated and time-sensitive ETL processes by allowing Snowflake to manage refreshes.
Tim Lu's photo

Tim Lu

Tutorial

Snowflake Data Ingestion: A Comprehensive Guide

Follow this guide to understand data ingestion in Snowflake. This tutorial is perfect for those new to Snowflake or those who want a quick review of the functionality.
Tim Lu's photo

Tim Lu

Tutorial

Building Data Pipelines in Snowflake: A Beginner-Friendly Guide

Start from scratch and build end-to-end data pipelines in Snowflake by following this guide. We will cover the basics of building automated and efficient Snowflake pipelines.
Tim Lu's photo

Tim Lu

See MoreSee More