course
CREATE TABLE in Databricks: A Full Guide to Building and Managing Tables
Databricks tables are a core component of the Databricks Lakehouse platform, designed to handle large-scale data storage and processing. These tables enable collaboration across teams by unifying structured and unstructured data while supporting analytical workloads.
In this article, I will guide you through the different techniques for creating tables in Databricks using the CREATE TABLE
command with several syntax variations to accommodate different use cases. I will also offer tips on the best practices when building and managing tables in Databricks for efficiency and reliability.
As we get started, I highly recommend checking out DataCamp’s How to Learn Databricks: A Beginner’s Guide to the Unified Data Platform blog post to help you understand the core features and applications of Databricks and provide a structured path to start your learning. I also recommend taking our Introduction to Databricks course. Even if you have experience in SQL, you will see that things can look different in the Databricks Lakehouse platform. The interactive exercises in our course will guide you through SQL queries as well as platform features and external system connections.
Introduction to Databricks Tables
To start, we should say Databricks tables are structured datasets that facilitate data organization and access within the Databricks Lakehouse environment. There are two types:
- Managed Tables: These tables are fully controlled by Databricks, which manages both the storage and metadata. When a managed table is created, Databricks takes care of the underlying data files, and if the table is deleted, the corresponding data is also removed.
- External Tables: External tables reference data that resides outside of Databricks, typically in cloud storage systems like AWS S3 or Azure Blob Storage. While Databricks manages the metadata for these tables, the actual data remains intact even if the table is dropped. These are also known as unmanaged tables.
Tables in Databricks often default to Delta Lake due to its advanced features and optimizations. Delta Lake provides ACID transactions, scalability, performance enhancements such as optimizations such as caching and indexing, and unified batch and streaming data processing.
As a note, the early versions of Azure Databricks used the legacy “Hive format” tables, which the modern Delta tables have replaced. Overall, modern Delta tables surpass legacy Hive tables in reliability and performance, making them the default choice for Databricks workflows.
Here is a summary that I think could help you see the difference between the two:
Feature | Hive format table | Delta table |
---|---|---|
Data format | Apache Hive (CSV, Parquet, ORC, etc.) | Delta Lake (based on Parquet with metadata) |
Transactions | No ACID support (eventual consistency) | Full ACID transactions |
Schema management | Limited enforcement | Strict schema enforcement |
Performance optimization | Manual tuning required | Built-in optimizations (caching, compaction) |
Time travel | No built-in support for historical queries | Supports time travel for querying previous versions of data |
Concurrency | Prone to conflicts during updates or deletes | Handles concurrent operations efficiently |
CREATE TABLE Syntax Variations
The CREATE TABLE
command in Databricks supports multiple variations to accommodate different cases. Each variation has its purpose, with targeted options like file location, cloning, and metadata management. If you need to refresh your knowledge about Databricks SQL, I recommend reading our Databricks SQL tutorial to learn, among other things, how to use a Notebook in a Databricks SQL warehouse.
Below are the main syntax styles with examples and explanations.
CREATE TABLE [USING]
This syntax creates a new table with a specified schema and data format. The following query creates a table named sales_data
by reading data from Parquet files stored at a specific location.
-- Create a table from files in a specific format
CREATE TABLE sales_data
USING PARQUET
LOCATION '/mnt/data/sales';
CREATE TABLE (Hive format)
This legacy syntax is similar to the standard SQL CREATE TABLE
command, used when defining a table with metadata managed by Databricks without needing to reference external files. The query below creates a Hive-style table named hive_table
with a predefined schema and additional metadata properties.
-- Hive-style table creation with schema definition
CREATE TABLE hive_table (
id INT,
name STRING
)
COMMENT 'Hive format example'
TBLPROPERTIES ('created_by' = 'databricks');
In the above query, the clause COMMENT
adds descriptive metadata while TBLPROPERTIES
stores custom properties about the table.
CREATE TABLE LIKE
This command creates a new table using the schema and properties of an existing table. You should use this syntax when you want to replicate the structure of an existing table without copying its data. This is useful for testing or development purposes.
-- Create a table with the same schema as another table
CREATE TABLE new_table LIKE existing_table;
CREATE TABLE CLONE
This syntax allows you to create a new table by cloning an existing Delta Lake table, either as a deep or shallow clone. Therefore, you should use the CREATE TABLE CLONE
syntax when you need an independent copy of a table's data and metadata (deep clone) or just the metadata while referencing the original data (shallow clone). Think about this for versioning or creating backups.
The following example shows the queries for creating deep clone and shallow clone tables in Databricks.
-- Deep clone by default
CREATE TABLE cloned_table CLONE original_table;
-- Shallow clone
CREATE TABLE shallow_clone_table SHALLOW CLONE original_table;
CREATE OR REPLACE TABLE
This single, combined command replaces an existing table with a new one, maintaining the same table name, as one operation. You should use this syntax when you want to completely overwrite an existing table while keeping its name. This is particularly useful for updating the structure or data of a table without needing to drop it first.
The example query below creates or replaces the sales
table with a new schema.
-- Create or replace a sales table with a new schema
CREATE OR REPLACE TABLE sales (
sale_id INT,
amount FLOAT
)
USING DELTA
LOCATION '/mnt/delta/sales';
There are other relevant clauses you can use with the CREATE TABLE
syntax in Databricks to manage your tables according to specific requirements. The following is a summary of usage of each of these clauses. I used only some but not all of these as examples already.
-
IF NOT EXISTS
: Avoids errors by skipping table creation if it already exists. -
USING
: Specifies the file format (e.g., Delta, Parquet, CSV). -
EXTERNAL
: Declares a table where data resides outside Databricks' managed storage. -
PARTITIONED BY
: Defines partitions for optimizing large datasets. -
LOCATION
: Points to the directory containing table data. -
TBLPROPERTIES
: Adds metadata for custom table configurations and optimizations.
Step-by-Step Demonstrations
In this section, I will demonstrate different ways to create and manage tables in Databricks, with concise examples and explanations of key clauses. I recommend checking our Databricks Tutorial: 7 Must-know Concepts For Any Data Specialist tutorial to learn how to run SQL commands in the Databricks platform.
How to create an empty table
To create an empty Delta table, define the table schema with column names and data types. Use the USING DELTA
clause to define the table format and COMMENT
to add descriptive metadata.
This query creates an empty Delta table named employees
with a predefined schema, including columns for id
, name
, department
, and hire_date
.
-- Create an empty Delta table with a defined schema
CREATE TABLE employees (
id BIGINT,
name STRING,
department STRING,
hire_date DATE
)
USING DELTA
COMMENT 'Table to store employee information';
How to create a table from existing data
To create a table from existing data, link the table to external data files in formats like CSV, Parquet, or JSON. You can use the LOCATION
clause to link the table to the file path and the USING
clause to specify the data format.
For example, the following query creates a table named sales_data
by referencing Parquet files stored at /mnt/data/sales_data/
.
-- Create a table from Parquet files in a specified location
CREATE TABLE sales_data
USING PARQUET
LOCATION '/mnt/data/sales_data/';
How to create a table as select (CTAS)
You can also create a table as select by executing a query and storing the result. For this scenario, the AS SELECT
clause populates the table with the query result.
For example, the query below creates a new table named high_value_sales
by selecting and transforming data from the existing sales_data
table. It filters rows where total_amount
is greater than 1000, keeping only the sale_id
, customer_id
, and total_amount
columns.
-- Create a table by selecting and transforming data from another table
CREATE TABLE high_value_sales AS
SELECT
sale_id,
customer_id,
total_amount
FROM sales_data
WHERE total_amount > 1000;
How to create an external table
Creating an external table involves referencing data stored outside of Databricks' managed storage. You should use the LOCATION
clause to specify the external data location.
The query below creates an external Delta table named external_table
by referencing data stored at /mnt/external_storage/external_table/
.
-- Create an external table referencing data outside Databricks storage
CREATE TABLE external_table
USING DELTA
LOCATION '/mnt/external_storage/external_table/';
How to create a table LIKE
To create a table LIKE
, duplicate the schema of an existing table without copying its data. In the example below, the query creates a new table named sales_backup
with the same schema as an existing sales
table.
-- Create a new sales table with the same schema as an existing table
CREATE TABLE sales_backup LIKE sales;
How to create TABLE CLONE
You can create a table clone by using the CLONE
clause to create a shallow or deep copy of a Delta table for backups or testing.
In the following example, the query creates a new table, sales_shallow_clone,
copying only the metadata (schema and table properties).
-- Shallow clone: Copies only metadata (references original data)
CREATE TABLE sales_shallow_clone CLONE sales_data;
The example below creates a new table, sales_deep_clone
, but copies both metadata and actual data.
-- Deep clone: Copies both metadata and actual data
CREATE TABLE sales_deep_clone CLONE sales_data DEEP;
How to create a temporary table (temp table)
Use the TEMPORARY
clause when creating temporary tables to ensure the table is session-specific.
The query below creates a temporary table named temp_table
that exists only for the current session. It selects id
, name
, and sales_amount
from sales_data
, filtering for rows where region = 'US'
.
-- Create a temporary table that exists only in the session
CREATE TEMPORARY TABLE temp_table AS
SELECT
id,
name,
sales_amount
FROM sales_data
WHERE region = 'US';
How to replace a table
You can use the CREATE OR REPLACE
clause to replace the table definition and content. The query below replaces the existing table updated_table
with a new schema and data. It defines the table with three columns (id
, name
, total_sales
).
-- Replace an existing table with a new schema or data
CREATE OR REPLACE TABLE updated_table (
id BIGINT,
name STRING,
total_sales DOUBLE
)
USING DELTA
LOCATION '/mnt/delta/updated_table/';
Advanced Topics
In this section, I will explore advanced Databricks table features that enhance functionality, performance, and security.
Generated columns and identity columns
Databricks supports identity columns, which automatically generate unique values for each new row. This feature is useful for creating surrogate keys in data warehousing.
The query below creates a table employees
with three columns: id
, name
, and department
. The id
column is an identity column that auto-increments with each new row.
-- Identity column: auto-incremented value
CREATE TABLE employees (
id BIGINT GENERATED ALWAYS AS IDENTITY, -- Auto-incremented column
name STRING,
department STRING
);
This other query creates a table sales
with three columns: sale_id
, sale_date
, and sale_year
. The sale_year
column is generated and automatically calculated based on the sale_date
column using the YEAR(sale_date)
function.
-- Generated column: derived from other columns
CREATE TABLE sales (
sale_id BIGINT,
sale_date DATE,
sale_year INT GENERATED ALWAYS AS (YEAR(sale_date)) -- Derived column
);
Partitioning and clustering
Partitioning divides data into smaller chunks, improving query performance, while clustering organizes data within partitions for faster reads.
For example, the query below creates a partitioned table where data is split into partitions based on the region
column, improving query performance when filtering by region.
-- Partitioned table
CREATE TABLE partitioned_table (
id BIGINT,
region STRING,
sales_amount DOUBLE
)
USING DELTA
PARTITIONED BY (region); -- Querying by region will be faster
The query below optimizes the partitioned table by clustering the data within each partition based on the sales_amount
column. It makes queries on sales_amount
faster by improving the data layout for efficient reads. The ZORDER BY
clause works within partitions to improve read performance by clustering data within partitions, not across partitions.
-- Clustered table
OPTIMIZE partitioned_table
ZORDER BY (sales_amount); -- Clusters data within partitions for efficient reads
Column masking and row filters
Column masking and row filters restrict data visibility based on user roles or conditions, protecting sensitive information.
For example, the query below creates a masked view named masked_table
based on the employees
table. The salary
column is redacted (replaced with REDACTED
) for all users except those with the manager
role. The CASE
statement ensures that only managers can see the actual salary, while others see the masked data
-- Column masking: Redact sensitive data
CREATE MASKED VIEW masked_table AS
SELECT
id,
name,
CASE
WHEN current_user() = 'manager' THEN salary
ELSE 'REDACTED'
END AS salary
FROM employees;
Similarly, the query below uses row filters to create or replace a view named filtered_table
based on the sales
table. The view filters rows to only include data where region = 'US'
, restricting access to only sales data from the US.
-- Row filters: Fine-grained access control
CREATE OR REPLACE VIEW filtered_table AS
SELECT *
FROM sales
WHERE region = 'US'; -- Limit data visibility to a specific region
Schema evaluation
Databricks supports schema evolution through CREATE OR REPLACE TABLE
and table alterations. You should use the CREATE OR REPLACE TABLE
statement to overwrite the schema and data and the ALTER TABLE
statement to modify the schema incrementally without affecting data.
The query below replaces the existing employees
table with a new schema, adding a new hire_date
column while keeping the data stored in Delta format.
-- Schema evolution with CREATE OR REPLACE TABLE
CREATE OR REPLACE TABLE employees (
id BIGINT,
name STRING,
department STRING,
hire_date DATE -- New column added
)
USING DELTA;
This other query modifies the employees
table by adding a new salary
column to the existing schema.
-- Altering a table schema
ALTER TABLE employees ADD COLUMNS (salary DOUBLE);
Performance tuning
To optimize performance for large-scale tables, you can utilize various table properties and techniques such as auto-optimization and Z-ordering.
In the following example, the query enables auto-optimization on the delta_table
by setting properties to automatically compact small files and merge files during writes, improving storage efficiency.
-- Enable automatic optimizations
ALTER TABLE delta_table
SET TBLPROPERTIES (
'delta.autoOptimize.optimizeWrite' = 'true', -- Automatically compact small files
'delta.autoOptimize.autoCompact' = 'true' -- Automatically merge files
);
Similarly, the query below optimizes the delta_table
by clustering data based on region
and sales_date
to enhance query performance, especially for filters on these columns.
-- Perform clustering for faster query performance
OPTIMIZE delta_table
ZORDER BY (region, sales_date); -- Improves query efficiency for frequent filters
Best practices and considerations
Here are some things to consider when creating tables in Databricks, so you do things correctly:
- Table naming conventions: Always use clear and consistent names to improve discoverability and reduce errors. Avoid special characters or excessively long names to prevent compatibility issues.
- Security and permissions: Utilize Unity Catalog for centralized access control across all data assets. Use Access Control Lists (ACLs) to restrict who can view or modify data. Also, mask sensitive data with column masking or row filters.
- Backup and recovery: Rely on Delta Lake's version history for easy recovery and time travel. Regularly monitor and manage the retention of historical versions to balance storage costs and recovery needs.
- Managed vs. external tables: It is also important to understand managed versus external tables. In managed tables, Databricks handles both metadata and data. This is ideal for scenarios where you want full control over data lifecycle management. However, in external tables, the referenced data is stored outside of Databricks. This is useful for maintaining control over data that may be shared across different platforms or systems.
Common Pitfalls and Troubleshooting
Now that we have discussed the best practices for managing tables in Databricks let me cover the frequent issues you may encounter with Databricks tables, their typical error messages, and then the solutions.
Non-empty location errors when using CREATE vs. CREATE OR REPLACE
When using CREATE TABLE
with a specified location (LOCATION
clause), if the directory is not empty, you may encounter an error like:
Error: Cannot create table because the location is not empty.
This error occurs because the CREATE TABLE
command expects an empty directory for the specified location, but existing files or metadata conflict with this requirement.
To solve this error, use the CREATE OR REPLACE TABLE
statement to overwrite existing data and metadata. Or else, manually clean up the directory before using the CREATE TABLE
statement.
CREATE OR REPLACE TABLE delta_table
USING DELTA
LOCATION '/mnt/data/delta_table'; -- Overwrites table and files
Schema mismatch or issues with certain data types
Inserting or querying data results in schema mismatch errors such as:
Error: Schema mismatch: Expected column type STRING, but found INT.
The error could be caused by a mismatch between the table schema and incoming data types. Also, unsupported or ambiguous data type conversions (e.g., complex nested types) could cause schema mismatch. To solve this error, enforce schema matching to verify the schema during table creation or use schema enforcement in Delta tables.
-- Enforce schema matching during table creation
CREATE TABLE schema_table (
id INT,
name STRING
)
USING DELTA
TBLPROPERTIES ('delta.schema.enforce' = 'true'); -- Enforces strict schema
Also, you should allow schema evolution when adding new columns or altering data types.
ALTER TABLE schema_table ADD COLUMNS (new_column DOUBLE);
Using explicit type casting when inserting data should also help troubleshoot this type of error.
INSERT INTO schema_table SELECT CAST(id AS INT), CAST(name AS STRING) FROM source_table;
Concurrent transaction limitations with identity columns
When using identity columns, you may encounter issues related to concurrent transactions, such as:
Error: Concurrent transactions detected on identity column table
Identity columns are stateful and can encounter conflicts when multiple transactions attempt to insert data concurrently. To troubleshoot this error, use surrogate keys to replace identity columns with deterministic surrogate keys generated using functions like uuid()
or monotonically_increasing_id()
.
CREATE TABLE better_table AS
SELECT
monotonically_increasing_id() AS id, -- Replaces identity column
name, department
FROM source_data;
You can also use Delta Lake's MERGE
for complex insert and update workflows without relying on identity columns.
MERGE INTO target_table AS t
USING source_table AS s
ON t.id = s.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;
Conclusion
In Databricks, the CREATE TABLE
command is a versatile tool for defining data structures, linking to external data, and deriving new datasets through queries. It is important to understand the different syntax for creating and managing tables in Databricks according to your specified needs. As you have seen, there are a few different table-creating methods to choose from.
If you want to explore the foundational concepts of Databricks, I highly recommend taking DataCamp’s Introduction to Databricks course to learn about Databricks as a data warehousing solution for Business Intelligence (BI). I also recommend checking out our Databricks Certifications In 2025 blog post to learn how to obtain Databricks certifications, explore career benefits, and learn how to choose the right certification for your career goals.
Databricks CREATE TABLE FAQs
What is the purpose of the CREATE TABLE command in Databricks?
The CREATE TABLE
command is used to define a new table in Databricks, allowing users to specify the table's schema, data format, and storage location.
What is the difference between managed and external tables?
Managed tables store both metadata and data in Databricks, while external tables store only metadata, leaving data in an external location.
Is an external table the same as an unmanaged table?
Yes, external tables and unmanaged tables refer to the same concept in Databricks. They store metadata in Databricks but keep the actual data at an external location, allowing you to manage the data lifecycle independently.
What is a shallow vs. deep clone?
A shallow clone copies metadata only, while a deep clone duplicates both data and metadata.
What are temporary tables?
Temporary tables exist only within the current session and are deleted automatically after the session ends.
What is schema evolution in Databricks?
Schema evolution allows you to modify an existing table's schema (e.g., adding or modifying columns) without needing to recreate the entire table, which is particularly beneficial for Delta tables.
Learn Databricks with DataCamp
course
Data Management in Databricks
course
Data Visualization in Databricks
blog
How to Learn Databricks: A Beginner’s Guide to the Unified Data Platform

Josep Ferrer
tutorial
Databricks SQL: A Comprehensive Guide for Data Analytics and BI Workloads

Allan Ouko
11 min
tutorial
Databricks CLI: Use Databricks from the Command Line

Allan Ouko
11 min
tutorial
SQL Commands for Data Scientists
Emiko Sano
12 min
tutorial
Databricks Tutorial: 7 Must-know Concepts For Any Data Specialist
tutorial
SQLite Show Tables: A Complete Guide for Database Navigation

Allan Ouko
9 min