CREATE TABLE in Databricks: A Full Guide to Building and Managing Tables

Discover how to create tables in Databricks using the CREATE TABLE command. Learn different methods for different scenarios, including how to a create a table from existing data and how to CREATE TABLE as SELECT.

Jan 24, 2025 · 10 min read

Databricks tables are a core component of the Databricks Lakehouse platform, designed to handle large-scale data storage and processing. These tables enable collaboration across teams by unifying structured and unstructured data while supporting analytical workloads.

In this article, I will guide you through the different techniques for creating tables in Databricks using the CREATE TABLE command with several syntax variations to accommodate different use cases. I will also offer tips on the best practices when building and managing tables in Databricks for efficiency and reliability.

As we get started, I highly recommend checking out DataCamp’s How to Learn Databricks: A Beginner’s Guide to the Unified Data Platform blog post to help you understand the core features and applications of Databricks and provide a structured path to start your learning. I also recommend taking our Introduction to Databricks course. Even if you have experience in SQL, you will see that things can look different in the Databricks Lakehouse platform. The interactive exercises in our course will guide you through SQL queries as well as platform features and external system connections.

Introduction to Databricks Tables

To start, we should say Databricks tables are structured datasets that facilitate data organization and access within the Databricks Lakehouse environment. There are two types:

Managed Tables: These tables are fully controlled by Databricks, which manages both the storage and metadata. When a managed table is created, Databricks takes care of the underlying data files, and if the table is deleted, the corresponding data is also removed.
External Tables: External tables reference data that resides outside of Databricks, typically in cloud storage systems like AWS S3 or Azure Blob Storage. While Databricks manages the metadata for these tables, the actual data remains intact even if the table is dropped. These are also known as unmanaged tables.

Tables in Databricks often default to Delta Lake due to its advanced features and optimizations. Delta Lake provides ACID transactions, scalability, performance enhancements such as optimizations such as caching and indexing, and unified batch and streaming data processing.

As a note, the early versions of Azure Databricks used the legacy “Hive format” tables, which the modern Delta tables have replaced. Overall, modern Delta tables surpass legacy Hive tables in reliability and performance, making them the default choice for Databricks workflows.

Here is a summary that I think could help you see the difference between the two:

Feature	Hive format table	Delta table
Data format	Apache Hive (CSV, Parquet, ORC, etc.)	Delta Lake (based on Parquet with metadata)
Transactions	No ACID support (eventual consistency)	Full ACID transactions
Schema management	Limited enforcement	Strict schema enforcement
Performance optimization	Manual tuning required	Built-in optimizations (caching, compaction)
Time travel	No built-in support for historical queries	Supports time travel for querying previous versions of data
Concurrency	Prone to conflicts during updates or deletes	Handles concurrent operations efficiently

CREATE TABLE Syntax Variations

The CREATE TABLE command in Databricks supports multiple variations to accommodate different cases. Each variation has its purpose, with targeted options like file location, cloning, and metadata management. If you need to refresh your knowledge about Databricks SQL, I recommend reading our Databricks SQL tutorial to learn, among other things, how to use a Notebook in a Databricks SQL warehouse.

Below are the main syntax styles with examples and explanations.

CREATE TABLE [USING]

This syntax creates a new table with a specified schema and data format. The following query creates a table named sales_data by reading data from Parquet files stored at a specific location.

-- Create a table from files in a specific format
CREATE TABLE sales_data
USING PARQUET
LOCATION '/mnt/data/sales';

CREATE TABLE (Hive format)

This legacy syntax is similar to the standard SQL CREATE TABLE command, used when defining a table with metadata managed by Databricks without needing to reference external files. The query below creates a Hive-style table named hive_table with a predefined schema and additional metadata properties.

-- Hive-style table creation with schema definition
CREATE TABLE hive_table (
    id INT,
    name STRING
)
COMMENT 'Hive format example'
TBLPROPERTIES ('created_by' = 'databricks');

In the above query, the clause COMMENT adds descriptive metadata while TBLPROPERTIES stores custom properties about the table.

CREATE TABLE LIKE

This command creates a new table using the schema and properties of an existing table. You should use this syntax when you want to replicate the structure of an existing table without copying its data. This is useful for testing or development purposes.

-- Create a table with the same schema as another table
CREATE TABLE new_table LIKE existing_table;

CREATE TABLE CLONE

This syntax allows you to create a new table by cloning an existing Delta Lake table, either as a deep or shallow clone. Therefore, you should use the CREATE TABLE CLONE syntax when you need an independent copy of a table's data and metadata (deep clone) or just the metadata while referencing the original data (shallow clone). Think about this for versioning or creating backups.

The following example shows the queries for creating deep clone and shallow clone tables in Databricks.

 -- Deep clone by default
CREATE TABLE cloned_table CLONE original_table;

-- Shallow clone
CREATE TABLE shallow_clone_table SHALLOW CLONE original_table;

CREATE OR REPLACE TABLE

This single, combined command replaces an existing table with a new one, maintaining the same table name, as one operation. You should use this syntax when you want to completely overwrite an existing table while keeping its name. This is particularly useful for updating the structure or data of a table without needing to drop it first.

The example query below creates or replaces the sales table with a new schema.

-- Create or replace a sales table with a new schema
CREATE OR REPLACE TABLE sales (
    sale_id INT,
    amount FLOAT
)
USING DELTA
LOCATION '/mnt/delta/sales';

There are other relevant clauses you can use with the CREATE TABLE syntax in Databricks to manage your tables according to specific requirements. The following is a summary of usage of each of these clauses. I used only some but not all of these as examples already.

IF NOT EXISTS: Avoids errors by skipping table creation if it already exists.
USING: Specifies the file format (e.g., Delta, Parquet, CSV).
EXTERNAL: Declares a table where data resides outside Databricks' managed storage.
PARTITIONED BY: Defines partitions for optimizing large datasets.
LOCATION: Points to the directory containing table data.
TBLPROPERTIES: Adds metadata for custom table configurations and optimizations.

Step-by-Step Demonstrations

In this section, I will demonstrate different ways to create and manage tables in Databricks, with concise examples and explanations of key clauses. I recommend checking our Databricks Tutorial: 7 Must-know Concepts For Any Data Specialist tutorial to learn how to run SQL commands in the Databricks platform.

How to create an empty table

To create an empty Delta table, define the table schema with column names and data types. Use the USING DELTA clause to define the table format and COMMENT to add descriptive metadata.

This query creates an empty Delta table named employees with a predefined schema, including columns for id, name, department, and hire_date.

-- Create an empty Delta table with a defined schema
CREATE TABLE employees (
    id BIGINT,
    name STRING,
    department STRING,
    hire_date DATE
)
USING DELTA
COMMENT 'Table to store employee information';

How to create a table from existing data

To create a table from existing data, link the table to external data files in formats like CSV, Parquet, or JSON. You can use the LOCATION clause to link the table to the file path and the USING clause to specify the data format.

For example, the following query creates a table named sales_data by referencing Parquet files stored at /mnt/data/sales_data/.

-- Create a table from Parquet files in a specified location
CREATE TABLE sales_data
USING PARQUET
LOCATION '/mnt/data/sales_data/';

How to create a table as select (CTAS)

You can also create a table as select by executing a query and storing the result. For this scenario, the AS SELECT clause populates the table with the query result.

For example, the query below creates a new table named high_value_sales by selecting and transforming data from the existing sales_data table. It filters rows where total_amount is greater than 1000, keeping only the sale_id, customer_id, and total_amount columns.

-- Create a table by selecting and transforming data from another table
CREATE TABLE high_value_sales AS
SELECT
    sale_id,
    customer_id,
    total_amount
FROM sales_data
WHERE total_amount > 1000;

How to create an external table

Creating an external table involves referencing data stored outside of Databricks' managed storage. You should use the LOCATION clause to specify the external data location.

The query below creates an external Delta table named external_table by referencing data stored at /mnt/external_storage/external_table/.

-- Create an external table referencing data outside Databricks storage
CREATE TABLE external_table
USING DELTA
LOCATION '/mnt/external_storage/external_table/';

How to create a table LIKE

To create a table LIKE, duplicate the schema of an existing table without copying its data. In the example below, the query creates a new table named sales_backup with the same schema as an existing sales table.

-- Create a new sales table with the same schema as an existing table
CREATE TABLE sales_backup LIKE sales;

How to create TABLE CLONE

You can create a table clone by using the CLONE clause to create a shallow or deep copy of a Delta table for backups or testing.

In the following example, the query creates a new table, sales_shallow_clone, copying only the metadata (schema and table properties).

-- Shallow clone: Copies only metadata (references original data)
CREATE TABLE sales_shallow_clone CLONE sales_data;

The example below creates a new table, sales_deep_clone, but copies both metadata and actual data.

-- Deep clone: Copies both metadata and actual data
CREATE TABLE sales_deep_clone CLONE sales_data DEEP;

How to create a temporary table (temp table)

Use the TEMPORARY clause when creating temporary tables to ensure the table is session-specific.

The query below creates a temporary table named temp_table that exists only for the current session. It selects id, name, and sales_amount from sales_data, filtering for rows where region = 'US'.

-- Create a temporary table that exists only in the session
CREATE TEMPORARY TABLE temp_table AS
SELECT
    id,
    name,
    sales_amount
FROM sales_data
WHERE region = 'US';

How to replace a table

You can use the CREATE OR REPLACE clause to replace the table definition and content. The query below replaces the existing table updated_table with a new schema and data. It defines the table with three columns (id, name, total_sales).

-- Replace an existing table with a new schema or data
CREATE OR REPLACE TABLE updated_table (
    id BIGINT,
    name STRING,
    total_sales DOUBLE
)
USING DELTA
LOCATION '/mnt/delta/updated_table/';

Advanced Topics

In this section, I will explore advanced Databricks table features that enhance functionality, performance, and security.

Generated columns and identity columns

Databricks supports identity columns, which automatically generate unique values for each new row. This feature is useful for creating surrogate keys in data warehousing.

The query below creates a table employees with three columns: id, name, and department. The id column is an identity column that auto-increments with each new row.

-- Identity column: auto-incremented value
CREATE TABLE employees (
    id BIGINT GENERATED ALWAYS AS IDENTITY, -- Auto-incremented column
    name STRING,
    department STRING
);

This other query creates a table sales with three columns: sale_id, sale_date, and sale_year. The sale_year column is generated and automatically calculated based on the sale_date column using the YEAR(sale_date) function.

-- Generated column: derived from other columns
CREATE TABLE sales (
    sale_id BIGINT,
    sale_date DATE,
    sale_year INT GENERATED ALWAYS AS (YEAR(sale_date)) -- Derived column
);

Partitioning and clustering

Partitioning divides data into smaller chunks, improving query performance, while clustering organizes data within partitions for faster reads.

For example, the query below creates a partitioned table where data is split into partitions based on the region column, improving query performance when filtering by region.

-- Partitioned table
CREATE TABLE partitioned_table (
    id BIGINT,
    region STRING,
    sales_amount DOUBLE
)
USING DELTA
PARTITIONED BY (region); -- Querying by region will be faster

The query below optimizes the partitioned table by clustering the data within each partition based on the sales_amount column. It makes queries on sales_amount faster by improving the data layout for efficient reads. The ZORDER BY clause works within partitions to improve read performance by clustering data within partitions, not across partitions.

-- Clustered table
OPTIMIZE partitioned_table
ZORDER BY (sales_amount); -- Clusters data within partitions for efficient reads

Column masking and row filters

Column masking and row filters restrict data visibility based on user roles or conditions, protecting sensitive information.

For example, the query below creates a masked view named masked_table based on the employees table. The salary column is redacted (replaced with REDACTED) for all users except those with the manager role. The CASE statement ensures that only managers can see the actual salary, while others see the masked data

-- Column masking: Redact sensitive data
CREATE MASKED VIEW masked_table AS
SELECT
    id,
    name,
    CASE
        WHEN current_user() = 'manager' THEN salary
        ELSE 'REDACTED'
    END AS salary
FROM employees;

Similarly, the query below uses row filters to create or replace a view named filtered_table based on the sales table. The view filters rows to only include data where region = 'US', restricting access to only sales data from the US.

-- Row filters: Fine-grained access control
CREATE OR REPLACE VIEW filtered_table AS
SELECT *
FROM sales
WHERE region = 'US'; -- Limit data visibility to a specific region

Schema evaluation

Databricks supports schema evolution through CREATE OR REPLACE TABLE and table alterations. You should use the CREATE OR REPLACE TABLE statement to overwrite the schema and data and the ALTER TABLE statement to modify the schema incrementally without affecting data.

The query below replaces the existing employees table with a new schema, adding a new hire_date column while keeping the data stored in Delta format.

-- Schema evolution with CREATE OR REPLACE TABLE
CREATE OR REPLACE TABLE employees (
    id BIGINT,
    name STRING,
    department STRING,
    hire_date DATE -- New column added
)
USING DELTA;

This other query modifies the employees table by adding a new salary column to the existing schema.

-- Altering a table schema
ALTER TABLE employees ADD COLUMNS (salary DOUBLE);

Performance tuning

To optimize performance for large-scale tables, you can utilize various table properties and techniques such as auto-optimization and Z-ordering.

In the following example, the query enables auto-optimization on the delta_table by setting properties to automatically compact small files and merge files during writes, improving storage efficiency.

-- Enable automatic optimizations
ALTER TABLE delta_table
SET TBLPROPERTIES (
    'delta.autoOptimize.optimizeWrite' = 'true', -- Automatically compact small files
    'delta.autoOptimize.autoCompact' = 'true'   -- Automatically merge files
);

Similarly, the query below optimizes the delta_table by clustering data based on region and sales_date to enhance query performance, especially for filters on these columns.

-- Perform clustering for faster query performance
OPTIMIZE delta_table
ZORDER BY (region, sales_date); -- Improves query efficiency for frequent filters

Best practices and considerations

Here are some things to consider when creating tables in Databricks, so you do things correctly:

Table naming conventions: Always use clear and consistent names to improve discoverability and reduce errors. Avoid special characters or excessively long names to prevent compatibility issues.
Security and permissions: Utilize Unity Catalog for centralized access control across all data assets. Use Access Control Lists (ACLs) to restrict who can view or modify data. Also, mask sensitive data with column masking or row filters.
Backup and recovery: Rely on Delta Lake's version history for easy recovery and time travel. Regularly monitor and manage the retention of historical versions to balance storage costs and recovery needs.
Managed vs. external tables: It is also important to understand managed versus external tables. In managed tables, Databricks handles both metadata and data. This is ideal for scenarios where you want full control over data lifecycle management. However, in external tables, the referenced data is stored outside of Databricks. This is useful for maintaining control over data that may be shared across different platforms or systems.

Common Pitfalls and Troubleshooting

Now that we have discussed the best practices for managing tables in Databricks let me cover the frequent issues you may encounter with Databricks tables, their typical error messages, and then the solutions.

Non-empty location errors when using CREATE vs. CREATE OR REPLACE

When using CREATE TABLE with a specified location (LOCATION clause), if the directory is not empty, you may encounter an error like:

Error: Cannot create table because the location is not empty.

This error occurs because the CREATE TABLE command expects an empty directory for the specified location, but existing files or metadata conflict with this requirement.

To solve this error, use the CREATE OR REPLACE TABLE statement to overwrite existing data and metadata. Or else, manually clean up the directory before using the CREATE TABLE statement.

CREATE OR REPLACE TABLE delta_table
USING DELTA
LOCATION '/mnt/data/delta_table'; -- Overwrites table and files

Schema mismatch or issues with certain data types

Inserting or querying data results in schema mismatch errors such as:

Error: Schema mismatch: Expected column type STRING, but found INT.

The error could be caused by a mismatch between the table schema and incoming data types. Also, unsupported or ambiguous data type conversions (e.g., complex nested types) could cause schema mismatch. To solve this error, enforce schema matching to verify the schema during table creation or use schema enforcement in Delta tables.

-- Enforce schema matching during table creation
CREATE TABLE schema_table (
    id INT,
    name STRING
)
USING DELTA
TBLPROPERTIES ('delta.schema.enforce' = 'true'); -- Enforces strict schema

Also, you should allow schema evolution when adding new columns or altering data types.

ALTER TABLE schema_table ADD COLUMNS (new_column DOUBLE);

Using explicit type casting when inserting data should also help troubleshoot this type of error.

INSERT INTO schema_table SELECT CAST(id AS INT), CAST(name AS STRING) FROM source_table;

Concurrent transaction limitations with identity columns

When using identity columns, you may encounter issues related to concurrent transactions, such as:

Error: Concurrent transactions detected on identity column table

Identity columns are stateful and can encounter conflicts when multiple transactions attempt to insert data concurrently. To troubleshoot this error, use surrogate keys to replace identity columns with deterministic surrogate keys generated using functions like uuid() or monotonically_increasing_id().

CREATE TABLE better_table AS
SELECT
    monotonically_increasing_id() AS id, -- Replaces identity column
    name, department
FROM source_data;

You can also use Delta Lake's MERGE for complex insert and update workflows without relying on identity columns.

MERGE INTO target_table AS t
USING source_table AS s
ON t.id = s.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

Conclusion

In Databricks, the CREATE TABLE command is a versatile tool for defining data structures, linking to external data, and deriving new datasets through queries. It is important to understand the different syntax for creating and managing tables in Databricks according to your specified needs. As you have seen, there are a few different table-creating methods to choose from.

If you want to explore the foundational concepts of Databricks, I highly recommend taking DataCamp’s Introduction to Databricks course to learn about Databricks as a data warehousing solution for Business Intelligence (BI). I also recommend checking out our Databricks Certifications In 2025 blog post to learn how to obtain Databricks certifications, explore career benefits, and learn how to choose the right certification for your career goals.

Author

Allan Ouko

What is the purpose of the CREATE TABLE command in Databricks?

What is the difference between managed and external tables?

Is an external table the same as an unmanaged table?

What is a shallow vs. deep clone?

What are temporary tables?

What is schema evolution in Databricks?

Topics

Cloud

Databricks

Learn Databricks with DataCamp

Course

Introduction to Databricks

3 hr

29.4K

Learn about the Databricks Lakehouse platform and how it can modernize data architectures and improve data management processes.

See Details

Start Course

Course

Data Management in Databricks

3 hr

3.8K

Learn data management in Databricks with Delta Lake, including ACID transactions, schema enforcement, and security.

See Details

Start Course

Course

Data Visualization in Databricks

3 hr

2.8K

Create visualizations and dynamic dashboards with Databricks, turning raw data into clear and actionable insights.

See Details

Start Course

blog

How to Learn Databricks: A Beginner’s Guide to the Unified Data Platform

Understand the steps to start learning Databricks and set clear goals. Learn about creating and managing clusters, running notebooks, and automating workflows. Build a foundation for applying these techniques in future projects.

Josep Ferrer

Tutorial

Databricks SQL: A Full Guide to Scalable SQL Workloads

Discover how to leverage Databricks SQL for efficient data analytics, querying, and business intelligence with practical examples and best practices.

Allan Ouko

Tutorial

Databricks CLI: Use Databricks from the Command Line

Master how to install and use Databricks CLI for cluster management, job automation, notebook handling, and DBFS operations with tips and best practices.

Allan Ouko

Tutorial

SQL Commands for Data Scientists

Learn the basic and essential commands for SQL with examples and a workspace notebook to refer to.

Emiko Sano

Tutorial

Databricks Tutorial: 7 Must-know Concepts For Any Data Specialist

Learn the most popular unified platform for big data analytics - Databricks. The tutorial covers the seven core concepts and features of Databricks and how they interconnect to solve real-world issues in the modern data world.

Bex Tuychiev

Tutorial

SQLite Show Tables: A Complete Guide for Database Navigation

Explore different methods to list tables in an SQLite database using the command-line, SQL queries, and programming languages like Python and C.

Allan Ouko

See More See More

Introduction to Databricks Tables

CREATE TABLE Syntax Variations

CREATE TABLE [USING]

CREATE TABLE (Hive format)

CREATE TABLE LIKE

CREATE TABLE CLONE

CREATE OR REPLACE TABLE

Step-by-Step Demonstrations

How to create an empty table

How to create a table from existing data

How to create a table as select (CTAS)

How to create an external table

How to create a table LIKE

How to create TABLE CLONE

How to create a temporary table (temp table)

How to replace a table

Advanced Topics

Generated columns and identity columns

Partitioning and clustering

Column masking and row filters

Schema evaluation

Performance tuning

Best practices and considerations

Common Pitfalls and Troubleshooting

Non-empty location errors when using CREATE vs. CREATE OR REPLACE

Schema mismatch or issues with certain data types

Concurrent transaction limitations with identity columns

Conclusion

Databricks CREATE TABLE FAQs

Is an external table the same as an unmanaged table?

What is a shallow vs. deep clone?

What are temporary tables?

What is schema evolution in Databricks?

How to Learn Databricks: A Beginner’s Guide to the Unified Data Platform

Databricks SQL: A Full Guide to Scalable SQL Workloads

Databricks CLI: Use Databricks from the Command Line

SQL Commands for Data Scientists

Databricks Tutorial: 7 Must-know Concepts For Any Data Specialist

SQLite Show Tables: A Complete Guide for Database Navigation

.css-1531qan{-webkit-text-decoration:none;text-decoration:none;color:inherit;}Introduction to Databricks

Data Management in Databricks

Data Visualization in Databricks

How to Learn Databricks: A Beginner’s Guide to the Unified Data Platform

Databricks SQL: A Full Guide to Scalable SQL Workloads

Databricks CLI: Use Databricks from the Command Line

SQL Commands for Data Scientists

Databricks Tutorial: 7 Must-know Concepts For Any Data Specialist

SQLite Show Tables: A Complete Guide for Database Navigation

Introduction to Databricks