Saltar al contenido principal
InicioTutorialesAnálisis de datos

Databricks Tutorial: 7 Must-know Concepts For Any Data Specialist

Learn the most popular unified platform for big data analytics - Databricks. The tutorial covers the seven core concepts and features of Databricks and how they interconnect to solve real-world issues in the modern data world.
ene 2024  · 12 min leer

First, there were data warehouses. They stored data in rows and columns because all the Internet and computers of the time were capable of distributing was simple text information. Much later came data lakes — they could store nearly any type of data you could collect. They were great for the social media and YouTube age.

But they both had disadvantages — data warehouses were expensive and unsuitable for modern data science, while data lakes were messy and often turned into data swamps. So, companies started having two separate tech stacks — warehouses for BI and analytics and lakes for machine learning.

However, managing two different data architectures was such a pain that companies often had poor results. This issue gave rise to the lakehouse architecture, which is precisely what Databricks is famous for.

Databricks is a cloud-based platform that allows users to derive value from both warehouses and lakes in a unified environment. This article will give an overview of the platform, showing its most important features and how to use them.

What We’ll Cover in this Databricks Tutorial

Databricks is such a massive platform that its documentation itself could be turned into a book. So, the article’s goal is to provide you with a concepts hierarchy — linearly ordered explanations of Databricks features that will take you from a beginner to a decent Databricks practitioner. If you’re a total newcomer, you may also want to check out our Introduction to Databricks course.

Let’s get started!

1. What is Databricks?

When you read the word Databricks, you should immediately think of it as a platform, not as some framework or Python library. Typically, platforms offer a wide range of features, and Databricks is no exception. It is one of the very few platforms that can be used by any data professional, from data engineers to the modern machine learning engineers (or what the press calls AI programmers today).

Databricks has the following core components:

  1. Workspace: Databricks provides a centralized environment where teams can collaborate without any hassles. The environment is accessible through a user-friendly web interface.
  2. Notebooks: Databricks has a version of Jupyter notebooks specifically designed for collaboration and flexibility.
  3. Apache Spark: Databricks loves Apache Spark. It is the engine that powers all parallel processing of humongous datasets, making it suitable for big data analytics.
  4. Delta Lake: An enhancement on data lakes by providing ACID transactions. Delta Lakes ensures data reliability and consistency, addressing traditional challenges associated with data lakes.
  5. Scalability: The platform scales horizontally rather than vertically, which is ideally suited for organizations dealing with ever-increasing data demands.

Databricks benefits

These components, in combination, unlock a wide range of benefits:

  • Cross-team collaboration: engineers, analysts, scientists and ML engineers can work seamlessly in the same platform.
  • Consistency: with notebooks, users can transition between tasks and programming languages without the need for context-switching.
  • Efficient workflows: Users can perform tasks such as data cleaning, transformation, and machine learning in a cohesive manner
  • Integrated data management: users can ingest data into the platform from multiple sources, create tables, and run SQL
  • Real-time collaboration: shared notebooks and collaborative editing features enable real-time collaboration. Multiple team members can work on the same notebook simultaneously.

If I’ve got you convinced of Databricks’ importance in the data world, let’s get you up and running with the platform.

2. Account Setup

To set up your account, go to https://www.databricks.com/try-databricks and sign up for the Community Edition.

image4.png

Community Edition has fewer features than the Enterprise version, but it doesn’t require a cloud-provider set up, which is great for small-use cases like tutorials.

If you have this page after email verification, you are good to go:

image3.png

3. Databricks Workspace

The interface you discovered is the Workspace for your email address (the community edition workspace can easily be found). In practice, usually an account admin from your company creates a single Databricks account and manages access to the workspace.

Now, let’s understand the UI of the platform. On the left panel, we have the menu for the different components Databricks offers. The enterprise version will have even more buttons:

image6.png

The first option in the menu is the type of workspace, which is set to data science and engineering by default. If you can change it to machine learning, a new Experiments option pops up:

image5.gif

On the surface, it may look like it doesn’t do much, but once you upgrade your account and start tinkering, you will notice some great features of the platform:

  • Central hub for resources: notebooks, clusters, tables, libraries and dashboards
  • Notebooks in multiple languages
  • Cluster management: managing computational resources for the workspace to execute code
  • Table management
  • Dashboard creation: DB users have the ability to collect visuals into dashboards right in the workspace
  • Collaborative real-time editing of notebooks
  • Version control for notebooks
  • Job scheduling (a powerful feature): users can execute notebooks and scripts at specified intervals

and so on.

Now, let’s look at some of these components more closely.

4. Databricks Clusters

Clusters in Databricks refer to the computational resources used to execute data processing tasks. Usually, clusters are served by your chosen cloud provider during account setup.

The community edition clusters are limited in RAM and CPU power, and GPUs aren’t included. However, premium users can often do the following tasks with clusters in a straightforward way:

  • Data processing: Clusters are used to process and transform large volumes of data, using parallel processing powers of Spark.
  • Machine learning: You can use Python (or any other language) and its libraries for model training and inference.
  • ETL workflows: Clusters also support Extract, Transform, Load workflows by efficiently processing and transforming data from source to destination.

To create a cluster, you can use the “Create” button or the “Compute” options from the menu:

image12.gif

When creating the cluster, choose an appropriate Spark version for your environment and wait a few minutes for it to become operational.

5. Databricks Notebooks

Once you have a running cluster, you are ready to create notebooks. If you’ve worked with Jupyter, Colab, or DataCamp Workspaces, this will be familiar:

image1.gif

But in a world where real Jupyter exists, why would you go for something “similar to Jupyter”? Well, Databricks notebooks have the following advantages over Jupyter notebooks:

  • Collaboration: Built-in collaborative features allow multiple users to work on the same notebook at the same time. Changes are tracked in real-time.
  • Execution environment: Most Jupyter environment providers or local instances rely on single machines with predefined hardware. Users must install external libraries and dependencies on their own. In contrast, Databricks notebooks are powered by clusters, which automatically handle resources and scaling by the workload. They also come with pre-populated environments.
  • Integration with Big Data tech: Jupyter can work with Apache Spark, but users need to manage Spark sessions and dependencies manually. Since Databricks was founded by Spark creators, it supports the framework natively. Spark sessions and clusters are automatically managed by the Databricks platform.

There are many other advantages of Databricks notebooks over Jupyter, so here is a table summarizing the differences:

Feature

Jupyter Notebooks

Databricks Notebooks

Platform

Open-source, runs locally or on cloud platforms

Exclusive to the Databricks platform

Collaboration and Sharing

Limited collaboration features, manual sharing

Built-in collaboration, real-time concurrent editing

Execution

Relies on local or external servers

Execution on Databricks clusters

Integration with Big Data

Can be integrated with Spark, requires additional configurations

Native integration with Apache Spark, optimized for big data

Built-in Features

External tools/extensions for version control, collaboration, and visualization

Integrated with Databricks-specific features like Delta Lake, built-in support for collaboration and analytics tools

Cost and Scaling

Local installations are often free, cloud-based solutions may have costs

Paid service, costs depend on usage, scales seamlessly with Databricks clusters

Ease of Use

Familiar and widely used in the data science community

Tailored for big data analytics, may have a steeper learning curve for Databricks-specific features

Data Visualization

Limited built-in support for data visualization

Built-in support for data visualization within the notebook environment

Cluster Management

Users need to manage Spark sessions and dependencies manually

Databricks platform handles cluster management and scaling automatically

Use Cases

Versatile for various data science tasks

Specialized for collaborative big data analytics within the Databricks platform

Ultimately, the above advantages of Databricks notebooks come into effect in specific use cases. If you want to play around with a CSV dataset with Pandas on your laptop, Jupyter is much better.

But, for enterprise-level applications, Databricks as a platform may be a better option.

6. Data Ingestion into Databricks

Data ingestion refers to the process of importing data from various sources. Databricks supports ingestion from a variety of sources including:

  • AWS S3
  • Azure Blob Storage
  • Google Cloud Storage
  • Relational databases (MySQL, PostgreSQL, etc.)
  • Data lakes (Delta Lake, Parquet, Avro, etc.)
  • Streaming platforms (Apache Kafka)
  • Google BigQuery
  • That local CSV file you have

and so on.

Now, let’s actually see how you can load certain types of data into Databricks. We will start with local files:

image10.gif

Once you follow the steps in the GIF, you will have a file stored in the workspace. Here is how you can load it with Spark:

# Importing necessary libraries
from pyspark.sql import SparkSession

# Creating a Spark session
spark = SparkSession.builder.appName("S3ImportExample").getOrCreate()

# Defining the CSV path to the data
path = "dbfs:/FileStore/tables/diamonds.csv"

# Reading data from S3 into a DataFrame
data_from_s3 = spark.read.csv(path, header=True, inferSchema=True)

# Displaying the imported data
data_from_s3.show()

Pay attention to the dbfs: prefix. All workspace files must include it for the file to be loaded correctly with Spark. DBFS stands for databricks file system.

Importing data from an S3 bucket is similar (for enterprise accounts):

# Importing necessary libraries
from pyspark.sql import SparkSession

# Creating a Spark session
spark = SparkSession.builder.appName("S3ImportExample").getOrCreate()

# Defining the S3 path to the data
s3_path = "s3://your-bucket/your-data.csv"

# Reading data from S3 into a DataFrame
data_from_s3 = spark.read.csv(s3_path, header=True, inferSchema=True)

# Displaying the imported data
data_from_s3.show()

For other types of data, you can check the Data engineering and Connect to data sources sections of the Databricks documentation.

7. Running SQL in Databricks

When we uploaded the diamonds.csv file, it became a Databricks table in a database called default:

image9.png

This default database is created whenever we try to load structured files without creating the database first.

If we’ve got a database, that means we can query it with SQL, not just with Spark. To do so, create a new notebook or change the language of the current notebook to SQL. Then, try the following code snippet:

SELECT * FROM default.diamonds_1_csv
LIMIT 5;

It must return the top five rows of the diamonds table:

image11.png

Note: I am using an SQL notebook for the above snippet

You can load this table in Pandas as well. Within the same notebook, paste this snippet:

%python
# Import the necessary libraries
import pandas as pd

# Assuming 'default' is the database name and 'diamonds' is the table name
# Use the spark.sql function to query the table and retrieve the data
table_df = spark.sql("SELECT * FROM default.diamonds_1_csv")

# Convert the Spark DataFrame to a Pandas DataFrame
pandas_df = table_df.toPandas()

# Display the Pandas DataFrame
pandas_df.head()

It should print the head of the table:

image8.png

Now, you can do any typical data analysis task on the table with both SQL and Pandas.

Conclusion and Further Steps

We’ve managed to learn and do a lot using our bare-bones Databricks community edition account. To continue learning about the platform, the first step is to use the two-week free trial Databricks offers for premium accounts.

Then, you can fully enjoy the lessons of the Introduction to Databricks course offered by DataCamp. Apart from account set up, you will learn and practice using the following core features of DataCamp:

  • Administering a Databricks workspaces
  • Reading and writing to external databases
  • Data transformations
  • Data orchestration, aka scheduling jobs
  • Comprehensive overview of Databricks SQL
  • Using Lakehouse AI for large-scale machine learning.

Photo of Bex Tuychiev
Author
Bex Tuychiev
LinkedIn

I am a data science content creator with over 2 years of experience and one of the largest followings on Medium. I like to write detailed articles on AI and ML with a bit of a sarcastıc style because you've got to do something to make them a bit less dull. I have produced over 130 articles and a DataCamp course to boot, with another one in the makıng. My content has been seen by over 5 million pairs of eyes, 20k of whom became followers on both Medium and LinkedIn. 

Temas

Start Your Databricks Journey Today!

Course

Introduction to Databricks

3 hr
1.8K
Learn about the Databricks Lakehouse platform and how it can modernize data architectures and improve data management processes.
See DetailsRight Arrow
Start Course
Ver másRight Arrow
Relacionado

tutorial

dbt Tutorial: 7 Must-Know Concepts For Data Engineers

Learn the 7 most important concepts around dbt - the favorite tool of modern data engineers.
Bex Tuychiev's photo

Bex Tuychiev

11 min

tutorial

A Comprehensive Guide to Databricks Lakehouse AI For Data Scientists

This tutorial dives into the Databricks approach to AI & Machine Learning in the Databricks Lakehouse and introduces its latest features.
Arunn Thevapalan's photo

Arunn Thevapalan

12 min

tutorial

Snowflake Tutorial For Beginners: From Architecture to Running Databases

Learn the fundamentals of cloud data warehouse management using Snowflake. Snowflake is a cloud-based platform that offers significant benefits for companies wanting to extract as much insight from their data as quickly and efficiently as possible.
Bex Tuychiev's photo

Bex Tuychiev

12 min

tutorial

A Beginner's Guide to BigQuery

Learn what BigQuery is, how it works, its differences from traditional data warehouses, and how to use the BigQuery console to query public datasets provided by Google.
Eduardo Oliveira's photo

Eduardo Oliveira

9 min

tutorial

Apache Spark Tutorial: ML with PySpark

Apache Spark tutorial introduces you to big data processing, analysis and ML with PySpark.
Karlijn Willems's photo

Karlijn Willems

34 min

tutorial

Cloudera Hadoop Tutorial

Learn about Hadoop ecosystem, the architectures and how to start with Cloudera.
DataCamp Team's photo

DataCamp Team

27 min

See MoreSee More