Skip to main content

How to Execute SQL Queries in Python and R Tutorial

Abid Ali Awan,
April 4, 2022 min read
LinkedInFacebookTwitterCopy
Learn easy and effective ways to run SQL queries in Python and R for data analysis and database management.

What is SQL

Structured Query Language (SQL) is the most common language used for running various data analysis tasks. It is also used for maintaining a relational database, for example: adding tables, removing values, and optimizing the database. A simple relational database consists of multiple tables that are interconnected, and each table consists of rows and columns.

On average, a technology company generates millions of data points every day. A storage solution that is robust and effective is required so they can use the data to improve the current system or come up with a new product. A relational database such as MySQL, PostgreSQL, and SQLite solve these problems by providing robust database management, security, and high performance.

Core functionalities of SQL

  • Create new tables in a database
  • Execute queries against a database
  • Retrieve data from a database
  • Insert records into a database
  • Update records in a database
  • Delete records from a database
  • Optimize any database

SQL is a high-demand skill that will help you land any job in the tech industry. Companies like Meta, Google, and Netflix are always on the lookout for data professionals that can extract information from SQL databases and come up with innovative solutions. You can learn the basics of SQL by taking the Introduction to SQL tutorial on DataCamp.

Why use SQL with Python and R?

SQL can help us discover the company's performance, understand customer behaviors, and monitor the success metrics of marketing campaigns. Most data analysts can perform the majority of business intelligence tasks by running SQL queries, so why do we need tools such as PoweBI, Python, and R? By using SQL queries, you can tell what has happened in the past, but you cannot predict future projections. These tools help us understand more about the current performance and potential growth.

Python and R are multipurpose languages that allow professionals to run advanced statistical analysis, build machine learning models, create data APIs, and eventually help companies to think beyond KPIs. In this tutorial, we will learn to connect SQL databases, populate databases, and run SQL queries using Python and R.

Note: If you are new to SQL, then take the SQL skill track to understand the fundamentals of writing SQL queries.

Python Tutorial

The Python tutorial will cover the basics of connecting with various databases (MySQL and SQLite), creating tables, adding records, running queries, and learning about the Pandas function read_sql.

Setting Up

We can connect the database using SQLAlchemy, but in this tutorial, we are going to use the inbuilt Python package SQLite3 to run queries on the database. SQLAlchemy provides support for all kinds of databases by providing a unified API. If you are interested in learning more about SQLAlchemy and how it works with other databases then check out the Introduction to Databases in Python course.

MySQL is the most popular database engine in the world, and it is widely used by companies like Youtube, Paypal, LinkedIn, and GitHub. Here we will learn how to connect the database. The rest of the steps for using MySQL are similar to the SQLite3 package.

First, install the mysql package using '!pip install mysql' and then create a local database engine by providing your username, password, and database name.

import mysql.connector as sql

conn = sql.connect(
  host="localhost",
  user="abid",
  password="12345",
  database="datacamp_python"
)

Similarly, we can create or load a SQLite database by using the sqlite3.connect function. SQLite is a library that implements a self-contained, zero-configuration, and serverless database engine. It is DataCamp Workspace friendly, so we will use it in our project to avoid local host errors.

import sqlite3
import pandas as pd

conn= sqlite3.connect("datacamp_python.db")

Creating a Database

In this part, we will learn how to load the COVID-19's impact on airport traffic dataset, under the CC BY-NC-SA 4.0 license, into our SQLite database. We will also learn how to create tables from scratch.

Image from Kaggle Dataset

The airport traffic dataset consists of a percentage of the traffic volume during the baseline period from February 1, 2020, to March 15, 2020. We will load a CSV file by using Pandas function read_csv and then use the to_sql function to transfer the dataframe into our SQLite table. The to_sql function requires a table name (String) and connection to the SQLite engine.

data = pd.read_csv("data/covid_impact_on_airport_traffic.csv")


data.to_sql(
            'airport',             # Name of the sql table
            conn,                 # sqlite.Connection or sqlalchemy.engine.Engine
            if_exists='replace'
           )

We will now test whether we have succeeded by running a quick SQL query. Before running a query, we need to create a cursor that will help us execute queries, as shown in the code block below. You can have multiple cursors on the same database within a single connection.

In our case, the SQL query returned three columns and five rows from the airport table. To display the first row, we will use cursor.fetchone().

cursor = conn.cursor()
cursor.execute("""SELECT Date, AirportName, PercentOfBaseline  
                  FROM airport
                  LIMIT 5""")
cursor.fetchone()


>>> ('2020-04-03', 'Kingsford Smith', 64)

To display the rest of the records, we will use cursor.fetchall(). The airport dataset is successfully loaded into the database with a few lines of code.

cursor.fetchall()

>>> [('2020-04-13', 'Kingsford Smith', 29),
     ('2020-07-10', 'Kingsford Smith', 54),
     ('2020-09-02', 'Kingsford Smith', 18),
     ('2020-10-31', 'Kingsford Smith', 22)]

Now, let’s learn to create a table from scratch and populate it by adding sample values. We will create a studentinfo table with id (integer, primary key, auto-increment), name (text), and subject (text).

Note: The SQLite syntax is a bit different. It is recommended to check out the SQLite cheat sheet to understand the SQL queries mentioned in this tutorial.

cursor.execute("""
                CREATE TABLE studentinfo
                (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                name TEXT,
                subject TEXT
                )
                 """)

Let’s check how many tables we have added to the database by running a simple SQLite query.

cursor.execute("""
                SELECT name
                FROM sqlite_master
                WHERE type='table'
                 """)
cursor.fetchall()

>>> [('airport',), ('studentinfo',)]

Running Queries

In this section, we will add values to the studentinfo table and run simple SQL queries. By using INSERT INTO, we can add a single row to the studentinfo table.

To insert values, we need to provide a query and value arguments to the execute function. The function fills the “?” inputs with the values that we have provided.

query = """
            INSERT INTO studentinfo
            (name, subject)
            VALUES (?, ?)
        """
value = ("Marry", "Math")
cursor.execute(query,value)

Repeat the above query by adding multiple records.

query = """
           INSERT INTO studentinfo
           (name, subject)
           VALUES (?, ?)
        """
values = [("Abid", "Stat"),
          ("Carry", "Math"),
          ("Ali","Data Science"),
          ("Nisha","Data Science"),
          ("Matthew","Math"),
          ("Henry","Data Science")]

cursor.executemany(query,values)

It’s time to verify the record. To do this we will run a simple SQL query that will return rows where the subject is Data Science.

cursor.execute("""
                SELECT *
                FROM studentinfo
                WHERE subject LIKE 'Data Science'
                 """)
cursor.fetchall()

>>> [(4, 'Ali', 'Data Science'),
     (5, 'Nisha', 'Data Science'),
     (7, 'Henry', 'Data Science')]

The DISTINCT subject command is used to display unique values present in subject columns. In our case, it’s Math, Stat, and Data Science.

cursor.execute("SELECT DISTINCT subject from studentinfo")
cursor.fetchall()

>>> [('Math',), ('Stat',), ('Data Science',)]

To save all of the changes, we will use the commit() function. Without a commit, the data will be lost after the machine restarts.

conn.commit()

SQL with Pandas

In this part we will learn how to extract the data from the SQLite database and convert it into a Pandas dataframe with one line of code. read_sql provides more than just running SQL queries. We can use it for setting index columns, parsing date and time, adding values, and filtering out column names. Learn more about importing data in Python by taking a short DataCamp course.

read_sql requires two arguments: a SQL query and connection to the SQLite engine. The output contains the top five rows from the airport table where PercentOfBaseline is greater than 20.

data_sql_1 = pd.read_sql("""
                        SELECT Date,City,PercentOfBaseline
                        FROM airport
                        WHERE PercentOfBaseline > 20
                        ORDER BY Date DESC
                        """,
                        conn)
print(data_sql_1.head())

         Date          City  PercentOfBaseline
0  2020-12-02        Sydney                 27
1  2020-12-02      Santiago                 48
2  2020-12-02       Calgary                 99
3  2020-12-02  Leduc County                100
4  2020-12-02      Richmond                 86

Running data analysis on relational databases has become easier with Pandas integration. We can also use this data to forecast the values and run complex statistical analyses.

The plot function is used to visualize the histogram of the PercentOfBaseline column.

data_sql_1.plot(y="PercentOfBaseline",kind="hist");

Similarly, we can limit the values to the top 20 and display a time series line chart.

data_sql_2 = pd.read_sql("""
                        SELECT Date,City,PercentOfBaseline
                        FROM airport
                        WHERE PercentOfBaseline > 20
                        ORDER BY Date DESC
                        LIMIT 20
                        """,
                        conn)


data_sql_2.plot(x="Date",y="PercentOfBaseline",kind="line");

Finally, we will close the connection to free up resources. Most packages do this automatically, but it is preferred to close the connections after finalizing the changes.

conn.close()

R Tutorial

We are going to replicate all the tasks from the Python tutorial using R. The tutorial includes creating connections, writing tables, appending rows, running queries, and data analysis with dplyr.

Setting Up

The DBI package is used for connecting with the most popular databases such as MariaDB, Postgres, Duckdb, and SQLite. For example, install the RMySQL package and create a database by providing a username, password, database name, and host address.

install.packages("RMySQL")
library(RMySQL)
conn = dbConnect(
                MySQL(),
                user = 'abid',
                password = '1234',
                dbname = 'datacamp_R',
                host = 'localhost'
                )

In this tutorial, we are going to create an SQLite database by providing a name and the SQLite function.

library(RSQLite)
library(DBI)
library(tidyverse)

conn = dbConnect(SQLite(), dbname = 'datacamp_R.db')

Creating Database

By importing the tidyverse library, we will have access to the dplyr, ggplot, and defaults datasets.

dbWriteTable function takes data.frame and adds it into the SQL table. It takes three arguments: connection to SQLite, name of the table, and data frame. With dbReadTable, we can view the entire table. To view the top 6 rows, we have used head.

dbWriteTable(conn, "cars", mtcars)
head(dbReadTable(conn, "cars"))

    mpg    cyl    disp    hp    drat    wt    qsec    vs    am    gear    carb
1    21.0    6    160    110    3.90    2.620    16.46    0    1    4    4
2    21.0    6    160    110    3.90    2.875    17.02    0    1    4    4
3    22.8    4    108    93    3.85    2.320    18.61    1    1    4    1
4    21.4    6    258    110    3.08    3.215    19.44    1    0    3    1
5    18.7    8    360    175    3.15    3.440    17.02    0    0    3    2
6    18.1    6    225    105    2.76    3.460    20.22    1    0    3    1

dbExecute lets us execute any SQLite query, so we will be using it to create a table called idcard.

To display the names of the tables in the database, we will use dbListTables.

dbExecute(conn, 'CREATE TABLE idcard (id int, name text)')
dbListTables(conn)

>>> 'cars''idcard'

Let’s add a single row to the idcard table and use dbGetQuery to display results.

Note: dbGetQuery runs a query and returns the records whereas dbExecute runs SQL query but does not return any records.

dbExecute(conn, "INSERT INTO idcard (id,name)\
                 VALUES(1,'love')")
dbGetQuery(conn,"SELECT * FROM idcard")


id    name
1    love

We will now add two more rows and display results by using dbReadTable.

dbExecute(conn,"INSERT INTO idcard (id,name)\
                VALUES(2,'Kill'),(3,'Game')
               ")
dbReadTable(conn,'idcard')


id    name
1    love
2    Kill
3    Game

dbCreateTable lets us create a hassle-free table. It requires three arguments; connection, name of the table, and either a character vector or a data.frame. The character vector consists of names (column names) and values (types). In our case, we are going to provide a default population data.frame to create the initial structure.

dbCreateTable(conn,'population',population)
dbReadTable(conn,'population')

country    year    population

Then, we are going to use dbAppendTable to add values in the population table.

dbAppendTable(conn,'population',head(population))
dbReadTable(conn,'population')

country    year    population
Afghanistan    1995    17586073
Afghanistan    1996    18415307
Afghanistan    1997    19021226
Afghanistan    1998    19496836
Afghanistan    1999    19987071
Afghanistan    2000    20595360

Running Queries

We will use dbGetQuery to perform all of our data analytics tasks. Let’s try to run a simple query and then learn more about other functions.

dbGetQuery(conn,"SELECT * FROM idcard")

id    name
1    love
2    Kill
3    Game

You can also run a complex SQL query to filter horsepower and display limited rows and columns.

dbGetQuery(conn, "SELECT mpg,hp,gear\
                  FROM cars\
                  WHERE hp > 50\
                  LIMIT 5")


mpg    hp    gear
21.0    110    4
21.0    110    4
22.8    93    4
21.4    110    3
18.7    175    3

To remove tables, use dbRemoveTable. As we can now see, we have successfully removed the idcard table.

dbRemoveTable(conn,'idcard')
dbListTables(conn)
>>> 'cars''population'

To understand more about tables we will use dbListFields which will display the column names in a particular table.

dbListFields(conn, "cars")


>>> 'mpg''cyl''disp''hp''drat''wt''qsec''vs''am''gear''carb'

SQL with dplyr

In this section, we will use dplyr to read tables and then run queries using filter, select, and collect. If you don’t want to learn SQL syntax and want to perform all the tasks using pure R, then this method is for you. We have pulled the cars table, filtered it by gears and mpg, and then selected three columns as shown below.

cars_results <-
  tbl(conn, "cars") %>%
  filter(gear %in% c(4, 3),
         mpg >= 14,
         mpg <= 21) %>%
  select(mpg, hp, gear) %>%
  collect()
cars_results

mpg    hp    gear
21.0    110    4
21.0    110    4
18.7    175    3
18.1    105    3
14.3    245    3
...   ...   ...

We can use the filtered data frame to display a boxplot graph using ggplot.

ggplot(cars_results,aes(fill=as.character(gear), y=mpg)) +
  geom_boxplot()

Or we can display a facet point plot divided by the number of gears.

ggplot(cars_results,
  aes(mpg, ..count.. ) ) +
  geom_point(stat = "count", size = 4) +
  coord_flip()+
  facet_grid( as.character(gear) ~ . )

Conclusion

In this tutorial, we have learned the importance of running SQL queries with Python and R, creating databases, adding tables, and performing data analysis using SQL queries. We have also learned how Pandas and dplyr help us run queries with a single line of code.

SQL is a must-learn skill for all tech-related jobs. If you are starting your career as a data analyst, we recommend completing the Data Analyst with SQL Server career track within two months. This career track will teach you everything about SQL queries, servers, and managing resources.

You can run all the scripts used in this tutorial for free by clicking on the green button Open In Workspace.