Tutorials
r programming
+1

SQLite in R

In this tutorial, you will learn about using SQLite, an extremely light-weight relational database management system (RDBMS) in R.

As mentioned at the end of my previous tutorial, Beginners Guide to SQLite, SQLite databases are most powerful when used along R and Python. So far in DataCamp, you have been exposed to how to operate with SQLite databases from Python (see the SQLite in Python tutorial by Sayak Paul to learn how to manipulate SQLite databases through the sqlite3 package in Python). However, in this tutorial, we will focus on how to use SQLite databases in R using the RSQLite package.

We will go over the basics of how to perform essential tasks such as sending queries to a SQLite database or creating tables using RSQLite. Furthermore, I will cover how to use parameterized queries and operations like INSERT or DELETE that do not return tabular results.

Creating Databases and Tables

The first step as you may have guessed is to create a database. RSQLite can create ephemeral in-memory transient SQLite databases just as it happens when you open the SQLite command line. However, this is not usually what you want, so let's create a proper database for the mtcars dataset using the function dbConnect(), which takes the following arguments:

  • drv: A database driver
  • path: The path to a SQLite database. If you are creating a new one, simply give it a name of your choice as I do below. If you want to operate with a transient in-memory database; however, you can omit the path argument or type ":memory:").
# Load the RSQLite Library
library(RSQLite)
# Load the mtcars as an R data frame put the row names as a column, and print the header.
data("mtcars")
mtcars$car_names <- rownames(mtcars)
rownames(mtcars) <- c()
head(mtcars)
# Create a connection to our new database, CarsDB.db
# you can check that the .db file has been created on your working directory
conn <- dbConnect(RSQLite::SQLite(), "CarsDB.db")
mpgcyldisphpdratwtqsecvsamgearcarbcar_names
21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4
21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Mazda RX4 Wag
22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Datsun 710
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 Hornet 4 Drive
18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 Hornet Sportabout
18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 Valiant

Once you have the database created and your data in proper shape, you can go ahead and create a table inside the database using the dbWriteTable() function. This function can take multiple arguments, but, for now, let's focus on the following:

  • conn: The connection to your SQLite database
  • name: The name you want to use for your table
  • value: The data that you want to insert. It should be an R data frame or an object coercible to an R data frame.

After that, you can use the function dbListTables() with the SQLite database connection as an argument to check that you have successfully created the table.

# Write the mtcars dataset into a table names mtcars_data
dbWriteTable(conn, "cars_data", mtcars)
# List all the tables available in the database
dbListTables(conn)

'cars_data'

An extremely useful feature when creating tables using RSQLite, is that you can append more data to an existing table using a loop in case that you have several data frames by setting the optional argument append = TRUE inside dbWriteTable(). For instance, let us create a new toy table with some cars and the manufacturers by appending two different data frames:

# Create toy data frames
car <- c('Camaro', 'California', 'Mustang', 'Explorer')
make <- c('Chevrolet','Ferrari','Ford','Ford')
df1 <- data.frame(car,make)
car <- c('Corolla', 'Lancer', 'Sportage', 'XE')
make <- c('Toyota','Mitsubishi','Kia','Jaguar')
df2 <- data.frame(car,make)
# Add them to a list
dfList <- list(df1,df2)
# Write a table by appending the data frames inside the list
for(k in 1:length(dfList)){
    dbWriteTable(conn,"Cars_and_Makes", dfList[[k]], append = TRUE)
}
# List all the Tables
dbListTables(conn)
  1. 'Cars_and_Makes'
  2. 'cars_data'

Let us now make sure that all the data is in the new table:

dbGetQuery(conn, "SELECT * FROM Cars_and_Makes")
carmake
Camaro Chevrolet
CaliforniaFerrari
Mustang Ford
Explorer Ford
Corolla Toyota
Lancer Mitsubishi
Sportage Kia
XE Jaguar

Executing SQL Queries

As you have seen above, it is possible to execute valid SQL queries through RSQLite using the dbGetQuery() function, which has the following arguments:

  • conn: The connection to the SQLite Database
  • query: The SQL query you want to execute as a string

To further showcase the capability to execute SQL queries with RSQLite, let's go over some more query examples on the cars_data table:

Note: It is important to keep in mind that, through RSQLIte, you can execute any query that is valid for a SQLite database from simple SELECT statements to JOINS (Except RIGHT OUTER JOINS and FULL OUTER JOINS, which are not allowed in SQLite).

# Gather the first 10 rows in the cars_data table
dbGetQuery(conn, "SELECT * FROM cars_data LIMIT 10")
mpgcyldisphpdratwtqsecvsamgearcarbcar_names
21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4
21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 Mazda RX4 Wag
22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 Datsun 710
21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 Hornet 4 Drive
18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 Hornet Sportabout
18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 Valiant
14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 Duster 360
24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 Merc 240D
22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 Merc 230
19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 Merc 280
# Get the car names and horsepower of the cars with 8 cylinders
dbGetQuery(conn,"SELECT car_names, hp, cyl FROM cars_data
                 WHERE cyl = 8")
car_nameshpcyl
Hornet Sportabout 175 8
Duster 360 245 8
Merc 450SE 180 8
Merc 450SL 180 8
Merc 450SLC 180 8
Cadillac Fleetwood 205 8
Lincoln Continental215 8
Chrysler Imperial 230 8
Dodge Challenger 150 8
AMC Javelin 150 8
Camaro Z28 245 8
Pontiac Firebird 175 8
Ford Pantera L 264 8
Maserati Bora 335 8
# Get the car names and horsepower starting with M that have 6 or 8 cylinders
dbGetQuery(conn,"SELECT car_names, hp, cyl FROM cars_data
                 WHERE car_names LIKE 'M%' AND cyl IN (6,8)")
car_nameshpcyl
Mazda RX4 110 6
Mazda RX4 Wag110 6
Merc 280 123 6
Merc 280C 123 6
Merc 450SE 180 8
Merc 450SL 180 8
Merc 450SLC 180 8
Maserati Bora335 8
# Get the average horsepower and mpg by number of cylinder groups
dbGetQuery(conn,"SELECT cyl, AVG(hp) AS 'average_hp', AVG(mpg) AS 'average_mpg' FROM cars_data
                 GROUP BY cyl
                 ORDER BY average_hp")
cylaverage_hpaverage_mpg
4 82.6363626.66364
6 122.2857119.74286
8 209.2142915.10000

To store the results of your queries to do further operations in R down the line as a data frame, it is as simple as assigning the result of your query to a variable.

avg_HpCyl <- dbGetQuery(conn,"SELECT cyl, AVG(hp) AS 'average_hp'FROM cars_data
                 GROUP BY cyl
                 ORDER BY average_hp")
avg_HpCyl
class(avg_HpCyl)
cylaverage_hp
4 82.63636
6 122.28571
8 209.21429

'data.frame'

Insert Variables into Queries (Parameterised Queries)

One of the greatest advantages of operating with SQLite databases from R is the ability to use parameterized queries. That is, the ability to gather variables available in your R workspace and use them to query your SQLite database. Let me show an example of how to use a variable inside a SQLite query:

# Lets assume that there is some user input that asks us to look only into cars that have over 18 miles per gallon (mpg)
# and more than 6 cylinders
mpg <-  18
cyl <- 6
Result <- dbGetQuery(conn, 'SELECT car_names, mpg, cyl FROM cars_data WHERE mpg >= ? AND cyl >= ?', params = c(mpg,cyl))
Result
car_namesmpgcyl
Mazda RX4 21.0 6
Mazda RX4 Wag 21.0 6
Hornet 4 Drive 21.4 6
Hornet Sportabout18.7 8
Valiant 18.1 6
Merc 280 19.2 6
Pontiac Firebird 19.2 8
Ferrari Dino 19.7 6

As you can see the only difference between sending a normal query and a parameterized query lies in the placeholder value in the query (>= ?) and the params argument of dbGetQuery(), which takes a list or a vector with the values you want to assign to the placeholders (in this case we have a vector containing the mpg and cyl variables).

Now, what happens when you want to execute a different query? In the example above, as you can see, I pretty much hand-crafted the query. It just allows the input of mpg and cyl values, and the query will only retrieve the cars with higher or equal mpg and cyl. However, there may be situations where you may want to be more flexible. What happens if a user wants to look at the cars that have the higher or equal horsepower and weight than a specific value as well? With the query above, you would have to go back and rewrite it, but you could write a function that would make that step unnecessary. Let's take a look at one:

# Assemble an example function that takes the SQLite database connection, a base query,
# and the parameters you want to use in the WHERE clause as a list
assembleQuery <- function(conn, base, search_parameters){
    parameter_names <- names(search_parameters)
    partial_queries <- ""
    # Iterate over all the parameters to assemble the query
    for(k in 1:length(parameter_names)){
        filter_k <- paste(parameter_names[k], " >= ? ")
        # If there is more than 1 parameter, add an AND statement before the parameter name and placeholder
        if(k > 1){
            filter_k <- paste("AND ", parameter_names[k], " >= ?")
        }
        partial_queries <- paste(partial_queries, filter_k)
    }
    # Paste all together into a single query using a WHERE statement
    final_paste <- paste(base, " WHERE", partial_queries)
    # Print the assembled query to show how it looks like
    print(final_paste)
    # Run the final query. I unlist the values from the search_parameters list into a vector since it is needed
    # when using various anonymous placeholders (i.e. >= ?)
    values <- unlist(search_parameters, use.names = FALSE)
    result <- dbGetQuery(conn, final_paste, params = values)
    # return the executed query
    return(result)
}

base <- "SELECT car_names, mpg, hp, wt FROM cars_data"
search_parameters <- list("mpg" = 16, "hp" = 150, "wt" = 2.1)
result <- assembleQuery(conn, base, search_parameters)
result
[1] "SELECT car_names, mpg, hp, wt FROM cars_data  WHERE  mpg  >= ?  AND  hp  >= ? AND  wt  >= ?"
car_namesmpghpwt
Hornet Sportabout18.7 175 3.440
Merc 450SE 16.4 180 4.070
Merc 450SL 17.3 180 3.730
Pontiac Firebird 19.2 175 3.845
Ferrari Dino 19.7 175 2.770

The function above may be basic, but it illustrates how it is possible to write R code that generates SQL queries that can be executed on a SQLite database. I encourage you to keep exploring if you are interested in this use case. The function above can be enhanced in several ways starting with the removal of the assumption that the user providing the parameters will want to see the values that are higher or equal than those inputted.

Statements That Do Not Return Tabular Results

Sometimes, you may want to execute SQL queries that do not necessarily return tabular data. Examples of these operations are inserting, updating or deleting table records. For this, we can use the function dbExecute(), which takes as arguments a SQLite database connection and a SQL query. A couple of examples can be seen below:

# Visualize the table before deletion
dbGetQuery(conn, "SELECT * FROM cars_data LIMIT 10")
# Delete the column belonging to the Mazda RX4. You will see a 1 as the output.
dbExecute(conn, "DELETE FROM cars_data WHERE car_names = 'Mazda RX4'")
# Visualize the new table after deletion
dbGetQuery(conn, "SELECT * FROM cars_data LIMIT 10")
mpgcyldisphpdratwtqsecvsamgearcarbcar_names
21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4
21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 Mazda RX4 Wag
22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 Datsun 710
21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 Hornet 4 Drive
18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 Hornet Sportabout
18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 Valiant
14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 Duster 360
24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 Merc 240D
22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 Merc 230
19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 Merc 280

1

mpgcyldisphpdratwtqsecvsamgearcarbcar_names
21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 Mazda RX4 Wag
22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 Datsun 710
21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 Hornet 4 Drive
18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 Hornet Sportabout
18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 Valiant
14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 Duster 360
24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 Merc 240D
22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 Merc 230
19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 Merc 280
17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 Merc 280C

Now lets try inserting the data of the Mazda RX4 back:

# Insert the data for the Mazda RX4. This will also ouput a 1
dbExecute(conn, "INSERT INTO cars_data VALUES (21.0,6,160.0,110,3.90,2.620,16.46,0,1,4,4,'Mazda RX4')")
# See that we re-introduced the Mazda RX4 succesfully at the end
dbGetQuery(conn, "SELECT * FROM cars_data")

1

mpgcyldisphpdratwtqsecvsamgearcarbcar_names
21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 Mazda RX4 Wag
22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 Datsun 710
21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 Hornet 4 Drive
18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 Hornet Sportabout
18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 Valiant
14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 Duster 360
24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 Merc 240D
22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 Merc 230
19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 Merc 280
17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 Merc 280C
16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 Merc 450SE
17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 Merc 450SL
15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 Merc 450SLC
10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 Cadillac Fleetwood
10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 Lincoln Continental
14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 Chrysler Imperial
32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 Fiat 128
30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 Honda Civic
33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 Toyota Corolla
21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 Toyota Corona
15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 Dodge Challenger
15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 AMC Javelin
13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 Camaro Z28
19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 Pontiac Firebird
27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 Fiat X1-9
26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 Porsche 914-2
30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 Lotus Europa
15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 Ford Pantera L
19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 Ferrari Dino
15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 Maserati Bora
21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 Volvo 142E
21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4

As you can see, the last line of code added the Mazda RX4 to the end of the table. Once you are done operating with your SQLite database within R, it is important to call the function dbDisconnect(). This ensures that we release any resources that the database connection has been using, which is always a good practice.

# Close the database connection to CarsDB
dbDisconnect(conn)

Conclusion

In this tutorial, we have covered the essential functions needed to manipulate SQLite databases in R using RSQLite. SQLite databases, if appropriately used, are an incredibly useful tool when combined inside your R scripts. An example of the power of the intersection between R and SQLite can be seen in parameterized queries, which could be used if you need to query a database in order to display information based on user input inside an R Shiny app. Another example use case or parameterized queries could be virtual assistants or chatbots. If you want to learn more about this, I encourage you to look into DataCamp's Building Chatbots with Python course.

Writing SQLite tables by appending data frames, can also be very powerful. As I noted in the Beginners Guide to SQLite tutorial, this capability allowed me to complete a social network analysis project by appending the followers of several Twitter users of interest to a table as I gathered them. I used a SQLite database for this, to save time by avoiding the need to start from scratch in the case of a power outage or Windows update, which could forcefully shut down my computer. If something like that happened, all that I had to do was restart gathering the followers of the user that was next to the last one that I had in the database. To put it into perspective, the computation time to gather the followers of all my Twitter users of interest, took nearly 4 weeks to complete due to the rate limits of the Twitter API. Therefore, starting from zero due to a power outage or Windows update would have been very undesirable.

As always, I encourage you to keep learning about SQLite databases and how to interact with them through R. I strongly encourage you to dive deeper an check the previous tutorials covering their usage with Python and the command line, which can further help you become a full SQLite wiz. Keep learning; the sky is the limit!

Want to leave a comment?