Tutorials
sql

Introduction to Indexing in SQL

In this tutorial, learn about indexing in databases and different types of indexing techniques.

Being a data scientist, you will often have to deal with loads of data. Dealing with data (which is present in massive quantities) is not at all easy. So, to deal with it as efficiently as possible you need to have a clear understanding of how the data is organized physically so that you can process it with handy techniques.

SQL is a must-have skill for any modern software engineer because most of the software's depend on some kind of data and integrates well with an RDBMS (Relational Database Management System). Be it a web application, be it an API or be it an in-house application, RDBMS is always there. And SQL is the language for querying an RDBMS.

As a data scientist, it is vital to know SQL and its related techniques. For being able to query an RDBMS and get answers to specific questions that you will have about the data you are dealing with, SQL is the minimum need.

In his latest video with DataCamp, David Robinson(Chief Data Scientist @ DataCamp) showed us how he uses SQL in a Data Science problem. Please check it out. His workflow is very interesting.

Today you are going to learn about a technique called Indexing that primarily concerns organization of data inside a database, and you are going to implement some of them using SQL. This will give you an overview of how indexing can be used to store information inside a database and how it can result in faster execution times.

Following is going to be the outline of this tutorial:

  • A brief note on the organization of records in a file
  • Why indexing is required?
  • Structure of index files
  • Classification of indexing
  • Anatomy of each of the types
  • Examples in SQL

N.B.: Before you start reading this tutorial, it is highly recommended that you learn the basics of SQL if you are not familiar with them. DataCamp's Intro to SQL for Data Science course is an excellent resource in case you want to refresh your SQL basics.

A brief note on the organization of records in a file:

Before you start studying about anything on indexing, it is imperative that you understand how data is organized physically inside of files. Obviously, you are not going to go into all the details but having a good overview of the organization will help you in understanding indexing very clearly. Let's get started.

Organization of records/data in general deals with how the records are stored and broadly the organizations can be divided into two types:

  • Ordered organization: All the records in a file are ordered in some search key value. Binary search is generally incorporated to search for any amount residing inside the ordered file. This makes the search very efficient as the search time complexity remains within logarithmic time. But insertion to the file becomes an expensive operation because you might have to reorganize the entire file to facilitate the new item.

  • Unordered organization: In this type, records are inserted in the file wherever the place is available, usually to the very end of the file. As linear search is used for this type, the search is not as efficient as the previous variant but insertion is not that of an expensive operation.

Although you have an ordered organization of files what if the block sizes of the files and size of each record inside the files are very large? Let's explore this in the next section.

Need for indexing:

You will start this section by learning about the motivation behind using indexing in order to store files/information efficiently and how it enhances the other operations related to these files/information.

Consider you have a table (relation) consisting several records in a database. The records are further divided into 1000 blocks. Pictorially, the organization looks like the following:

From the above organization, it is clear that -

  • The order is applied to the first column of records (consider that this is a relational database system inside which data is organized in a tabular manner)
  • The number of blocks in which the total data is divided - 1000 blocks.

Remember that, each record contains other columns as well but for this organization, the order is applied to the first column and the blocks are accordingly divided. Now, if you do a binary search in order to search for anything inside this organization or records, the total time it will take is - $\log_{2}1000$ which is equal to 10 units of time. Can this search time be improved?

Of course, yes.

Consider this with respect to reading a book and not having the index page. You open a random page (or in case of binary search you open the middle page of the book) and turn the pages left and right accordingly to go to the page that you want. This certainly takes time. Doesn't it?

But if the book had contained the index page, this would certainly make the search even more efficient. You could have gone to the page just by looking at the index. You can apply indexing in the same way to the above situation.

Here is a summary of how you can apply indexing here:

  • Maintain a block pointer for each block along with the ordered values used for the previous organization.

This will result in lesser number of blocks used to store the data files. Now, in order to search for a particular record, you will just have to search through this new organization consisting of a lesser number of blocks and eventually you will get your record (if at all it is present) in much less time. Let's visualize this thing. Considering the number of newly indexed records resulted in 8 blocks, you will get an organization like the following:

The search time is going to be drastically reduced for the lesser number of blocks. Suppose, you want to search the record 90. In this new indexed scheme, you will first have to locate its respective block pointer which in this case is 3 and then using this information you will find the original data records just in one go.

So, the search time, in this case, is going to be $\log_{2}8 + 1$ = 4 units of time which are significantly less as compared to the earlier one.

The above example introduces you to the need of performing indexing. Here are some points about indexing worth considering:

  • Ordering on the original data records can be done using only one field. And then the field entries can be indexed accordingly. Only if you are trying to search using that field, you will get an improved search time. (Extremely important)
  • The indices are also ordered.
  • Index record contains two fields (structure of the index file)-
    • The key of the original file
    • Pointer to the block where the key is available in the original data records
  • Binary search is used to search through the indices.
  • To access a record using the indexed entries, the average number of block accesses required-
    $\log_{2}B_i + 1$ , where $B_i$ is the number of blocks in the indexed records
  • Index can be created on any field of the relation (primary key, candidate keys, non-keys).

Now, based on the ordering of the original data records and how many records you are going to maintain in the indexed file there can be many indexing schemes. In the next section, you will study about the most popular ones.

Different indexing strategies:

  • Dense Indexing: If an index entry is created for every search key value, then it is dense indexing. Consider the following diagram for understanding this visually.

  • Sparse Indexing: If an index entry is created only for some records, then it is sparse indexing. Here's a diagram describing this pictorially.

The diagrams above make both the indexing schemes quite easy to understand. One crucial point you have to make here is that the above example of sparse indexing is a combination of both dense and sparse indexing. That is because for every unique search key values (1,2 and 3) there is an index, but for every data record, there is not an index.

Now you will study the other types of indexing schemes based on the level of records. In single-level indexing, the number of the index file is only one. But, sometimes the size of the index file becomes so large that the index file itself gets indexed. In that case, it called multi-level indexing. Let's delve further.

Here's an overview of the further indexing strategies made based on leveling.

Single-level indexing:

  • Primary indexing
  • Clustered indexing
  • Secondary indexing

Multi-level indexing:

  • B Tree
  • B+ Tree

You will study all of the indexing strategies belonging to single-level indexing. To the end of this tutorial, you will have a link to explore the multi-level indexing schemes in case if you are interested. Let's examine Primary Indexing now.

Primary indexing:

A primary index is an ordered file whose records are of fixed length with two fields:

  • The first field is the same as the primary key of data file.
  • The second field is a pointer to the data block where the primary key is available. - Source

Index created for the first record of each block is called block anchors. In primary indexing, the number of index entries = the number of original data blocks. The average number of block accesses using primary index is:

$\log_{2}B_i + 1$ , where $B_i$ is the number of blocks in the indexed records.

Refer to the following diagram in order to understand this more clearly:

The rightmost figure refers to the original data records divided into several blocks. Note that column containing the numbers 1,2,3,...,9 are the primary keys. The leftmost figure refers to the indexed entries where each entry consists of:

  • The first entry of each data block (block anchor)
  • The second entry denotes the block pointer.

Take a moment and think what of kind of indexing this is (Sparse or Dense). Use the comments section for posting your answers.

You will now study Clustering Indexing.

Clustering indexing:

A Clustering index is created on data file whose file records are physically ordered on a non-key field which does not have a distinct value for each record. This field is known as clustering field based on which the indexing is performed. Hence the name - clustering index.

Diagrams always make it easier to understand. As you can see the original data records are ordered on a non-key attribute and for each distinct value of the non-key attribute, an index entry is created. The number of average block accesses needs to locate a particular record using this scheme is $\geq$ $\log_{2}B_i + 1$ , where $B_i$ is the number of blocks in the indexed records. Notice the $\geq$ sign here.

In the clustering index, after locating block to which a particular key is present, you may traverse other blocks as well (which is quite evident from the above figure).

You might want to think what kind of indexing this is (Sparse or Dense). Use the comments section for posting your answers. Let's study Secondary Indexing now.

Secondary indexing:

Suppose you have a tabled called Employee in your database. The table has the following attributes:

  • employee_id
  • employee_name
  • employee_department
  • employee_salary

employee_id is its primary key. Now, you have already created primary indexing on this table based on employee_id. But while developing an application, you found out that most of the database queries are using the attribute employee_name. In that case, the primary indexing will not help much, and it will be a good practice to maintain separate indexing for all values belonging to employee_name. In any way, the employee names are not going to be present in an ordered way inside the database. So, indexing them will undoubtedly speed up the queries involving employee names.

This is a classic example of secondary indexing. Let's try to construct a suitable image for secondary indexing as well:

Please feel free to post a better representation in the comments section if you got one.

You will now see how you can create indices in PostgreSQL. Take DataCamp's Joining Data in PostgreSQL course if you want to get its basics down.

Creating indices in PostgreSQL:

Before you begin to create indices in a PostgreSQL database, you will need some data present in a table. Let's create a simple table named Student consisting of the following records:

  • student_id
  • student_name
  • student_year

You will make student_id as the primary key, and you will not allow the student names and student years to be null.

Following is going to be the query for this:

CREATE TABLE STUDENT(
   student_id TEXT PRIMARY KEY,
   student_name  TEXT NOT NULL,
   student_year  TEXT NOT NULL
);

After the table is successfully created, you will have to insert some suitable data into the table. For convenience, you can use this .csv file and import it into Student. You can import a compatible .csv file into a PostgreSQL table using the following query:

COPY STUDENT FROM '/path/to/csv/Student.csv' WITH (FORMAT csv);

The above query assumes that the name of the .csv file from which data is being copied is Student.

Let's take a look at the data now. Running select * from STUDENT; returns the following records:

The select query should return a total of 86 records. Now, you are in a position to create indices. You can create single-column indices using the following syntax:

CREATE INDEX index_name
ON table_name (column_name);

Let's create an index on the student_id field (primary indexing).

CREATE INDEX id_index
ON STUDENT (student_id);

Multi-column indices can also be created:

CREATE INDEX id_index
ON STUDENT (student_id,student_name);

You can drop an index using the following syntax:

DROP INDEX index_name;

It is hard to get a feeling of what indexing is doing in a small table like STUDENT. But if the table were large (consider a database table containing similar records of a big university), indexing would undoubtedly play a vital role.

You did it!

Congratulations for making it till the end. In this tutorial, you learned about indexing, its needs, different indexing schemes. You also saw how you can perform simple indexing in PostgreSQL.

But you did not study about multi-level indexing in this tutorial. The following are some excellent resources if you are interested in exploring them:

But does it always help to use indexing? Well, there are some cases, where using indexing is not advisable.

Source

I hope the tutorial helped you to get clear on the basics of indexing. Let me know about your interesting findings in the comments section.

Want to leave a comment?