Database Indexing - 101

Database Indexing - 101

Learn how database indexes works internally.

What is a Database Index?

Quoting wikipedia : "A database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of additional writes and storage space to maintain the index data structure. Indexes are used to quickly locate data without having to search every row in a database table every time a database table is accessed."

Why use a Database Index?

Being a fast paced industry, we must avoid latency at all cost. Running slow queries may cause bad experience for users, especially as they do not like to wait endlessly for data. A survey conducted by Amazon found out that with every 100ms increase in latency, their sales decreases by 1%. The stats for other tech companies are not very different. Thus it is important to reduce as much time as possible to return a response to the user.

How is a Database Index implemented ?

Now it is clear that the time taken by an application to respond to the user has to be reduced, let's now discuss how indexes are used to reduce data access time. Indexing is not just throwing an index at every column of our where clause, hoping that one of the indexes will be used by our query. It is a lot more nuanced. Having a bad index can actually make performance worse.

First, we need to understand how indexes are implemented and how they work, and what is the best way to understand this other than taking the classic example of telephone directory. A telephone directory usually contains a list of first names, last names and phone numbers, always ordered by the last names.

This means if someone knows your last name "foo", it would be very easy for them to find you in a telephone directory, but if the same person knows only your first name, this task would be more difficult. They would have to look through each and every page of the directory and try all the numbers that come with the first name in order to find your last name.

Indexes can be seen the same way: a telephone directory, where the index is on the last name.

Let us now step into the digital world to briefly discuss what indexes are made of. Now comes the obnoxious and deadly B-trees, where B stands for Balanced. But how do we create balanced trees? Well, we simply keep balancing it.

Example

Say we have a list of numbers and we want to store them and take a look at them up in order.

1.png The easiest way to do this would be to keep a sorted array.

2.png

Now if we have to find 6, all we have to do is apply a binary search and any element in the array can be found in log(N) time can be found, where N is the size of the array. Though this is a good data structure to find an element, it is not so good to mutate. There is always a chance of having to move all other elements to insert or delete a number. As databases use disk to save data, there would be too many disk operations and since disk I/O is expensive, binary trees won't be of much help.

What if we divide the above array into multiple small sorted arrays of a fixed size? Let us take 5 as the size:

3.png Now if we need to insert 11, we would get:

4.png But, what if the block into which we need to insert is already full?

Then, we divide the existing block into two and thus create more space. Now, inserting 5 would be like:

5.png It is also possible to keep a percentage of how much space a block has to create. For eg: if a block has to be 50% empty, the division can be done in such a way that every new block created will have at least 50% empty space. Of course, we can also merge two blocks on deletion if they are too small. When too many blocks are generated, we can have another set of blocks which carry the entries in the blocks we created earlier, and what data each has. This is called a tree structure. The layer of blocks containing actual data is the same; the layers above it store pointers to the next layer.

All blocks are doubly linked to each other so that they are aware of what comes before and after them. This would make performing a range scan much faster.

We could go more detailed into it, but this is a good bird's eye view about what data structure index uses and how insert and delete operations on indexes are handled.