Database indexing -101 (Part 2)

Database indexing -101 (Part 2)

Learn how to optimise your MySQL query, the easy way.

Let's take an example of a query to better understand how indexes work and how they can help. We take a simple Payment table which saves a timestamp of whenever a user makes a payment in our system.

create table payment (
   payment_time TIMESTAMP NOT NULL,
   amount FLOAT NOT NULL,
   user INT NOT NULL
);

Let's assume this table has 10 million rows spanning from year 2018 to 2020. In order to calculate the total amount of payment received on a given year our simple database query would look like:

> select sum(amount) from payment where year(payment_time) = 2018;
+-------------+
| sum(amount) |
+-------------+
|    24042400 |
+-------------+
1 row in set (4.33 sec)

Query takes a total of 4.33 sec to scan 10 million+ rows to calculate the total amount, for this query to run fast, the first thing any developer can think of is to add index, as we have only one column in our where clause, lets try adding an index on column payment_time to see how it performs. We add an index by simple query like

ALTER TABLE payment ADD INDEX (payment_time);

Let's run our query again:

> select sum(amount) from payment where year(payment_time) = 2018;
+-------------+
| sum(amount) |
+-------------+
|    24042400 |
+-------------+
1 row in set (4.35 sec)

Hmm, 4.35 seconds no change at all, what could be the possible reason, time to pull out our Swiss knife EXPLAIN , almost all database software gives this tool so that users of the database can better understand query paths and take steps on optimisation, let's run EXPLAIN on our query and see the output.

> EXPLAIN select sum(amount) from payment where year(payment_time) = 2018 \G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: payment
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 12453594
        Extra: Using where

As we have discussed about column type in our earlier blog and we understand that the value ALL in our table scan has to be avoided at all costs, as it just means that even with the added index our query is still doing a complete table scan. And what's surprising is the value of possible keys and key is Null, database is not even considering the index we have added. Let's take a step back and think in terms of mental model of a telephone directory we have built, is it easy to find out all telephone numbers of first name Foo when the telephone directory is sorted on the basis of last name, of-course not, we would need to do a complete scan of telephone directory to get all phone number of name Foo similarly if whole timestamp is indexed database has no knowledge of any value which is generated by an function. So our index is not even getting considered, but wait a minute we have learnt about a thing or two about Range scan, and that range scan is better than all table scan, lets change our query to use a range of dates instead of a function to see if it gets any performance improvement. But first, let's run an Explain on our query to see the possible query path execution.

> EXPLAIN select sum(amount) from payment where payment_time >= '2018-01-01 00:00' AND payment_time <= '2018-12-31 59:59' \G                                                                                                             *************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: payment
         type: ALL
possible_keys: payment_time
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 12453594
        Extra: Using where
1 row in set, 2 warnings (0.00 sec)

So this time it did identify that the payment_time index can be possible used in the query, but its not actually using it which is evident from the column key being Null and and column type which denotes that its a complete table scan. We can also see in above output that the value of column rows is close to 12 million which denotes a full table scan. Let's force use an index to see if that increases the performance or help us reduce the total no of rows to be scanned.

> select sum(amount) from payment force index (payment_time) where payment_time >= '2018-01-01 00:00' AND payment_time <= '2018-12-31 59:59' \G
*************************** 1. row ***************************
sum(amount): 24042400
1 row in set, 2 warnings (1 min 36.71 sec)

Well, this is weird, using an index took more time to calculate values(1 min vs 5 seconds without index), if we read mysql documentation this behaviour is defined, it says if no of rows to be scanned with index are greater than 40% then it would actually use a table scan instead of range scan, the reason for that MYSQL is optimised to fetch table rows in bulk if it knows in advance that it has to do a table scan, vs the range scan where MYSQL will first query the index, and then fetch each row one by one from disk which is expensive.

So now we are back to square one, in order to optimise this query our index is resulting in scanning more rows, now lets re-look at our query and see what can we actually do to reduce the no of row scan. We are using column amount and payment_time but our index is only on payment time, imagine we have an aggregate table which sums up the amount for each date, or month or year, then definitely query like above would run at an extremely fast pace as no of rows to scan will drastically reduce. The thumb rule is always try to minimise the row scan below the 40% threshold. Now let's add an index together on amount and payment_time and delete our existing index on payment_time.

> alter table payment drop index payment_time;
Query OK, 0 rows affected (0.05 sec)
Records: 0  Duplicates: 0  Warnings: 0
> create index pay_amount on payment(payment_time, amount);
Query OK, 0 rows affected (27.94 sec)
Records: 0  Duplicates: 0  Warnings: 0

Now let's run our same query again with explain and see if query is using our index or not.

mysql> EXPLAIN select sum(amount) from payment force index (payment_time) where payment_time >= '2000-01-01 00:00' AND payment_time <= '2018-12-31 59:59' \G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: payment
         type: range
possible_keys: payment_time
          key: payment_time
      key_len: 4
          ref: NULL
         rows: 6226797
        Extra: Using where; Using index
1 row in set, 2 warnings (0.00 sec)

Now we see that it's using the index automatically without using any force. This time let's focus on column Extra , it has the value Using index , what it means the entire data required for the query is available using index and MYSQL keeps the entire index data in memory, so this query will not do any disk access to retrieve any data. A using index is the most powerful thing a database has, what it means that not only your where clause but also columns mentioned in our Select clause are included in the index, this of-course comes with a tradeoff, this is an exact index which means that developer has created this index specifically for optimising this query. But we all know the requirement keeps changing, so lets consider a case where we have to sum the amount in a given year for a given user id, our query would look like

> EXPLAIN select sum(amount) from payment force index (payment_time) where payment_time >= '2018-01-01 00:00' AND payment_time <= '2018-12-31 59:59' and user=7
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: payment
         type: ALL
possible_keys: payment_time
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 12453594
        Extra: Using where
1 row in set, 3 warnings (0.00 sec)

Now we are back to the same problem we started up with, a complete table scan, but not to worry let's reuse our previous solution, we will drop our old index and add a new index which also includes column user

> alter table payment drop index pay_amount;
Query OK, 0 rows affected (0.02 sec)
Records: 0  Duplicates: 0  Warnings: 0
>> create index pay_amount_u on payment(payment_time, amount, user);
Query OK, 0 rows affected (25.72 sec)
Records: 0  Duplicates: 0  Warnings: 0

Let's look at our EXPLAIN output now,

> EXPLAIN select sum(amount) from payment where payment_time >= '2018-01-01 00:00' AND payment_time <= '2018-12-31 59:59' and user=7 \G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: payment
         type: range
possible_keys: pay_amount
          key: pay_amount
      key_len: 4
          ref: NULL
         rows: 6226797
        Extra: Using where; Using index
1 row in set, 2 warnings (0.01 sec)

So we are back to using index but still our index is scanning 6 million rows, that doesn't look right one user can't possibly pay 6 million times and responsible for half of our revenue. Something looks fishy. And surprisingly no of rows to scan are exactly the same when we only had index on payment and amount column. Let's dive a little deeper on how our index looks like, we have an index on payment_time, amount and user, what it means that first our column payment_time is stored in sorted order, then our column amount is stored in sorted order in respect to payment_time, meaning all amounts are sorted for a given date before any new date starts, and finally our user column is sorted on the basis of amount. Now the thing with SQL is we can skip columns, if we want our multi-column index to work we have to make sure all columns are present in where clause, but in our above query column amount is not present in where clause, our index is just using payment_time in index and then doing a complete range scan.

Moral of the story is order of columns in index matters, the index on column payment_time, amount is not same as index on column amount, payment_time

So in order to fix above query all we have to do is change the order of index and we are sorted, right ? Let's try it out, and run our EXPLAIN query

> alter table payment drop index pay_amount;                                                                                                                                                                                             Query OK, 0 rows affected (0.02 sec)
Records: 0  Duplicates: 0  Warnings: 0
> create index pay_amount on payment(payment_time, user, amount);
Query OK, 0 rows affected (27.45 sec)
Records: 0  Duplicates: 0  Warnings: 0
> EXPLAIN select sum(amount) from payment where payment_time >= '2000-01-01 00:00' AND payment_time <= '2018-12-31 59:59' and user=7 \G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: payment
         type: range
possible_keys: pay_amount
          key: pay_amount
      key_len: 8
          ref: NULL
         rows: 6226797
        Extra: Using where; Using index
1 row in set, 2 warnings (0.00 sec)

Hmm, still nothing changed it is still scanning 6 million rows, that brings to our last catch of the blog inequality operators, whenever there is an inequality operator the index would just stop there and would resort to scanning the whole index. So if that's the problem, then the fix is very simple right, let's put user as our first column in index so that it uses that to optimise query, let's try that out

> alter table payment drop index pay_amount;
Query OK, 0 rows affected (0.03 sec)
Records: 0  Duplicates: 0  Warnings: 0
> create index pay_amount on payment(user, payment_time, amount);
Query OK, 0 rows affected (28.49 sec)
Records: 0  Duplicates: 0  Warnings: 0
> EXPLAIN select sum(amount) from payment where payment_time >= '2000-01-01 00:00' AND payment_time <= '2018-12-31 59:59' and user=7 \G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: payment
         type: range
possible_keys: pay_amount
          key: pay_amount
      key_len: 8
          ref: NULL
         rows: 10
        Extra: Using where; Using index
1 row in set, 2 warnings (0.00 sec)

Alas, our query is optimised and performing at the best level, this is just a gentle introduction to query optimisations hope to uncover more of this catch in our upcoming blog series.