Database indexing -101 (Part 2)
Learn how to optimise your MySQL query, the easy way.
Let's take an example of a query to better understand how indexes work and how they can help. We take a simple Payment
table which saves a timestamp of whenever a user makes a payment in our system.
create table payment (
payment_time TIMESTAMP NOT NULL,
amount FLOAT NOT NULL,
user INT NOT NULL
);
Let's assume this table has 10 million rows spanning from year 2018 to 2020. In order to calculate the total amount of payment received on a given year our simple database query would look like:
> select sum(amount) from payment where year(payment_time) = 2018;
+-------------+
| sum(amount) |
+-------------+
| 24042400 |
+-------------+
1 row in set (4.33 sec)
Query takes a total of 4.33 sec to scan 10 million+ rows to calculate the total amount, for this query to run fast, the first thing any developer can think of is to add index, as we have only one column in our where clause, lets try adding an index on column payment_time
to see how it performs. We add an index by simple query like
ALTER TABLE payment ADD INDEX (payment_time);
Let's run our query again:
> select sum(amount) from payment where year(payment_time) = 2018;
+-------------+
| sum(amount) |
+-------------+
| 24042400 |
+-------------+
1 row in set (4.35 sec)
Hmm, 4.35 seconds no change at all, what could be the possible reason, time to pull out our Swiss knife EXPLAIN
, almost all database software gives this tool so that users of the database can better understand query paths and take steps on optimisation, let's run EXPLAIN
on our query and see the output.
> EXPLAIN select sum(amount) from payment where year(payment_time) = 2018 \G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: payment
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 12453594
Extra: Using where
As we have discussed about column type in our earlier blog and we understand that the value ALL
in our table scan has to be avoided at all costs, as it just means that even with the added index our query is still doing a complete table scan. And what's surprising is the value of possible keys and key is Null
, database is not even considering the index we have added.
Let's take a step back and think in terms of mental model of a telephone directory we have built, is it easy to find out all telephone numbers of first name Foo when the telephone directory is sorted on the basis of last name, of-course not, we would need to do a complete scan of telephone directory to get all phone number of name Foo similarly if whole timestamp is indexed database has no knowledge of any value which is generated by an function.
So our index is not even getting considered, but wait a minute we have learnt about a thing or two about Range scan
, and that range scan
is better than all
table scan, lets change our query to use a range of dates instead of a function to see if it gets any performance improvement. But first, let's run an Explain
on our query to see the possible query path execution.
> EXPLAIN select sum(amount) from payment where payment_time >= '2018-01-01 00:00' AND payment_time <= '2018-12-31 59:59' \G *************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: payment
type: ALL
possible_keys: payment_time
key: NULL
key_len: NULL
ref: NULL
rows: 12453594
Extra: Using where
1 row in set, 2 warnings (0.00 sec)
So this time it did identify that the payment_time index can be possible used in the query, but its not actually using it which is evident from the column key being Null and and column type which denotes that its a complete table scan. We can also see in above output that the value of column rows is close to 12 million which denotes a full table scan. Let's force use an index to see if that increases the performance or help us reduce the total no of rows to be scanned.
> select sum(amount) from payment force index (payment_time) where payment_time >= '2018-01-01 00:00' AND payment_time <= '2018-12-31 59:59' \G
*************************** 1. row ***************************
sum(amount): 24042400
1 row in set, 2 warnings (1 min 36.71 sec)
Well, this is weird, using an index took more time to calculate values(1 min vs 5 seconds without index), if we read mysql documentation this behaviour is defined, it says if no of rows to be scanned with index are greater than 40% then it would actually use a table scan instead of range scan, the reason for that MYSQL is optimised to fetch table rows in bulk if it knows in advance that it has to do a table scan, vs the range scan where MYSQL will first query the index, and then fetch each row one by one from disk which is expensive.
So now we are back to square one, in order to optimise this query our index is resulting in scanning more rows, now lets re-look at our query and see what can we actually do to reduce the no of row scan.
We are using column amount and payment_time
but our index is only on payment time, imagine we have an aggregate table which sums up the amount for each date, or month or year, then definitely query like above would run at an extremely fast pace as no of rows to scan will drastically reduce. The thumb rule is always try to minimise the row scan below the 40% threshold.
Now let's add an index together on amount and payment_time
and delete our existing index on payment_time
.
> alter table payment drop index payment_time;
Query OK, 0 rows affected (0.05 sec)
Records: 0 Duplicates: 0 Warnings: 0
> create index pay_amount on payment(payment_time, amount);
Query OK, 0 rows affected (27.94 sec)
Records: 0 Duplicates: 0 Warnings: 0
Now let's run our same query again with explain
and see if query is using our index or not.
mysql> EXPLAIN select sum(amount) from payment force index (payment_time) where payment_time >= '2000-01-01 00:00' AND payment_time <= '2018-12-31 59:59' \G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: payment
type: range
possible_keys: payment_time
key: payment_time
key_len: 4
ref: NULL
rows: 6226797
Extra: Using where; Using index
1 row in set, 2 warnings (0.00 sec)
Now we see that it's using the index automatically without using any force. This time let's focus on column Extra
, it has the value Using index , what it means the entire data required for the query is available using index and MYSQL keeps the entire index data in memory, so this query will not do any disk access to retrieve any data. A using index is the most powerful thing a database has, what it means that not only your where clause but also columns mentioned in our Select
clause are included in the index, this of-course comes with a tradeoff, this is an exact index which means that developer has created this index specifically for optimising this query. But we all know the requirement keeps changing, so lets consider a case where we have to sum the amount in a given year for a given user id, our query would look like
> EXPLAIN select sum(amount) from payment force index (payment_time) where payment_time >= '2018-01-01 00:00' AND payment_time <= '2018-12-31 59:59' and user=7
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: payment
type: ALL
possible_keys: payment_time
key: NULL
key_len: NULL
ref: NULL
rows: 12453594
Extra: Using where
1 row in set, 3 warnings (0.00 sec)
Now we are back to the same problem we started up with, a complete table scan, but not to worry let's reuse our previous solution, we will drop our old index and add a new index which also includes column user
> alter table payment drop index pay_amount;
Query OK, 0 rows affected (0.02 sec)
Records: 0 Duplicates: 0 Warnings: 0
>> create index pay_amount_u on payment(payment_time, amount, user);
Query OK, 0 rows affected (25.72 sec)
Records: 0 Duplicates: 0 Warnings: 0
Let's look at our EXPLAIN
output now,
> EXPLAIN select sum(amount) from payment where payment_time >= '2018-01-01 00:00' AND payment_time <= '2018-12-31 59:59' and user=7 \G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: payment
type: range
possible_keys: pay_amount
key: pay_amount
key_len: 4
ref: NULL
rows: 6226797
Extra: Using where; Using index
1 row in set, 2 warnings (0.01 sec)
So we are back to using index
but still our index is scanning 6 million rows, that doesn't look right one user can't possibly pay 6 million times and responsible for half of our revenue. Something looks fishy. And surprisingly no of rows to scan are exactly the same when we only had index on payment and amount
column. Let's dive a little deeper on how our index looks like, we have an index on payment_time, amount and user, what it means that first our column payment_time
is stored in sorted order, then our column amount is stored in sorted order in respect to payment_time, meaning all amounts are sorted for a given date before any new date starts, and finally our user column is sorted on the basis of amount. Now the thing with SQL is we can skip columns, if we want our multi-column index to work we have to make sure all columns are present in where clause, but in our above query column amount
is not present in where clause, our index is just using payment_time
in index and then doing a complete range scan.
Moral of the story is order of columns in index matters, the index on column payment_time, amount
is not same as index on column amount, payment_time
So in order to fix above query all we have to do is change the order of index and we are sorted, right ? Let's try it out, and run our EXPLAIN
query
> alter table payment drop index pay_amount; Query OK, 0 rows affected (0.02 sec)
Records: 0 Duplicates: 0 Warnings: 0
> create index pay_amount on payment(payment_time, user, amount);
Query OK, 0 rows affected (27.45 sec)
Records: 0 Duplicates: 0 Warnings: 0
> EXPLAIN select sum(amount) from payment where payment_time >= '2000-01-01 00:00' AND payment_time <= '2018-12-31 59:59' and user=7 \G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: payment
type: range
possible_keys: pay_amount
key: pay_amount
key_len: 8
ref: NULL
rows: 6226797
Extra: Using where; Using index
1 row in set, 2 warnings (0.00 sec)
Hmm, still nothing changed it is still scanning 6 million rows, that brings to our last catch of the blog inequality operators, whenever there is an inequality operator the index would just stop there and would resort to scanning the whole index. So if that's the problem, then the fix is very simple right, let's put user
as our first column in index so that it uses that to optimise query, let's try that out
> alter table payment drop index pay_amount;
Query OK, 0 rows affected (0.03 sec)
Records: 0 Duplicates: 0 Warnings: 0
> create index pay_amount on payment(user, payment_time, amount);
Query OK, 0 rows affected (28.49 sec)
Records: 0 Duplicates: 0 Warnings: 0
> EXPLAIN select sum(amount) from payment where payment_time >= '2000-01-01 00:00' AND payment_time <= '2018-12-31 59:59' and user=7 \G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: payment
type: range
possible_keys: pay_amount
key: pay_amount
key_len: 8
ref: NULL
rows: 10
Extra: Using where; Using index
1 row in set, 2 warnings (0.00 sec)
Alas, our query is optimised and performing at the best level, this is just a gentle introduction to query optimisations hope to uncover more of this catch in our upcoming blog series.