Indexes (MySQL)

Uit De Vliegende Brigade
Ga naar: navigatie, zoeken
Wauw: Na toevoeging van een index via alter table dimensions_tmp add index sku(sku); is de executietijd van 3,3 minuten teruggebracht naar 2 seconde! En da's niet de eerste keer dat ik zo'n dramatische verbetering meemaak!

Ik heb dramatische verbeteringen meegemaakt na het toevoegen van indexes aan query's. Trage query? Indexes is een van de eerste dingen om aan te denken!

Inleiding

[1]:

Basically an index on a table works like an index in a book (that's where the name came from):

Let's say you have a book about databases and you want to find some information about, say, storage. Without an index (assuming no other aid, such as a table of contents) you'd have to go through the pages one by one, until you found the topic (that's a full table scan). On the other hand, an index has a list of keywords, so you'd consult the index and see that storage is mentioned on pages 113-120,231 and 354. Then you could flip to those pages directly, without searching (that's a search with an index, somewhat faster).

Of course, how useful the index will be, depends on many things - a few examples, using the simile above:

  • If you had a book on databases and indexed the word "database", you'd see that it's mentioned on pages 1-59,61-290, and 292 to 400. In such case, the index is not much help and it might be faster to go through the pages one by one (in a database, this is "poor selectivity").
  • For a 10-page book, it makes no sense to make an index, as you may end up with a 10-page book prefixed by a 5-page index, which is just silly - just scan the 10 pages and be done with it.
  • The index also needs to be useful - there's generally no point to index e.g. the frequency of the letter "L" per page.

[2]:

There are different kinds of indexes and they're implemented in the storage layer, so there's no standard between them and they also depend on the storage engine that you're using.

InnoDB and the B+Tree index

For InnoDB, the most common index type is the B+Tree based index, that stores the elements in a sorted order. Also, you don't have to access the real table to get the indexed values, which makes your query return way faster.

The "problem" about this index type is that you have to query for the leftmost value to use the index. So, if your index has two columns, say last_name and first_name, the order that you query these fields matters a lot.

So, given the following table:

CREATE TABLE person 
(
    last_name VARCHAR(50) NOT NULL,
    first_name VARCHAR(50) NOT NULL,
    INDEX (last_name, first_name)
);

This query would take advantage of the index:

SELECT 
   last_name, 
   first_name 
FROM person
WHERE 
   last_name = "John" 
   AND 
   first_name LIKE "J%"

But the following one would not

SELECT 
   last_name, 
   first_name 
FROM person 
WHERE first_name = "Constantine"

Because you're querying the first_name column first and it's not the leftmost column in the index.

This last example is even worse:

SELECT 
   last_name, 
   first_name 
FROM person 
WHERE first_name LIKE "%Constantine"

Because now, you're comparing the rightmost part of the rightmost field in the index.

The hash index

This is a different index type that unfortunately, only the memory backend supports. It's lightning fast but only useful for full lookups, which means that you can't use it for operations like >, < or LIKE.

Since it only works for the memory backend, you probably won't use it very often. The main case I can think of right now is the one that you create a temporary table in the memory with a set of results from another select and perform a lot of other selects in this temporary table using hash indexes.

If you have a big VARCHAR field, you can "emulate" the use of a hash index when using a B-Tree, by creating another column and saving a hash of the big value on it. Let's say you're storing a url in a field and the values are quite big. You could also create an integer field called url_hash and use a hash function like CRC32 or any other hash function to hash the url when inserting it. And then, when you need to query for this value, you can do something like this:

SELECT url 
FROM url_table 
WHERE url_hash=CRC32("http://gnu.org");

The problem with the above example is that since the CRC32 function generates a quite small hash, you'll end up with a lot of collisions in the hashed values. If you need exact values, you can fix this problem by doing the following:

SELECT url FROM url_table 
WHERE 
   url_hash=CRC32("http://gnu.org") 
   AND 
url="http://gnu.org";

It's still worth to hash things even if the collision number is high cause you'll only perform the second comparison (the string one) against the repeated hashes.

Unfortunately, using this technique, you still need to hit the table to compare the url field. Wrap up

Some facts that you may consider every time you want to talk about optimization:

  • Integer comparison is way faster than string comparison. It can be illustrated with the example about the emulation of the hash index in InnoDB.
  • Maybe, adding additional steps in a process makes it faster, not slower. It can be illustrated by the fact that you can optimize a SELECT by splitting it into two steps, making the first one store values in a newly created in-memory table, and then execute the heavier queries on this second table.

MySQL has other indexes too, but I think the B+Tree one is the most used ever and the hash one is a good thing to know, but you can find the other ones in the MySQL documentation.

I highly recommend you to read the "High Performance MySQL" book, the answer above was definitely based on its chapter about indexes.

Zo dus

drop table if exists tool_tmp;

create table tool_tmp select * from tool;

alter table tool_tmp
    add index(tool_id),
    add index(tool_kind);
# Ik geloof dat de naam tussen haakjes, de naam van de index is. 
# Ik weet niet waarom je indexes een naam zou geven
################################################################
#
alter table dimensions_tmp add index sku(sku);

Waarschijnlijk een handige vuistregel:

Voeg altijd indexes toe aan tabellen,
op velden die in joins gebruikt worden. 

Verder:

  • Als je dit commando nu opnieuw uitvoert, krijg je een foutmelding
  • TXT- & BLOB-velden kunnen niet geïndexeerd worden
  • PK-velden zijn altijd geïndexeerd
  • Het lijkt erop, dat als je een index aanbrengt in een tabel zonder PK, dat die kolom dan automatisch de PK wordt - Hmm, lijkt me eigenlijk nogal sterk.

Casus: Update-query (april 2017)

Het probleem

Deze query is zo traag, dat ik 'm na 10 minuten heb onderbroken:

update ean_code
inner join ean_code_tmp on ean_code.ean_id = ean_code_tmp.ean_id
set 

    ean_code.sku = ean_code_tmp.sku,
    ean_code.note = ean_code_tmp.note;
  • Tabel ean_code heeft 100.000 regels
  • Tabel ean_code_tmp heeft 13.997 regels

Aanvullende gegevens

  • Ik heb ooit ergens opgepikt dat sommige soorten query's onthutsend traag zijn, omdat de engine daarbij per record van de ene tabel, alle records van de andere tabel moet langslopen (full table scan). Wellicht speelt dat hier een rol
  • Geen indexen gebruikt, terwijl dit waarschijnlijk precies een situatie is, waarin dat helpt om full table scans te voorkomen

De oplossing

-- Indexes aanmaken
-- ================
--
alter table ean_code     add index ean_id (ean_id);
alter table ean_code_tmp add index ean_id (ean_id);

update ean_code
inner join ean_code_tmp on ean_code.ean_id = ean_code_tmp.ean_id
set 

    ean_code.sku = ean_code_tmp.sku,
    ean_code.note = ean_code_tmp.note;

... Executietijd was nu nihil!

Zie ook

Bronnen