MySQLのインデックスの効果を理解する

CREATE TABLE `fuga` (
  `id` int(10) unsigned NOT NULL auto_increment,
  `category` int(11) NOT NULL,
  `name` varchar(40) NOT NULL,
  PRIMARY KEY  (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

のようなテーブルを用意して、0〜9999までのランダムなcategoryで適当なnameを含むデータを30万件ほど突っ込んでみる。1つのcategoryあたり30件くらい取れる計算。

普通にwhereでcategoryを指定して取ると結構遅い。はかってみる。

#!/usr/bin/perl
use strict;
use warnings;

use DBI;
use Benchmark 'timethis';

my $dbh = DBI->connect('dbi:mysql:hoge', 'root', '');

my $sth = $dbh->prepare(q{ SELECT name FROM fuga WHERE ( category = ? ) });

my $select = sub {
    $sth->execute(int(rand 10000));
    while (my $row = $sth->fetchrow_hashref) {
        warn $row->{name};
    }
};
timethis(300, $select);

$ benchmark.pl 2> /dev/null
timethis 300: 40 wallclock secs ( 0.11 usr +  0.02 sys =  0.13 CPU) @ 2307.69/s (n=300)

300回繰り返すと40秒くらい。
explainの結果は以下。

mysql> explain SELECT me.id, me.category, me.name FROM fuga me WHERE ( category = 0 );
+----+-------------+-------+------+---------------+------+---------+------+--------+-------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows   | Extra       |
+----+-------------+-------+------+---------------+------+---------+------+--------+-------------+
|  1 | SIMPLE      | me    | ALL  | NULL          | NULL | NULL    | NULL | 300432 | Using where | 
+----+-------------+-------+------+---------------+------+---------+------+--------+-------------+

これを高速化するために"category"にインデックスを貼る。

mysql> ALTER TABLE fuga ADD INDEX `category_idx` (category);
Query OK, 300000 rows affected (6.73 sec)
Records: 300000  Duplicates: 0  Warnings: 0
mysql> explain SELECT me.id, me.category, me.name FROM fuga me WHERE ( category = 0 );
+----+-------------+-------+------+---------------+--------------+---------+-------+------+-------+
| id | select_type | table | type | possible_keys | key          | key_len | ref   | rows | Extra |
+----+-------------+-------+------+---------------+--------------+---------+-------+------+-------+
|  1 | SIMPLE      | me    | ref  | category_idx  | category_idx | 4       | const |   31 |       | 
+----+-------------+-------+------+---------------+--------------+---------+-------+------+-------+

インデックスが使われるようになる。もう一度はかり直してみると

$ benchmark.pl 2> /dev/null
timethis 300:  0 wallclock secs ( 0.11 usr +  0.02 sys =  0.13 CPU) @ 2307.69/s (n=300)

300回が一瞬で終わる。

カバリングインデックス

インデックスを使ってselectする場合も、どの列を取得するかによって挙動が変わる。

mysql> explain SELECT id, category FROM fuga WHERE ( category = '1000' );
+----+-------------+-------+------+---------------+--------------+---------+-------+------+-------------+
| id | select_type | table | type | possible_keys | key          | key_len | ref   | rows | Extra       |
+----+-------------+-------+------+---------------+--------------+---------+-------+------+-------------+
|  1 | SIMPLE      | fuga  | ref  | category_idx  | category_idx | 4       | const |   33 | Using index | 
+----+-------------+-------+------+---------------+--------------+---------+-------+------+-------------+
1 row in set (0.00 sec)

mysql> explain SELECT id, category, name FROM fuga WHERE ( category = '1000' );
+----+-------------+-------+------+---------------+--------------+---------+-------+------+-------+
| id | select_type | table | type | possible_keys | key          | key_len | ref   | rows | Extra |
+----+-------------+-------+------+---------------+--------------+---------+-------+------+-------+
|  1 | SIMPLE      | fuga  | ref  | category_idx  | category_idx | 4       | const |   33 |       | 
+----+-------------+-------+------+---------------+--------------+---------+-------+------+-------+
1 row in set (0.00 sec)

上記の場合、id, categoryにはそれぞれインデックスがあるので行を読み取る必要がなく、高速に値を返してくれる。nameも取得する場合はインデックスに無いので行の読み取りが必要になる。

#!/usr/bin/perl
use strict;
use warnings;

use DBI;
use Benchmark 'cmpthese';

my $dbh = DBI->connect('dbi:mysql:hoge', 'root', '');

my $sth1 = $dbh->prepare(q{ SELECT *  FROM fuga WHERE ( category = ? ) });
my $sth2 = $dbh->prepare(q{ SELECT id FROM fuga WHERE ( category = ? ) });

my $select1 = sub {
    $sth1->execute(int(rand 10000));
    while (my $row = $sth1->fetchrow_hashref) {
        warn $row->{id};
    }
};
my $select2 = sub {
    $sth2->execute(int(rand 10000));
    while (my $row = $sth2->fetchrow_hashref) {
        warn $row->{id};
    }
};
cmpthese(2000, {
    select1 => $select1,
    select2 => $select2,
});

$ benchmark2.pl 2> /dev/null
          Rate select1 select2
select1 2000/s      --    -36%
select2 3125/s     56%      --

約50%はやくなった。idだけを引く必要がある場合などはこれを意識しておくといいのかもしれない。それほどクリティカルなものではないと思うけど。