Optimizing GROUP BY and DISTINCT
MySQL optimizes these two kinds of queries similarly in many cases, and in fact con-
verts between them as needed internally during the optimization process. Both types
of queries benefit from indexes, as usual, and that’s the single most important way to
optimize them.
MySQL has two kinds of GROUP BY strategies when it can’t use an index: it can use a
temporary table or a filesort to perform the grouping. Either one can be more efficient
244 | Chapter 6: Query Performance Optimization
for any given query. You can force the optimizer to choose one method or the other
with the SQL_BIG_RESULT and SQL_SMALL_RESULT optimizer hints, as discussed earlier in
this chapter.
If you need to group a join by a value that comes from a lookup table, it’s usually more
efficient to group by the lookup table’s identifier than by the value. For example, the
following query isn’t as efficient as it could be:
mysql> SELECT actor.first_name, actor.last_name, COUNT(*)
-> FROM sakila.film_actor
->
INNER JOIN sakila.actor USING(actor_id)
-> GROUP BY actor.first_name, actor.last_name;
The query is more efficiently written as follows:
mysql> SELECT actor.first_name, actor.last_name, COUNT(*)
-> FROM sakila.film_actor
->
INNER JOIN sakila.actor USING(actor_id)
-> GROUP BY film_actor.actor_id;
Grouping by actor.actor_id could be even more efficient than grouping by film_
actor.actor_id. You should test on your specific data to see.
This query takes advantage of the fact that the actor’s first and last name are dependent
on the actor_id, so it will return the same results, but it’s not always the case that you
can blithely select nongrouped columns and get the same result. You might even have
the server’s SQL_MODE configured to disallow it. You can use MIN() or MAX() to work
around this when you know the values within the group are distinct because they de-
pend on the grouped-by column, or if you don’t care which value you get:
mysql> SELECT MIN(actor.first_name), MAX(actor.last_name), ...;
Purists will argue that you’re grouping by the wrong thing, and they’re right. A spurious
MIN() or MAX() is a sign that the query isn’t structured correctly. However, sometimes
your only concern will be making MySQL execute the query as quickly as possible. The
purists will be satisfied with the following way of writing the query:
mysql> SELECT actor.first_name, actor.last_name, c.cnt
-> FROM sakila.actor
->
INNER JOIN (
->
SELECT actor_id, COUNT(*) AS cnt
->
FROM sakila.film_actor
->
GROUP BY actor_id
->
) AS c USING(actor_id) ;
But the cost of creating and filling the temporary table required for the subquery may
be high compared to the cost of fudging pure relational theory a little bit. Remember,
the temporary table created by the subquery has no indexes.17
It’s generally a bad idea to select nongrouped columns in a grouped query, because the
results will be nondeterministic and could easily change if you change an index or the
17. This is another limitation that’s fixed in MariaDB, by the way.
Optimizing Specific Types of Queries | 245
optimizer decides to use a different strategy. Most such queries we see are accidents
(because the server doesn’t complain), or are the result of laziness rather than being
designed that way for optimization purposes. It’s better to be explicit. In fact, we suggest
that you set the server’s SQL_MODE configuration variable to include ONLY_FULL
_GROUP_BY so it produces an error instead of letting you write a bad query.
MySQL automatically orders grouped queries by the columns in the GROUP BY clause,
unless you specify an ORDER BY clause explicitly. If you don’t care about the order and
you see this causing a filesort, you can use ORDER BY NULL to skip the automatic sort.
You can also add an optional DESC or ASC keyword right after the GROUP BY clause to
order the results in the desired direction by the clause’s columns.