http://code.openark.org/blog/mysql/sql-selecting-top-n-records-per-group
SQL: selecting top N records per group
January 6, 2011
A while back I presented(*) an SQL trick to present with non-aggregated column on a GROUP BY query, without use of subquery or derived tables.
Based on a similar concept, combined with string walking, I now present a query which selects top-n records for each group, ordered by some condition. It will require no subqueries. It executes faster than its more conventional alternatives.
[UPDATE: this is MySQL only. Others can use Window Functions where available]
Using the simple world database, we answer the following question:
What are the top 5 largest (by area) countries for each continent? What are their names, surface area and population?
Similar questions would be:
What were the latest 5 films rented by each customer?
What were the most presented advertisements for each user?
etc.
Step 1: getting the top
We already know how to get a single column's value for the top country, as presented in the aforementioned post:
SELECT
Continent,
SUBSTRING_INDEX(
GROUP_CONCAT(Name ORDER BY SurfaceArea DESC),
',', 1) AS Name
FROM
Country
GROUP BY
Continent
;
+---------------+--------------------+
| Continent | Name |
+---------------+--------------------+
| Asia | China |
| Europe | Russian Federation |
| North America | Canada |
| Africa | Sudan |
| Oceania | Australia |
| Antarctica | Antarctica |
| South America | Brazil |
+---------------+--------------------+
Step 2: adding columns
This part is easy: just throw in the rest of the columns (again, only indicating the top country in each continent)
SELECT
Continent,
SUBSTRING_INDEX(
GROUP_CONCAT(Name ORDER BY SurfaceArea DESC),
',', 1) AS Name,
SUBSTRING_INDEX(
GROUP_CONCAT(SurfaceArea ORDER BY SurfaceArea DESC),
',', 1) AS SurfaceArea,
SUBSTRING_INDEX(
GROUP_CONCAT(Population ORDER BY SurfaceArea DESC),
',', 1) AS Population
FROM
Country
GROUP BY
Continent
;
+---------------+--------------------+-------------+------------+
| Continent | Name | SurfaceArea | Population |
+---------------+--------------------+-------------+------------+
| Asia | China | 9572900.00 | 1277558000 |
| Europe | Russian Federation | 17075400.00 | 146934000 |
| North America | Canada | 9970610.00 | 31147000 |
| Africa | Sudan | 2505813.00 | 29490000 |
| Oceania | Australia | 7741220.00 | 18886000 |
| Antarctica | Antarctica | 13120000.00 | 0 |
| South America | Brazil | 8547403.00 | 170115000 |
+---------------+--------------------+-------------+------------+
Step 3: casting
You'll notice that the Population column from this last execution is aligned to the left. This is because it is believed to be a string. The GROUP_CONCAT clause concatenates values in one string, and SUBSTRING_INDEX parses a substring. The same applies to the SurfaceArea column. We'll cast Population as UNSIGNED and SurfaceArea as DECIMAL:
SELECT
Continent,
SUBSTRING_INDEX(
GROUP_CONCAT(Name ORDER BY SurfaceArea DESC),
',', 1) AS Name,
CAST(
SUBSTRING_INDEX(
GROUP_CONCAT(SurfaceArea ORDER BY SurfaceArea DESC),
',', 1)
AS DECIMAL(20,2)
) AS SurfaceArea,
CAST(
SUBSTRING_INDEX(
GROUP_CONCAT(Population ORDER BY SurfaceArea DESC),
',', 1)
AS UNSIGNED
) AS Population
FROM
Country
GROUP BY
Continent
;
+---------------+--------------------+-------------+------------+
| Continent | Name | SurfaceArea | Population |
+---------------+--------------------+-------------+------------+
| Asia | China | 9572900.00 | 1277558000 |
| Europe | Russian Federation | 17075400.00 | 146934000 |
| North America | Canada | 9970610.00 | 31147000 |
| Africa | Sudan | 2505813.00 | 29490000 |
| Oceania | Australia | 7741220.00 | 18886000 |
| Antarctica | Antarctica | 13120000.00 | 0 |
| South America | Brazil | 8547403.00 | 170115000 |
+---------------+--------------------+-------------+------------+
Step 4: top n records
It's time to use string walking. Examples for string walking (described in the excellent SQL Cookbook) can be found here, here and here. We'll be using a numbers table: a simple table which lists ascending integer numbers. For example, you can use the following:
DROP TABLE IF EXISTS `tinyint_asc`;
CREATE TABLE `tinyint_asc` (
`value` tinyint(3) unsigned NOT NULL default '0',
PRIMARY KEY (value)
) ;
INSERT INTO `tinyint_asc` VALUES (0),(1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15),(16),(17),(18),(19),(20),(21),(22),(23),(24),(25),(26),(27),(28),(29),(30),(31),(32),(33),(34),(35),(36),(37),(38),(39),(40),(41),(42),(43),(44),(45),(46),(47),(48),(49),(50),(51),(52),(53),(54),(55),(56),(57),(58),(59),(60),(61),(62),(63),(64),(65),(66),(67),(68),(69),(70),(71),(72),(73),(74),(75),(76),(77),(78),(79),(80),(81),(82),(83),(84),(85),(86),(87),(88),(89),(90),(91),(92),(93),(94),(95),(96),(97),(98),(99),(100),(101),(102),(103),(104),(105),(106),(107),(108),(109),(110),(111),(112),(113),(114),(115),(116),(117),(118),(119),(120),(121),(122),(123),(124),(125),(126),(127),(128),(129),(130),(131),(132),(133),(134),(135),(136),(137),(138),(139),(140),(141),(142),(143),(144),(145),(146),(147),(148),(149),(150),(151),(152),(153),(154),(155),(156),(157),(158),(159),(160),(161),(162),(163),(164),(165),(166),(167),(168),(169),(170),(171),(172),(173),(174),(175),(176),(177),(178),(179),(180),(181),(182),(183),(184),(185),(186),(187),(188),(189),(190),(191),(192),(193),(194),(195),(196),(197),(198),(199),(200),(201),(202),(203),(204),(205),(206),(207),(208),(209),(210),(211),(212),(213),(214),(215),(216),(217),(218),(219),(220),(221),(222),(223),(224),(225),(226),(227),(228),(229),(230),(231),(232),(233),(234),(235),(236),(237),(238),(239),(240),(241),(242),(243),(244),(245),(246),(247),(248),(249),(250),(251),(252),(253),(254),(255);
The trick is to apply the same technique as used above, not for a single row, but for several rows. Here's how to present the top 5 countries:
SELECT
Continent,
SUBSTRING_INDEX(
SUBSTRING_INDEX(
GROUP_CONCAT(Name ORDER BY SurfaceArea DESC),
',', value),
',', -1)
AS Name,
CAST(
SUBSTRING_INDEX(
SUBSTRING_INDEX(
GROUP_CONCAT(SurfaceArea ORDER BY SurfaceArea DESC),
',', value),
',', -1)
AS DECIMAL(20,2)
) AS SurfaceArea,
CAST(
SUBSTRING_INDEX(
SUBSTRING_INDEX(
GROUP_CONCAT(Population ORDER BY SurfaceArea DESC),
',', value),
',', -1)
AS UNSIGNED
) AS Population
FROM
Country, tinyint_asc
WHERE
tinyint_asc.value >= 1 AND tinyint_asc.value <= 5
GROUP BY
Continent, value
;
+---------------+----------------------------------------------+-------------+------------+
| Continent | Name | SurfaceArea | Population |
+---------------+----------------------------------------------+-------------+------------+
| Asia | China | 9572900.00 | 1277558000 |
| Asia | India | 3287263.00 | 1013662000 |
| Asia | Kazakstan | 2724900.00 | 16223000 |
| Asia | Saudi Arabia | 2149690.00 | 21607000 |
| Asia | Indonesia | 1904569.00 | 212107000 |
| Europe | Russian Federation | 17075400.00 | 146934000 |
| Europe | Ukraine | 603700.00 | 50456000 |
| Europe | France | 551500.00 | 59225700 |
| Europe | Spain | 505992.00 | 39441700 |
| Europe | Sweden | 449964.00 | 8861400 |
| North America | Canada | 9970610.00 | 31147000 |
| North America | United States | 9363520.00 | 278357000 |
| North America | Greenland | 2166090.00 | 56000 |
| North America | Mexico | 1958201.00 | 98881000 |
| North America | Nicaragua | 130000.00 | 5074000 |
| Africa | Sudan | 2505813.00 | 29490000 |
| Africa | Algeria | 2381741.00 | 31471000 |
| Africa | Congo | 2344858.00 | 51654000 |
| Africa | The Democratic Republic of the | 1759540.00 | 5605000 |
| Africa | Libyan Arab Jamahiriya | 1284000.00 | 7651000 |
| Oceania | Australia | 7741220.00 | 18886000 |
| Oceania | Papua New Guinea | 462840.00 | 4807000 |
| Oceania | New Zealand | 270534.00 | 3862000 |
| Oceania | Solomon Islands | 28896.00 | 444000 |
| Oceania | New Caledonia | 18575.00 | 214000 |
| Antarctica | Antarctica | 13120000.00 | 0 |
| Antarctica | French Southern territories | 7780.00 | 0 |
| Antarctica | South Georgia and the South Sandwich Islands | 3903.00 | 0 |
| Antarctica | Heard Island and McDonald Islands | 359.00 | 0 |
| Antarctica | Bouvet Island | 59.00 | 0 |
| South America | Brazil | 8547403.00 | 170115000 |
| South America | Argentina | 2780400.00 | 37032000 |
| South America | Peru | 1285216.00 | 25662000 |
| South America | Colombia | 1138914.00 | 42321000 |
| South America | Bolivia | 1098581.00 | 8329000 |
+---------------+----------------------------------------------+-------------+------------+
Limitations
You should have:
Enough numbers in the numbers table (I've used 5 out of 255)
Reasonable setting for group_concat_max_len (see this post). Actually it would be better to have a smaller value here, while you make sure it's large enough; this way you do not waste memory for large groups.
(*) This was two years ago! I'm getting old