MySQL Groupwise Maximum Question

Cloud · 02-05-2005 02:31PM #1

I'm having problems with a slow query for Planet Of The Blogs - I have three tables, and in the blog entries table, I want to select the entry title corresponding to the row with the maximum date for each feeds ID...

describe entries;
+-----------------+--------------+------+-----+---------------------+-------+
| Field           | Type         | Null | Key | Default             | Extra |
+-----------------+--------------+------+-----+---------------------+-------+
| feedsID         | int(11)      |      | MUL | 0                   |       |
| title           | tinytext     | YES  |     | NULL                |       |
| dc_date         | datetime     | YES  |     | 0000-00-00 00:00:00 |       |
.
.
.
+-----------------+--------------+------+-----+---------------------+-------+

describe blogs;
+-------------------+--------------+------+-----+-------------------+----------------+
| Field             | Type         | Null | Key | Default           | Extra          |
+-------------------+--------------+------+-----+-------------------+----------------+
| ID                | int(11)      |      | PRI | NULL              | auto_increment |
| title             | varchar(255) |      |     |                   |                |
.
.
.
+-------------------+--------------+------+-----+-------------------+----------------+

describe feeds;
+---------+--------------+------+-----+-------------------+----------------+
| Field   | Type         | Null | Key | Default           | Extra          |
+---------+--------------+------+-----+-------------------+----------------+
| ID      | int(11)      |      | PRI | NULL              | auto_increment |
| blogsID | int(11)      |      | MUL | 0                 |                |
.
.
.
+---------+--------------+------+-----+-------------------+----------------+

Here's the SQL I'm using - it takes 16 seconds to run. Any ideas to speed it up? I'm not too well up on indexing, and this idea of a groupwise maximum is new to me...

select b1.title as blogs_title, b1.ID as blogs_ID, e1.title, e1.ID, date_format(e1.dc_date, "%e.%c.%Y, %H:%i") as date from entries e1 left join feeds f1 on e1.feedsID = f1.ID left join blogs b1 on f1.blogsID = b1.ID where dc_date=(select max(dc_date) from entries e2 where e1.feedsID = e2.feedsID) group by e1.feedsID order by e1.dc_date desc

Thanks,

John.
--

bonkey · 02-05-2005 04:25PM

SQL looks more-or-less ok, although I can't figure out why the group by is there. You should be able to get rid of it, I woulda thought.

After that, I'd make sure that e1.feedsID and f1.blogsID are indexed fields - the optimiser may choose to join the tables in an order other than how the query is written, and this would be the most likely culprit I can see for performance problems.

bonkey · 02-05-2005 04:33PM

Ah....the group by....I see what it does now.

Assuming its supported in the MySQL syntax, drop the group by, and use SELECT DISTINCT instead.

oscarBravo · 02-05-2005 06:15PM

I'd index "entries" on "dc_date". In the subquery, you're looking for max(dc_date) - in the absence of an index, the only way it can determine the maximum value of a field is a full table scan. The most important rule in index optimisation is, avoid table scans.

For query optimisation hints, prefix the select query with "explain " and check the output. If possible_keys for a table is NULL, you'll end up reading every row in the table.

Hope this makes sense.

bonkey · 02-05-2005 06:27PM

oscarBravo wrote:

I'd index "entries" on "dc_date". In the subquery, you're looking for max(dc_date) - in the absence of an index, the only way it can determine the maximum value of a field is a full table scan.

Not so. Its looking for the highest date for a specific id on any given iteration. Ergo, the ideal index would be id, dc_date. Indeed, this would be a covering index - which would remove the need to read from the table at all!

But indexing the date would do nothing, as you'd still have to read the record to see if it was a matching id.

jc

Cloud · 03-05-2005 11:41AM

dc_date is something like 2005-04-06 13:47:00 - could that be part of the problem?

Replacing "group by" with "distinct" still makes it 16 seconds.

I should also say that the join with the other two tables doesn't affect performance - changing the query to:

select distinct e1.feedsID, e1.title, e1.ID, date_format(e1.dc_date, "%e.%c.%Y, %H:%i") as date from entries e1 where dc_date=(select max(dc_date) from entries e2 where e1.feedsID = e2.feedsID) order by e1.dc_date desc;

lasts 16 seconds too.

Adding dc_date as an index, no difference:

Here's the explain output:

mysql> explain select distinct e1.feedsID, e1.title, e1.ID, date_format(e1.dc_date, "%e.%c.%Y, %H:%i") as date from entries e1 where dc_date=(select max(dc_date) from entries e2 where e1.feedsID = e2.feedsID) order by e1.dc_date desc;
+----+--------------------+-------+------+---------------+-------------+---------+----------------------+------+----------------------------------------------+
| id | select_type        | table | type | possible_keys | key         | key_len | ref                  | rows | Extra                                        |
+----+--------------------+-------+------+---------------+-------------+---------+----------------------+------+----------------------------------------------+
|  1 | PRIMARY            | e1    | ALL  | NULL          | NULL        |    NULL | NULL                 | 8280 | Using where; Using temporary; Using filesort |
|  2 | DEPENDENT SUBQUERY | e2    | ref  | rss_feed_ID   | rss_feed_ID |       4 | planetphp.e1.feedsID |   23 |                                              |
+----+--------------------+-------+------+---------------+-------------+---------+----------------------+------+----------------------------------------------+
2 rows in set (0.00 sec)

bonkey · 03-05-2005 12:22PM

Cloud wrote:

dc_date is something like 2005-04-06 13:47:00 - could that be part of the problem?

Shouldn't. Thats just the string representation of the field - its stored and handled as a native datetime. So unless MySQL is notoriously bad with datetime fields, thats not a problem.

Adding dc_date as an index, no difference:

Have you tried adding feedsId, dc_date as a (composite) index? As I explained in my last post, I don't think that dc_date on its own will make a huge diff.

jc

bonkey · 03-05-2005 12:37PM

Cloud,

I've been thinking about this a bit more...

Most probably, the reason the query is slow is because the nested query is correlated. What this means is that the nested query references a field in the "wrapping" query. This means that the nested query typically ends up being run once for *every* row in the wrapping query. So if you have 1,000 rows, the inner query will be run 1000 times. 2000 rows => 2000 executions of the inner query. And so on.

What should provide better performance is a way of de-correlating the query.

Copnsider this:

SELECT      e1.feedsID, 
            e1.title, 
            e1.ID, 
            date_format(e1.dc_date, "%e.%c.%Y, %H:%i") as date 
  FROM      entries e1 
  INNER JOIN (
            SELECT      feedsID,
                        MAX(dc_date) as max_date
              FROM      entries
              GROUP BY  feedsId
) e2
    ON    e1.feedsID = e2.feedsID
    AND   e1.dc_date = e2.max_date

I don't speak MySQL, so this may need some tweaking to fit...and I'm assuming whatever version you're using fully supports nested queries (cause you use one in your WHERE clause).

But, if you look at this query, the nested part needs to be run only once - it is non-correlated as it isn't dependant on the current value of any field in e1. This nested query can then be joined to the "wrapping" query. Id expect that if there's an index on feedsId, dc_date, this would be used both in the nested query and in the join of the nest with the main query.

Hope that helps.

jc

Cloud · 03-05-2005 12:41PM

Wow - it certainly did - 0.23 seconds - thanks bonkey, you rock!

bonkey · 03-05-2005 12:44PM

Excellent

MySQL Groupwise Maximum Question

Comments