Advertisement
If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)
Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie

MySQL Groupwise Maximum Question

Options
  • 02-05-2005 2:31pm
    #1
    Registered Users Posts: 5,049 ✭✭✭


    I'm having problems with a slow query for Planet Of The Blogs - I have three tables, and in the blog entries table, I want to select the entry title corresponding to the row with the maximum date for each feeds ID...
    describe entries;
    +-----------------+--------------+------+-----+---------------------+-------+
    | Field           | Type         | Null | Key | Default             | Extra |
    +-----------------+--------------+------+-----+---------------------+-------+
    | feedsID         | int(11)      |      | MUL | 0                   |       |
    | title           | tinytext     | YES  |     | NULL                |       |
    | dc_date         | datetime     | YES  |     | 0000-00-00 00:00:00 |       |
    .
    .
    .
    +-----------------+--------------+------+-----+---------------------+-------+
    
    describe blogs;
    +-------------------+--------------+------+-----+-------------------+----------------+
    | Field             | Type         | Null | Key | Default           | Extra          |
    +-------------------+--------------+------+-----+-------------------+----------------+
    | ID                | int(11)      |      | PRI | NULL              | auto_increment |
    | title             | varchar(255) |      |     |                   |                |
    .
    .
    .
    +-------------------+--------------+------+-----+-------------------+----------------+
    
    describe feeds;
    +---------+--------------+------+-----+-------------------+----------------+
    | Field   | Type         | Null | Key | Default           | Extra          |
    +---------+--------------+------+-----+-------------------+----------------+
    | ID      | int(11)      |      | PRI | NULL              | auto_increment |
    | blogsID | int(11)      |      | MUL | 0                 |                |
    .
    .
    .
    +---------+--------------+------+-----+-------------------+----------------+
    
    

    Here's the SQL I'm using - it takes 16 seconds to run. Any ideas to speed it up? I'm not too well up on indexing, and this idea of a groupwise maximum is new to me...

    select b1.title as blogs_title, b1.ID as blogs_ID, e1.title, e1.ID, date_format(e1.dc_date, "%e.%c.%Y, %H:%i") as date from entries e1 left join feeds f1 on e1.feedsID = f1.ID left join blogs b1 on f1.blogsID = b1.ID where dc_date=(select max(dc_date) from entries e2 where e1.feedsID = e2.feedsID) group by e1.feedsID order by e1.dc_date desc

    Thanks,

    John.
    --


Comments

  • Registered Users Posts: 15,443 ✭✭✭✭bonkey


    SQL looks more-or-less ok, although I can't figure out why the group by is there. You should be able to get rid of it, I woulda thought.

    After that, I'd make sure that e1.feedsID and f1.blogsID are indexed fields - the optimiser may choose to join the tables in an order other than how the query is written, and this would be the most likely culprit I can see for performance problems.


  • Registered Users Posts: 15,443 ✭✭✭✭bonkey


    Ah....the group by....I see what it does now.

    Assuming its supported in the MySQL syntax, drop the group by, and use SELECT DISTINCT instead.


  • Technology & Internet Moderators Posts: 28,804 Mod ✭✭✭✭oscarBravo


    I'd index "entries" on "dc_date". In the subquery, you're looking for max(dc_date) - in the absence of an index, the only way it can determine the maximum value of a field is a full table scan. The most important rule in index optimisation is, avoid table scans.

    For query optimisation hints, prefix the select query with "explain " and check the output. If possible_keys for a table is NULL, you'll end up reading every row in the table.

    Hope this makes sense.


  • Registered Users Posts: 15,443 ✭✭✭✭bonkey


    oscarBravo wrote:
    I'd index "entries" on "dc_date". In the subquery, you're looking for max(dc_date) - in the absence of an index, the only way it can determine the maximum value of a field is a full table scan.

    Not so. Its looking for the highest date for a specific id on any given iteration. Ergo, the ideal index would be id, dc_date. Indeed, this would be a covering index - which would remove the need to read from the table at all!

    But indexing the date would do nothing, as you'd still have to read the record to see if it was a matching id.


    jc


  • Registered Users Posts: 5,049 ✭✭✭Cloud


    dc_date is something like 2005-04-06 13:47:00 - could that be part of the problem?

    Replacing "group by" with "distinct" still makes it 16 seconds.

    I should also say that the join with the other two tables doesn't affect performance - changing the query to:

    select distinct e1.feedsID, e1.title, e1.ID, date_format(e1.dc_date, "%e.%c.%Y, %H:%i") as date from entries e1 where dc_date=(select max(dc_date) from entries e2 where e1.feedsID = e2.feedsID) order by e1.dc_date desc;

    lasts 16 seconds too.

    Adding dc_date as an index, no difference:

    Here's the explain output:
    mysql> explain select distinct e1.feedsID, e1.title, e1.ID, date_format(e1.dc_date, "%e.%c.%Y, %H:%i") as date from entries e1 where dc_date=(select max(dc_date) from entries e2 where e1.feedsID = e2.feedsID) order by e1.dc_date desc;
    +----+--------------------+-------+------+---------------+-------------+---------+----------------------+------+----------------------------------------------+
    | id | select_type        | table | type | possible_keys | key         | key_len | ref                  | rows | Extra                                        |
    +----+--------------------+-------+------+---------------+-------------+---------+----------------------+------+----------------------------------------------+
    |  1 | PRIMARY            | e1    | ALL  | NULL          | NULL        |    NULL | NULL                 | 8280 | Using where; Using temporary; Using filesort |
    |  2 | DEPENDENT SUBQUERY | e2    | ref  | rss_feed_ID   | rss_feed_ID |       4 | planetphp.e1.feedsID |   23 |                                              |
    +----+--------------------+-------+------+---------------+-------------+---------+----------------------+------+----------------------------------------------+
    2 rows in set (0.00 sec)
    


  • Advertisement
  • Registered Users Posts: 15,443 ✭✭✭✭bonkey


    Cloud wrote:
    dc_date is something like 2005-04-06 13:47:00 - could that be part of the problem?

    Shouldn't. Thats just the string representation of the field - its stored and handled as a native datetime. So unless MySQL is notoriously bad with datetime fields, thats not a problem.
    Adding dc_date as an index, no difference:
    Have you tried adding feedsId, dc_date as a (composite) index? As I explained in my last post, I don't think that dc_date on its own will make a huge diff.

    jc


  • Registered Users Posts: 15,443 ✭✭✭✭bonkey


    Cloud,

    I've been thinking about this a bit more...

    Most probably, the reason the query is slow is because the nested query is correlated. What this means is that the nested query references a field in the "wrapping" query. This means that the nested query typically ends up being run once for *every* row in the wrapping query. So if you have 1,000 rows, the inner query will be run 1000 times. 2000 rows => 2000 executions of the inner query. And so on.

    What should provide better performance is a way of de-correlating the query.

    Copnsider this:
    SELECT      e1.feedsID, 
                e1.title, 
                e1.ID, 
                date_format(e1.dc_date, "%e.%c.%Y, %H:%i") as date 
      FROM      entries e1 
      INNER JOIN (
                SELECT      feedsID,
                            MAX(dc_date) as max_date
                  FROM      entries
                  GROUP BY  feedsId
    ) e2
        ON    e1.feedsID = e2.feedsID
        AND   e1.dc_date = e2.max_date
    

    I don't speak MySQL, so this may need some tweaking to fit...and I'm assuming whatever version you're using fully supports nested queries (cause you use one in your WHERE clause).

    But, if you look at this query, the nested part needs to be run only once - it is non-correlated as it isn't dependant on the current value of any field in e1. This nested query can then be joined to the "wrapping" query. Id expect that if there's an index on feedsId, dc_date, this would be used both in the nested query and in the join of the nest with the main query.

    Hope that helps.

    jc


  • Registered Users Posts: 5,049 ✭✭✭Cloud


    Wow - it certainly did - 0.23 seconds - thanks bonkey, you rock!


  • Registered Users Posts: 15,443 ✭✭✭✭bonkey


    Excellent :)


Advertisement