Advertisement
If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)
Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie

SQL - a "light" challenge

Options
  • 04-08-2009 3:48pm
    #1
    Registered Users Posts: 15,443 ✭✭✭✭


    Here's a "light" challenge for any SQL junkies out there.

    Lets say I have the following table structure:
    id          integer
    valid_from  date
    valid_until date
    val         varchar2
    
    What I will have is a series of records giving a history for an ID.

    In some cases (ID 01 below), the same value appears in contiguous history records.
    In some cases (ID 03 below), there can be gaps in the time-series.

    No two records exist where the same ID is valid for the same point in time in both.

    Some sample data:
    id  valid_from  valid_until val
    --  ----------  ----------- ---
    
    01  01-01-2009  31-01-2009  01A
    01  01-02-2009  28-02-2009  01B
    01  01-03-2009  31-03-2009  01C
    01  01-04-2009  30-04-2009  01A
    
    
    02  01-01-2009  31-01-2009  02A
    02  01-02-2009  28-02-2009  02A
    02  01-03-2009  31-03-2009  02A
    02  01-04-2009  30-04-2009  02B
    
    
    03  01-01-2009  31-01-2009  03A
    03  01-03-2009  31-03-2009  03A
    03  01-04-2009  30-04-2009  03B
    
    What I want to do is build a query which "compresses" this data so that records have the same ID and the same value, and lie adjacent to each other,I want 1 record in output.

    In other words, the output for the aboce input would be :
    id  valid_from  valid_until val
    --  ----------  ----------- ---
    
    01  01-01-2009  31-01-2009  01A
    01  01-02-2009  28-02-2009  01B
    01  01-03-2009  31-03-2009  01C
    01  01-04-2009  30-04-2009  01A
    
    
    02  01-01-2009  31-03-2009  02A
    02  01-04-2009  30-04-2009  02B
    
    
    03  01-01-2009  31-01-2009  03A
    03  01-03-2009  31-03-2009  03A
    03  01-04-2009  30-04-2009  03B
    
    Nothing happens to ID 01, because the only records with the same value don't lie in adjacent time-periods.
    Nothing happens to ID 03, because (again) the only records with the same value don't lie in adjacent time-periods...even though there are no records in-between.
    In the case of 02, there were three records in a row, covering all dates from 01-01 to 31-03, all with the same value (02A)...so I get one value as an output.

    In case its relevant to you, I'm working on Oracle.

    So...anyone any suggestions how to solve this?

    For the record...this is not homework. Its a generalisation of a problem I'm looking at here in work.

    The following SQL generates the test-set and solves it, but I find it pretty horrible....particularly when I look at the optimiser cost of processing millions of rows.

    I've spoilered the solution out so that anyone who wants to tackle this isn't influenced by my approach....but I've left the generation of the test-data unspoilered, in case anyone on Oracle wants to use it.
    with
    myTable as (
        select  1 as ID
            ,   to_date('01-01-2009', 'dd-mm-yyyy') as valid_from
            ,   to_date('31-01-2009', 'dd-mm-yyyy') as valid_until
            ,   '01A' as val
        from DUAL
        union
        select  1 as ID
            ,   to_date('01-02-2009', 'dd-mm-yyyy') as valid_from
            ,   to_date('28-02-2009', 'dd-mm-yyyy') as valid_until
            ,   '01B' as val
        from DUAL
        union
        select  1 as ID
            ,   to_date('01-03-2009', 'dd-mm-yyyy') as valid_from
            ,   to_date('31-03-2009', 'dd-mm-yyyy') as valid_until
            ,   '01C' as val
        from DUAL
        union
        select  1 as ID
            ,   to_date('01-04-2009', 'dd-mm-yyyy') as valid_from
            ,   to_date('30-04-2009', 'dd-mm-yyyy') as valid_until
            ,   '01D' as val
        from DUAL
        union
        select  2 as ID
            ,   to_date('01-01-2009', 'dd-mm-yyyy') as valid_from
            ,   to_date('31-01-2009', 'dd-mm-yyyy') as valid_until
            ,   '02A' as val
        from DUAL
        union
        select  2 as ID
            ,   to_date('01-02-2009', 'dd-mm-yyyy') as valid_from
            ,   to_date('28-02-2009', 'dd-mm-yyyy') as valid_until
            ,   '02A' as val
        from DUAL
        union
        select  2 as ID
            ,   to_date('01-03-2009', 'dd-mm-yyyy') as valid_from
            ,   to_date('31-03-2009', 'dd-mm-yyyy') as valid_until
            ,   '02A' as val
        from DUAL
        union
        select  2 as ID
            ,   to_date('01-04-2009', 'dd-mm-yyyy') as valid_from
            ,   to_date('30-04-2009', 'dd-mm-yyyy') as valid_until
            ,   '02B' as val
        from DUAL
        union
        select  3 as ID
            ,   to_date('01-01-2009', 'dd-mm-yyyy') as valid_from
            ,   to_date('31-01-2009', 'dd-mm-yyyy') as valid_until
            ,   '03A' as val
        from DUAL
        union
        select  3 as ID
            ,   to_date('01-03-2009', 'dd-mm-yyyy') as valid_from
            ,   to_date('31-03-2009', 'dd-mm-yyyy') as valid_until
            ,   '03A' as val
        from DUAL
        union
        select  3 as ID
            ,   to_date('01-04-2009', 'dd-mm-yyyy') as valid_from
            ,   to_date('30-04-2009', 'dd-mm-yyyy') as valid_until
            ,   '03B' as val
        from DUAL
    )
    


Comments

  • Closed Accounts Posts: 7 bladez


    I do not have oracle so cant test. A semi cartision join should do the trick.

    I am bit foggy on oracle joins, i am trying to do a left join.

    What are you working on? It looks like you are working on a temporal database. Which would genearally be used for historical locations of an object ie person residence.

    I assume you cant have two values as once.

    SELECT A.ID,
    MIN(A.VALID_FROM) AS VALID_FROM,
    MAX(CASE WHEN B.VALID_UNTIL IS NULL THEN A.VALID_UNTIL ELSE B.VALID_UNTIL END) AS VALID_UNTIL,
    A.VAL
    FROM
    MYTABLE A , MYTABLE B
    WHERE (A.ID = B.ID OR A.ID = (+)) AND (A.VAL = B.ID OR A.VAL = (+)) AND (A.VALID_UNTIL >= B.VALID_FROM OR A.VALID_UNTIL = (+))
    GROUP BY A.ID,A.VAL


  • Closed Accounts Posts: 7 bladez


    Just checking there

    Updated the join and added a day so that consective days are continous

    SELECT A.ID,
    MIN(A.VALID_FROM) AS VALID_FROM,
    MAX(CASE WHEN B.VALID_UNTIL IS NULL THEN A.VALID_UNTIL ELSE B.VALID_UNTIL END) AS VALID_UNTIL,
    A.VAL
    FROM
    MYTABLE A , MYTABLE B
    WHERE A.ID = B.ID(+) AND A.VAL = B.ID(+) AND A.VALID_UNTIL+1 >= B.VALID_FROM(+)
    GROUP BY A.ID,A.VAL


  • Registered Users Posts: 15,443 ✭✭✭✭bonkey


    Thanks for the suggestion. Unfortunately, I've already tried that approach, and I don't think it can work.

    With the self-join, you identify (in A) each record which has a record following it with the same value (you wrote it wrong, but I assume you wanted to join val against val).

    So far, so good.

    Then, the problems start.
    With the min, you find the earliest valid_from for a given ID and value. The max finds the latest.

    Imagine that you hve an ID which oscillates between values like this : A -> B -> A -> B -> A -> B. Here, we should perform no "compression" because no two adjacent time-slots have the same value...but the grouping that you've defined only allows one output for a value of A and one for B.

    If we change the outer to an inner join to (somehow) get around that problem, it still won't work for A->A->B->B->A->A.

    Its given me half of an idea for a nested-query approach though, which I might try...

    Again..thanks.


  • Registered Users Posts: 2,800 ✭✭✭voxpop


    just something that popped into my head that you might be able to use - again not similar with Oracle - but recently I was doing some duration calculations using sql - you know something like a bunch of timestamp records show a value of a, that changes to b at some stage. Anyway I used something like
    with states as ( 
    	SELECT ROW_NUMBER() OVER(ORDER BY val,[timestamp]) AS RowNum,val,[timestamp] from log
    	)
    
    select a.val
    		b.timestamp as start,
    		a.timestamp as [end]
    	from states a
    	inner join states b on a.RowNum = b.RowNum +1
    	where a.val <> b.val 
    

    Sorry the sql is a bit ropey as im doing it from memory buy it was basically adding row numbers over a particular order, then comparing row n to row n+1 and taking the start and end where val changes


  • Registered Users Posts: 15,443 ✭✭✭✭bonkey


    voxpop wrote: »
    again not similar with Oracle

    Doesn't matter. I speak a number of different SQL flavours, and can generally rewrite one as the other.

    Thanks :)


  • Advertisement
  • Closed Accounts Posts: 7 bladez


    This works on mysql

    SELECT A.ID,
    CASE WHEN B.VALID_FROM BETWEEN A.VALID_FROM AND A.VALID_UNTIL
    THEN A.VALID_FROM ELSE B.VALID_FROM END AS VALID_FROM,
    CASE WHEN A.VALID_UNTIL BETWEEN B.VALID_FROM AND B.VALID_UNTIL
    THEN A.VALID_UNTIL ELSE B.VALID_UNTIL END AS VALID_UNTIL,
    A.VAL
    FROM
    MYTABLE A left join MYTABLE B ON
    A.ID = B.ID AND A.VAL = B.VAL
    GROUP BY A.ID,
    CASE WHEN B.VALID_FROM BETWEEN A.VALID_FROM AND A.VALID_UNTIL
    THEN A.VALID_FROM ELSE B.VALID_FROM END,
    CASE WHEN A.VALID_UNTIL BETWEEN B.VALID_FROM AND B.VALID_UNTIL
    THEN A.VALID_UNTIL ELSE B.VALID_UNTIL END,
    A.VAL


Advertisement