SQL - a "light" challenge

bonkey · 04-08-2009 3:48pm #1

Here's a "light" challenge for any SQL junkies out there.

Lets say I have the following table structure:

id          integer
valid_from  date
valid_until date
val         varchar2

What I will have is a series of records giving a history for an ID.

In some cases (ID 01 below), the same value appears in contiguous history records.
In some cases (ID 03 below), there can be gaps in the time-series.

No two records exist where the same ID is valid for the same point in time in both.

Some sample data:

id  valid_from  valid_until val
--  ----------  ----------- ---

01  01-01-2009  31-01-2009  01A
01  01-02-2009  28-02-2009  01B
01  01-03-2009  31-03-2009  01C
01  01-04-2009  30-04-2009  01A


02  01-01-2009  31-01-2009  02A
02  01-02-2009  28-02-2009  02A
02  01-03-2009  31-03-2009  02A
02  01-04-2009  30-04-2009  02B


03  01-01-2009  31-01-2009  03A
03  01-03-2009  31-03-2009  03A
03  01-04-2009  30-04-2009  03B

What I want to do is build a query which "compresses" this data so that records have the same ID and the same value, and lie adjacent to each other,I want 1 record in output.

In other words, the output for the aboce input would be :

id  valid_from  valid_until val
--  ----------  ----------- ---

01  01-01-2009  31-01-2009  01A
01  01-02-2009  28-02-2009  01B
01  01-03-2009  31-03-2009  01C
01  01-04-2009  30-04-2009  01A


02  01-01-2009  31-03-2009  02A
02  01-04-2009  30-04-2009  02B


03  01-01-2009  31-01-2009  03A
03  01-03-2009  31-03-2009  03A
03  01-04-2009  30-04-2009  03B

Nothing happens to ID 01, because the only records with the same value don't lie in adjacent time-periods.
Nothing happens to ID 03, because (again) the only records with the same value don't lie in adjacent time-periods...even though there are no records in-between.
In the case of 02, there were three records in a row, covering all dates from 01-01 to 31-03, all with the same value (02A)...so I get one value as an output.

In case its relevant to you, I'm working on Oracle.

So...anyone any suggestions how to solve this?

For the record...this is not homework. Its a generalisation of a problem I'm looking at here in work.

The following SQL generates the test-set and solves it, but I find it pretty horrible....particularly when I look at the optimiser cost of processing millions of rows.

I've spoilered the solution out so that anyone who wants to tackle this isn't influenced by my approach....but I've left the generation of the test-data unspoilered, in case anyone on Oracle wants to use it.

with
myTable as (
    select  1 as ID
        ,   to_date('01-01-2009', 'dd-mm-yyyy') as valid_from
        ,   to_date('31-01-2009', 'dd-mm-yyyy') as valid_until
        ,   '01A' as val
    from DUAL
    union
    select  1 as ID
        ,   to_date('01-02-2009', 'dd-mm-yyyy') as valid_from
        ,   to_date('28-02-2009', 'dd-mm-yyyy') as valid_until
        ,   '01B' as val
    from DUAL
    union
    select  1 as ID
        ,   to_date('01-03-2009', 'dd-mm-yyyy') as valid_from
        ,   to_date('31-03-2009', 'dd-mm-yyyy') as valid_until
        ,   '01C' as val
    from DUAL
    union
    select  1 as ID
        ,   to_date('01-04-2009', 'dd-mm-yyyy') as valid_from
        ,   to_date('30-04-2009', 'dd-mm-yyyy') as valid_until
        ,   '01D' as val
    from DUAL
    union
    select  2 as ID
        ,   to_date('01-01-2009', 'dd-mm-yyyy') as valid_from
        ,   to_date('31-01-2009', 'dd-mm-yyyy') as valid_until
        ,   '02A' as val
    from DUAL
    union
    select  2 as ID
        ,   to_date('01-02-2009', 'dd-mm-yyyy') as valid_from
        ,   to_date('28-02-2009', 'dd-mm-yyyy') as valid_until
        ,   '02A' as val
    from DUAL
    union
    select  2 as ID
        ,   to_date('01-03-2009', 'dd-mm-yyyy') as valid_from
        ,   to_date('31-03-2009', 'dd-mm-yyyy') as valid_until
        ,   '02A' as val
    from DUAL
    union
    select  2 as ID
        ,   to_date('01-04-2009', 'dd-mm-yyyy') as valid_from
        ,   to_date('30-04-2009', 'dd-mm-yyyy') as valid_until
        ,   '02B' as val
    from DUAL
    union
    select  3 as ID
        ,   to_date('01-01-2009', 'dd-mm-yyyy') as valid_from
        ,   to_date('31-01-2009', 'dd-mm-yyyy') as valid_until
        ,   '03A' as val
    from DUAL
    union
    select  3 as ID
        ,   to_date('01-03-2009', 'dd-mm-yyyy') as valid_from
        ,   to_date('31-03-2009', 'dd-mm-yyyy') as valid_until
        ,   '03A' as val
    from DUAL
    union
    select  3 as ID
        ,   to_date('01-04-2009', 'dd-mm-yyyy') as valid_from
        ,   to_date('30-04-2009', 'dd-mm-yyyy') as valid_until
        ,   '03B' as val
    from DUAL
)

bladez · 04-08-2009 10:39pm

I do not have oracle so cant test. A semi cartision join should do the trick.

I am bit foggy on oracle joins, i am trying to do a left join.

What are you working on? It looks like you are working on a temporal database. Which would genearally be used for historical locations of an object ie person residence.

I assume you cant have two values as once.

SELECT A.ID,
MIN(A.VALID_FROM) AS VALID_FROM,
MAX(CASE WHEN B.VALID_UNTIL IS NULL THEN A.VALID_UNTIL ELSE B.VALID_UNTIL END) AS VALID_UNTIL,
A.VAL
FROM
MYTABLE A , MYTABLE B
WHERE (A.ID = B.ID OR A.ID = (+)) AND (A.VAL = B.ID OR A.VAL = (+)) AND (A.VALID_UNTIL >= B.VALID_FROM OR A.VALID_UNTIL = (+))
GROUP BY A.ID,A.VAL

bladez · 04-08-2009 10:45pm

Just checking there

Updated the join and added a day so that consective days are continous

SELECT A.ID,
MIN(A.VALID_FROM) AS VALID_FROM,
MAX(CASE WHEN B.VALID_UNTIL IS NULL THEN A.VALID_UNTIL ELSE B.VALID_UNTIL END) AS VALID_UNTIL,
A.VAL
FROM
MYTABLE A , MYTABLE B
WHERE A.ID = B.ID(+) AND A.VAL = B.ID(+) AND A.VALID_UNTIL+1 >= B.VALID_FROM(+)
GROUP BY A.ID,A.VAL

bonkey · 05-08-2009 1:52pm

Thanks for the suggestion. Unfortunately, I've already tried that approach, and I don't think it can work.

With the self-join, you identify (in A) each record which has a record following it with the same value (you wrote it wrong, but I assume you wanted to join val against val).

So far, so good.

Then, the problems start.
With the min, you find the earliest valid_from for a given ID and value. The max finds the latest.

Imagine that you hve an ID which oscillates between values like this : A -> B -> A -> B -> A -> B. Here, we should perform no "compression" because no two adjacent time-slots have the same value...but the grouping that you've defined only allows one output for a value of A and one for B.

If we change the outer to an inner join to (somehow) get around that problem, it still won't work for A->A->B->B->A->A.

Its given me half of an idea for a nested-query approach though, which I might try...

Again..thanks.

voxpop · 05-08-2009 2:26pm

just something that popped into my head that you might be able to use - again not similar with Oracle - but recently I was doing some duration calculations using sql - you know something like a bunch of timestamp records show a value of a, that changes to b at some stage. Anyway I used something like

with states as ( 
	SELECT ROW_NUMBER() OVER(ORDER BY val,[timestamp]) AS RowNum,val,[timestamp] from log
	)

select a.val
		b.timestamp as start,
		a.timestamp as [end]
	from states a
	inner join states b on a.RowNum = b.RowNum +1
	where a.val <> b.val

Sorry the sql is a bit ropey as im doing it from memory buy it was basically adding row numbers over a particular order, then comparing row n to row n+1 and taking the start and end where val changes

bonkey · 05-08-2009 3:57pm

voxpop wrote: »

again not similar with Oracle

Doesn't matter. I speak a number of different SQL flavours, and can generally rewrite one as the other.

Thanks

bladez · 05-08-2009 7:32pm

This works on mysql

SELECT A.ID,
CASE WHEN B.VALID_FROM BETWEEN A.VALID_FROM AND A.VALID_UNTIL
THEN A.VALID_FROM ELSE B.VALID_FROM END AS VALID_FROM,
CASE WHEN A.VALID_UNTIL BETWEEN B.VALID_FROM AND B.VALID_UNTIL
THEN A.VALID_UNTIL ELSE B.VALID_UNTIL END AS VALID_UNTIL,
A.VAL
FROM
MYTABLE A left join MYTABLE B ON
A.ID = B.ID AND A.VAL = B.VAL
GROUP BY A.ID,
CASE WHEN B.VALID_FROM BETWEEN A.VALID_FROM AND A.VALID_UNTIL
THEN A.VALID_FROM ELSE B.VALID_FROM END,
CASE WHEN A.VALID_UNTIL BETWEEN B.VALID_FROM AND B.VALID_UNTIL
THEN A.VALID_UNTIL ELSE B.VALID_UNTIL END,
A.VAL

SQL - a "light" challenge

Comments