SQL Hell!

les_steaks · 25-01-2007 7:18pm #1

Hi all,

I'm trying to write an SQL query that will clean up data in a given table.

Its difficult to describe so hopefully the SQL I've come up with so far will help:

(select grouping_id from tableX where
charges_id = 5 and tax_id = 2 and commissions_id = 24)
intersect
(select grouping_id from tableX where
charges_id = 2 and tax_id = 0 and commissions_id = 0)

(these are multiple records and there can be more than two!)

This works for the case where the values are 5,2,24 and 2,0,0. The problem is that over time different users entered data in different compinations/ permutations so data which may look different can actually be a duplicate entry. (I hope that makes sense!)

So is there a way to return the group_id without getting all the permutations manually and writting a very long sql statement. There may be far more than two rows so this could get very complicated.

Thanks in advance,

Les

bonkey · 25-01-2007 8:01pm

I'm not quite seeing what it is you want to do.

Are you trying to identify all records for all grouping_ids where the grouping_id occurs more than once in the table?

I think thats what you're trying to do, but it might not be.

Also...what flavour of SQL are you using (i.e. for which DB). It could be useful in terms of making sure that whatever suggestions you get don't use some technique not supported on your database.

amen · 25-01-2007 10:38pm

The problem is that over time different users entered data in different compinations/ permutations so data which may look different can actually be a duplicate entry

what caused this to happend and has it been rectified so it can't happen again?

If not waste of time cleaning up the data as you will be back doing it again.

Make sure there won't be any more crappy data before fixing the data
I'd also suggest correcting the data on a backup copy first and ensure it is correct there. If there are any reports on the data run them on the backup and make sure they are correct.

Just curious since you have TaxIDs and Commissions what impact does this have on historical data? Will someone be owed/owe you money?
Does this data appear on a legal document or a document a customer have?
If you clean it up and for some reason your company ends up in dispute/court what happens if the document is rerun with data now missing and it doesn't match the customers?

les_steaks · 26-01-2007 12:52pm

Thanks for the replies, hopefully I'll explain it a bit better this time!

Firstly, I'm using SQL Server 2005. Its all in a development environment so there will not be any implications legally etc.

The inconsistency occurred over time as different people entered data manually. Go forward I've automated the process so consistency should prevail but I use nned to identify duplicate data and eliminate it.

Take two sample records:

GRP_ID CHAR_ID TAX_ID COMM_ID etc.
1 2 5 20
1 5 0 0

These may also appear as:

GRP_ID CHAR_ID TAX_ID COMM_ID etc.
10 5 5 20
10 2 0 0

The table is used to populate a front end on an application and will can correctly work using either record so neither is technically incorrect but for redundancy etc. I want to eliminate the duplicates in the future.

There could be 10 rows in a group so is without having to query using each permutation can anybody recommend a simpler way?

Thanks,

Les

bonkey · 26-01-2007 1:50pm

What you're describing are rows where TAX_ID, COMM_ID etc. are duplicated, but *not* with GRP_ID CHAR_ID the same. Funnily...I've just been solving exactly the same problem in an Oracle database

OK...here's one way of doing it. I'm going to assume that TAX_ID and COMM_ID and ETC are the full set of duplicated fields. If there are more fields (which your etc. suggests) you need to add them....

There might be some syntax problems in how I nested the SELECT as a table, but I don't have MSSQL to hand so I'm not going to be easily able to check it up for you.

But the idea is right...the SELECT / HAVING COUNT(*) > 1 gives you a unique set of the duplicated details. Join this back to the table on every field in the set, and it will give you all the records which hit those duplicates.

The ORDER BY at the end puts them in a sensible order.

The next trick comes when you want to delete all but one copy of each. Let me know if you can't get that bit either....

SELECT details.*
FROM myTable AS details
INNER JOIN (
SELECT tax_id, comm_id, etc, count(*)
FROM mytable
GROUP BY tax_id, comm_id, etc
HAVING count(*) > 1
) nested_table
ON details.tax_id = nested_table.tax_id
AND details.comm_id = nested_table.comm_id
AND details.etc = nested_table.etc

ORDER BY tax_id, comm_id, etc, grp_id, char_id

les_steaks · 26-01-2007 2:57pm

Thanks bonkey, but I'm not sure that will do it.

There may be two groups of records say:

GRP_ID CHG_ID TAX_ID COMM_ID
1 2 5 10
1 5 0 0

and

GRP_ID CHG_ID TAX_ID COMM_ID
10 5 5 10
10 2 0 0

I need a query that will return both grp_ids, i.e., to me (in this circumstance) the two groups are the same so I need to identify them, delete one and reuse the remaining one.

They have different grp_ids and the chg_id's are reversed but the front end's query gets all charges, taxes and comms for a particular group so both groups above would return the same.

What I need is a query like eg.

(((SELECT GRP_ID FROM
TABLE_NAME WHERE
CHR_ID = 2 AND TAX_ID = 5 AND COMM_ID =10)
INTERSECT
(SELECT GRP_ID FROM
TABLE_NAME WHERE
CHR_ID = 5 AND TAX_ID = 0 AND COMM_ID =0))
UNION
((SELECT GRP_ID FROM
TABLE_NAME WHERE
CHR_ID = 5 AND TAX_ID = 5 AND COMM_ID =10)
INTERSECT
(SELECT GRP_ID FROM
TABLE_NAME WHERE
CHR_ID = 2 AND TAX_ID = 0 AND COMM_ID =0)))

As I could have ten rows in a particular group to explicitly include all permutations in the query in not practical?

Thanks,

Les

amen · 26-01-2007 3:39pm

when deleting these is a nasty little hack you can do in MS SQL
if you have 5 duplicated rows you could use
SET ROWCOUNT 1
and run the delete statement 4 times
each time only one row will be delete as ROWCOUNT is 1

or try
SET ROWCOUNT 4 and run delete once
remember to
SET ROWCOUNT 0 afterwards

les_steaks · 26-01-2007 3:53pm

Thanks Amen, had that in mind alright, but its actaully identifying the groups to delete that the major headache! Any thoughts????

bonkey · 26-01-2007 4:19pm

IF you look at my query, it finds rows which are the same except for the grp_id / chg_id. Its trivial to change that to find rows which are the same except for just the group_id.

Then it identifies where two seperate grp_ids have the same content in the rest of the row. This doesn't verify that *every* row in one group matches *every* row in another group, but it does get you close.

To be honest, I don't think you can easily reduce this down to a piece of SQL.

Using a cursor, some temporary tables (or cursors) etc. you could do it alright, but I can't think of any easier way.

To do it with a cursor, you'd do something like the following (written in pseudocode)

Open cursor1 on : select distinct grp_id from MainTable.
For each grp_id X in cursor1
  - select * from MainTable where grp_id = X into temptable1
  - open cursor2 on select distinct grp_id from MainTable
  - for each grp_id Y in cursor2
      - select * from MainTable where grp_id = Y into temptable2
      - run a select, joining the contents of temptable1 to temptable2 
        on all fields except grp_id.
      - if the rowcount of this select == rowcount of temptable1
        and rowcount of this select == rowcount of temptable2 then you have a matching group.
          - delete from the main table where grp_id = Y
        end if
    loop
loop

You can make this smarter by writing a first version which instead of doing deletes simply writes out messages saying Grp X is a clone of Grp Y...and potentially even writing out the contents of each group.

You can then verify this as you like, and when you're happy, put in the delete and off you go.

les_steaks · 26-01-2007 5:28pm

Think I've cracked it. Seems to do what I need in anyhow!

SELECT grouping_id,count(grouping_id) as counter
FROM table_name
WHERE(charges_id IN (0,12,14,11)) AND (tax_id IN (1,0)) AND
(commissions_id IN (6,7,0))
group by grouping_id
order by counter desc

Now anything in this case with a count of 4 (i.e., (0,12,14,11))) are the groups I'm looking for.

Thanks all for your help,

Les

SQL Hell!

Comments