IWETHEY v. 0.3.0 | TODO
1,095 registered users | 0 active users | 0 LpH | Statistics
Login | Create New User
IWETHEY Banner

Welcome to IWETHEY!

New Slow query needs optimizing
I've got a table that is just a bunch of raw data, no primary key; basically, a logfile saying we did X to Y on Z date. I'm trying to extract a table of unique items using the last date (after picking a few obvious fields to make a primary key out of); but there are duplicates, so I join that result to one with the same date and the pkey fields, grouped around a separate field. There are still duplicates, so I join _that_ to another selection grouped on another field...

I get a bunch of nested subqueries looking something like this pseudo:

pkey values + max (date)\n        + max (c) within max(date)\n                + max (d) within max(c) within max(date)\n                        + max (e) within d within c within date\n                                ...


By the time it's 3 subqueries deep, it takes three hours to run. I need to make this more efficient somehow. I figure I'd get more luck trying to rework the logic rather than the query text, but I don't see a way of doing it. I tried selecting max(date) max(c) group by date, but that undid the max(date) as if I had just selected date.

Essentially, I need to set up a grouping/exclusion order. Group first to max(date), and if that's not enough to get a unique list, group that to another field, and so on, for every field that's not part of the result's primary key.

Come to think of it, I could restructure the problem and come up with something which halfway works, but I'd like to see a solution to this inefficiency as I might run into it again with the kind of stuff I'm working with.

System is postgres, btw.
New Re: Slow query needs optimizing
Add one of [link|http://h18002.www1.hp.com/alphaserver/sc/sys_sc45_features.html|these].


Peter
[link|http://www.debian.org|Shill For Hire]
[link|http://www.kuro5hin.org|There is no K5 Cabal]
[link|http://guildenstern.dyndns.org|Blog]
New Just imagine a Beowolf cluster of those.
New Great, now we're turning into /.
Next you know, somebody will post...

In Soviet Russia, a Beowulf cluster of insensitive clods imagines you!

/me shudders.
Nobody wins in a butter eating contest
New Running DOSiX!
New Can you post
A few records of your raw data and what the result should look like from that data. I must be getting old 'cuz I don't quite follow what you are trying to accomplish. Or it could be I just like working with "visual" information. I keep thinking that the solution is relatively straight forward so I must be missing something.
New 'fraid not
With the Internet policy at work, I can't get any information out without surreptitiously installing ssh software on one of the internet terminals, which is likely to get me asked to leave and not come back. When I need to get help from lists or newsgroups, I have to write down error messages etc on a piece of paper and sneak it out the door.

Here's some bogus data that might help explain what I'm dealing with:
\na  b  date   c   d   e\n----------------------\na  1  1999   1   2   3\na  2  2000   1   2   3\na  1  1999   2   2   3\na  1  1999   2   3   3


The new table will have primary key (a,b), and is supposed to contain all the data in the most recent record from the original table for each a,b item.
\nselect B.* from \n  (select a,b, max(date) as xDate from tbl group by a,b) as A,\n  tbl as B where B.date = a.xDate


max(date) works for a,2 because the data is nice and there is only one last record, but for a,1 it will return 3 records. To pare that down, I decide that c is the next most important field and some values are more important than others, and for the sake of simplicity max(c) will return the most important value.
\nselect B.* from\n  (select * from\n    (select a,b, max(date) as xDate from tbl group by a,b) as AA,\n    (select a,b, date, max(c) as xC from tbl group by a,b,date) as BB\n    where BB.date = AA.xDate\n  ) as A, tbl as B where B.c = A.xC


That still leave duplicates, so I pick the next most important field. As the query takes three hours to run, I start to think that I'm going about this the wrong way, but I don't see another way of doing it.
New pump to file, cat file |sort -k3 >outputfile

"You're just like me streak. You never left the free-fire zone.You think aspirins and meetings and cold showers are going to clean out your head. What you want is God's permission to paint the trees with the bad guys. That wont happen big mon." Clete
questions, help? [link|mailto:pappas@catholic.org|email pappas at catholic.org]
New If you're allowed correlated subquery, then:
\nSELECT a.*\n  FROM table_name aa,\n       (SELECT a, b, max(date)\n          FROM table_name\n         GROUP BY a, b) bb\n WHERE aa.a = bb.a\n   AND aa.b = bb.b\n   AND aa.date = bb.date\n

You shouldn't need more than one subquery to do this.

Edit: Sorry. I didn't see what you said about getting multiples of "a, b, date" back from the GROUP BY in the subquery. If that's the case, and you need to go to a variable number of tie-breakers, then you've effectively blown the limits of relational calculus.

IMO, your only choice at this point is to drop to a procedural language, stream the records by, and pick-out the ones you want to keep. Straight SQL simply will not work if you have *any* duplicate records. Even if each row is unique, nested subqueries will make your SELECT move like pondwater.


Second Edit:

Actually, combining with Ben's solution, you can do something like:

\nselect aa.*\n  from table aa,\n       (select a, b, \n               max(to_char(date)||c||d||e) sortkey\n          from table) bb\n where aa.a = bb.a\n   and aa.b = bb.b\n   and to_char(aa.date)||aa.c||aa.d||aa.e = bb.sortkey\n


Like I said above, though, if you have a duplicate row, then you're screwed. Unless, of course, you're using Oracle and can append the 'rowid' psuedocolumn to the sortkey field.
Expand Edited by morganek April 16, 2004, 11:33:52 PM EDT
Expand Edited by morganek April 16, 2004, 11:43:11 PM EDT
New Re: If you're allowed correlated subquery, then:
combining with Ben's solution, you can do something like:
\nselect aa.*\n  from table aa,\n       (select a, b, \n               max(to_char(date)||c||d||e) sortkey\n          from table) bb\n where aa.a = bb.a\n   and aa.b = bb.b\n   and to_char(aa.date)||aa.c||aa.d||aa.e = bb.sortkey\n


Thanks, I think that'll do it.

I didn't understand what Ben meant when I first read his post, but it looks clear in retrospect. I'm in the habit of thinking of data as discrete objects and didn't consider stringing their text together.
New Re: 'fraid not
I am going to assume that in the real table there is some data that would preclude you simply doing this.

select a,b, max(date), max(c), max(d) from tbl group by a,b

I can see two possible solutions other then what Morgan and Ben suggested.

The ugliest is to use a loop and scan through the records one by one. Create an index that contains all the fields you want to include in the selection critera and then use that to pick through the records one by one, inserting the records you want into a second table.

I do not think you need to go that far here, but I have ended up using them when the critera for selecting records would not condense into a single query or when the critera depended on looking at more then one record at a time.

The second goes like this. Run a query to group by (a,b) and stick them in the second table, with data filled in with nulls. Then do a series of queries like this to fill in the data.
\nUpdate tbl2 set (all the data fields) from\n  (select B.* from \n    (select a,b, max(date) as xDate, count(1) as RowCount, c,d\n      from tbl group by a,b,c,d) as A,\n  tbl as B where B.date = a.xDate and RowCount = 1) as GD,\ntbl2 where GD.a = tbl2.a and GD.b = tbl2.b and tbl2.date is null\n

The idea in this case is that query picks out all the rows where there is only one date and moves them over. Then you can write another query that looks at the next field and only moves them over and so on.

I've used this a couple of times when 90% or more of the records where clean and there where only a handful of records that needed scrubbing, often by hand.

Given what you have said, in SLQ Sever I would probably end up using the same solution you came up with, but would put a primary key on the table and create an index that covers the fields that are part of the critera.

Jay
New Is "SQL Sever" intentional? (Regardless, I like it. :-)
To deny the indirect purchaser, who in this case is the ultimate purchaser, the right to seek relief from unlawful conduct, would essentially remove the word consumer from the Consumer Protection Act
- [link|http://www.techworld.com/opsys/news/index.cfm?NewsID=1246&Page=1&pagePos=20|Nebraska Supreme Court]
New I don't recognize the "within" clause
But I would suggest trying to format everything as a string whose ASCIIbetical sort is the order you want, and then do one max. For instance format numbers to show you all leading 0's to some fixed width, format dates as YYYYMMDD:HH:MM:SS, strings as themselves, etc.

No, it is not perfect. But it should run fairly quickly.

Cheers,
Ben
To deny the indirect purchaser, who in this case is the ultimate purchaser, the right to seek relief from unlawful conduct, would essentially remove the word consumer from the Consumer Protection Act
- [link|http://www.techworld.com/opsys/news/index.cfm?NewsID=1246&Page=1&pagePos=20|Nebraska Supreme Court]
New Not quite sure what you are doing
I'm not quite sure what you are doing from your description.

But in MS SQL Server one thing that can help many complex queries running on a table with no key is to create a temporary primary key. Just throw a identity number column on the starting table. This speeds up the process because SQL Server can optimize a query much better if there is a primary key. You end up gaining much more in processing time then you lose creating the bogus primary key.

Jay
New Whew, glad it's not just me that didn't understand.
New Pet peeve on SQLServer
is that the temporary tables can have only 1 index (primary key takes that spot).
New Rarely use real temporary tables
Working in SQL Server I rarely used real temporary tables anyway.

In those few cases where I was writing a stored procedure so complex I had to use a temporary table to hold a partial result, it was so complex I wanted to keep the partial results so I could trace my process backwards.

Jay
New I just finished writing about 15 stored procedures.
Where I heavily used temporary tables. Why? Because I was given an impossible deadline with incomplete specs that kept changing (here are some Access queries, replicate them in T-Sql and have it done by the end of today (4 hours). So, 40+ MS Access queries got condensed to 15 sp's - but 20+ of the Access queries were "Make Table" queries, so, well, you know). I typically try to avoid temp tables as well. Here's what my fear is: the design of these sp's is atrocious (I am geniunely ashamed that I wrote them) and I fear I will never be given the time to go back, look at the idiocy of the requirements I was given and write clean T-Sql that provides the required output with much less lunacy than was in the original design of the Access queries. My guess is that if I were granted that time, I probably wouldn't have to use any temp tables at all.

That said, I wonder if you or Chris have made much use of "table variables". I haven't really any experience with them. I have a sense of the limitations, but was wondering how you liked them.

bcnu,
Mikem
New Table variables are great
They run faster since they don't use logging (no transaction overhead). Just wish they'd taken the next logical step and allow you to pass them as a parameter to another procedure. Means that all the processing on the Table Variable has to be inside one procedure - making modularity hard to accomplish.
New Thanks.
If I ever get time to re-design the mess I've made, I'll give them a whirl.
     Slow query needs optimizing - (tangaroa) - (19)
         Re: Slow query needs optimizing - (pwhysall) - (3)
             Just imagine a Beowolf cluster of those. -NT - (ChrisR) - (2)
                 Great, now we're turning into /. - (inthane-chan)
                 Running DOSiX! -NT - (Another Scott)
         Can you post - (jbrabeck) - (6)
             'fraid not - (tangaroa) - (5)
                 pump to file, cat file |sort -k3 >outputfile - (boxley)
                 If you're allowed correlated subquery, then: - (morganek) - (1)
                     Re: If you're allowed correlated subquery, then: - (tangaroa)
                 Re: 'fraid not - (JayMehaffey) - (1)
                     Is "SQL Sever" intentional? (Regardless, I like it. :-) -NT - (ben_tilly)
         I don't recognize the "within" clause - (ben_tilly)
         Not quite sure what you are doing - (JayMehaffey) - (6)
             Whew, glad it's not just me that didn't understand. -NT - (jbrabeck)
             Pet peeve on SQLServer - (ChrisR) - (4)
                 Rarely use real temporary tables - (JayMehaffey) - (3)
                     I just finished writing about 15 stored procedures. - (mmoffitt) - (2)
                         Table variables are great - (ChrisR) - (1)
                             Thanks. - (mmoffitt)

Bad hair often requires a good hat.
80 ms