Slow query needs optimizing

Post #151,364 by tangaroa 4/16/04 2:52:55 AM Reply	Slow query needs optimizing I've got a table that is just a bunch of raw data, no primary key; basically, a logfile saying we did X to Y on Z date. I'm trying to extract a table of unique items using the last date (after picking a few obvious fields to make a primary key out of); but there are duplicates, so I join that result to one with the same date and the pkey fields, grouped around a separate field. There are still duplicates, so I join _that_ to another selection grouped on another field... I get a bunch of nested subqueries looking something like this pseudo: pkey values + max (date)\n + max (c) within max(date)\n + max (d) within max(c) within max(date)\n + max (e) within d within c within date\n ... By the time it's 3 subqueries deep, it takes three hours to run. I need to make this more efficient somehow. I figure I'd get more luck trying to rework the logic rather than the query text, but I don't see a way of doing it. I tried selecting max(date) max(c) group by date, but that undid the max(date) as if I had just selected date. Essentially, I need to set up a grouping/exclusion order. Group first to max(date), and if that's not enough to get a unique list, group that to another field, and so on, for every field that's not part of the result's primary key. Come to think of it, I could restructure the problem and come up with something which halfway works, but I'd like to see a solution to this inefficiency as I might run into it again with the kind of stuff I'm working with. System is postgres, btw.
Post #151,368 by pwhysall 4/16/04 7:02:04 AM Reply	Re: Slow query needs optimizing Add one of [link\|http://h18002.www1.hp.com/alphaserver/sc/sys_sc45_features.html\|these]. Peter [link\|http://www.debian.org\|Shill For Hire] [link\|http://www.kuro5hin.org\|There is no K5 Cabal] [link\|http://guildenstern.dyndns.org\|Blog]
Post #151,414 by ChrisR 4/16/04 11:33:33 AM Reply	Just imagine a Beowolf cluster of those.
Post #151,415 by inthane-chan 4/16/04 11:36:50 AM Reply	Great, now we're turning into /. Next you know, somebody will post... In Soviet Russia, a Beowulf cluster of insensitive clods imagines you! /me shudders. Nobody wins in a butter eating contest
Post #151,418 by Another Scott 4/16/04 11:50:39 AM Reply	Running DOSiX!
Post #151,389 by jbrabeck 4/16/04 9:55:24 AM Reply	Can you post A few records of your raw data and what the result should look like from that data. I must be getting old 'cuz I don't quite follow what you are trying to accomplish. Or it could be I just like working with "visual" information. I keep thinking that the solution is relatively straight forward so I must be missing something.
Post #151,481 by tangaroa 4/16/04 6:11:09 PM Reply	'fraid not With the Internet policy at work, I can't get any information out without surreptitiously installing ssh software on one of the internet terminals, which is likely to get me asked to leave and not come back. When I need to get help from lists or newsgroups, I have to write down error messages etc on a piece of paper and sneak it out the door. Here's some bogus data that might help explain what I'm dealing with: \na b date c d e\n----------------------\na 1 1999 1 2 3\na 2 2000 1 2 3\na 1 1999 2 2 3\na 1 1999 2 3 3 The new table will have primary key (a,b), and is supposed to contain all the data in the most recent record from the original table for each a,b item. \nselect B.* from \n (select a,b, max(date) as xDate from tbl group by a,b) as A,\n tbl as B where B.date = a.xDate max(date) works for a,2 because the data is nice and there is only one last record, but for a,1 it will return 3 records. To pare that down, I decide that c is the next most important field and some values are more important than others, and for the sake of simplicity max(c) will return the most important value. \nselect B.* from\n (select * from\n (select a,b, max(date) as xDate from tbl group by a,b) as AA,\n (select a,b, date, max(c) as xC from tbl group by a,b,date) as BB\n where BB.date = AA.xDate\n ) as A, tbl as B where B.c = A.xC That still leave duplicates, so I pick the next most important field. As the query takes three hours to run, I start to think that I'm going about this the wrong way, but I don't see another way of doing it.
Post #151,506 by boxley 4/16/04 10:48:21 PM Reply	pump to file, cat file \|sort -k3 >outputfile "You're just like me streak. You never left the free-fire zone.You think aspirins and meetings and cold showers are going to clean out your head. What you want is God's permission to paint the trees with the bad guys. That wont happen big mon." Clete questions, help? [link\|mailto:pappas@catholic.org\|email pappas at catholic.org]
Post #151,512 by morganek 4/16/04 11:16:20 PM 4/16/04 11:43:11 PM Reply	If you're allowed correlated subquery, then: \nSELECT a.\n FROM table_name aa,\n (SELECT a, b, max(date)\n FROM table_name\n GROUP BY a, b) bb\n WHERE aa.a = bb.a\n AND aa.b = bb.b\n AND aa.date = bb.date\n You shouldn't need more than one subquery to do this. Edit: Sorry. I didn't see what you said about getting multiples of "a, b, date" back from the GROUP BY in the subquery. If that's the case, and you need to go to a variable number of tie-breakers, then you've effectively blown the limits of relational calculus. IMO, your only choice at this point is to drop to a procedural language, stream the records by, and pick-out the ones you want to keep. Straight SQL simply will not work if you have any* duplicate records. Even if each row is unique, nested subqueries will make your SELECT move like pondwater. Second Edit: Actually, combining with Ben's solution, you can do something like: \nselect aa.*\n from table aa,\n (select a, b, \n max(to_char(date)\|\|c\|\|d\|\|e) sortkey\n from table) bb\n where aa.a = bb.a\n and aa.b = bb.b\n and to_char(aa.date)\|\|aa.c\|\|aa.d\|\|aa.e = bb.sortkey\n Like I said above, though, if you have a duplicate row, then you're screwed. Unless, of course, you're using Oracle and can append the 'rowid' psuedocolumn to the sortkey field. Edited by morganek April 16, 2004, 11:33:52 PM EDT Edited by morganek April 16, 2004, 11:43:11 PM EDT Expand All History
Post #151,517 by tangaroa 4/17/04 12:12:09 AM Reply	Re: If you're allowed correlated subquery, then: combining with Ben's solution, you can do something like: \nselect aa.*\n from table aa,\n (select a, b, \n max(to_char(date)\|\|c\|\|d\|\|e) sortkey\n from table) bb\n where aa.a = bb.a\n and aa.b = bb.b\n and to_char(aa.date)\|\|aa.c\|\|aa.d\|\|aa.e = bb.sortkey\n Thanks, I think that'll do it. I didn't understand what Ben meant when I first read his post, but it looks clear in retrospect. I'm in the habit of thinking of data as discrete objects and didn't consider stringing their text together.
Post #151,518 by JayMehaffey 4/17/04 1:43:40 AM Reply	Re: 'fraid not I am going to assume that in the real table there is some data that would preclude you simply doing this. select a,b, max(date), max(c), max(d) from tbl group by a,b I can see two possible solutions other then what Morgan and Ben suggested. The ugliest is to use a loop and scan through the records one by one. Create an index that contains all the fields you want to include in the selection critera and then use that to pick through the records one by one, inserting the records you want into a second table. I do not think you need to go that far here, but I have ended up using them when the critera for selecting records would not condense into a single query or when the critera depended on looking at more then one record at a time. The second goes like this. Run a query to group by (a,b) and stick them in the second table, with data filled in with nulls. Then do a series of queries like this to fill in the data. \nUpdate tbl2 set (all the data fields) from\n (select B.* from \n (select a,b, max(date) as xDate, count(1) as RowCount, c,d\n from tbl group by a,b,c,d) as A,\n tbl as B where B.date = a.xDate and RowCount = 1) as GD,\ntbl2 where GD.a = tbl2.a and GD.b = tbl2.b and tbl2.date is null\n The idea in this case is that query picks out all the rows where there is only one date and moves them over. Then you can write another query that looks at the next field and only moves them over and so on. I've used this a couple of times when 90% or more of the records where clean and there where only a handful of records that needed scrubbing, often by hand. Given what you have said, in SLQ Sever I would probably end up using the same solution you came up with, but would put a primary key on the table and create an index that covers the fields that are part of the critera. Jay
Post #151,519 by ben_tilly 4/17/04 1:49:40 AM Reply	Is "SQL Sever" intentional? (Regardless, I like it. :-) To deny the indirect purchaser, who in this case is the ultimate purchaser, the right to seek relief from unlawful conduct, would essentially remove the word consumer from the Consumer Protection Act - [link\|http://www.techworld.com/opsys/news/index.cfm?NewsID=1246&Page=1&pagePos=20\|Nebraska Supreme Court]
Post #151,435 by ben_tilly 4/16/04 1:47:40 PM Reply	I don't recognize the "within" clause But I would suggest trying to format everything as a string whose ASCIIbetical sort is the order you want, and then do one max. For instance format numbers to show you all leading 0's to some fixed width, format dates as YYYYMMDD:HH:MM:SS, strings as themselves, etc. No, it is not perfect. But it should run fairly quickly. Cheers, Ben To deny the indirect purchaser, who in this case is the ultimate purchaser, the right to seek relief from unlawful conduct, would essentially remove the word consumer from the Consumer Protection Act - [link\|http://www.techworld.com/opsys/news/index.cfm?NewsID=1246&Page=1&pagePos=20\|Nebraska Supreme Court]
Post #151,457 by JayMehaffey 4/16/04 3:49:51 PM Reply	Not quite sure what you are doing I'm not quite sure what you are doing from your description. But in MS SQL Server one thing that can help many complex queries running on a table with no key is to create a temporary primary key. Just throw a identity number column on the starting table. This speeds up the process because SQL Server can optimize a query much better if there is a primary key. You end up gaining much more in processing time then you lose creating the bogus primary key. Jay
Post #151,458 by jbrabeck 4/16/04 4:02:30 PM Reply	Whew, glad it's not just me that didn't understand.
Post #151,460 by ChrisR 4/16/04 4:38:14 PM Reply	Pet peeve on SQLServer is that the temporary tables can have only 1 index (primary key takes that spot).
Post #151,477 by JayMehaffey 4/16/04 5:27:44 PM Reply	Rarely use real temporary tables Working in SQL Server I rarely used real temporary tables anyway. In those few cases where I was writing a stored procedure so complex I had to use a temporary table to hold a partial result, it was so complex I wanted to keep the partial results so I could trace my process backwards. Jay
Post #151,489 by mmoffitt 4/16/04 8:09:35 PM Reply	I just finished writing about 15 stored procedures. Where I heavily used temporary tables. Why? Because I was given an impossible deadline with incomplete specs that kept changing (here are some Access queries, replicate them in T-Sql and have it done by the end of today (4 hours). So, 40+ MS Access queries got condensed to 15 sp's - but 20+ of the Access queries were "Make Table" queries, so, well, you know). I typically try to avoid temp tables as well. Here's what my fear is: the design of these sp's is atrocious (I am geniunely ashamed that I wrote them) and I fear I will never be given the time to go back, look at the idiocy of the requirements I was given and write clean T-Sql that provides the required output with much less lunacy than was in the original design of the Access queries. My guess is that if I were granted that time, I probably wouldn't have to use any temp tables at all. That said, I wonder if you or Chris have made much use of "table variables". I haven't really any experience with them. I have a sense of the limitations, but was wondering how you liked them. bcnu, Mikem
Post #151,495 by ChrisR 4/16/04 8:37:19 PM Reply	Table variables are great They run faster since they don't use logging (no transaction overhead). Just wish they'd taken the next logical step and allow you to pass them as a parameter to another procedure. Means that all the processing on the Table Variable has to be inside one procedure - making modularity hard to accomplish.
Post #151,591 by mmoffitt 4/18/04 12:26:23 PM Reply	Thanks. If I ever get time to re-design the mess I've made, I'll give them a whirl.

Welcome to IWETHEY!