SQL's Replacement Rant

I am tired of SQL. I want to replace it.

What if relational expressions were defined as a sequence of tranformations on record sets (virtual or actual)?

The idea was partly inspired from:

[link|http://www.mcjones.org/System_R/bs12.html|http://www.mcjones...._R/bs12.html]

and from my XBase days.

The format I am thinking about would be something like:

Tnew = operation(T1...Tn, expression1...expressionN)

Where Tx is a table or record-set reference. These references may be temporary, virtual, or actual. Expressions would reference (have scope on) only a maximum of two tables per operation. It is sort of like a math expression where the base operations have only 2 operations.

Limiting expression scope to two tables at a time simplifies things. If you want to filter based on a joined result, then you join it in a *prior* statement so that you are usually dealing with a "flat" view at any given operation, or at least no more than 2 "data sources" at a time.

Note that a field-list is considered a (virtual) table with one column. Thus, I did not include a "field list" parameter type. For example, if one of the parameters was "[name, start_date, serial]", then it would be internally converted into a table like:

columnName
----------
"name"
"start_date"
"serial"

Anyhow, typical queries may resemble:

T1 = filter(fooTable, x > y)
T2 = lookup(T1, barTable, [[amt][charge]], 1.serial = 2.ssn)
T3 = groupSum(T2, [empNo], [Amt])
....

You don't have to name the temporary record aliases "Tx", it is just to make it clear what is external tables and what is internal tables, kind of like SQL table aliases. (Perhaps a special symbol should be used for temporary/virtual tables, such as an ampersand.)

The advantage over SQL is that first one can think more sequentially, and in smaller chunks at a time. Second it can be more easily debugged sequentially. For example, you could do:

T1 = filter(fooTable, "x > y")
T2 = lookup(T1, barTable, [[amt][charge]], 1.serial = 2.ssn)
usr:TempTable4 = copy(T2)
T3 = groupSum(T2, [empNo] [Amt])
....

"Copy" would copy the work table T2 to a specified database saving area.

Note that other programmers have complained about the non-sequential nature of SQL besides me. It is not just the thought process, but also that there is insufficient seperation of scope in SQL. A column name reference has to be unambigious, and if you have many Selects in SQL, it gets harder and harder to keep the name-space clean without lots of explicit table or alias references. Things get too "global" in SQL such that you can't really test in peices because they are all "integrated".

It needs to be broken into more descrete chunks, or at least smaller chunks to better mentally manage and debug as isolated units.

Let's take a look at slightly more complex statement.

T2 = lookup(T1, barTable, [[amt][charge],[time][when]], 1.serial = 2.ssn)

This would copy over the amount and time of a bar tab based on a serial number lookup. (The example idea came from the "bar" in "foobar", and not from watching Cheers, BTW.)

Generally the "many" table of one-to-many joins would be the first table. (I will describe the management of non-matches some other time.)

The middle expression in the square brackets is actually a field translation table. It would be translated into a table like:

ToField FromField
--------- -------
amt charge
time when

If you have more rows, you may want to even write the statement as something like


T2 = lookup(T1, barTable, [  [amt] [charge],
                             [time][when],
                             [nog] [tog],
                             [grig][snig],
                          ], 1.serial = 2.ssn)

(Sorry for the double-spaced lines)

if you want to make it visually resemble such a table. You could also reference an actual table that contained just such columns. The square brackets are simply a "short-cut" way to define and populate a temporary table.

I was thinking of allowing a table of expressions instead of just field names, but that then that may be the same kind of big-ball-of-twine SQL issues starting to creap in. Although, even if it was done, it is still dealing with only 2 tables at a time. I am still considering this issue. The left column would still be the destination field, but the right column could be something like "1.x + 2.y".

If expressions are voted out and you need to do calculated fields, then calculate the value to a temporary table before or after performing the look-up, not during. There could be a "Calc" operation, for example. For more complex calculations, you may need to add some "work columns". These would usually be virtual. For example, one could first create some temporary columns that match the columns that will come from a later Lookup operation. After the columns are copied over, then calculations could be done on the new view.

I have not decided on what the operations would be yet. Here are some candidates:

lookup - Similar to a Join (examples given), but could
also act as kind of an Update.

Filter - Filter a table based on an expression

groupSum - Summarize or group by an expression.
(see note below).

copy - Copy one table to another. An optional
field selection list can be supplied. (Copy
and Lookup may be consolidatable into one.
this requires some pondering.)

Calc - Column calculations on a given table. May not
be needed if we go with the expression-enabled
Lookup command, since one just does an expression-based
look-up on the same table.

Order - Change or supply an ordering to a record-set view.

Schema - Get a column description table

I am not sure if the primatives should include things like averaging or not. Averaging is not very common, and can be calculated using temporary work columns. I am also thinking of having something like this instead of GroupSum:

x = Group(origTable, grouping_expression, grouped_text_columns,
summed_columns, averaged_columns, count_columns)

If you have are not using one or the other, then
simply give a blank set "[]". (If you are not
using any to the right, then you don't have to
supply all parameters.)

PERFORMANCE

Past suggestions to do things similar to this created the criticism that since the SQL approach is generally not defined by a sequence, that there is room for the database engine to optimize the query process for maximum efficiency. In other words, SQL allegedly defines what the result is to be like, and not how to acheive it.

However, the suggested approach also does not necessarily define a strict sequence. As long as the *result* is the same, it does not matter how the database engine goes about performing the request. I suppose there could be a "non-atomic mode" that could be used to inspect intermediate results, but during production a given set of such commands would be assumed atomic.

For example, these commands generally have the form:

Y = op(X, blah)

This works kind of like a *copy* of X to Y, but with some transformations. The optimizer might notice that X is not being used for anything else after this operation. (There will be some kind of atomic block unit, similar to the way an SQL statement is considered atomic.) Thus, it can alter X in place rather than make an actual copy if it decides that is more efficient.

Whether defining it in such a way provides fewer optimization choices than SQL probably would require detailed expert analysis. However, it should be clear that some optimization is possible.

It is not really that different from SQL anyhow. SQL tends to be infix, while this approach is more prefix-based. Infix is harder to read for long expressions in my opinion.

SQL pattern:

y a (x b c d) e

Suggested pattern:

T1 = x(b c d)
T2 = y(a, T1, e)

It is true that each operation is more isolated in scope in the suggested pattern. This may limit "wide-reaching" automated optimizations, reducing potential performance improvements. On the other hand, it could make parsing faster because the context is smaller.

Also note that the second pattern offers more opportunity to re-using a given record set since it is reference-based instead of nesting-based. You can reference a given record-set view multiple times, but you generally cannot do such in a typical SQL expression.

SUMMARY

This is just a preliminary suggestion. SQL was purposely designed to be more like natural language rather than optimizing it to be programmer-friendly.

...a QBE tool to raw SQL programming? (Or go to the other extreme and use [link|http://www.redbooks.ibm.com/redbooks/SG245402.html|RPG] to break down the Set manipulation into a cycle?)

Is your argument (a) SQL is incapable; or (b) people are incapable of using SQL? If it's (a) then you haven't shown a query that SQL is incapable of handling. If it's (b) then I think your case that an XBase language represents an improvement is weak. Query languages should be declarative in nature - not imperative. Moving towards a procedural form of querying is a step backwards, not forwards.

Doesn't mean that I wouldn't like to see some improvement (as well as better standardization). One thing that I'd like to see is related to value recalculation - be nice if you could calculate a named column for output and then have that available in subsequent columns. Something like:

SELECT
   a = Foo * Bar,
   b = {a} * 12
FROM SomeTable

Many times, I find myself having to calculate the same value multiple times throughout a query. I also wish the WHERE and GROUP BY clauses would do a similar recognition of named outputs.

I do understand why the language is constrained this way. Specifically, the database will try to limit the amount of data handling and computation by first applying the WHERE clause, so it wouldn't necessarily be wise to have a dependency that required building a complete output result before applying the filter. Same thing goes for the Group By clause. As for the Output columns, the result query is designed to be a one way street, using the resulting input solely as the basis of output. Much inefficiency would result if you change the ordering of these things. Of course, the query compiler could simply do a macro type expansion such as:

SELECT
   a = Foo * Bar,
   b = (Foo * Bar) * 12
FROM SomeTable

when it encountered the above query.

Post #38,915 by tablizer 5/16/02 2:23:56 PM Reply	SQL's Replacement Rant I am tired of SQL. I want to replace it. What if relational expressions were defined as a sequence of tranformations on record sets (virtual or actual)? The idea was partly inspired from: [link\|http://www.mcjones.org/System_R/bs12.html\|http://www.mcjones...._R/bs12.html] and from my XBase days. The format I am thinking about would be something like: Tnew = operation(T1...Tn, expression1...expressionN) Where Tx is a table or record-set reference. These references may be temporary, virtual, or actual. Expressions would reference (have scope on) only a maximum of two tables per operation. It is sort of like a math expression where the base operations have only 2 operations. Limiting expression scope to two tables at a time simplifies things. If you want to filter based on a joined result, then you join it in a prior statement so that you are usually dealing with a "flat" view at any given operation, or at least no more than 2 "data sources" at a time. Note that a field-list is considered a (virtual) table with one column. Thus, I did not include a "field list" parameter type. For example, if one of the parameters was "[name, start_date, serial]", then it would be internally converted into a table like: columnName ---------- "name" "start_date" "serial" Anyhow, typical queries may resemble: T1 = filter(fooTable, x > y) T2 = lookup(T1, barTable, [[amt][charge]], 1.serial = 2.ssn) T3 = groupSum(T2, [empNo], [Amt]) .... You don't have to name the temporary record aliases "Tx", it is just to make it clear what is external tables and what is internal tables, kind of like SQL table aliases. (Perhaps a special symbol should be used for temporary/virtual tables, such as an ampersand.) The advantage over SQL is that first one can think more sequentially, and in smaller chunks at a time. Second it can be more easily debugged sequentially. For example, you could do: T1 = filter(fooTable, "x > y") T2 = lookup(T1, barTable, [[amt][charge]], 1.serial = 2.ssn) usr:TempTable4 = copy(T2) T3 = groupSum(T2, [empNo] [Amt]) .... "Copy" would copy the work table T2 to a specified database saving area. Note that other programmers have complained about the non-sequential nature of SQL besides me. It is not just the thought process, but also that there is insufficient seperation of scope in SQL. A column name reference has to be unambigious, and if you have many Selects in SQL, it gets harder and harder to keep the name-space clean without lots of explicit table or alias references. Things get too "global" in SQL such that you can't really test in peices because they are all "integrated". It needs to be broken into more descrete chunks, or at least smaller chunks to better mentally manage and debug as isolated units. Let's take a look at slightly more complex statement. T2 = lookup(T1, barTable, [[amt][charge],[time][when]], 1.serial = 2.ssn) This would copy over the amount and time of a bar tab based on a serial number lookup. (The example idea came from the "bar" in "foobar", and not from watching Cheers, BTW.) Generally the "many" table of one-to-many joins would be the first table. (I will describe the management of non-matches some other time.) The middle expression in the square brackets is actually a field translation table. It would be translated into a table like: ToField FromField --------- ------- amt charge time when If you have more rows, you may want to even write the statement as something like T2 = lookup(T1, barTable, [ [amt] [charge], [time][when], [nog] [tog], [grig][snig], ], 1.serial = 2.ssn) (Sorry for the double-spaced lines) if you want to make it visually resemble such a table. You could also reference an actual table that contained just such columns. The square brackets are simply a "short-cut" way to define and populate a temporary table. I was thinking of allowing a table of expressions instead of just field names, but that then that may be the same kind of big-ball-of-twine SQL issues starting to creap in. Although, even if it was done, it is still dealing with only 2 tables at a time. I am still considering this issue. The left column would still be the destination field, but the right column could be something like "1.x + 2.y". If expressions are voted out and you need to do calculated fields, then calculate the value to a temporary table before or after performing the look-up, not during. There could be a "Calc" operation, for example. For more complex calculations, you may need to add some "work columns". These would usually be virtual. For example, one could first create some temporary columns that match the columns that will come from a later Lookup operation. After the columns are copied over, then calculations could be done on the new view. I have not decided on what the operations would be yet. Here are some candidates: lookup - Similar to a Join (examples given), but could also act as kind of an Update. Filter - Filter a table based on an expression groupSum - Summarize or group by an expression. (see note below). copy - Copy one table to another. An optional field selection list can be supplied. (Copy and Lookup may be consolidatable into one. this requires some pondering.) Calc - Column calculations on a given table. May not be needed if we go with the expression-enabled Lookup command, since one just does an expression-based look-up on the same table. Order - Change or supply an ordering to a record-set view. Schema - Get a column description table I am not sure if the primatives should include things like averaging or not. Averaging is not very common, and can be calculated using temporary work columns. I am also thinking of having something like this instead of GroupSum: x = Group(origTable, grouping_expression, grouped_text_columns, summed_columns, averaged_columns, count_columns) If you have are not using one or the other, then simply give a blank set "[]". (If you are not using any to the right, then you don't have to supply all parameters.) PERFORMANCE Past suggestions to do things similar to this created the criticism that since the SQL approach is generally not defined by a sequence, that there is room for the database engine to optimize the query process for maximum efficiency. In other words, SQL allegedly defines what the result is to be like, and not how to acheive it. However, the suggested approach also does not necessarily define a strict sequence. As long as the result is the same, it does not matter how the database engine goes about performing the request. I suppose there could be a "non-atomic mode" that could be used to inspect intermediate results, but during production a given set of such commands would be assumed atomic. For example, these commands generally have the form: Y = op(X, blah) This works kind of like a copy of X to Y, but with some transformations. The optimizer might notice that X is not being used for anything else after this operation. (There will be some kind of atomic block unit, similar to the way an SQL statement is considered atomic.) Thus, it can alter X in place rather than make an actual copy if it decides that is more efficient. Whether defining it in such a way provides fewer optimization choices than SQL probably would require detailed expert analysis. However, it should be clear that some optimization is possible. It is not really that different from SQL anyhow. SQL tends to be infix, while this approach is more prefix-based. Infix is harder to read for long expressions in my opinion. SQL pattern: y a (x b c d) e Suggested pattern: T1 = x(b c d) T2 = y(a, T1, e) It is true that each operation is more isolated in scope in the suggested pattern. This may limit "wide-reaching" automated optimizations, reducing potential performance improvements. On the other hand, it could make parsing faster because the context is smaller. Also note that the second pattern offers more opportunity to re-using a given record set since it is reference-based instead of nesting-based. You can reference a given record-set view multiple times, but you generally cannot do such in a typical SQL expression. SUMMARY This is just a preliminary suggestion. SQL was purposely designed to be more like natural language rather than optimizing it to be programmer-friendly. ________________ oop.ismad.com
Post #38,926 by ChrisR 5/16/02 3:42:24 PM Reply	Maybe you would prefer... ...a QBE tool to raw SQL programming? (Or go to the other extreme and use [link\|http://www.redbooks.ibm.com/redbooks/SG245402.html\|RPG] to break down the Set manipulation into a cycle?) Is your argument (a) SQL is incapable; or (b) people are incapable of using SQL? If it's (a) then you haven't shown a query that SQL is incapable of handling. If it's (b) then I think your case that an XBase language represents an improvement is weak. Query languages should be declarative in nature - not imperative. Moving towards a procedural form of querying is a step backwards, not forwards. Doesn't mean that I wouldn't like to see some improvement (as well as better standardization). One thing that I'd like to see is related to value recalculation - be nice if you could calculate a named column for output and then have that available in subsequent columns. Something like: SELECT a = Foo * Bar, b = {a} * 12 FROM SomeTable Many times, I find myself having to calculate the same value multiple times throughout a query. I also wish the WHERE and GROUP BY clauses would do a similar recognition of named outputs. I do understand why the language is constrained this way. Specifically, the database will try to limit the amount of data handling and computation by first applying the WHERE clause, so it wouldn't necessarily be wise to have a dependency that required building a complete output result before applying the filter. Same thing goes for the Group By clause. As for the Output columns, the result query is designed to be a one way street, using the resulting input solely as the basis of output. Much inefficiency would result if you change the ordering of these things. Of course, the query compiler could simply do a macro type expansion such as: SELECT a = Foo * Bar, b = (Foo * Bar) * 12 FROM SomeTable when it encountered the above query.
Post #38,953 by tablizer 5/16/02 8:26:58 PM Reply	It is declarative Query languages should be declarative in nature - not imperative. I believe my suggested approach is declarative. It is even functional I believe because none of the commands alters an existing table until the block is done. (I forgot to clarify that a table reference can not be reassigned to in order to satisfy functional.) I do understand why the language is constrained this way. Specifically, the database will try to limit the amount of data handling and computation by first applying the WHERE clause, so it wouldn't necessarily be wise to have a dependency that required building a complete output result before applying the filter But why limit the language for performance without a choice? IOW, effeciency is nice, but I would rather it let one do things in an inefficient way if it allows one to be more expressive. IOW, give one the choice of trade-off. be nice if you could calculate a named column for output and then have that available in subsequent columns. My approach would allow such because you can use the same "expression table" by referencing it more than once. (That is, if I go with the expression-table approach, which I think I am leaning toward.) ________________ oop.ismad.com
Post #38,935 by Steven A S 5/16/02 5:03:18 PM 5/16/02 5:06:06 PM Reply	You can do something similar If you really don't want to attack the SQL all in one go, you can do something like this: CREATE VIEW T1 AS SELECT * FROM FOOTABLE WHERE X > Y; CREATE VIEW T2 AS SELECT EMPNO,CHARGE AS AMT,WHEN AS TIME FROM T1 A, BARTABLE B WHERE A.SERIAL = B.SSN; SELECT EMPNO,SUM(AMT) FROM T2 GROUP BY EMPNO; This would be the same thing as the procedure you explained (based on my understanding of it's meaning at least) ~~~)-Steven---- "I want you to remember that no bastard ever won a war by dying for his country. He won it by making the other poor dumb bastard die for his country..." General George S. Patton Edited by Steven A S May 16, 2002, 05:05:24 PM EDT Edited by Steven A S May 16, 2002, 05:06:06 PM EDT Expand All History
Post #38,955 by tablizer 5/16/02 8:30:53 PM Reply	Views are too formal Often, one has to go thru the DBA to have views. Temporary views may be a partial solution, but the standard is fuzzy on this. It may also be less "atomic" than my suggestion. ________________ oop.ismad.com
Post #38,987 by CRConrad 5/17/02 8:13:51 AM Reply	SQL "non-sequential"? Look up the concept 'SQL Cursor'. HTH!
Post #39,070 by tablizer 5/17/02 3:58:16 PM Reply	Cursors are proprietary add-ons ________________ oop.ismad.com

Welcome to IWETHEY!