What if relational expressions were defined as a sequence of tranformations on record sets (virtual or actual)?
The idea was partly inspired from:
[link|http://www.mcjones.org/System_R/bs12.html|http://www.mcjones...._R/bs12.html]
and from my XBase days.
The format I am thinking about would be something like:
Tnew = operation(T1...Tn, expression1...expressionN)
Where Tx is a table or record-set reference. These references may be temporary, virtual, or actual. Expressions would reference (have scope on) only a maximum of two tables per operation. It is sort of like a math expression where the base operations have only 2 operations.
Limiting expression scope to two tables at a time simplifies things. If you want to filter based on a joined result, then you join it in a *prior* statement so that you are usually dealing with a "flat" view at any given operation, or at least no more than 2 "data sources" at a time.
Note that a field-list is considered a (virtual) table with one column. Thus, I did not include a "field list" parameter type. For example, if one of the parameters was "[name, start_date, serial]", then it would be internally converted into a table like:
columnName
----------
"name"
"start_date"
"serial"
Anyhow, typical queries may resemble:
T1 = filter(fooTable, x > y)
T2 = lookup(T1, barTable, [[amt][charge]], 1.serial = 2.ssn)
T3 = groupSum(T2, [empNo], [Amt])
....
You don't have to name the temporary record aliases "Tx", it is just to make it clear what is external tables and what is internal tables, kind of like SQL table aliases. (Perhaps a special symbol should be used for temporary/virtual tables, such as an ampersand.)
The advantage over SQL is that first one can think more sequentially, and in smaller chunks at a time. Second it can be more easily debugged sequentially. For example, you could do:
T1 = filter(fooTable, "x > y")
T2 = lookup(T1, barTable, [[amt][charge]], 1.serial = 2.ssn)
usr:TempTable4 = copy(T2)
T3 = groupSum(T2, [empNo] [Amt])
....
"Copy" would copy the work table T2 to a specified database saving area.
Note that other programmers have complained about the non-sequential nature of SQL besides me. It is not just the thought process, but also that there is insufficient seperation of scope in SQL. A column name reference has to be unambigious, and if you have many Selects in SQL, it gets harder and harder to keep the name-space clean without lots of explicit table or alias references. Things get too "global" in SQL such that you can't really test in peices because they are all "integrated".
It needs to be broken into more descrete chunks, or at least smaller chunks to better mentally manage and debug as isolated units.
Let's take a look at slightly more complex statement.
T2 = lookup(T1, barTable, [[amt][charge],[time][when]], 1.serial = 2.ssn)
This would copy over the amount and time of a bar tab based on a serial number lookup. (The example idea came from the "bar" in "foobar", and not from watching Cheers, BTW.)
Generally the "many" table of one-to-many joins would be the first table. (I will describe the management of non-matches some other time.)
The middle expression in the square brackets is actually a field translation table. It would be translated into a table like:
ToField FromField
--------- -------
amt charge
time when
If you have more rows, you may want to even write the statement as something like
T2 = lookup(T1, barTable, [ [amt] [charge],
[time][when],
[nog] [tog],
[grig][snig],
], 1.serial = 2.ssn)
(Sorry for the double-spaced lines)
if you want to make it visually resemble such a table. You could also reference an actual table that contained just such columns. The square brackets are simply a "short-cut" way to define and populate a temporary table.
I was thinking of allowing a table of expressions instead of just field names, but that then that may be the same kind of big-ball-of-twine SQL issues starting to creap in. Although, even if it was done, it is still dealing with only 2 tables at a time. I am still considering this issue. The left column would still be the destination field, but the right column could be something like "1.x + 2.y".
If expressions are voted out and you need to do calculated fields, then calculate the value to a temporary table before or after performing the look-up, not during. There could be a "Calc" operation, for example. For more complex calculations, you may need to add some "work columns". These would usually be virtual. For example, one could first create some temporary columns that match the columns that will come from a later Lookup operation. After the columns are copied over, then calculations could be done on the new view.
I have not decided on what the operations would be yet. Here are some candidates:
lookup - Similar to a Join (examples given), but could
also act as kind of an Update.
Filter - Filter a table based on an expression
groupSum - Summarize or group by an expression.
(see note below).
copy - Copy one table to another. An optional
field selection list can be supplied. (Copy
and Lookup may be consolidatable into one.
this requires some pondering.)
Calc - Column calculations on a given table. May not
be needed if we go with the expression-enabled
Lookup command, since one just does an expression-based
look-up on the same table.
Order - Change or supply an ordering to a record-set view.
Schema - Get a column description table
I am not sure if the primatives should include things like averaging or not. Averaging is not very common, and can be calculated using temporary work columns. I am also thinking of having something like this instead of GroupSum:
x = Group(origTable, grouping_expression, grouped_text_columns,
summed_columns, averaged_columns, count_columns)
If you have are not using one or the other, then
simply give a blank set "[]". (If you are not
using any to the right, then you don't have to
supply all parameters.)
PERFORMANCE
Past suggestions to do things similar to this created the criticism that since the SQL approach is generally not defined by a sequence, that there is room for the database engine to optimize the query process for maximum efficiency. In other words, SQL allegedly defines what the result is to be like, and not how to acheive it.
However, the suggested approach also does not necessarily define a strict sequence. As long as the *result* is the same, it does not matter how the database engine goes about performing the request. I suppose there could be a "non-atomic mode" that could be used to inspect intermediate results, but during production a given set of such commands would be assumed atomic.
For example, these commands generally have the form:
Y = op(X, blah)
This works kind of like a *copy* of X to Y, but with some transformations. The optimizer might notice that X is not being used for anything else after this operation. (There will be some kind of atomic block unit, similar to the way an SQL statement is considered atomic.) Thus, it can alter X in place rather than make an actual copy if it decides that is more efficient.
Whether defining it in such a way provides fewer optimization choices than SQL probably would require detailed expert analysis. However, it should be clear that some optimization is possible.
It is not really that different from SQL anyhow. SQL tends to be infix, while this approach is more prefix-based. Infix is harder to read for long expressions in my opinion.
SQL pattern:
y a (x b c d) e
Suggested pattern:
T1 = x(b c d)
T2 = y(a, T1, e)
It is true that each operation is more isolated in scope in the suggested pattern. This may limit "wide-reaching" automated optimizations, reducing potential performance improvements. On the other hand, it could make parsing faster because the context is smaller.
Also note that the second pattern offers more opportunity to re-using a given record set since it is reference-based instead of nesting-based. You can reference a given record-set view multiple times, but you generally cannot do such in a typical SQL expression.
SUMMARY
This is just a preliminary suggestion. SQL was purposely designed to be more like natural language rather than optimizing it to be programmer-friendly.