Both Josh Berkus and Hubert made blog entries about our last excerpt. In general I will start off by saying that we are more or less in agreement on what is good SQL coding format.
Here are the things I think we can all agree on
SQL Keywords should be upper cased
Field names should be prefixed with their tables especially when you have a multi-join statement involved
Use JOIN syntax instead of stuffing everything in the WHERE though we are highly against just saying JOIN. We prefer INNER JOIN
The major points of contention I think are
Should you use aliases over tables and if you use aliases should you keep them short or more descriptive. Josh thinks table names should be used where possible and when aliases are used they should be longer than a few characters
and Hubert agrees with us that short aliases are fine and in fact desirable. I think we all agree aliases should be meaningful when used, but our idea of what it means to be meaningful is a little different.
In use of JOIN syntax -- we prefer using INNER JOIN instead of using just JOIN and in fact find it quite irritating that PostgreSQL rewrites our INNERs as JOIN. I suspect Hubert and Josh and many other PostgreSQL folk are on
the other side of the fence on this. The reason we feel strongly about this is there are so many kinds of JOINS - INNER JOIN, RIGHT JOIN, LEFT JOIN, CROSS JOIN, FULL JOIN, and the horrid NATURAL JOIN (that should be shot and put out of its misery). To just say JOIN to us is just confusing.
While you can write LEFT OUTER JOIN, the OUTER is kind of pointless because no one goes around writing LEFT INNER JOINS
Use well supported standards where possible. This means CURRENT_TIMESTAMP instead of now(). now() is not in all databases, but most relational databases support CURRENT_TIMESTAMP
So let us start by looking at Josh's last SQL statement and compare to how we would write it.
Josh I hope you didn't come up with this table structure. We personally hate the use of id as a field identifier for every table, but that's another story. We prefer something like forum_id, post_id etc. and whats the deal with giving a field
name of date. Stay away from data types as field names.
--JoshSELECT persons.id, persons.first_name, persons.last_name, forums.category,
COUNT(DISTINCT posts.id)as num_posts,
COALESCE(MAX(comments.rating), 0)AS highest_rating,
COALESCE(MIN(comments.rating), 0)AS lowest_rating
FROM persons JOIN posts ON persons.id= posts.author
JOIN forums on posts.forum = forums.idLEFTOUTERJOIN comments ON posts.id= comments.post
WHERE persons.status >0AND forums.ratings = TRUE
AND comments.date >(now()- INTERVAL '1year')GROUPBY persons.id, persons.first_name, persons.last_name, forums.category
HAVINGcount(DISTINCT posts.id)>0ORDERBY persons.last_name, persons.first_name;
--Us--Leotendstopreferuppercasealiasfortablelowercaseorpropercaseforfields.--LeoalsolikestoseehisSELECTfieldsacrossonelineregardlesshowmanybecause--hehas2widescreenubberhugemonitorsandwantstotakefulladvantageofthem--Icouldn'tdemonstratethiswithoutmessinguptheformatofthepage.SELECT P.id, P.first_name, P.last_name, F.category, COUNT(DISTINCT PO.id)as num_posts,
COALESCE(MAX(C.rating), 0)AS highest_rating, COALESCE(MIN(C.rating), 0)AS lowest_rating
FROM persons AS P INNERJOIN posts AS PO ON P.id= PO.author
INNERJOIN forums As F ON PO.forum = F.idLEFTJOIN comments AS C ON PO.id= C.post
WHERE P.status >0AND F.ratings = TRUE
AND C.date >(CURRENT_TIMESTAMP- INTERVAL '1year')GROUPBY P.id, P.first_name, P.last_name, F.category
HAVINGCOUNT(DISTINCT PO.id)>0ORDERBY P.last_name, P.first_name;
--Reginaflipflopsandoftendependsondatabase--sheisworkingwiththatdayormoment--whethersheuppercasesorlowercasesorpropercases.--ItmustbenotedthatReginaalsohas2wideubberscreenstoworkwith--becauseLeobelieveseveryprogrammershouldhaveatleast36inchestoworkwith.--Onceshefinallygetsadjustedtotheenormityofthesethings,--shewillusethespacetokeep20applicationsopenatthesametimeandtrainhermind--formassiveparallelconcentrationandget20timesmoreworkdoneSELECT pe.id, pe.first_name, pe.last_name, f.category, COUNT(DISTINCT po.id)AS num_posts,
COALESCE(MAX(c.rating), 0)AS highest_rating, COALESCE(MIN(c.rating), 0)AS lowest_rating
FROM persons AS pe INNERJOIN posts AS po ON pe.id= po.author
INNERJOIN forums AS f ON po.forum = f.idLEFTJOIN comments AS c ON po.id= c.post
WHERE pe.status >0AND f.ratings = TRUE
AND c.date >(CURRENT_TIMESTAMP- INTERVAL '1year')GROUPBY pe.id, pe.first_name, pe.last_name, f.category
HAVINGCOUNT(DISTINCT po.id)>0ORDERBY pe.last_name, pe.first_name;
I will start off by saying for the above example -- Josh is lucky because his table names are naturally short and we are
disadvantaged in this example because two tables start with the same letter. Imagine if the table names were much more descriptive how repetitive and noisy Josh's scene would become.
The reason we prefer short aliases over long is that it allows you to fit more fields on the same line and scan them quickly and still prefix with the table names.
Leo prefers upper case his table because he can quickly spot where a table name ends and field begins. Generally speaking you usually have 4 tables or fewer in a statement, the FROM clause is not that far away
as a legend, and the tables
usually start with different letters, so its not a large mind cramp for us to remember F is for Forum and C is for Comments.
Unlike other languages, you then need to keep on repeating this variable
over and over again which is much more typing as well as just makes your statement that much longer to digest.
As a last gripe to what Josh said. Just because you have 255 characters at your disposal as opposed to the 8 you once had, doesn't mean you should feel compelled to use them all. Long live Fortran. It is older than
any language I can think of including COBOL and has stood the test of time and is still undergoing innovation.
In regards to the table.id argument, why not consider semantics? table.id is semantic. table.table_id is just plain redundant and insulting to coding intelligence.
Why do the extra work for no explicit benefit? Note how you guys state, "Field names should be prefixed with their tables." You might as well go on to say, "... and id columns should also be prefixed with their tables." Read that back to yourselves and let your mind wobble. Thanks for helping to push non beneficial "standards" to make my day-to-day tasks more of a pain in the ass.
First of all these are just opinions and part of the point of this article is that standards are not quite as cut and dry as some people think. Josh I'm sure feels strongly about his style just as much as we feel strongly about our style.
The 2 problems I see with id is
1) You always have to alias it in the SELECT because who wants to see id in a multi join output when you can't see the tables that are being joined.
In many cases we need to output the primary key for example if the query is a view to be joined later with other things and if you are going to have to alias to a meaningful name, you might as well do it in the table.
2) Its just distracting to see all fields named the same unless they mean the same thing (yah we can argue an id is an id, but an id is not always an integer so why should it always have the same name and that's a whole other argument as to whether we should always use surrogate keys). first_name is a first_name and always means the same thing okay I get it.
When you are using short aliasing of tables its especially annoying.
PO.id is just not quite as useful to see as PO.post_id. In Josh' model its not quite as bad since he always spells out the table name. So I guess in Josh's model it works if you never need to output the primary key in the query.
I'll have to think about this a bit more. You are probably right. Rereading Celko's paper I could read it a couple of ways.
The way I have always understood it, a column defines the whole set of data of a specific slot (column) in a table. A field defines a specific element of data that lives in a specific row in a specific column.
So when I think of writing SQL statements -- I think of them as fields, because I am not selecting the whole column 0 table (I am only selecting a set of fields that live in a column). But yes I am selecting from columns so perhaps I should call it columns.
So when I am defining a table structure I think columns. When I am selecting data, I think fields.
The other reason I just assume not bother talking about columns is we do a lot of financial consulting and when you start talking columns - the first crazy thought that pops into finance peoples minds is (Ah so a database is a spreadsheet - we have columns too). I just assume not get into that mindset of thinking of databases as spreadsheets even though they both have columns and I have partaken in the guilty pleasure of simulating spreadsheet behavior in databases :)