Pedantic SQL is a style guide and code beautifier for ANSI SQL intended to make
SELECT statements more readable. I use the term ‘beautifier’ loosely because one may well argue that no amount of formatting will be sufficient to make SQL look pretty. I call it ‘pedantic’ because it is insistently rigid in how it is applied. But this rigidity is what makes the resulting queries readable. The lack of consistency in SQL writing styles makes it entirely too difficult to read others’ queries. A lot of bad practices have evolved which detract from query readability and extensibility. A query should be understandable at a quick glance. Following this style guide will make your queries a lot easier to grasp. It will also make them a lot easier to extend or build upon later.
Nobody seems to care very much about how to format SQL code. As a result, most SQL is absolutely godawful to look at. Both analysts and programmers are guilty of this. Many analysts who do not have experience writing code in other programming languages do not give much thought to how they format their code. Programmers typically look at SQL queries as a chore, a distraction from their primary language, and thus they put only the tiniest amount of effort towards formatting their SQL.
This style guide is an attempt to change all that. By following these rules, you will be able to write SQL queries that are neat, readable, and easy to share and modify.
SQL query workflow recommendations
You should use a proper code editor with a monospaced font and syntax highlighting that recognizes the ANSI SQL standard. The dialect is not particularly important, just choose one that covers the basics (keywords, literals, operators).
Most All GUI SQL clients suck. They are not an acceptable substitute for a good text editor. You will frequently be required to interact with shell or terminal clients since most database servers are on headless remote machines. Avoid the practice of opening a shell client and typing in your query free-form. SQL queries are not sentences. When you submit a query in a shell, it is not easy to edit or run it again. Compose your query in a text editor so you can save it locally and have a convenient means of editing. Then copy and paste into your client. Yes, this is a painful extra step, but is the only way to maintain control of your code. Strongly consider using a version control system to backup your queries and share them with your org.
Pedantic SQL style guide
Here are the rules of Pedantic SQL, in no particular order because they are all equally pedantic.
Do not capitalize keywords
This isn’t COBOL. Uppercase keywords take longer to type and look ugly. If you are using them to differentiate between keywords and column names or table names, white space and proper indentation will do a much better job.
Do not use leading commas
Punctuation at the beginning of a line is atrocious. This unfortunate habit is an attempt to solve for a truly annoying problem of SQL syntax errors: missing commas. SQL interpreters are dumb as stumps, they won’t give you any hints about where a syntax error is, but 98.5% of the time the problem is a missing comma in a long list of statements in the
SELECT clause. In a mis-guided attempt to protect against this comic omission, some smart query writer decided one day to use leading commas and thereby forced the rest of us to endure this disgusting looking hack. And it doesn’t even solve the problem! You still have to remember not to put the comma on the first line, so how is that any better than remembering not to put the comma on the last line?? Sentences don’t begin with punctuation
,so please avoid this because it merely serves to make your query look awful.
Always use table aliases
This is a good practice even if you are only working with one table. What happens later when you want to join to another table? If you have aliased everything in your
WHERE clauses, you’ll have nothing to worry about. Long complex queries with multiple tables usually begin as short simple queries on one table. A habit of aliasing early will make it easier to build your simple queries into more complicated ones.
Use short relevant abbreviations as table aliases
Avoid using ‘a’ and ‘b’ unless you will only ever have two tables in your entire schema (and then only if the table names actually start with ‘a’ and ‘b’). Here is a good list of table aliases:
- customers c (or cust)
- devices d (or dev)
- orders o
- order_items oi
- products p (or prod)
- users u
- impressions i (or impr)
- clicks c (or ck)
- global_weekly_sales_aggregates gwsa (or agg)
When you need to self-join (not ideal but sometimes necessary), append the alias with a 1 or 2 to distinguish which copy of the table you are operating on. For pre/post analysis, try appending ‘_pre’ and ‘_post’ or just use ‘pre’ and ‘post’ to keep it simple.
Limit the use of one-liners
Unless your query is short and simple and will really fit on one line, where one line is something around 80 characters, you should use the more verbose line formatting and indentation rules below. Again, the focus is on readability and extensibility. You might start with something simple like:
But as soon as you need to add more filter predicates or columns in your
SELECT clause, your one-liner can quickly become ungainly:
Begin each major keyword on a new line. Indent with four spaces.
The Pedantic way to format a query uses newlines and indentation to put things in consistent locations relative to each other. This makes it easy to see what is happening at a quick glance.
Use ANSI join syntax and format as follows:
Joins to inline queries
When joining to an inline query, indent the query in the join clause to visually separate the inner query from the outer query. The following pattern is useful to avoid joining two large fact tables together:
A common query pattern that arises is generating a frequency distribution. Consider a table
devices with a primary key of
device_id and a foreign key
user_id referencing the user associated with the device. You want to generate the distribution of the number of devices per user. The Pedantic way to write this query is:
Group by and order by
Use column references when available. Technically, this violates the ANSI SQL standard because the
GROUP BY clause can be processed before the engine even begins to look at the
SELECT clause, but it is a widely supported and useful feature. Interpreters that do not support this feature are annoying, but also common. It is tiresome and an unnecessary burden to be forced to type the entire contents of the
SELECT statement again in the
GROUP BY clause. To add insult to injury, however, those Ellison-esque interpreters refuse to allow you to put aliases in the
GROUP BY references, forcing a lot of extra editing and room for creating syntax errors. Where your server allows them, using column references is highly convenient.
Group by and having
HAVING clause modifies a
GROUP BY. Here are some good ways to format:
There are two ways of building a case expression.
There are two important reasons to avoid the second form of the case statement: you are restricted to equality operations and you are restricted to evaluating the expression using only one column. You may want to use the second method, but as soon as you need to do anything with other operators or columns, you will have to rewrite it, so just use the first method. As an added bonus, the first method is more naturally readable. It is almost poetic whereas the second method sounds stilted.
There are all kinds of special functions that try to deal with null values:
nanvl, and probably some more that you should immediately forget about. The only one that you need is
coalesce and it is part of ANSI so portability is not an issue. For anything that requires more complexity than what
coalesce can provide, use a
Special indentation for
Operator precedence rules are tricky and an easy source of logical errors. The filter predicates in a
WHERE clause are typically linked with
AND, so when you need to use an
OR, wrap it in a set of parentheses to make sure your logic is understood by the interpreter. Additionally, provide some extra indentation and formatting to make the
OR stand out visually.
In PostgreSQL, MySQL, Vertica, Oracle, and SQL Server,
AND always has higher operator precedence than
OR. The nomenclature of levels of operator precedence is somewhat confusing. What does it mean to have higher or lower precedence? An operator with higher precedence is executed before an operator with lower precedence. Operators at the same level of precedence are executed in sequence left to right. To disambiguate the situation, think of a set of parentheses around the operator with higher precedence.
In most queries, filter predicates are meant to be conjunctive, and are thus joined together with the
AND operator. Where this can lead to trouble is when you want to nest an
OR condition within a long list of
This query will probably not do what the author intended. It will return all backordered books ordered on July 1 in addition to all orders with a ship status of pending. What you probably want is the following, made explicit with the parentheses:
It is generally advisable when mixing conditional operators to use parentheses to make it clear both to the interpreter and to other human readers exactly what logic you are intending.
Date and time literals are a special class of hell for SQL query writers. Your database may support date literals like this
'2015-07-01' but will it perform date arithmetic using this form, or will it treat it sometimes like a date and other times like a string?
Be very careful with date literals, since the syntax between one platform and another can vary greatly. Use one of these date functions appropriate to your platform for treating date literals as date column types.
When joining, use
ON instead of
USING only works if the join columns are named the same in both tables and, for some bizarre reason, you can’t reference these columns in your
SELECT clause. Why? Who cares, just use
ON to specify your join conditions.
Only use ANSI join syntax
The ANSI join syntax allows you to cleanly separate join conditions from filter predicates. When using old-style joins (comma-separated lists of tables in the
FROM clause), this essentially overloads the
WHERE clause with join conditions and this makes it easy to inadvertently miss a crucial join condition.
You’ve just done a cross join between your orders and order_items tables which will massively inflate the resulting sum. Instead, if you use ANSI join syntax, it is a lot more difficult to lose your join conditions in a long list of filter predicates:
Avoid putting join conditions in the where clause
The previous example makes a case for using ANSI join syntax as a way of avoiding mistaken omissions of join conditions. But this only works if you are vigilant about keeping join conditions in the
JOIN clause and filter predicates in the
WHERE clause. It’s easy to be lax about this because you can actually put all your filters in the
ON clause of a join, even if they have nothing to do with joining the two tables! That’s because the optimizer doesn’t know or care about the difference between a join condition and a filter predicate, they are simply logical rules used to limit the result set from the initial Cartesian product of the tables. Likewise, you can just as easily put join conditions in the
WHERE clause. But that’s not the Pedantic way. Clean separation of join conditions from filter predicates is more of a convenience to you, the query writer, than it is to the SQL engine.
But if these are only conveniences, why bother? Well, there are more than just aesthetic factors involved in this rule. Consider this query where we want to count the number of impressions and clicks on each ad for each user:
After running this we realize that it will only return impressions if the user has clicked at least once. Since the query is using an inner join, it will only return records in both tables. So, what we really want to do is change this to a left join:
These queries will return identical results. Why? Because the
WHERE clause contains what should really be a join condition. Your intent to make this a left join is subverted by the presence of the line
and i.ad_source = ck.ad_source.
When performing an outer join, ensure that there are no references to the optional table in the
WHERE clause, or you will instead be doing an inner join.
Avoid putting filter predicates in the join clause
As a corollary, you don’t want to overload your
JOIN clause with what are effectively filter predicates that belong in the
WHERE clause. Maintaining a clean distinction between these two concepts helps ensure that unintended side effects don’t crop up when making changes to the join type. It also aids readability because you know that when you are looking at the join conditions, you are looking at only the information that is necessary to join the tables together, while the
WHERE clause contains the filters that reduce your result set to only the records you are interested in retrieving.
Avoid Oracle-style outer join syntax
Oracle introduced the
(+) operator to signify an optional (null-able) column many years before the ANSI outer join syntax was standardized. For several years after that, Oracle’s optimizer was far more efficient performing outer joins using the old-style syntax. However, current versions of Oracle fully support the ANSI syntax without any performance penalties. Oracle now recommends using the ANSI syntax because there are several important deficiencies in the Oracle syntax that are solved by using the ANSI syntax. SQL Server had a similar outer join operator
*= but it never worked correctly anyway. Definitely stay away from that one!
When you do a
count(*), you are instructing the database to examine every column in the result set looking for a non-null value. In addition to being sub-optimal, this may also lead to inaccurate results in cases where you have nulls where you aren’t expecting them. In most cases, you do not need to examine the entire record, you are only interested in counting non-null values on one column, typically a primary key. Where you want to count all records in the result set, regardless of whether some or all columns are null, use
count(1). With the first method, the optimizer needs to evaluate only one column, and using the second method, it doesn’t even need to do that, it simply returns a count for each record in the result set.
Most of the time, cross joins (aka Cartesian joins), are not desirable. However, there are valid cases in which you really do want this type of join. The syntax for this is
CROSS JOIN without an
ON condition. Consider the following scenario in which you have a fact table with a user registration date and you would like to generate a report showing the cumulative counts of registered users by week. Build a dimension table called
weeks where each row contains a week_starting and week_ending day. When you do a cross join between the user registration table and the weeks table, the intermediate result set will repeat your user registration table for each week. This makes it easy to generate the desired report of cumulative user registrations.
Strongly prefer left join to right join
It is more natural to think of outer joins as occurring between a required table on the left and an optional table on the right. This is an arbitrary choice, but one where consistency will greatly aid query readability and ease of interpretation. It’s possible to write the same query using either method, but the apparent simplicity of altering a single keyword from
RIGHT actually causes a huge disconnect in our brains. If you are used to conceptualizing left joins, it takes a lot of mental headache to rewire your brain to look at it from the opposite direction.
In my SQL code repository, I currently count 1,523 occurrences of “left join” and 16 occurrences of “right join”. The only reasons I have used “right join” were to maintain consistency in queries inherited from another analyst, and a literal translation of a pig script where I again wanted to maintain consistency with the original script. For most purposes, you should stick to left join because being consistent will make it easier to wrap your head around what is happening.
Omit the “outer” when doing a left join
It’s redundant. When you say
LEFT JOIN, that is by definition an outer join. There’s no other type of left join and no reason to type the extra word. The only time you should use the
OUTER keyword is with a
FULL OUTER JOIN. These are relatively rare and it’s worth the extra effort to make explicit what you intend.
Likewise omit “inner” from inner joins
See above. Without any qualifier, a
JOIN will be an inner join (aka an equi-join).
Be careful with “distinct”.
DISTINCT keyword is often misunderstood and misplaced. What appears natural in human language is not necessarily how the interpreter will interpret your use of
union. You meant to use
union all. Seriously.
UNION construct is a rarity compared to
UNION ALL. If you are using
UNION by itself, your query probably needs a re-write. If you do not understand the difference, be sure to do some reading and save yourself angst when staring at unexpected query output.
Comments can cause problems with many interpreters. In production code, you may need to strip them out entirely. When organizing a large SQL script, begin the file with a comment block demarcated with
*/. A good, descriptive comment block should include the filename, title, date, summary of the script, and any other notes that may help you or others understand the purpose of the analysis.
Keep inline comments to a minimum. Use
-- to indicate a quick single-line comment, or to comment out a line which indicates an optional variant of your query. A well formatted query can be understood without the need for comments.