A (somewhat opinionated) list of SQL tips and tricks that I've picked up over the years.
There's so much you can you do with SQL but I've focused on what I find most useful in my day-to-day work as a data analyst and what I wish I had known when I first started writing SQL.
Please note that some of these tips might not be relevant for all RDBMs.
- Use a leading comma to separate fields
- Use a dummy value in the WHERE clause
- Indent your code
- Consider CTEs when writing complex queries
- Anti-joins will return rows from one table that have no match in another table
NOT EXISTSis faster thanNOT INif your column allowsNULL- Use
QUALIFYto filter window functions - You can (but shouldn't always)
GROUP BYcolumn position - You can create a grand total with
GROUP BY ROLLUP - Use
EXCEPTto find the difference between two tables
- Be aware of how
NOT INbehaves withNULLvalues - Avoid ambiguity when naming calculated fields
- Always specify which column belongs to which table
- Understand the order of execution
- Comment your code!
- Read the documentation (in full)
- Use descriptive names for your saved queries
Use a leading comma to separate fields in the SELECT clause rather than a trailing comma.
-
Clearly defines that this is a new column vs code that's wrapped to multiple lines.
-
Visual cue to easily identify if the comma is missing or not. Varying line lengths makes it harder to determine.
SELECT employee_id , employee_name , job , salary FROM employees ;- Also use a leading
ANDin theWHEREclause, for the same reasons (following tip demonstrates this).
Use a dummy value in the WHERE clause so you can easily comment out conditions when testing or tweaking a query.
/* If I want to comment out the job condition the following query will break: */ SELECT * FROM employees WHERE --job IN ('Clerk', 'Manager') AND dept_no != 5 ; /* With a dummy value there's no issue. I can comment out all the conditions and 1=1 will ensure the query still runs: */ SELECT * FROM employees WHERE 1=1 -- AND job IN ('Clerk', 'Manager') AND dept_no != 5 ;Indent your code to make it more readable to colleagues and your future self.
Opinions will vary on what this looks like so be sure to follow your company/team's guidelines or, if that doesn't exist, go with whatever works for you.
You can also use an online formatter like poorsql or a linter like sqlfluff.
SELECT -- Bad: vc.video_id , CASE WHEN meta.GENRE IN ('Drama', 'Comedy') THEN 'Entertainment' ELSE meta.GENRE END as content_type FROM video_content AS vc INNER JOIN metadata ON vc.video_id = metadata.video_id ; -- Good: SELECT vc.video_id , CASE WHEN meta.GENRE IN ('Drama', 'Comedy') THEN 'Entertainment' ELSE meta.GENRE END AS content_type FROM video_content INNER JOIN metadata ON video_content.video_id = metadata.video_id ;For longer than I'd care to admit I would nest inline views, which would lead to queries that were hard to understand, particularly if revisited after a few weeks.
If you find yourself nesting inline views more than 2 or 3 levels deep, consider using common table expressions, which can help you keep your code more organised and readable.
-- Using inline views: SELECT vhs.movie , vhs.vhs_revenue , cs.cinema_revenue FROM ( SELECT movie_id , SUM(ticket_sales) AS cinema_revenue FROM tickets GROUP BY movie_id ) AS cs INNER JOIN ( SELECT movie , movie_id , SUM(revenue) AS vhs_revenue FROM blockbuster GROUP BY movie, movie_id ) AS vhs ON cs.movie_id = vhs.movie_id ; -- Using CTEs: WITH cinema_sales AS ( SELECT movie_id , SUM(ticket_sales) AS cinema_revenue FROM tickets GROUP BY movie_id ), vhs_sales AS ( SELECT movie , movie_id , SUM(revenue) AS vhs_revenue FROM blockbuster GROUP BY movie, movie_id ) SELECT vhs.movie , vhs.vhs_revenue , cs.cinema_revenue FROM cinema_sales AS cs INNER JOIN vhs_sales AS vhs ON cs.movie_id = vhs.movie_id ;Use anti-joins when you want to return rows from one table that don't have a match in another table.
For example, you only want video IDs of content that hasn't been archived.
There are multiple ways to do an anti-join:
-- Using a LEFT JOIN: SELECT vc.video_id FROM video_content AS vc LEFT JOIN archive ON vc.video_id = archive.video_id WHERE 1=1 AND archive.video_id IS NULL -- Any rows with no match will have a NULL value. ; -- Using NOT IN/subquery: SELECT video_id FROM video_content WHERE 1=1 AND video_id NOT IN (SELECT video_id FROM archive) -- Be mindful of NULL values. -- Using NOT EXISTS/correlated subquery: SELECT video_id FROM video_content AS vc WHERE 1=1 AND NOT EXISTS ( SELECT 1 FROM archive AS a WHERE a.video_id = vc.video_id ) Note that I advise against using NOT IN - see the following tip.
If you're using an anti-join with NOT IN you'll likely find it's slower than using NOT EXISTS, if the values/column you're comparing against allows NULL.
I've experienced this when using Snowflake and the PostgreSQL Wiki explicity calls this out:
"...NOT IN (SELECT ...) does not optimize very well."
Aside from being slow, using NOT IN will not work as intended if there is a NULL in the values being compared against - see tip 11.
QUALIFY lets you filter the results of a query based on a window function, meaning you don't need to use an inline view to filter your result set and thus reducing the number of lines of code needed.
For example, if I want to return the top 10 markets per product I can use QUALIFY rather than an inline view:
-- Using QUALIFY: SELECT product , market , SUM(revenue) AS market_revenue FROM sales GROUP BY product, market QUALIFY DENSE_RANK() OVER (PARTITION BY product ORDER BY SUM(revenue) DESC) <= 10 ORDER BY product, market_revenue ; -- Without QUALIFY: SELECT product , market , market_revenue FROM ( SELECT product , market , SUM(revenue) AS market_revenue , DENSE_RANK() OVER (PARTITION BY product ORDER BY SUM(revenue) DESC) AS market_rank FROM sales GROUP BY product, market ) WHERE market_rank <= 10 ORDER BY product, market_revenue ;Unfortunately it looks like QUALIFY is only available in the big data warehouses (Snowflake, Amazon Redshift, Google BigQuery) but I had to include this because it's so useful.
Instead of using the column name, you can GROUP BY or ORDER BY using the column position.
- This can be useful for ad-hoc/one-off queries, but for production code you should always refer to a column by its name.
SELECT dept_no , SUM(salary) AS dept_salary FROM employees GROUP BY 1 -- dept_no is the first column in the SELECT clause. ORDER BY 2 DESC ;Creating a grand total (or sub-totals) is possible thanks to GROUP BY ROLLUP.
For example, if you've aggregated a company's employees salary per department you can use GROUP BY ROLLUP to create a grand total that sums up the aggregated dept_salary column.
SELECT COALESCE(dept_no, 'Total') AS dept_no , SUM(salary) AS dept_salary FROM employees GROUP BY ROLLUP(dept_no) ORDER BY dept_salary -- Be sure to order by this column to ensure the Total appears last/at the bottom of the result set. ;EXCEPT returns rows from the first query's result set that don't appear in the second query's result set.
/* Miles Davis will be returned from this query */ SELECT artist_name FROM artist WHERE artist_name = 'Miles Davis' EXCEPT SELECT artist_name FROM artist WHERE artist_name = 'Nirvana' ; /* Nothing will be returned from this query as 'Miles Davis' appears in both queries' result sets. */ SELECT artist_name FROM artist WHERE artist_name = 'Miles Davis' EXCEPT SELECT artist_name FROM artist WHERE artist_name = 'Miles Davis' ;You can also utilise EXCEPT with UNION ALL to verify whether two tables have the same data.
If no rows are returned the tables are identical - otherwise, what's returned are rows causing the difference:
/* The first query will return rows from employees that aren't present in department. The second query will return rows from department that aren't present in employees. The UNION ALL will ensure that the final result set returned combines these all of these rows so you know which rows are causing the difference. */ ( SELECT id , employee_name FROM employees EXCEPT SELECT id , employee_name FROM department ) UNION ALL ( SELECT id , employee_name FROM department EXCEPT SELECT id , employee_name FROM employees ) ; NOT IN doesn't work if NULL is present in the values being checked against. As NULL represents Unknown the SQL engine can't verify that the value being checked is not present in the list.
- Instead use
NOT EXISTS.
INSERT INTO departments (id) VALUES (1), (2), (NULL); -- Doesn't work due to NULL: SELECT * FROM employees WHERE department_id NOT IN (SELECT DISTINCT id from departments) ; -- Solution. SELECT * FROM employees e WHERE NOT EXISTS ( SELECT 1 FROM departments d WHERE d.id = e.department_id ) ;When creating a calculated field, you might be tempted to name it the same as an existing column, but this can lead to unexpected behaviour, such as a window function operating on the wrong field.
CREATE TABLE products ( product VARCHAR(50) NOT NULL, revenue INT NOT NULL ) ; INSERT INTO products (product, revenue) VALUES ('Shark', 100), ('Robot', 150), ('Alien', 90); /* The window function will rank the 'Robot' product as 1 when it should be 3 */ SELECT product , CASE product WHEN 'Robot' THEN 0 ELSE revenue END AS revenue , RANK() OVER (ORDER BY revenue DESC) FROM products ; /* You can instead do this: */ SELECT product , CASE product WHEN 'Robot' THEN 0 ELSE revenue END AS revenue , RANK() OVER (ORDER BY CASE product WHEN 'Robot' THEN 0 ELSE revenue END DESC) FROM products ; When you have complex queries with multiple joins, it pays to be able to trace back an issue with a value to its source.
Additionally, your RDBMS might raise an error if two tables share the same column name and you don't specify which column you are using.
SELECT vc.video_id , vc.series_name , metadata.season , metadata.episode_number FROM video_content AS vc INNER JOIN video_metadata AS metadata ON vc.video_id = metadata.video_id ;If I had to give one piece of advice to someone learning SQL, it'd be to understand the order of execution (of clauses). It will completely change how you write queries. This blog post is a fantastic resource for learning.
While in the moment you know why you did something, if you revisit the code weeks, months or years later you might not remember.
- In general you should strive to write comments that explain why you did something, not how.
- Your colleagues and future self will thank you!
SELECT video_content.* FROM video_content LEFT JOIN archive -- New CMS cannot process archive video formats. ON video_content.video_id = archive.video_id WHERE 1=1 AND archive.video_id IS NULL ;Using Snowflake I once needed to return the latest date from a list of columns and so I decided to use GREATEST().
What I didn't realise was that if one of the arguments is NULL then the function returns NULL.
If I'd read the documentation in full I'd have known! In many cases it can take just a minute or less to scan the documentation and it will save you the headache of having to work out why something isn't working the way you expected:
/* If I'd read the documentation further I'd also have realised that my solution to the NULL problem with GREATEST()... */ SELECT COALESCE(GREATEST(signup_date, consumption_date), signup_date, consumption_date); /* ... could have been solved with the following function: */ SELECT GREATEST_IGNORE_NULLS(signup_date, consumption_date);There's almost nothing worse than not being able to find a query you need to re-run/refer back to.
Use a descriptive name when saving your queries so you can easily find what you're looking for.
I usually will write the subject of the query, the month the query was ran and the name of the requester (if they exist). For example: Lapsed users analysis - 2023-09-01 - Olivia Roberts