SQL gotchas
I’ve been reviewing some SQL queries recently, and I’ve noticed some things that people do that lead to far from optimal performance. Here’s a few, and what people should be doing.
COUNT(*) vs. COUNT(col)
At first blush, these may seem the same, but they actually mean very different things. COUNT(*) will aggregate the number of rows in the result set or group, whereas COUNT(col) will give you the number of non-null values that column has in the current result set or group. You almost always want COUNT(*). If you don’t know which you want, use COUNT(*). Treat that like it’s the law.
If you’re thinking that this isn’t a big deal, keep in mind that COUNT(*) requires much work on the part of the DBMS than COUNT(col); the value of COUNT(*) falls naturally out of what the DBMS is doing whereas COUNT(col) requires the DBMS to keep track of an extra counter for non-null values for that column.
Applying functions to fields you’re searching against
Say you have a big table, and you want to get all the rows whose created fields is between two dates. created is indexed. If you were to try this:
SELECT id, title, created
FROM entries
WHERE (TO_DAYS(NOW()) - TO_DAYS(created)) > ? AND (TO_DAYS(NOW()) - TO_DAYS(created)) <= ?;
You’re going to subject the DBMS to doing a full table scan just to get those entries. The reason for this is TO_DAYS(created). By doing this, you prevent the DBMS from using the index on created. It’s easy enough to rearrange this to work a lot better:
SELECT id, title, created
FROM entries
WHERE created BETWEEN CURRENT_DATE + INTERVAL ? DAY AND CURRENT_DATE + INTERVAL ? DAY;
It’s a simple change, but now the DBMS is able to make effective use of the index.