I've Been Bamboozl
Pole Prancing, Liv
borkbun.com
I’m still looking
Loyalties Will Be
men got closer to
Another argument a
artfib.com
It's Like the Perf
Let's Get Rid of tI'm wondering why the following query performs so badly in PostgreSQL.
SELECT * FROM "users" WHERE user_id IN (
SELECT user_id FROM "visited_users" GROUP BY user_id HAVING COUNT(*) = 3)
When I do an EXPLAIN on it, it doesn't suggest any indexes. It does suggest that it uses an index for "users", but the index it suggests is the wrong one. It says it's using an index that has user_id, status_id, and user_date columns. Since the index doesn't have the two columns used in the subquery, why would it ever choose it? Could it be because the subquery would be slower without the index?
Any ideas why this query is running so slowly, and/or any ideas on how to speed it up? I'm running PostgreSQL 9.1.3.
A:
What's wrong with (the obvious) alternative?
SELECT * FROM users WHERE user_id IN (
SELECT user_id
FROM visited_users
GROUP BY user_id
HAVING COUNT(*) = 3)
Not knowing which version of Postgres you are using, this is untested, but based on EXPLAIN output it should not be slower.
Perhaps you want EXISTS instead, which should be faster in all versions:
SELECT * FROM users u
WHERE EXISTS (
SELECT 1 FROM visited_users v
WHERE u.user_id = v.user_id
GROUP BY v.user_id
HAVING COUNT(*) = 3);
This will only work if visited_users contains all possible user_ids, or if you have a UNIQUE constraint that requires that this condition be met.
A:
The problem is the having clause. For performance reasons, it requires the list to be in the index, and this creates a lot of extra computation.
The only way around this is to create a temporary table that has the list of user ids and then query that table for which ids are present:
create temporary table user_ids_for_query as
select user_id from visited_users;
SELECT * FROM users u WHERE u.user_id IN (select user_id from user_ids_for_query);
Not only is the query faster, it is now index-only, and Postgres 9.1 can use indices for those without having to create another index.
A:
I'm using the following code and it seems to work fine:
DELETE FROM visits
WHERE visited_users.user_id in (
SELECT user_id from visited_users
GROUP BY user_id
HAVING COUNT(*) = 3
);
Not quite as elegant as the one-liner suggested by mikl, but the following is also faster:
DELETE FROM visits
WHERE visits.visited_users.user_id IN (
SELECT user_id from visits
GROUP BY user_id
HAVING COUNT(*) = 3
);
Here is the explain for the slow SQL:
+----+-------------+----------+------+---------------+---------+---------+-----------------------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------+------+---------------+---------+---------+-----------------------------+------+-------------+
| 1 | SIMPLE | visited_users | ref | user_id_visits | user_id | 9 | const,const | 180 | Using where |
+----+-------------+----------+------+---------------+---------+---------+-----------------------------+------+-------------+
And here is the slow explain for the delete:
+----+-------------+-------------+-------+---------------+---------+---------+------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+-------+---------------+---------+---------+------------+------+-------------+
| 1 | PRIMARY | users | ref | user_id | user_id | 9 | db.visited_users.user_id | | Using where |
| 3 | DELETE | visits | ref | visited_users | visited_users | 4 | const,const | | Using where |
+----+-------------+-------------+-------+---------------+---------+---------+------------+------+-------------+
If anyone has any ideas why the two queries behave so differently, I would appreciate the input.
UPDATE:
It turns out that my slow query was returning the count of rows in a different table. If I use count(distinct user_id) instead, the delete query is 10x faster. Thanks!
A:
Not sure if this is faster, but I have a solution that returns the result I need.
I query the visited_users table to get all the unique user_id's. Then I perform a subquery to extract the users in the users table.
SELECT users.*
FROM users
WHERE user_id IN (
SELECT user_id FROM visited_users GROUP BY user_id HAVING COUNT(*) = 3
);
This works even if the subquery returns more than three unique users.
I had a closer look at the explain plan, and found that it was using the user_id column in the users table in the subquery. That's why it was so slow. If I use an aggregate function or window function in the subquery, it will perform much faster.
SELECT users.*
FROM users
WHERE user_id IN (
SELECT user_id FROM visited_users GROUP BY user_id HAVING COUNT(*) = 3
) AND user_id < 1234;
This query took 3 milliseconds to run.
Even better, if you know you need the result for a specific date range, you can change the subquery to make it more efficient:
SELECT users.*
FROM users
WHERE user_id IN (
SELECT user_id FROM visited_users GROUP BY user_id HAVING COUNT(*) = 3
) AND user_date BETWEEN '2014-01-01