PostgreSQL: Correct way for subset calculations

I have a huge table (around 10 million items). For simplicity, let's say it has only 2 columns: user_id and activity_id like this

user_id | activity_id

---------------------

1       | 1

1       | 2

1       | 3

2       | 1

2       | 2

I want to select all user_id with activity_id = 1, 2 NOT 3. In the case above it will be just one result: user_id = 2. I can do it using SELECT DISTINCT combined with INTERSECT and EXCEPT operators, but it seems to be extremely slow.

From what I know about databases, it can be improved with GIN and table partitioning, however I feel like it's not correct solution in the case of PostgreSQL (because subsets operators are slow by their own).

edited 5 hours ago

Evan Carroll

31.4k865209

asked 9 hours ago

Ximik

1063

New contributor

Please post your table(s) definition(s), query, and execution plan.

– Colin 't Hart
7 hours ago

@a_horse_with_no_name activity_id=4 changes nothing in such a case. It's 10.000.000 rows.

– Ximik
7 hours ago

When you say activity id = 1, 2 do you mean 1 AND 2, or 1 OR 2?

– Evan Carroll
5 hours ago

add a comment |

I have a huge table (around 10 million items). For simplicity, let's say it has only 2 columns: user_id and activity_id like this

user_id | activity_id

---------------------

1       | 1

1       | 2

1       | 3

2       | 1

2       | 2

edited 5 hours ago

Evan Carroll

31.4k865209

asked 9 hours ago

Ximik

1063

New contributor

Please post your table(s) definition(s), query, and execution plan.

– Colin 't Hart
7 hours ago

@a_horse_with_no_name activity_id=4 changes nothing in such a case. It's 10.000.000 rows.

– Ximik
7 hours ago

When you say activity id = 1, 2 do you mean 1 AND 2, or 1 OR 2?

– Evan Carroll
5 hours ago

add a comment |

I have a huge table (around 10 million items). For simplicity, let's say it has only 2 columns: user_id and activity_id like this

user_id | activity_id

---------------------

1       | 1

1       | 2

1       | 3

2       | 1

2       | 2

edited 5 hours ago

Evan Carroll

31.4k865209

asked 9 hours ago

Ximik

1063

New contributor

I have a huge table (around 10 million items). For simplicity, let's say it has only 2 columns: user_id and activity_id like this

user_id | activity_id

---------------------

1       | 1

1       | 2

1       | 3

2       | 1

2       | 2

postgresql

edited 5 hours ago

Evan Carroll

31.4k865209

asked 9 hours ago

Ximik

1063

New contributor

edited 5 hours ago

Evan Carroll

31.4k865209

asked 9 hours ago

Ximik

1063

New contributor

edited 5 hours ago

Evan Carroll

31.4k865209

edited 5 hours ago

Evan Carroll

31.4k865209

edited 5 hours ago

Evan Carroll

31.4k865209

asked 9 hours ago

Ximik

1063

New contributor

asked 9 hours ago

Ximik

1063

asked 9 hours ago

Ximik

1063

New contributor

Ximik is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

Please post your table(s) definition(s), query, and execution plan.

– Colin 't Hart
7 hours ago

@a_horse_with_no_name activity_id=4 changes nothing in such a case. It's 10.000.000 rows.

– Ximik
7 hours ago

When you say activity id = 1, 2 do you mean 1 AND 2, or 1 OR 2?

– Evan Carroll
5 hours ago

add a comment |

Please post your table(s) definition(s), query, and execution plan.

– Colin 't Hart
7 hours ago

@a_horse_with_no_name activity_id=4 changes nothing in such a case. It's 10.000.000 rows.

– Ximik
7 hours ago

When you say activity id = 1, 2 do you mean 1 AND 2, or 1 OR 2?

– Evan Carroll
5 hours ago

Please post your table(s) definition(s), query, and execution plan.

– Colin 't Hart
7 hours ago

@a_horse_with_no_name activity_id=4 changes nothing in such a case. It's 10.000.000 rows.

– Ximik
7 hours ago

When you say activity id = 1, 2 do you mean 1 AND 2, or 1 OR 2?

– Evan Carroll
5 hours ago

add a comment |

3 Answers
3

active

oldest

votes

You can easily do this with arrays in Postgres:

select user_id, array_agg(activity_id) as activities

from users

group by user_id

having array_agg(activity_id) @> array[1,2]

   and not 3 = any(array_agg(activity_id));

The condition array_agg(activity_id) @> array[1,2] only returns those that have activity_ids 1 and 2 and the condition not 3 = any(array_agg(activity_id)) removes all those that contain activity_id = 3

If the table contains more than just those two columns, an index on (user_id, activitiy_id) will help as it enables Postgres to use an "Index Only Scan" instead of a full table scan. If there are only very users that have activity_ids 1 and two, an additional condition that only returns rows with either one of them (e.g. using a where exists condition) might help as it reduces the number of rows that need to be aggregated. In that case the index should be on (activity_id, user_id) to enable Postgres to remove unwanted rows efficiently.

On a table with 100.000 rows this ran in about 100ms on my laptop with Postgres 11 and a SSD.

Online example: https://rextester.com/YLN7221

edited 8 hours ago

answered 9 hours ago

a_horse_with_no_name

38.9k775112

add a comment |

You can first try to rewrite the query using EXISTS and a regular (B-tree) index on user_id and activity_id.

CREATE INDEX elbat_user_id_activity_id

             ON elbat (user_id,

                       activity_id);

SELECT DISTINCT t1.user_id

       FROM elbat t1

       WHERE EXISTS (SELECT *

                            FROM elbat t2

                            WHERE t2.user_id = t1.user_id

                                  AND t2.activity_id = '1')

             AND EXISTS (SELECT *

                                FROM elbat t2

                                WHERE t2.user_id = t1.user_id

                                      AND t2.activity_id = '2')

             AND NOT EXISTS (SELECT *

                                    FROM elbat t2

                                    WHERE t2.user_id = t1.user_id

                                          AND t2.activity_id = '3');

If you have a user table, you might also want to join to that instead of retrieving the distinct user ID from the other table.

answered 9 hours ago

sticky bit

1,723314

add a comment |

You can use bool_or for this.

SELECT user_id

FROM users

GROUP BY user_id

HAVING bool_or(activity_id IN (1,2))  -- assumes '1, 2' mean '1 OR 2'

  AND NOT bool_or(activity_id IN (3))

answered 5 hours ago

Evan Carroll

31.4k865209

add a comment |

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "182"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

Ximik is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');

var $window = $(window),
onScroll = function(e) {
var $elem = $('.new-login-left'),
docViewTop = $window.scrollTop(),
docViewBottom = docViewTop + $window.height(),
elemTop = $elem.offset().top,
elemBottom = elemTop + $elem.height();
if ((docViewTop elemBottom)) {
StackExchange.using('gps', function() { StackExchange.gps.track('embedded_signup_form.view', { location: 'question_page' }); });
$window.unbind('scroll', onScroll);
}
};
$window.on('scroll', onScroll);

});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f227083%2fpostgresql-correct-way-for-subset-calculations%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

You can easily do this with arrays in Postgres:

select user_id, array_agg(activity_id) as activities

from users

group by user_id

having array_agg(activity_id) @> array[1,2]

   and not 3 = any(array_agg(activity_id));

On a table with 100.000 rows this ran in about 100ms on my laptop with Postgres 11 and a SSD.

Online example: https://rextester.com/YLN7221

edited 8 hours ago

answered 9 hours ago

a_horse_with_no_name

38.9k775112

add a comment |

You can easily do this with arrays in Postgres:

select user_id, array_agg(activity_id) as activities

from users

group by user_id

having array_agg(activity_id) @> array[1,2]

   and not 3 = any(array_agg(activity_id));

On a table with 100.000 rows this ran in about 100ms on my laptop with Postgres 11 and a SSD.

Online example: https://rextester.com/YLN7221

edited 8 hours ago

answered 9 hours ago

a_horse_with_no_name

38.9k775112

add a comment |

You can easily do this with arrays in Postgres:

select user_id, array_agg(activity_id) as activities

from users

group by user_id

having array_agg(activity_id) @> array[1,2]

   and not 3 = any(array_agg(activity_id));

On a table with 100.000 rows this ran in about 100ms on my laptop with Postgres 11 and a SSD.

Online example: https://rextester.com/YLN7221

edited 8 hours ago

answered 9 hours ago

a_horse_with_no_name

38.9k775112

You can easily do this with arrays in Postgres:

select user_id, array_agg(activity_id) as activities

from users

group by user_id

having array_agg(activity_id) @> array[1,2]

   and not 3 = any(array_agg(activity_id));

On a table with 100.000 rows this ran in about 100ms on my laptop with Postgres 11 and a SSD.

Online example: https://rextester.com/YLN7221

edited 8 hours ago

answered 9 hours ago

a_horse_with_no_name

38.9k775112

edited 8 hours ago

answered 9 hours ago

a_horse_with_no_name

38.9k775112

answered 9 hours ago

a_horse_with_no_name

38.9k775112

answered 9 hours ago

a_horse_with_no_name

38.9k775112

add a comment |

You can first try to rewrite the query using EXISTS and a regular (B-tree) index on user_id and activity_id.

CREATE INDEX elbat_user_id_activity_id

             ON elbat (user_id,

                       activity_id);

SELECT DISTINCT t1.user_id

       FROM elbat t1

       WHERE EXISTS (SELECT *

                            FROM elbat t2

                            WHERE t2.user_id = t1.user_id

                                  AND t2.activity_id = '1')

             AND EXISTS (SELECT *

                                FROM elbat t2

                                WHERE t2.user_id = t1.user_id

                                      AND t2.activity_id = '2')

             AND NOT EXISTS (SELECT *

                                    FROM elbat t2

                                    WHERE t2.user_id = t1.user_id

                                          AND t2.activity_id = '3');

If you have a user table, you might also want to join to that instead of retrieving the distinct user ID from the other table.

answered 9 hours ago

sticky bit

1,723314

add a comment |

You can first try to rewrite the query using EXISTS and a regular (B-tree) index on user_id and activity_id.

CREATE INDEX elbat_user_id_activity_id

             ON elbat (user_id,

                       activity_id);

SELECT DISTINCT t1.user_id

       FROM elbat t1

       WHERE EXISTS (SELECT *

                            FROM elbat t2

                            WHERE t2.user_id = t1.user_id

                                  AND t2.activity_id = '1')

             AND EXISTS (SELECT *

                                FROM elbat t2

                                WHERE t2.user_id = t1.user_id

                                      AND t2.activity_id = '2')

             AND NOT EXISTS (SELECT *

                                    FROM elbat t2

                                    WHERE t2.user_id = t1.user_id

                                          AND t2.activity_id = '3');

If you have a user table, you might also want to join to that instead of retrieving the distinct user ID from the other table.

answered 9 hours ago

sticky bit

1,723314

add a comment |

You can first try to rewrite the query using EXISTS and a regular (B-tree) index on user_id and activity_id.

CREATE INDEX elbat_user_id_activity_id

             ON elbat (user_id,

                       activity_id);

SELECT DISTINCT t1.user_id

       FROM elbat t1

       WHERE EXISTS (SELECT *

                            FROM elbat t2

                            WHERE t2.user_id = t1.user_id

                                  AND t2.activity_id = '1')

             AND EXISTS (SELECT *

                                FROM elbat t2

                                WHERE t2.user_id = t1.user_id

                                      AND t2.activity_id = '2')

             AND NOT EXISTS (SELECT *

                                    FROM elbat t2

                                    WHERE t2.user_id = t1.user_id

                                          AND t2.activity_id = '3');

If you have a user table, you might also want to join to that instead of retrieving the distinct user ID from the other table.

answered 9 hours ago

sticky bit

1,723314

You can first try to rewrite the query using EXISTS and a regular (B-tree) index on user_id and activity_id.

CREATE INDEX elbat_user_id_activity_id

             ON elbat (user_id,

                       activity_id);

SELECT DISTINCT t1.user_id

       FROM elbat t1

       WHERE EXISTS (SELECT *

                            FROM elbat t2

                            WHERE t2.user_id = t1.user_id

                                  AND t2.activity_id = '1')

             AND EXISTS (SELECT *

                                FROM elbat t2

                                WHERE t2.user_id = t1.user_id

                                      AND t2.activity_id = '2')

             AND NOT EXISTS (SELECT *

                                    FROM elbat t2

                                    WHERE t2.user_id = t1.user_id

                                          AND t2.activity_id = '3');

If you have a user table, you might also want to join to that instead of retrieving the distinct user ID from the other table.

answered 9 hours ago

sticky bit

1,723314

answered 9 hours ago

sticky bit

1,723314

answered 9 hours ago

sticky bit

1,723314

answered 9 hours ago

sticky bit

1,723314

add a comment |

You can use bool_or for this.

SELECT user_id

FROM users

GROUP BY user_id

HAVING bool_or(activity_id IN (1,2))  -- assumes '1, 2' mean '1 OR 2'

  AND NOT bool_or(activity_id IN (3))

answered 5 hours ago

Evan Carroll

31.4k865209

add a comment |

You can use bool_or for this.

SELECT user_id

FROM users

GROUP BY user_id

HAVING bool_or(activity_id IN (1,2))  -- assumes '1, 2' mean '1 OR 2'

  AND NOT bool_or(activity_id IN (3))

answered 5 hours ago

Evan Carroll

31.4k865209

add a comment |

You can use bool_or for this.

SELECT user_id

FROM users

GROUP BY user_id

HAVING bool_or(activity_id IN (1,2))  -- assumes '1, 2' mean '1 OR 2'

  AND NOT bool_or(activity_id IN (3))

answered 5 hours ago

Evan Carroll

31.4k865209

You can use bool_or for this.

SELECT user_id

FROM users

GROUP BY user_id

HAVING bool_or(activity_id IN (1,2))  -- assumes '1, 2' mean '1 OR 2'

  AND NOT bool_or(activity_id IN (3))

answered 5 hours ago

Evan Carroll

31.4k865209

answered 5 hours ago

Evan Carroll

31.4k865209

answered 5 hours ago

Evan Carroll

31.4k865209

answered 5 hours ago

Evan Carroll

31.4k865209

add a comment |

Ximik is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Ximik is a new contributor. Be nice, and check out our Code of Conduct.

Thanks for contributing an answer to Database Administrators Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

Post as a guest

Name

Required, but never shown

Sign up or log in

Post as a guest

Name

Required, but never shown

Sign up or log in

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Orsdxy