Ticket #1879 (new enhancement)

Opened 2 months ago

Last modified 1 month ago

Identity should use a sql query to get the user's permissions

Reported by: Janzert Assigned to: anonymous
Priority: normal Milestone: 1.5
Component: Identity Version: 1.1 HEAD
Severity: minor Keywords:
Cc:

Description

Currently Identity iterates through all of a user's groups and individually pulling all the permissions for each group and then adding them to a set before returning them.

It should instead use an appropriate query that will return the unique permissions from all the user's groups.

Christoph Zwerschke suggested two SA methods for doing this:

def permissions(self):

return set(Permission.query.distinct().join(['groups',

'users']).filter_by(user_id=self.user_id))

def permissions(self):

return set(Permission.query.filter(Permission.groups.any(

Group.users.any(User.user_id==self.user_id))).all())

The disadvantage of the first is that the join will gather all the duplicate permissions and then strip them out with the distinct clause. This is probably slower than the second method that uses an exist clause.

The problem with the second is that it is not compatible with SA 0.3.x.

Attachments

dbtest.zip (100.1 kB) - added by Janzert on 07/04/08 22:08:40.
quickstarted project with scripts to test query speeds.

Change History

07/04/08 10:38:33 changed by Janzert

  • type changed from defect to enhancement.

07/04/08 15:26:23 changed by faide

07/04/08 22:07:54 changed by Janzert

Before working on trying to get a SA 0.3 compatible statement I decided to check the speed difference of the different options. To my surprise the current method of iterating through the groups in python was much faster.

python -m timeit -s"import perm_test;perm_test.iter()" "perm_test.iter()" 10000 loops, best of 3: 94.9 usec per loop

python -m timeit -s"import perm_test;perm_test.join()" "perm_test.join()" 100 loops, best of 3: 13.1 msec per loop

python -m timeit -s"import perm_test;perm_test.exist()" "perm_test.exist()" 100 loops, best of 3: 14.7 msec per loop

These were done using SA 0.5 and postgresql 8.3.1. I also tested with sqlite and the results were very similiar.

In order to test I initialized the database with one user having 20 groups each group having 11 out of 60 permissions. I will attach the project with the scripts I used to test. To initialize the database first do "tg-admin sql create" then "python init.py". Then the timings can be tested using the commandlines above. You can also vary the number of groups, permissions and permissions per group in the init.py script.

Unless someone else shows why my timing method is wrong or comes out different for them I'll go ahead and close this in a few days.

07/04/08 22:08:40 changed by Janzert

  • attachment dbtest.zip added.

quickstarted project with scripts to test query speeds.

07/06/08 04:59:04 changed by chrisz

I can reproduce the numbers with your test, but as I wrote in the thread above, the test is not realistic because it queries the permissions in a loop in the same database session, where the group.permissions are all kept in memory. In reality, the requests run in different sessions and need to requery the database. If you simulate this using turbogears.database.session.expire_all(), then the results reverse - the old method is then 2-3 times slower. As suggested in the thread above, we may consider some caching to cover the case that permissions are queried multiple times in the same database session.

I can also reproduce that the second method using exists is somewhat slower than the method using join (no matter whether I use SQLite or Postgres 8.3, or whether I add additional indices and ANALYZE the tables), contrary to what we expected. Maybe this is because the database cannot optimize the chained exists clause for using indices.

07/06/08 09:17:18 changed by chrisz

This is the SQL query used in the first method:

SELECT DISTINCT permission_name FROM permission
JOIN group_permission USING (permission_id)
JOIN user_group USING (group_id)
WHERE user_id = $1

And this is the SQL query used in the second method:

SELECT permission_name FROM permission
WHERE EXISTS (SELECT 1 FROM group_permission
WHERE permission_id = permission.permission_id
AND EXISTS (SELECT 1 FROM user_group
WHERE group_id = group_permission.group_id
AND user_id = $1))

(The actual queries issued by SQLAlchemy are a bit more complicated because tg_user and tg_group are joined which is actually not necessary).

I tested these with PostgreSQL 8.3 and even larger numbers of groups and permissions, and it turned out that the second query is indeed much slower (of course I did an analyze on the database so that it can do an index scan).

I then did another test with the following third SQL query:

SELECT permission_name FROM permission
WHERE permission_id IN (SELECT permission_id FROM group_permission
WHERE group_id IN (SELECT group_id FROM user_group
WHERE user_id = $1))

It turned out that this query is even faster than the others, contrary to all we learned in school about how bad "in" is compared with "exists".

Does anybody have a good explanation for this and an idea how to elegantly express the last query with SQLAlchemy?

07/07/08 01:25:53 changed by Janzert

It's certainly not anything close to elegant, but at least to get started here is a method that will use the "in" query to get the permissions.

    @property
    def in_perms(self):
        from sqlalchemy import sql
        g_ids = sql.select([user_group_table.c.group_id],
                user_group_table.c.user_id == self.user_id)
        p_ids = sql.select([group_permission_table.c.permission_id],
                group_permission_table.c.group_id.in_(g_ids))
        return set(Permission.query.filter(
            Permission.permission_id.in_(p_ids)))

Also only tested with 0.5 and guessing that it needs >=0.4.

It does slightly beat out the join query (~14ms to ~15.5ms).

Also as a side note, explicitly caching the permissions results in it being about 10,000 times faster for subsequent checks. I'd be a bit worried about ending up with stale results if someone hangs onto a User object long term though.

07/07/08 15:38:52 changed by chrisz

That doesn't look so bad. I find it a bit more readable if you rewrite it as below (and this also works with SA 0.3.10):

from sqlalchemy.sql import select

...

    @property
    def permissions(self):
        perm = permissions_table.c
        group_perm = group_permission_table.c
        user_group = user_group_table.c
        return set(Permission.query.filter(perm.permission_id.in_(
            select([group_perm.permission_id], group_perm.group_id.in_(
                select([user_group.group_id],
                    user_group.user_id == self.user_id))))))

The database sessions usually last only one request, so I wouldn't worry about stale results. It may be even an advantage if permissions cannot change during one request.

07/07/08 23:46:20 changed by Janzert

Yes, your formulation is certainly better.

I also would like to change to returning a frozenset to emphasize that this is a read-only view of the permissions.

Changing that and adding caching it ends up as:

    @property
    def permissions(self):
        """
        Get the User's permissions.
        This just returns a static view of the permissions and cannot be used to
        change them.
        """
        try:
            return self._permissions
        except AttributeError:
            perm = permissions_table.c
            group_perm = group_permission_table.c
            user_group = user_group_table.c
            p = frozenset(Permission.query.filter(perm.permission_id.in_(
                select([group_perm.permission_id], group_perm.group_id.in_(
                    select([user_group.group_id],
                        user_group.user_id == self.user_id))))))
            self._permissions = p
            return p

If that looks good to everyone I'll work on getting a patch together for the quickstart templates. I'll be out of town for the latter half of the week though so it'll probably be next week before I have anything.

07/08/08 00:08:18 changed by chrisz

Returning a frozenset makes sense (the identity methods already do that).

We'll probably not want to patch this in TG 1.0 (only critical bugfixes there), so you should base your patch on TG 1.1/1.5.

Btw, we should also make a similar change to the SQLObject model.

07/08/08 22:29:36 changed by Janzert

Yep, I'm working with a checkout of the 1.1 branch.

I can't find any way to get access to the intermediate tables in order to use sqlobject's sqlbuilder. Raw sql can be used, which gives the following:

    _get_permissions(self):
        try:
            return self._permissions
        except AttributeError:
            p = frozenset(Permission.select("""permission.id IN (
                SELECT group_permission.permission_id FROM group_permission
                WHERE group_permission.group_id IN (
                    SELECT user_group.group_id FROM user_group, tg_user
                    WHERE user_id = tg_user.id))"""))
            self._permissions = p
            return p

This seems like fairly straight up sql but I wonder if it will run into database compatibility problems?

I also still need to take a look at an Elixir version, but hopefully that can be based off the sqlalchemy one.

07/23/08 13:31:52 changed by Janzert

The following works for Elixir but is quite ugly to get a hold of the intermediary tables.

    @property
    def permissions(self):
        perm = Permission.c
        user_group = self._descriptor.properties['groups'].secondary.c
        group_perm = Permission._descriptor.properties['groups'].secondary.c
        p = frozenset(Permission.query.filter(perm.permission_id.in_(
            select([group_perm.permission_permission_id],
                group_perm.tg_group_group_id.in_(
                    select([user_group.tg_group_group_id],
                        user_group.tg_user_user_id == self.user_id))))))
        self._permissions = p
        return p

I'm thinking at this point that unless someone can find a cleaner way to do this for sqlobject and elixir, maybe this should only be changed in sqlalchemy.