Warning: Can't synchronize with repository "(default)" (Unsupported version control system "svn": No module named svn). Look in the Trac log for more information.

Ticket #2438 (closed defect: fixed)

Opened 5 years ago

Last modified 4 years ago

Authentication of non-ascii user names does not work in TG 2.x

Reported by: chrisz Owned by: Gustavo
Priority: high Milestone: 2.1rc1
Component: TurboGears Version: trunk
Severity: major Keywords: repoze.who identity
Cc:

Description

This actually seems to be a problem of the repoze.who stack, but it affects TurboGears 2.x and we should push for a fix.

The problem is that repoze.who keeps the login as an encoded byte string (usually utf-8), while the default TurboGears user model class stores the user name as unicode.

This results in deprecation warnings from SQLAlchemy for ascii user names and failures for non-ascii user names.

There is a replacement for repoze.who.plugins.sa called  repoze.who.plugins.sqlalchemy which converts the input to unicode before accessing the database, but I'm not sure that this will fix the issue, since the conversion is done with a simple unicode() call without specifying any encoding. This would still fail for any non-ascii input. We shouldn't assume any default input encoding, since both utf-8 and latin-1 are pretty popular.

So I think the conversion to unicode should happen at an earlier stage where the input encoding is known, e.g. in repoze.who.friendlyforms, by replacing the following lines using paste.request in friendlyforms.py

    query = parse_dict_querystring(environ)
    ...
    form = parse_formvars(environ)
    form.update(query)

with the following lines using WebOb

    req = webob.Request(environ)
    if not req.charset:
        req.charset = 'utf-8'
    query = req.GET()
    ...
    form = req.POST()

I also think that it is better to get a unicode value for the environment key repoze.who.identity instead of an encoded byte string, because the application might want to use that value itself for some purposes, and unicode strings are better to compare and handle.

I've already asked for opinions on the repoze-dev mailing list, but so far no response.

Change History

comment:1 follow-up: ↓ 3 Changed 5 years ago by chrisz

  • Severity changed from normal to major

Unfortunately, there is still no feedback on the  repoze-dev mailing list.

I wished we had kept the auth stuff (identity) in TG2 itself, so we could easily fix it here. Should we merge the repoze-who/what packages back to TG2? Or at least some of the not-so-well-maintained plugins?

Just noticed that this has been already brought up on the  turbogears mailing list one year ago!

comment:2 Changed 5 years ago by percious

chrisz

I have access to the repoze stuff. Send me a patch and I'll fix and push a release.

comment:3 in reply to: ↑ 1 Changed 5 years ago by Gustavo

  • Owner set to Gustavo
  • Status changed from new to assigned

Replying to chrisz:

Unfortunately, there is still no feedback on the  repoze-dev mailing list.

From: Gustavo Narea
To: Christoph Zwerschke
Date: 2010-01-10 21:40

Hello, Christoph.

Thanks for letting me know about it and for the patch.

I tried your patch, but it doesn't seem to do the trick. Would you mind giving 
me a hand? I'm going to be busy this week and I don't think I'll have too much 
time to work on this.

I already wrote three tests to reproduce the bug and none of them passes yet. 
If you'd like to try, you can do:
    svn co http://svn.repoze.org/whoplugins/whofriendlyforms/trunk/ fforms
    cd fforms
    python setup.py develop
    easy_install nose
    nosetests

I guess I'll have some time again on Friday or next weekend.

Cheers!

 - Gustavo.

I gave this another try last weekend but with no success. I guess I'll need to get more time to focus on getting this fixed, but any help will be most welcome.

I wished we had kept the auth stuff (identity) in TG2 itself, so we could easily fix it here. Should we merge the repoze-who/what packages back to TG2? Or at least some of the not-so-well-maintained plugins?

First of all, repoze.what is under active development and repoze.who (which was never part of TG) is still maintained by its original author (although in bug fix mode only, AFAIK).

The repoze.who and repoze.what plugins used by TG2 applications, which are all maintained by me, are all stable and with no known issues - Except for this one.

So there's no reason for that suggestion.

comment:4 Changed 5 years ago by chrisz

Thanks for jumping in, guys. Good to know that repoze.what is still active.

Gustavo, I'm terribly sorry! In fact I found your mail in my trash folder - I must have accidentally deleted it. Since there was no response on the list and at that time you also posted about looking for a new maintainer for repoze.who.ldap so I somehow got the impression you had retired completely which obviously was wrong. As long as the project is actively maintained by your or somebody else, there is no need to pull it into TG. Sorry again for the disconcertment.

I'll look into the patch and your tests again tomorrow; it's too late today.

comment:5 Changed 5 years ago by Gustavo

  • Status changed from assigned to closed
  • Resolution set to fixed

chrisz provided a patch for this, which is available in the final release:  http://code.gustavonarea.net/repoze.who-friendlyform/News.html

comment:6 follow-up: ↓ 8 Changed 5 years ago by chrisz

  • Status changed from closed to reopened
  • Resolution fixed deleted

Unfortunately, I need to reopen this because there is still a problem.

My original patch suggested utf-8 as default encoding, but you changed this to iso-8859-1, probably because that actually once was the default encoding for http 1.1. But as a result the patch does not play well together with TG now, which is using utf-8.

You can test this by quickstarting a project and creating a user with a non-ascii name using the admin tool. The browser (I tested this with Firefox) is simply using the encoding of the page containing the form, which in the case of TG is utf-8 by default. The details are explained  here.

As a solution, I suggest making the default charset configurable with an additional parameter in the __init__ method of the FriendlyFormPlugin. The default could be either utf-8 or iso-8859-1 but I'd prefer the former. Then that parameter should be set to utf-8 in the setup_sql_auth method of repoze.what.quickstart. Or this could also be passed in to setup_sql_auth from the TG config, similar to other parameters such as login_handler or logout_handler.

I can attach the necessary patches if you agree that's the way to go.

comment:7 follow-up: ↓ 9 Changed 5 years ago by chrisz

Btw, here is  another reason to make UTF-8 the default ;-)

comment:8 in reply to: ↑ 6 Changed 5 years ago by Gustavo

Hello,

Replying to chrisz:

My original patch suggested utf-8 as default encoding, but you changed this to iso-8859-1, probably because that actually once was the default encoding for http 1.1. But as a result the patch does not play well together with TG now, which is using utf-8.

ISO-8859-1 has always been the default character encoding for HTTP/1.1, as far as I know.

You can test this by quickstarting a project and creating a user with a non-ascii name using the admin tool. The browser (I tested this with Firefox) is simply using the encoding of the page containing the form, which in the case of TG is utf-8 by default. The details are explained  here.

As a solution, I suggest making the default charset configurable with an additional parameter in the __init__ method of the FriendlyFormPlugin. The default could be either utf-8 or iso-8859-1 but I'd prefer the former. Then that parameter should be set to utf-8 in the setup_sql_auth method of repoze.what.quickstart. Or this could also be passed in to setup_sql_auth from the TG config, similar to other parameters such as login_handler or logout_handler.

I can attach the necessary patches if you agree that's the way to go.

I'm afraid that's the only option we have.

In theory, we should put the charset in our forms (in either "enctype" or "accept-charset") and the user agent will submit the form specifying the charset it decided to use in the CONTENT_TYPE header (because we might have given more than one option). But in practice, the browser (at least Firefox and Konqueror) doesn't include the charset used for the submitted form... So WebOb doesn't work the way I was expecting. :/

I've just added this option to repoze.who-friendlyform (which defaults to ISO-8859-1) and I'll release it to PYPI shortly, but it's too late here to make setup_sql_auth() (in repoze.what-quickstart) pass the charset to that plugin, so I'll do it this weekend -- Although if you'd like to do it, please let me know and I'll wait ;)

Cheers!

comment:9 in reply to: ↑ 7 Changed 5 years ago by Gustavo

Replying to chrisz:

Btw, here is  another reason to make UTF-8 the default ;-)

I support the use of UTF-8 and that's in fact the only charset I use, but I really don't want to implement anything against the HTTP spec.

comment:10 Changed 5 years ago by Gustavo

BTW, I'd still recommend to set the "accept-charset" to "utf-8" in the FORM element of the login page, so we don't assume the user agent will re-use the charset used for the web page... Which is not what they are supposed to do.

Other user agents may respect this, even if they are not the widely used ones.

comment:11 follow-up: ↓ 13 Changed 5 years ago by chrisz

Ok, I've added accept-charset="UTF-8" to the login form now.

In fact the HTML specs say: "The default value for this attribute is the reserved string "UNKNOWN". User agents may interpret this value as the character encoding that was used to transmit the document containing this FORM element."

So that behavior is not against the specs and in practice, all browsers seem to do this, but it cannot harm to explictly set the accept-charset attribute.

The problem is now how friendlyforms gets the information that the data is encoded as utf-8. As Ian writes, the browser could send the encoding as Content-Type: application/x-www-form-urlencoded; charset=utf8, but browsers seldom set this. In fact, all browsers I tested sent only Content-Type: application/x-www-form-urlencoded. So we must somehow explicitely tell friendlyforms to use utf-8 as default encoding instead of iso-8859-1 since it does not get this information from the browser.

comment:12 Changed 5 years ago by chrisz

Actually I would even make utf-8 the default encoding for friendlyforms since I don't think iso-8859-1 ever was the official default charset. Maybe it once was a de-facto standard, but the current de-facto-standard is certainly utf-8, and it's also the official standard in several RFCs now. See also  http://htmlpurifier.org/docs/enduser-utf8.html#whyutf8 which says "There is no official way of determining the character encoding of such a request" and gives some more reasons to use utf-8 (mainly the difficulties arising when chars outside the iso-8859-1 charset are used).

comment:13 in reply to: ↑ 11 Changed 5 years ago by Gustavo

Hello, Christoph.

Replying to chrisz:

Ok, I've added accept-charset="UTF-8" to the login form now.

In fact the HTML specs say: "The default value for this attribute is the reserved string "UNKNOWN". User agents may interpret this value as the character encoding that was used to transmit the document containing this FORM element."

So that behavior is not against the specs and in practice, all browsers seem to do this, but it cannot harm to explictly set the accept-charset attribute.

The HTML specification is for user agents (e.g., browsers) only, not for server-side stuff. This is problem is within the scope of HTTP and out of the scope of HTML, and the HTTP says I have to assume the charset is Latin-1.

I don't like that either (I prefer UTF-8 too) and I know that's just theory, but I'm not going to implement something against the HTTP specification.

The problem is now how friendlyforms gets the information that the data is encoded as utf-8. As Ian writes, the browser could send the encoding as Content-Type: application/x-www-form-urlencoded; charset=utf8, but browsers seldom set this. In fact, all browsers I tested sent only Content-Type: application/x-www-form-urlencoded. So we must somehow explicitely tell friendlyforms to use utf-8 as default encoding instead of iso-8859-1 since it does not get this information from the browser.

I was aware of that and I agree with the option to override it, hence I updated repoze.who-friendlyform and now it's also possible to make repoze.what-quickstart pass the charset to that plugin:  http://code.gustavonarea.net/repoze.what-quickstart/News.html

So TG2 projects can be generated with the following line on app_cfg.py:

base_config.sa_auth.charset = "utf-8"

Cheers!

comment:14 Changed 5 years ago by chrisz

  • Status changed from reopened to closed
  • Resolution set to fixed

Thanks a lot for fixing this os quickly, Gustavo.

Concerning the charset issue: It's true that iso-8859-1 is the http default encoding, but only for 'text/...' based MIME types, whereas we're dealing with 'application/x-www-form-urlencoded' here. So I'm not sure that would be against the specs. Anyway, I have now included 'utf-8' as the sa_auth.charset setting for quickstarted TG2 projects so the default charset does not really matter.

This was a big step forward, and I can now log in with a non-ascii user name, but there are still some open problems:

  • The repr() of User objects yielded the non-ascii user name which lead to various problems. Fixed this already in the TG2 trunk.
  • Error when trying to log in with non-ascii password: #2452
  • Can't log out after logging in with non-ascii user name: #2453

But these seem to be different issues so I think we can close this ticket now.

comment:15 Changed 5 years ago by chrisz

#2452 is fixed already. Added #2454 as a reminder that we still need regression tests for all of these fixes.

comment:16 Changed 4 years ago by percious

  • Milestone changed from 2.1 to 2.1rc1
Note: See TracTickets for help on using tickets.