Re: [whatwg] [URL] Starting work on a URL spec

<4C4A8CEC.9030704@yahoo.com>

Current votes: None.

This is a multi-part message in MIME format.
--------------090006030209060200020101
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit

  On 7/24/2010 2:02 PM, Boris Zbarsky wrote:
> On 7/24/10 1:50 AM, Brett Zamir wrote:
>>> I would be particularly interested in data on this last, across
>>> different browsers, operating systems, and locales... There seem to be
>>> servers out there expecting their URIs in UTF-8 and others expecting
>>> them in ISO-8859-1, and it's not clear to me how to make things work
>>> with them all.
>>
>> Seems to me that if they are not in UTF-8, they should be treated as
>> bugs, even if that is not a de jure standard.
>
> Treated as bugs by whom?
>
By the servers/scripting languages. While it is great that the browsers 
are involved in the process, I think it would be reasonable to invite 
the other stake-holders to join the discussions.
> The scenario is that a user types some non-ASCII text in the url bar. 
> This needs to be url-encoded to actually go on the wire, which raises 
> the question of what encoding.  If the user is using IRIs, the answer 
> is UTF-8.  A number of servers barf if you do this, especially because 
> some server-side scripting languages (PHP, e.g., last I checked) 
> default to URI-unescaping via something other than UTF-8.
>
Hopefully to be fixed in PHP6 with its promise of full Unicode support...

Though per http://www.slideshare.net/kfish/unicode-php6-presentation :

*Slide 34: *Conversions & Encoding “HTTP Input Encoding”
With Unicode semantics switch enabled, we need to convert HTTP input to 
Unicode
GET requests have no encoding at all and POST ones rarely come marked 
with the encoding
Encoding detection is not reliable
*Correctly decoding HTTP input is somewhat of an unsolved problem*

*Slide 35: *Conversions & Encoding “HTTP Input Encoding”
PHP will perform lazy decoding
Delays decoding data in $_GET, $_POST, and $_REQUEST until the first time 
you access them
Allows user to set expected encoding or just rely on a default one
Allows decoding errors to be handled by the same mechanism
Applications should also use filter extension to filter incoming data

> So some browser encode the non-query part of the URI as UTF-8 and the 
> query part as ... something (user's default filesystem encoding, say, 
> for lack of a better guess).  Others always use UTF-8 (and end up with 
> some servers not usable).  Others... I have no idea.  That's why I 
> want data.  ;)  In particular, while the "just use UTF-8, and if the 
> user can't access the site sucks to be the user" approach has a 
> certain theoretical-purity appeal, it doesn't seem like something I 
> want to do to my friends and family (always a good criterion for 
> things you'd like to do to users).
>
What I meant is to try to get the server systems on board to fix the 
issue, including in the long-term. I appreciate you all being admirably 
practical champions of present-day compatibility, though I'd hope there 
is a vision to make things work better for the future, even if there 
will be some inevitable growing pains for a subset of users (as the lack 
of standardization no doubt creates pains for another subset as it is).

Brett


--------------090006030209060200020101
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: 8bit

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
  <head>

    <meta http-equiv="content-type" content="text/html; charset=UTF-8">
  </head>
  <body bgcolor="#ffffff" text="#000000">
    On 7/24/2010 2:02 PM, Boris Zbarsky wrote:
    <blockquote id="mid_4C4A81FF_4060204_mit_edu"
      cite="mid:4C4A81FF.4060204@mit.edu" type="cite">On 7/24/10 1:50
      AM, Brett Zamir wrote:
      <br>
      <blockquote id="StationeryCiteGenerated_2" type="cite">
        <blockquote id="StationeryCiteGenerated_3" type="cite">I would
          be particularly interested in data on this last, across
          <br>
          different browsers, operating systems, and locales... There
          seem to be
          <br>
          servers out there expecting their URIs in UTF-8 and others
          expecting
          <br>
          them in ISO-8859-1, and it's not clear to me how to make
          things work
          <br>
          with them all.
          <br>
        </blockquote>
        <br>
        Seems to me that if they are not in UTF-8, they should be
        treated as
        <br>
        bugs, even if that is not a de jure standard.
        <br>
      </blockquote>
      <br>
      Treated as bugs by whom?
      <br>
      <br>
    </blockquote>
    By the servers/scripting languages. While it is great that the
    browsers are involved in the process, I think it would be reasonable
    to invite the other stake-holders to join the discussions.<br>
    <blockquote id="mid_4C4A81FF_4060204_mit_edu"
      cite="mid:4C4A81FF.4060204@mit.edu" type="cite">The scenario is
      that a user types some non-ASCII text in the url bar. This needs
      to be url-encoded to actually go on the wire, which raises the
      question of what encoding.  If the user is using IRIs, the answer
      is UTF-8.  A number of servers barf if you do this, especially
      because some server-side scripting languages (PHP, e.g., last I
      checked) default to URI-unescaping via something other than UTF-8.
      <br>
      <br>
    </blockquote>
    Hopefully to be fixed in PHP6 with its promise of full Unicode
    support... <br>
    <br>
    Though per <a class="moz-txt-link-freetext" href="http://www.slideshare.net/kfish/unicode-php6-presentation">http://www.slideshare.net/kfish/unicode-php6-presentation</a>
    :<br>
    <br>
    <b>Slide 34: </b>Conversions &amp; Encoding
    “HTTP Input Encoding”
    <br>
    With Unicode semantics switch enabled, we need to convert HTTP input
    to Unicode <br>
    GET requests have no encoding at all and POST ones rarely come
    marked with the encoding<br>
    Encoding detection is not reliable<br>
    <b>Correctly decoding HTTP input is somewhat of an unsolved problem</b><br>
    <br>
    <b>Slide 35: </b>Conversions &amp; Encoding
    “HTTP Input Encoding”
    <br>
    PHP will perform lazy decoding <br>
    Delays decoding data in $_GET, $_POST, and $_REQUEST until the first
    time you access them <br>
    Allows user to set expected encoding or just rely on a default one <br>
    Allows decoding errors to be handled by the same mechanism <br>
    Applications should also use filter extension to filter incoming data<br>
    <br>
    <blockquote id="mid_4C4A81FF_4060204_mit_edu"
      cite="mid:4C4A81FF.4060204@mit.edu" type="cite">So some browser
      encode the non-query part of the URI as UTF-8 and the query part
      as ... something (user's default filesystem encoding, say, for
      lack of a better guess).  Others always use UTF-8 (and end up with
      some servers not usable).  Others... I have no idea.  That's why I
      want data.  ;)  In particular, while the "just use UTF-8, and if
      the user can't access the site sucks to be the user" approach has
      a certain theoretical-purity appeal, it doesn't seem like
      something I want to do to my friends and family (always a good
      criterion for things you'd like to do to users).
      <br>
      <br>
    </blockquote>
    What I meant is to try to get the server systems on board to fix the
    issue, including in the long-term. I appreciate you all being
    admirably practical champions of present-day compatibility, though
    I'd hope there is a vision to make things work better for the
    future, even if there will be some inevitable growing pains for a
    subset of users (as the lack of standardization no doubt creates
    pains for another subset as it is).<br>
    <br>
    Brett<br>
    <br>
  </body>
</html>

--------------090006030209060200020101--