<AANLkTi=88AtQTJroZUuC5ihX5jqOuj5RL4nop7Cm5eSr@mail.gmail.com>
Current votes: None.
--000e0cd5927adafc99048c1a6530 Content-Type: text/plain; charset=UTF-8 http://code.google.com/apis/safebrowsing/developers_guide_v2.html#Canonicalization lists some interesting cases we've come across on the anti-phishing team in Google. To the extent you're concerned with / interested in canonicalizaiton, it may be worth taking a look at (not to suggest you follow that in determining how to parse/canonicalize URLs, but rather to make sure that you have some "correct" way of handling the listed URLs). BTW, are you covering canonicalization? -Ian On Fri, Jul 23, 2010 at 9:02 PM, Boris Zbarsky <bzbarsky@mit.edu> wrote: > On 7/23/10 11:59 PM, Silvia Pfeiffer wrote: > >> Is that URLs as values of attributes in HTML or is that URLs as pasted >> into the address bar? I believe their processing differs... >> > > It certainly does in Firefox (the latter have a lot more fixup done to > them, and there are also differences in terms of how character encodings are > handled). > > I would be particularly interested in data on this last, across different > browsers, operating systems, and locales... There seem to be servers out > there expecting their URIs in UTF-8 and others expecting them in ISO-8859-1, > and it's not clear to me how to make things work with them all. > > -Boris > --000e0cd5927adafc99048c1a6530 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable <a href=3D"http://code.google.com/apis/safebrowsing/developers_guide_v2.htm= l#Canonicalization">http://code.google.com/apis/safebrowsing/developers_gui= de_v2.html#Canonicalization</a>=C2=A0lists some interesting cases we've= come across on the anti-phishing team in Google. To the extent you're = concerned with / interested in canonicalizaiton, it may be worth taking a l= ook at (not to suggest you follow that in determining how to parse/canonica= lize URLs, but rather to make sure that you have some "correct" w= ay of handling the listed URLs).<div> <br></div><div>BTW, are you covering canonicalization?</div><div><br></div>= <div>-Ian</div><div><br></div><div><div class=3D"gmail_quote">On Fri, Jul 2= 3, 2010 at 9:02 PM, Boris Zbarsky <span dir=3D"ltr"><<a href=3D"mailto:b= zbarsky@mit.edu">bzbarsky@mit.edu</a>></span> wrote:<br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex;"><div class=3D"im">On 7/23/10 11:59 PM, Silv= ia Pfeiffer wrote:<br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"> Is that URLs as values of attributes in HTML or is that URLs as pasted<br> into the address bar? I believe their processing differs...<br> </blockquote> <br></div> It certainly does in Firefox (the latter have a lot more fixup done to them= , and there are also differences in terms of how character encodings are ha= ndled).<br> <br> I would be particularly interested in data on this last, across different b= rowsers, operating systems, and locales... =C2=A0There seem to be servers o= ut there expecting their URIs in UTF-8 and others expecting them in ISO-885= 9-1, and it's not clear to me how to make things work with them all.<br= > <font color=3D"#888888"> <br> -Boris<br> </font></blockquote></div><br></div> --000e0cd5927adafc99048c1a6530--