Index page

This data comes from a small random selection of the 4.5 million http:// URLs listed in the Open Directory Project, collected on . (There is an obvious strong bias towards some major sites, to English-language and European sites, to CNN.com, etc.)

Pages were downloaded with curl, following redirections, and those that returned HTTP code 200 and text/html were analysed further. Each page was passed through the HTML5 tokenisation algorithm, recording details about start tags and their attributes and some other features. Non-PCDATA sections (<title>, <script>, etc) were handled properly, but none of the rest of the tree-construction algorithm was performed. All data was treated as ISO-8859-1.

It may be interesting to compare against Rene Saarsoo's survey of pages from the same source a year ago, and Google's older survey from an unidentified set of pages.

Software error:

Can't call method "selectrow_array" on an undefined value at /var/www/canvex/survey/2007-07-17/analyse.cgi line 139.

For help, please send mail to the webmaster (excors@gmail.com), giving this error message and the time and date of the error.