This data comes from a small random selection of the 4.5 million
http:// URLs listed in the Open
Directory Project, collected on . (There is an
obvious strong bias towards some major sites, to English-language and European
sites, to CNN.com, etc.)
Pages were downloaded with curl, following redirections, and
those that returned HTTP code 200 and text/html were
analysed further.
Each page was passed through the HTML5 tokenisation algorithm, recording details
about start tags and their attributes and some other features. Non-PCDATA
sections (<title>, <script>, etc) were handled
properly, but none of the rest of the tree-construction algorithm was performed.
All data was treated as ISO-8859-1.
It may be interesting to compare against Rene Saarsoo's survey of pages from the same source a year ago, and Google's older survey from an unidentified set of pages.
Can't call method "selectrow_array" on an undefined value at /var/www/canvex/survey/2007-07-17/analyse.cgi line 139.
For help, please send mail to the webmaster (excors@gmail.com), giving this error message and the time and date of the error.