This data comes from a small random selection of the 4.5 million http:// URLs listed in the Open Directory Project, collected on . (There is an obvious strong bias towards some major sites, to English-language and European sites, etc.)
Pages were downloaded with curl, following redirections, and those that returned HTTP code 200 and text/html were analysed further.
Each page was passed through the HTML5 tokenisation algorithm, recording details about start tags and their attributes and some other features. Non-PCDATA sections (<title>, <script>, etc) were handled properly, but none of the rest of the tree-construction algorithm was performed. All data was treated as ISO-8859-1.
| td | 286575 |
|---|---|
| a | 282772 |
| br | 168479 |
| tr | 155049 |
| img | 146724 |
| font | 123885 |
| div | 95991 |
| p | 92280 |
| table | 60720 |
| b | 56751 |
| span | 56693 |
| li | 53778 |
| option | 35253 |
| meta | 34145 |
| script | 27801 |
| input | 23395 |
| strong | 19791 |
| i | 12096 |
| center | 9362 |
| link | 8476 |
| ul | 8230 |
| head | 8065 |
| body | 7904 |
| title | 7893 |
| html | 7855 |
| u | 7156 |
| area | 7136 |
| param | 6942 |
| hr | 6418 |
| form | 5319 |
| h2 | 4339 |
| em | 4098 |
| h3 | 3704 |
| style | 3561 |
| tbody | 3378 |
| h1 | 3043 |
| small | 2825 |
| noscript | 2798 |
| o:p | 2535 |
| nobr | 2418 |
| th | 2222 |
| h4 | 2101 |
| frame | 2012 |
| dd | 1902 |
| blockquote | 1878 |
| select | 1872 |
| map | 1824 |
| big | 1573 |
| dt | 1439 |
| embed | 1350 |
| object | 1294 |
| label | 1201 |
| h5 | 1180 |
| frameset | 1144 |
| spacer | 1030 |
| sup | 1025 |
| col | 881 |
| iframe | 803 |
| rating | 722 |
| dl | 684 |
| noframes | 580 |
| layer | 501 |
| h6 | 432 |
| style="text-decoration: | 394 |
| marquee | 351 |
| base | 342 |
| pre | 276 |
| applet | 250 |
| code | 241 |
| ol | 238 |
| address | 225 |
| cite | 190 |
| csobj | 176 |
| htpdiv | 171 |
| csaction | 164 |
| v:stroke | 154 |
| o:right | 145 |
| o:bottom | 145 |
| o:top | 145 |
| o:left | 145 |
| o:column | 144 |
| v:textbox | 143 |
| st1:place | 132 |
| v:path | 132 |
| o:lock | 132 |
| colgroup | 129 |
| v:shadow | 123 |
| textarea | 117 |
| tt | 107 |
| acronym | 107 |
| fieldset | 99 |
| bgsound | 97 |
| caption | 94 |
| v:fill | 89 |
| wbr | 86 |
| blink | 85 |
| button | 85 |
| ilayer | 84 |
| abbr | 77 |
| v:rect | 71 |
| head | 7608 |
|---|---|
| html | 7589 |
| title | 7578 |
| body | 7447 |
| meta | 7074 |
| a | 6772 |
| img | 6429 |
| br | 5869 |
| table | 5731 |
| tr | 5723 |
| td | 5722 |
| p | 5655 |
| script | 5146 |
| div | 4915 |
| font | 4237 |
| link | 4008 |
| b | 3785 |
| span | 3070 |
| style | 2463 |
| center | 2320 |
| form | 2266 |
| input | 2221 |
| strong | 2085 |
| noscript | 1748 |
| h1 | 1700 |
| hr | 1690 |
| li | 1650 |
| ul | 1570 |
| i | 1512 |
| h2 | 1192 |
| map | 1022 |
| area | 1013 |
| param | 926 |
| h3 | 923 |
| embed | 888 |
| object | 842 |
| select | 831 |
| option | 827 |
| em | 743 |
| frame | 720 |
| frameset | 718 |
| tbody | 647 |
| u | 628 |
| noframes | 574 |
| blockquote | 501 |
| iframe | 493 |
| h4 | 409 |
| small | 351 |
| base | 332 |
| label | 332 |
| th | 316 |
| h5 | 276 |
| nobr | 243 |
| marquee | 240 |
| dl | 200 |
| dd | 182 |
| dt | 179 |
| big | 171 |
| col | 159 |
| sup | 158 |
| ol | 141 |
| o:p | 135 |
| h6 | 121 |
| applet | 96 |
| bgsound | 95 |
| spacer | 92 |
| colgroup | 90 |
| address | 85 |
| textarea | 76 |
| pre | 75 |
| blink | 63 |
| layer | 56 |
| csscriptdict | 54 |
| fieldset | 53 |
| csactiondict | 53 |
| basefont | 51 |
| caption | 49 |
| csobj | 41 |
| legend | 35 |
| ilayer | 32 |
| tt | 30 |
| button | 28 |
| thead | 28 |
| st1:place | 27 |
| sub | 27 |
| noembed | 26 |
| cite | 25 |
| acronym | 25 |
| abbr | 25 |
| csaction | 25 |
| csactions | 25 |
| x-claris-tagview | 22 |
| x-claris-window | 22 |
| nolayer | 21 |
| st1:city | 20 |
| left | 17 |
| code | 16 |
| o:smarttagtype | 15 |
| image | 15 |
| wbr | 14 |
| align | 60 |
|---|---|
| border | 58 |
| alt | 37 |
| name | 29 |
| style | 27 |
| content | 26 |
| class | 24 |
| valign | 20 |
| width | 18 |
| 16 | |
| ; | 14 |
| target | 11 |
| hspace | 9 |
| the | 9 |
| height | 8 |
| bgcolor | 8 |
| frameborder | 8 |
| vspace | 8 |
| type | 7 |
| size | 7 |
| onmouseover | 6 |
| cellpadding | 5 |
| value | 5 |
| id | 5 |
| onmouseout | 5 |
| and | 5 |
| cellspacing | 4 |
| href | 4 |
| de | 4 |
| of | 4 |
| color | 3 |
| face | 3 |
| scrolling | 3 |
| marginheight | 3 |
| title | 3 |
| for | 3 |
| <meta | 3 |
| framespacing | 2 |
| marginwidth | 2 |
| rel | 2 |
| colspan | 2 |
| topmargin | 2 |
| leftmargin | 2 |
| text | 2 |
| alink | 2 |
| maxlength | 2 |
| art, | 2 |
| be | 2 |
| to | 2 |
| html | 2 |
| was | 2 |
| yellow | 2 |
| red | 2 |
| rectangle | 2 |
| onload | 1 |
| src | 1 |
| http-equiv | 1 |
| bordercolor | 1 |
| noresize | 1 |
| method | 1 |
| action | 1 |
| background | 1 |
| vlink | 1 |
| link | 1 |
| lang | 1 |
| rightmargin | 1 |
| dir | 1 |
| coords | 1 |
| usemap | 1 |
| xpos | 1 |
| nowrap | 1 |
| bgproperties | 1 |
| , | 1 |
| telescope, | 1 |
| red, | 1 |
| y | 1 |
| så | 1 |
| " | 1 |
| msambientcpg | 1 |
| la | 1 |
| - | 1 |
| arte | 1 |
| mostre | 1 |
| archeologia | 1 |
| cinema | 1 |
| fotografia | 1 |
| siciliani, | 1 |
| musica | 1 |
| libri | 1 |
| di | 1 |
| architettura | 1 |
| turismo | 1 |
| sicilia, | 1 |
| storia | 1 |
| editore | 1 |
| catania, | 1 |
| casa | 1 |
| geografia | 1 |
| editrice | 1 |
| guide | 1 |