Index page

This data comes from a small random selection of the 4.5 million http:// URLs listed in the Open Directory Project, collected on . (There is an obvious strong bias towards some major sites, to English-language and European sites, to CNN.com, etc.)

Pages were downloaded with curl, following redirections, and those that returned HTTP code 200 and text/html were analysed further. Each page was passed through the HTML5 tokenisation algorithm, recording details about start tags and their attributes and some other features. Non-PCDATA sections (<title>, <script>, etc) were handled properly, but none of the rest of the tree-construction algorithm was performed. All data was treated as ISO-8859-1.

It may be interesting to compare against Rene Saarsoo's survey of pages from the same source a year ago, and Google's older survey from an unidentified set of pages.

Number of pages downloaded
8192
Number of pages analysed
7739
Tag names (total)
td286575
a282772
br168479
tr155049
img146724
font123885
div95991
p92280
table60720
b56751
span56693
li53778
option35253
meta34145
script27801
input23395
strong19791
i12096
center9362
link8476
ul8230
head8065
body7904
title7893
html7855
u7156
area7136
param6942
hr6418
form5319
h24339
em4098
h33704
style3561
tbody3378
h13043
small2825
noscript2798
o:p2535
nobr2418
th2222
h42101
frame2012
dd1902
blockquote1878
select1872
map1824
big1573
dt1439
embed1350
object1294
label1201
h51180
frameset1144
spacer1030
sup1025
col881
iframe803
rating722
dl684
noframes580
layer501
h6432
style="text-decoration:394
marquee351
base342
pre276
applet250
code241
ol238
address225
cite190
csobj176
htpdiv171
csaction164
v:stroke154
o:left145
o:top145
o:bottom145
o:right145
o:column144
v:textbox143
o:lock132
v:path132
st1:place132
colgroup129
v:shadow123
textarea117
tt107
acronym107
fieldset99
bgsound97
caption94
v:fill89
wbr86
blink85
button85
ilayer84
abbr77
v:rect71
Tag names (pages)
head7608
html7589
title7578
body7447
meta7074
a6772
img6429
br5869
table5731
tr5723
td5722
p5655
script5146
div4915
font4237
link4008
b3785
span3070
style2463
center2320
form2266
input2221
strong2085
noscript1748
h11700
hr1690
li1650
ul1570
i1512
h21192
map1022
area1013
param926
h3923
embed888
object842
select831
option827
em743
frame720
frameset718
tbody647
u628
noframes574
blockquote501
iframe493
h4409
small351
base332
label332
th316
h5276
nobr243
marquee240
dl200
dd182
dt179
big171
col159
sup158
ol141
o:p135
h6121
applet96
bgsound95
spacer92
colgroup90
address85
textarea76
pre75
blink63
layer56
csscriptdict54
csactiondict53
fieldset53
basefont51
caption49
csobj41
legend35
ilayer32
tt30
thead28
button28
st1:place27
sub27
noembed26
csaction25
csactions25
acronym25
cite25
abbr25
x-claris-window22
x-claris-tagview22
nolayer21
st1:city20
left17
code16
o:smarttagtype15
image15
noindex14
Parse errors (pages)
None4324
Unrecognised entity name (in attribute)2341
Duplicate attribute428
Substituted value for numeric entity369
Character '-' in CommentEndState362
Non-permitted character '/'352
Unexpected character in CommentEndState276
Unrecognised entity name (in text)238
Missing ';' after named entity220
Character '?' in TagOpenState209
Unexpected character in MarkupDeclarationOpenState161
Attribute on end tag78
Unexpected character in TagOpenState58
Missing ';' after numeric entity35
Unexpected character in AfterDoctypePublicIdentifierState27
EOF in AttributeValueDoubleQuotedState10
Unexpected character in CloseTagOpenState9
Found U+0000 in input stream7
Character '>' in TagOpenState3
EOF in CommentState3
Character '>' in CommentStartState3
Character '>' in BeforeDoctypeNameState2
Missing number after '&#'2
Unexpected character in DoctypeState2
EOF in BeforeAttributeNameState2
EOF in AttributeValueSingleQuotedState2
EOF in TagNameState1
EOF in AttributeNameState1
Unexpected character in AfterDoctypeSystemIdentifierState1
Character '>' in BeforeDoctypeSystemIdentifierState1
Character '>' in CloseTagOpenState1
Character '>' in BeforeDoctypePublicIdentifierState1
Unexpected character in AfterDoctypeNameState1
Unexpected character in BeforeDoctypePublicIdentifierState1
Duplicate attribute names (pages)
align60
border58
alt37
name29
style27
content26
class24
valign20
width18
u000D16
;14
target11
the9
hspace9
vspace8
frameborder8
bgcolor8
height8
size7
type7
onmouseover6
id5
and5
onmouseout5
value5
cellpadding5
of4
href4
de4
cellspacing4
<meta3
title3
for3
marginheight3
scrolling3
face3
color3
yellow2
be2
text2
leftmargin2
topmargin2
html2
art,2
maxlength2
colspan2
rel2
red2
marginwidth2
was2
framespacing2
to2
rectangle2
alink2
monkey,1
recipe,1
grass,1
nfl,1
agapanthus1
frangipani,1
disease,1
flower,1
ginger,1
startpagina,1
storia1
uniforms,1
not1
cotton1
phormium1
beach1
filifera,1
pictures,1
white,1
musa1
martial1
plumbago1
"balance1
white1
australis1
pakket,1
"white"1
makelaardij,1
purple1
woning,1
sicilia,1
martial,1
frameboarder1
eden,1
hess,1
vereniging,1
makelaar,1
radio,1
grinch,1
turismo1
oleander1
swiss1
maneuveru0094,1
lang1
acacia1
world1
Doctypes (pages)
None3938
HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"934
html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"855
HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"497
HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"445
html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"148
html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"117
html PUBLIC "-//w3c//dtd html 4.0 transitional//en"100
HTML PUBLIC "-//W3C//DTD HTML 4.01 Frameset//EN" "http://www.w3.org/TR/html4/frameset.dtd"66
HTML PUBLIC "-//IETF//DTD HTML//EN"63
HTML PUBLIC "-//W3C//DTD HTML 3.2//EN"58
HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN"40
HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"35
html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"32
HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"29
HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/1999/REC-html401-19991224/loose.dtd"27
html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"26
html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"25
html PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-frameset.dtd"17
html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "DTD/xhtml1-transitional.dtd"16
HTML PUBLIC "-//W3C//DTD HTML 4.0//EN"13
html PUBLIC "-//" (incorrect)11
HTML PUBLIC "-//W3C//DTD HTML 4.0 Frameset//EN"9
HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3c.org/TR/1999/REC-html401-19991224/loose.dtd"8
HTML PUBLIC "-//W3C//DTD HTML 3.2 FINAL//EN"8
HTML PUBLIC "-//W3C//DTD HTML 4.01 Frameset//EN"8
html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"7
html PUBLIC "-//w3c//dtd html 3.2//en"7
doctype PUBLIC "-//w3c//dtd html 4.0 transitional//en"6
html PUBLIC "-//IETF//DTD HTML//EN//2.0"6
html PUBLIC "-//IETF//DTD HTML 2.0//EN"6
HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" (incorrect)6
html PUBLIC "-//W3C//DTD HTML 4.01 Frameset//EN"5
html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/html40/loose.dtd"4
html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html40/strict.dtd"4
HTML PUBLIC "-//SoftQuad//DTD HoTMetaL PRO 4.0::19971010::extensions to HTML 4.0//EN" "hmpro4.dtd"4
HTML PUBLIC "-//SoftQuad//DTD HoTMetaL PRO 5.0::19980907::extensions to HTML 4.0//EN" "hmpro5.dtd"4
HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/transitional.dtd"4
html PUBLIC "-//w3c//dtd html 4.01 transitional//en"4
HTML PUBLIC "-//WC3//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"3
HTML PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" (incorrect)3
html PUBLIC "-//IETF//DTD HTML 4.0//EN"3
html PUBLIC "-//w3c//dtd xhtml 1.1//en" "http://www.w3.org/tr/xhtml11/dtd/xhtml11.dtd"3
HTML (incorrect)3
html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/2000/REC-xhtml1-20000126/DTD/xhtml1-transitional.dtd"3
HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/loose.dtd"3
HTML PUBLIC "-//SoftQuad Software//DTD HoTMetaL PRO 6.0::19990601::extensions to HTML 4.0//EN" "hmpro6.dtd"3
html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "DTD/xhtml1-strict.dtd"3
html PUBLIC "-//IETF//DTD HTML 3.0//EN"3
html PUBLIC "-//W3C//DTD HTML 3.2//EN"3
HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http:\www.w3.org\TR\html\loose.dtd"3
html PUBLIC "-//W3C//DTD HTML 4.01//EN"2
HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"2
html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/1999/REC-html401-19991224/strict.dtd"2
html PUBLIC "-//W3C//DTD HTML 4.01 Frameset//EN" "http://www.w3.org/TR/html4/frameset.dtd"2
HTML PUBLIC "-//IETF//DTD HTML 4.0//EN"2
html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"2
HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/strict.dtd"2
(incorrect)2
html PUBLIC "-//W3C//DTD HTML 3.2 Final//EN"2
HTML PUBLIC "-//WC3//DTD HTML 3.2//EN"2
HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html40/loose.dtd"2
html PUBLIC "-//W3C//DTD HTML 4.0 //EN"2
html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/2002/REC-xhtml1-20020801/DTD/xhtml1-transitional.dtd"2
HTML PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"2
HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//ENu000Au0009u0009" (incorrect)1
HTML PUBLIC "-//W3C//DTD HTML 4.01 Frameset//EN" (incorrect)1
html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//NL" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"1
html PUBLIC "-//W3C//Dtd valign=" (incorrect)1
HTML PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3c.org/TR/1999/REC-html401-19991224/loose.dtd"1
HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3c.org/TR/html4/loose.dtd"1
html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" (incorrect)1
.php PUBLIC "-//W3C//DTD.php 4.01 Transitional//EN"1
HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/REC-html40/"1
html PUBLIC "-//W3C//DTD xHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml-transitional.dtd"1
HTML PUBLIC "-//IETF//DTD HTML 3.0//EN"1
HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN" "http://www.w3.org/MarkUp/Wilbur/HTML32.dtd"1
HTML PUBLIC "-//Netscape Comm. Corp.//DTD HTML//EN"1
html PUBLIC "-//W3C//DTD HTML 4.0//EN"1
HTML PUBLIC "-//SoftQuad//DTD HoTMetaL PRO 4.0::19970714::extensions to HTML 4.0//EN" "hmpro4.dtd"1
html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-Transitional.dtd"1
HTML PUBLIC "-//W3C//DTD HTML 4.0 transitional//EN"1
html PUBLIC "-//W3C//DTD HTML 4.0 Frameset//EN"1
HTML PUBLIC "-//W3C//DTD HTML 4.0 FINAL//EN"1
html PUBLIC "-//w3c//dtd xhtml 1.0 transitional//en" "http://www.w3.org/tr/xhtml1/dtd/strict.dtd"1
html PUBLIC "-//w3c/dtd html 4.01 transitional//en" "http://www.w3.org/tr/html4/strict.dtd"1
HTML PUBLIC "-//Asymetrix//DTD ToolBook II HTML//EN"1
HTML PUBLIC "-//W3C//DTD HTML 4.1 Transitional//EN"1
html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.ibm.com/data/dtd/v11/ibmxhtml1-transitional.dtd"1
html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"1
HTML PUBLIC "-//W3C//DTD HTML 4.0 Frameset//EN" "http://www.w3.org/TR/REC-html40/frameset.dtd"1
HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//en"1
HTML PUBLIC "-//SQ//DTD HTML 2.0 HoTMetaL + extensions//EN"1
HTML PUBLIC "-//W3C//DTD HTML 4.0//DE"1
HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html401/loose.dtd"1
HTML PUBLIC "-//W3C//DTD HTML 4.01u000ATransitional//EN" "http://www.w3.org/TR/html4/loose.dtd"1
html PUBLIC "-//W3C//DTD XHTML Basic 1.0//EN" "http://www.w3.org/TR/xhtml-basic/xhtml-basic10.dtd"1
HTML PUBLIC "-//W3C//DTD W3 HTML//EN"1
HTML PUBLIC "-//W3C/DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"1
html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://u000Awww.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"1
HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//LV"1