Index page

This data comes from a small random selection of the 4.5 million http:// URLs listed in the Open Directory Project, collected on . (There is an obvious strong bias towards some major sites, to English-language and European sites, etc.)

Pages were downloaded with curl, following redirections, and those that returned HTTP code 200 and text/html were analysed further. Each page was passed through the HTML5 tokenisation algorithm, recording details about start tags and their attributes and some other features. Non-PCDATA sections (<title>, <script>, etc) were handled properly, but none of the rest of the tree-construction algorithm was performed. All data was treated as ISO-8859-1.

Number of pages downloaded
8192
Number of pages analysed
7739
Tag names (total)
td286575
a282772
br168479
tr155049
img146724
font123885
div95991
p92280
table60720
b56751
span56693
li53778
option35253
meta34145
script27801
input23395
strong19791
i12096
center9362
link8476
ul8230
head8065
body7904
title7893
html7855
u7156
area7136
param6942
hr6418
form5319
h24339
em4098
h33704
style3561
tbody3378
h13043
small2825
noscript2798
o:p2535
nobr2418
th2222
h42101
frame2012
dd1902
blockquote1878
select1872
map1824
big1573
dt1439
embed1350
object1294
label1201
h51180
frameset1144
spacer1030
sup1025
col881
iframe803
rating722
dl684
noframes580
layer501
h6432
style="text-decoration:394
marquee351
base342
pre276
applet250
code241
ol238
address225
cite190
csobj176
htpdiv171
csaction164
v:stroke154
o:right145
o:bottom145
o:top145
o:left145
o:column144
v:textbox143
st1:place132
v:path132
o:lock132
colgroup129
v:shadow123
textarea117
tt107
acronym107
fieldset99
bgsound97
caption94
v:fill89
wbr86
blink85
button85
ilayer84
abbr77
v:rect71
Tag names (pages)
head7608
html7589
title7578
body7447
meta7074
a6772
img6429
br5869
table5731
tr5723
td5722
p5655
script5146
div4915
font4237
link4008
b3785
span3070
style2463
center2320
form2266
input2221
strong2085
noscript1748
h11700
hr1690
li1650
ul1570
i1512
h21192
map1022
area1013
param926
h3923
embed888
object842
select831
option827
em743
frame720
frameset718
tbody647
u628
noframes574
blockquote501
iframe493
h4409
small351
base332
label332
th316
h5276
nobr243
marquee240
dl200
dd182
dt179
big171
col159
sup158
ol141
o:p135
h6121
applet96
bgsound95
spacer92
colgroup90
address85
textarea76
pre75
blink63
layer56
csscriptdict54
fieldset53
csactiondict53
basefont51
caption49
csobj41
legend35
ilayer32
tt30
button28
thead28
st1:place27
sub27
noembed26
cite25
acronym25
abbr25
csaction25
csactions25
x-claris-tagview22
x-claris-window22
nolayer21
st1:city20
left17
code16
o:smarttagtype15
image15
wbr14
Parse errors (pages)
None4324
Unrecognised entity name (in attribute)2341
Duplicate attribute428
Substituted value for numeric entity369
Character '-' in CommentEndState362
Non-permitted character '/'352
Unexpected character in CommentEndState276
Unrecognised entity name (in text)238
Missing ';' after named entity220
Character '?' in TagOpenState209
Unexpected character in MarkupDeclarationOpenState161
Attribute on end tag78
Unexpected character in TagOpenState58
Missing ';' after numeric entity35
Unexpected character in AfterDoctypePublicIdentifierState27
EOF in AttributeValueDoubleQuotedState10
Unexpected character in CloseTagOpenState9
Found U+0000 in input stream7
Character '>' in CommentStartState3
Character '>' in TagOpenState3
EOF in CommentState3
Missing number after '&#'2
Character '>' in BeforeDoctypeNameState2
EOF in AttributeValueSingleQuotedState2
EOF in BeforeAttributeNameState2
Unexpected character in DoctypeState2
Character '>' in CloseTagOpenState1
Unexpected character in BeforeDoctypePublicIdentifierState1
EOF in AttributeNameState1
Character '>' in BeforeDoctypePublicIdentifierState1
Unexpected character in AfterDoctypeNameState1
Character '>' in BeforeDoctypeSystemIdentifierState1
Unexpected character in AfterDoctypeSystemIdentifierState1
EOF in TagNameState1
Duplicate attribute names (pages)
align60
border58
alt37
name29
style27
content26
class24
valign20
width18
16
;14
target11
hspace9
the9
height8
bgcolor8
frameborder8
vspace8
type7
size7
onmouseover6
cellpadding5
value5
id5
onmouseout5
and5
cellspacing4
href4
de4
of4
color3
face3
scrolling3
marginheight3
title3
for3
<meta3
framespacing2
marginwidth2
rel2
colspan2
topmargin2
leftmargin2
text2
alink2
maxlength2
art,2
be2
to2
html2
was2
yellow2
red2
rectangle2
onload1
src1
http-equiv1
bordercolor1
noresize1
method1
action1
background1
vlink1
link1
lang1
rightmargin1
dir1
coords1
usemap1
xpos1
nowrap1
bgproperties1
,1
telescope,1
red,1
y1
s&aring;1
"1
msambientcpg1
la1
-1
arte1
mostre1
archeologia1
cinema1
fotografia1
siciliani,1
musica1
libri1
di1
architettura1
turismo1
sicilia,1
storia1
editore1
catania,1
casa1
geografia1
editrice1
guide1
Doctypes (pages)
None3938
HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"934
html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"855
HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"497
HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"445
html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"148
html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"117
html PUBLIC "-//w3c//dtd html 4.0 transitional//en"100
HTML PUBLIC "-//W3C//DTD HTML 4.01 Frameset//EN" "http://www.w3.org/TR/html4/frameset.dtd"66
HTML PUBLIC "-//IETF//DTD HTML//EN"63
HTML PUBLIC "-//W3C//DTD HTML 3.2//EN"58
HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN"40
HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"35
html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"32
HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"29
HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/1999/REC-html401-19991224/loose.dtd"27
html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"26
html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"25
html PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-frameset.dtd"17
html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "DTD/xhtml1-transitional.dtd"16
HTML PUBLIC "-//W3C//DTD HTML 4.0//EN"13
html PUBLIC "-//" (incorrect)11
HTML PUBLIC "-//W3C//DTD HTML 4.0 Frameset//EN"9
HTML PUBLIC "-//W3C//DTD HTML 4.01 Frameset//EN"8
HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3c.org/TR/1999/REC-html401-19991224/loose.dtd"8
HTML PUBLIC "-//W3C//DTD HTML 3.2 FINAL//EN"8
html PUBLIC "-//w3c//dtd html 3.2//en"7
html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"7
HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" (incorrect)6
html PUBLIC "-//IETF//DTD HTML 2.0//EN"6
doctype PUBLIC "-//w3c//dtd html 4.0 transitional//en"6
html PUBLIC "-//IETF//DTD HTML//EN//2.0"6
html PUBLIC "-//W3C//DTD HTML 4.01 Frameset//EN"5
html PUBLIC "-//w3c//dtd html 4.01 transitional//en"4
HTML PUBLIC "-//SoftQuad//DTD HoTMetaL PRO 5.0::19980907::extensions to HTML 4.0//EN" "hmpro5.dtd"4
html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/html40/loose.dtd"4
html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html40/strict.dtd"4
HTML PUBLIC "-//SoftQuad//DTD HoTMetaL PRO 4.0::19971010::extensions to HTML 4.0//EN" "hmpro4.dtd"4
HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/transitional.dtd"4
html PUBLIC "-//IETF//DTD HTML 3.0//EN"3
html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "DTD/xhtml1-strict.dtd"3
HTML PUBLIC "-//SoftQuad Software//DTD HoTMetaL PRO 6.0::19990601::extensions to HTML 4.0//EN" "hmpro6.dtd"3
HTML (incorrect)3
html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/2000/REC-xhtml1-20000126/DTD/xhtml1-transitional.dtd"3
html PUBLIC "-//w3c//dtd xhtml 1.1//en" "http://www.w3.org/tr/xhtml11/dtd/xhtml11.dtd"3
html PUBLIC "-//W3C//DTD HTML 3.2//EN"3
HTML PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" (incorrect)3
HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http:\\www.w3.org\TR\html\loose.dtd"3
HTML PUBLIC "-//WC3//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"3
HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/loose.dtd"3
html PUBLIC "-//IETF//DTD HTML 4.0//EN"3
html PUBLIC "-//W3C//DTD HTML 4.01 Frameset//EN" "http://www.w3.org/TR/html4/frameset.dtd"2
html PUBLIC "-//W3C//DTD HTML 4.01//EN"2
HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html40/loose.dtd"2
HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"2
(incorrect)2
HTML PUBLIC "-//IETF//DTD HTML 4.0//EN"2
html PUBLIC "-//W3C//DTD HTML 3.2 Final//EN"2
html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"2
html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/1999/REC-html401-19991224/strict.dtd"2
HTML PUBLIC "-//WC3//DTD HTML 3.2//EN"2
html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/2002/REC-xhtml1-20020801/DTD/xhtml1-transitional.dtd"2
HTML PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"2
HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/strict.dtd"2
html PUBLIC "-//W3C//DTD HTML 4.0 //EN"2
html PUBLIC "-//w3c//dtd html 4.01 frameset//en" "http://www.w3.org/tr/html4/frameset.dtd"1
html PUBLIC "-//W3C//DTD XHTML Basic 1.0//EN" "http://www.w3.org/TR/xhtml-basic/xhtml-basic10.dtd"1
html PUBLIC "-//w3c//dtd xhtml 1.0 transitional//en" "http://www.w3.org/tr/xhtml1/dtd/strict.dtd"1
HTML PUBLIC "-//WC3//DTD HTML 4.01 Transitional//EN"1
HTML PUBLIC "-//W3C//DTD HTML 4.0 transitional//EN"1
HTML PUBLIC "-//W3C//DTD HTML 4//FR"1
HTML PUBLIC "-//W3C//Dtd HTML 4.0 transitional//EN"1
HTML PUBLIC "-//SoftQuad Software//DTD HoTMetaL PRO 5.0::19981217::extensions to HTML 4.0//EN" "hmpro5.dtd"1
HTML PUBLIC "-//W3C//DTD HTML 4.O Frameset//EN" "http://www.w3.org/tr/rec-html40/frameset.dtd"1
HTML PUBLIC "-//Netscape Comm. Corp.//DTD HTML//EN"1
html PUBLIC "-//W3C//Dtd valign=" (incorrect)1
HTML PUBLIC "-//w3c//dTD HTML 4.01 Transitional//eN"1
HTML PUBLIC "-//W3C//DTD HTML 4.0 TRANSITIONAL//EN"1
html PUBLIC "-//W3C//DTD html 4.0 Transitional//EN"1
html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1-strict.dtd"1
html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//DE" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"1
html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd "1
HTML PUBLIC "Rock Ridge Farm"1
html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http:// www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"1
html PUBLIC "-//W3C//DTD HTML 4.0//EN"1
HTML PUBLIC "-//W3C//DTD HTML 4.0 Frameset//EN" "http://www.w3.org/TR/PR-html40/frameset.dtd"1
HTML PUBLIC "-//W3C//DTD W3 HTML//EN"1
HTML PUBLIC "-//SQ//DTD HTML 2.0 HoTMetaL + extensions//EN"1
HTML PUBLIC "-//W3C//DTD HTML 3.2 Transitional//EN"1
HTML PUBLIC "-//W3C//DTD HTML 4.0 FINAL//EN"1
HTML PUBLIC "-//W3C//DTD HTML 4.0 FRAMESET//EN"1
HTML PUBLIC "-//W3C//DTD HTML 4.0//EN" "http://www.w3.org/TR/REC-html40/strict.dtd"1
HTML PUBLIC "-//WC//DTD HTML 4.0//EN"1
HTML PUBLIC "-//SoftQuad//DTD HTML 2.0 + extensions for HoTMetaL Light 3.0 19960703//EN"1
html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/2000/REC-xhtml1-20000126/DTD/xhtml1-strict.dtd"1
HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http: //www.w3.org/TR/1998/REC-html40-19980424/loose.dtd"1
HTML PUBLIC "-//Asymetrix//DTD ToolBook II HTML//EN"1
HTML PUBLIC "-//SoftQuad//DTD draft HTML 3.2 + extensions for HoTMetaL PRO 3.0 19960923//EN" "hmpro3.dtd"1
HTML PUBLIC "-//W3C//DTD HTML 4.0//EN" (incorrect)1
html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"1
HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//SV" "http://www.w3.org/TR/html4/loose.dtd"1