Just testing some analysis of the data from tokenising ~2500 HTML documents, with various bugs that make this not very good:

Start tag names

End tag names

Attribute names

Attributes per start tag

Attribute value lengths (in characters)