Rubbish HTML stats

Just testing some analysis of the data from tokenising ~2500 HTML documents, with various bugs that make this not very good: