Tokeniser state transitions, over 2522 documents (~100MB), treating everything as PCDATA: Format: "from_state to_state: count" AfterDoctypeSystemIdentifierState BogusDoctypeState: 1 CloseTagOpenState DataState: 1 CommentStartState DataState: 2 DoctypeSystemIdentifierSingleQuotedState AfterDoctypeSystemIdentifierState: 3 AfterDoctypePublicIdentifierState DoctypeSystemIdentifierSingleQuotedState: 3 CommentState DataState: 3 AfterDoctypeSystemIdentifierState AfterDoctypeSystemIdentifierState: 4 BeforeDoctypePublicIdentifierState DoctypePublicIdentifierSingleQuotedState: 6 DoctypePublicIdentifierSingleQuotedState AfterDoctypePublicIdentifierState: 6 AfterDoctypeNameState BogusDoctypeState: 8 CloseTagOpenState BogusCommentState: 38 AfterDoctypeNameState AfterDoctypeNameState: 54 BeforeAttributeValueState DataState: 98 TagOpenState BogusCommentState: 103 AfterDoctypePublicIdentifierState BogusDoctypeState: 112 BogusDoctypeState DataState: 121 DoctypeSystemIdentifierSingleQuotedState DoctypeSystemIdentifierSingleQuotedState: 159 DoctypePublicIdentifierSingleQuotedState DoctypePublicIdentifierSingleQuotedState: 220 AfterAttributeNameState DataState: 335 AfterAttributeNameState BeforeAttributeNameState: 491 CommentStartDashState CommentState: 573 CommentStartDashState CommentEndState: 653 AfterDoctypePublicIdentifierState DataState: 669 AfterDoctypePublicIdentifierState DoctypeSystemIdentifierDoubleQuotedState: 829 DoctypeSystemIdentifierDoubleQuotedState AfterDoctypeSystemIdentifierState: 829 AfterDoctypeSystemIdentifierState DataState: 831 MarkupDeclarationOpenState BogusCommentState: 920 BogusCommentState_Continue DataState: 1061 BogusCommentState BogusCommentState_Continue: 1061 CommentStartState CommentStartDashState: 1226 AfterDoctypePublicIdentifierState AfterDoctypePublicIdentifierState: 1459 CommentEndState CommentState: 1559 BeforeDoctypePublicIdentifierState DoctypePublicIdentifierDoubleQuotedState: 1607 DoctypePublicIdentifierDoubleQuotedState AfterDoctypePublicIdentifierState: 1607 AfterDoctypeNameState BeforeDoctypePublicIdentifierState: 1613 BeforeDoctypePublicIdentifierState BeforeDoctypePublicIdentifierState: 1613 MarkupDeclarationOpenState DoctypeState: 1621 DoctypeState BeforeDoctypeNameState: 1621 DoctypeNameState AfterDoctypeNameState: 1621 BeforeDoctypeNameState DoctypeNameState: 1621 AttributeNameState BeforeAttributeNameState: 1644 EntityInAttributeValueState_Unquoted AttributeValueUnquotedState: 1904 AttributeValueUnquotedState EntityInAttributeValueState_Unquoted: 1904 AfterAttributeNameState BeforeAttributeValueState: 3561 AttributeNameState DataState: 4168 BeforeAttributeValueState BeforeAttributeValueState: 4354 TagOpenState DataState: 4660 DoctypeNameState DoctypeNameState: 4872 BogusDoctypeState BogusDoctypeState: 6313 EntityInAttributeValueState_SingleQuoted AttributeValueSingleQuotedState: 13272 AttributeValueSingleQuotedState EntityInAttributeValueState_SingleQuoted: 13272 AfterAttributeNameState AttributeNameState: 26191 AttributeNameState AfterAttributeNameState: 30578 CommentEndDashState CommentState: 32424 CommentEndState CommentEndState: 34087 DoctypeSystemIdentifierDoubleQuotedState DoctypeSystemIdentifierDoubleQuotedState: 40526 BeforeAttributeValueState AttributeValueSingleQuotedState: 50487 AttributeValueSingleQuotedState BeforeAttributeNameState: 50487 AttributeValueUnquotedState DataState: 53147 CommentStartState CommentState: 58273 DoctypePublicIdentifierDoubleQuotedState DoctypePublicIdentifierDoubleQuotedState: 58693 CommentEndState DataState: 59496 MarkupDeclarationOpenState CommentStartState: 59501 CommentEndDashState CommentEndState: 60402 AfterAttributeNameState AfterAttributeNameState: 60654 TagOpenState MarkupDeclarationOpenState: 62042 AttributeValueUnquotedState BeforeAttributeNameState: 84322 EntityInAttributeValueState_DoubleQuoted AttributeValueDoubleQuotedState: 86084 AttributeValueDoubleQuotedState EntityInAttributeValueState_DoubleQuoted: 86084 CommentState CommentEndDashState: 92826 BeforeAttributeValueState AttributeValueUnquotedState: 137469 DataState EntityDataState: 143852 EntityDataState DataState: 143852 BogusCommentState_Continue BogusCommentState_Continue: 230673 AttributeValueUnquotedState AttributeValueUnquotedState: 694873 BeforeAttributeNameState DataState: 806080 TagNameState BeforeAttributeNameState: 863828 CloseTagOpenState TagNameState: 976495 TagOpenState CloseTagOpenState: 976534 BeforeAttributeNameState BeforeAttributeNameState: 1020627 AttributeValueSingleQuotedState AttributeValueSingleQuotedState: 1136691 TagOpenState TagNameState: 1260933 TagNameState DataState: 1373600 AttributeValueDoubleQuotedState BeforeAttributeNameState: 1505620 BeforeAttributeValueState AttributeValueDoubleQuotedState: 1505620 AttributeNameState BeforeAttributeValueState: 1690113 BeforeAttributeNameState AttributeNameState: 1700312 DataState TagOpenState: 2304272 TagNameState TagNameState: 3365406 CommentState CommentState: 6899456 AttributeNameState AttributeNameState: 7135748 AttributeValueDoubleQuotedState AttributeValueDoubleQuotedState: 27603954 DataState DataState: 32279170