HtmlTokenizer
Last updated at 7:57 am UTC on 16 September 2017
In category: 'Etoys-Squeakland-Network-HTML-Tokenizer' in Squeak 6.0a.
This class takes a text stream and produces a sequence of HTML tokens.
It requires its source stream to support #peek.
HtmlTokenizer on: aStream
| tokenizer htmlSource |
htmlSource := '<h1>The title of my report</h1><p>This report is about ...</p>'.
tokenizer := HtmlTokenizer on: htmlSource readStream.
Transcript clear.
[tokenizer atEnd] whileFalse: [Transcript show: tokenizer next printString; cr]
Output on Transcript
{HtmlTag:<h1>}
{HtmlText:The title of my report}
{HtmlTag:</h1>}
{HtmlTag:<p>}
{HtmlText:This report is about ...}
{HtmlTag:</p>}
HtmlTokenizer is used by the HtmlParser class.
Tokens types are
HtmlToken printHierarchy '
ProtoObject #()
Object #()
HtmlToken #(''source'')
HtmlComment #()
HtmlTag #(''isNegated'' ''name'' ''attribs'')
HtmlText #(''text'')'
Implementaton of HtmlTokenizer next
next
"return the next HtmlToken, or nil if there are no more"
|token|
"branch, depending on what the first character is"
self atEnd ifTrue: [ ^nil ].
self peekChar = $<
ifTrue: [ token := self nextTagOrComment ]
ifFalse: [ token := self nextText ].
"return the token, modulo modifications inside of textarea's"
textAreaLevel > 0 ifTrue: [
(token isTag and: [ token name = 'textarea' ]) ifTrue: [
"textarea tag--change textAreaLevel accordingly"
token isNegated
ifTrue: [ textAreaLevel := textAreaLevel - 1 ]
ifFalse: [ textAreaLevel := textAreaLevel -2 ].
textAreaLevel > 0
ifTrue: [
"still inside a <textarea>, so convert this tag to text"
^HtmlText forSource: token source ]
ifFalse: [ "end of the textarea; return the tag" ^token ] ].
"end of the textarea"
"inside the text area--return the token as text"
^HtmlText forSource: token source ].
(token isTag and: [ token isNegated not and: [ token name = 'textarea' ]]) ifTrue: [
"beginning of a textarea"
inTextArea := true.
^token ].
^token