DTDConstants
DocumentParser
public class Parser extends Object implements DTDConstants
Unfortunately there are many badly implemented HTML parsers out there, and as a result there are many badly formatted HTML files. This parser attempts to parse most HTML files. This means that the implementation sometimes deviates from the SGML specification in favor of HTML.
The parser treats \r and \r\n as \n. Newlines after starttags and before end tags are ignored just as specified in the SGML/HTML specification.
The html spec does not specify how spaces are to be coalesced very well. Specifically, the following scenarios are not discussed (note that a space should be used here, but I am using   to force the space to be displayed):
'<b>blah <i> <strike> foo' which can be treated as: '<b>blah <i><strike>foo'
as well as: '<p><a href="xx"> <em>Using</em></a></p>' which appears to be treated as: '<p><a href="xx"><em>Using</em></a></p>'
If strict
is false, when a tag that breaks flow, (TagElement.breaksFlows
) or trailing whitespace is encountered, all whitespace will be ignored until a non whitespace character is encountered. This appears to give behavior closer to the popular browsers.
Modifier and Type | Field | Description |
---|---|---|
protected DTD |
dtd |
The dtd. |
protected boolean |
strict |
This flag determines whether or not the Parser will be strict in enforcing SGML compatibility. |
ANY, CDATA, CONREF, CURRENT, DEFAULT, EMPTY, ENDTAG, ENTITIES, ENTITY, FIXED, GENERAL, ID, IDREF, IDREFS, IMPLIED, MD, MODEL, MS, NAME, NAMES, NMTOKEN, NMTOKENS, NOTATION, NUMBER, NUMBERS, NUTOKEN, NUTOKENS, PARAMETER, PI, PUBLIC, RCDATA, REQUIRED, SDATA, STARTTAG, SYSTEM
Constructor | Description |
---|---|
Parser |
Creates parser with the specified dtd . |
Modifier and Type | Method | Description |
---|---|---|
protected void |
endTag |
Handle an end tag. |
protected void |
error |
Invokes the error handler with the 1st, 2nd and 3rd error message argument "?". |
protected void |
error |
Invokes the error handler with the 2nd and 3rd error message argument "?". |
protected void |
error |
Invokes the error handler with the 3rd error message argument "?". |
protected void |
error |
Invokes the error handler. |
protected void |
flushAttributes() |
Removes the current attributes. |
protected SimpleAttributeSet |
getAttributes() |
Returns attributes for the current tag. |
protected int |
getCurrentLine() |
Returns the line number of the line currently being parsed. |
protected int |
getCurrentPos() |
Returns the current position. |
protected void |
handleComment |
Called when an HTML comment is encountered. |
protected void |
handleEmptyTag |
Called when an empty tag is encountered. |
protected void |
handleEndTag |
Called when an end tag is encountered. |
protected void |
handleEOFInComment() |
Called when the content terminates without closing the HTML comment. |
protected void |
handleError |
An error has occurred. |
protected void |
handleStartTag |
Called when a start tag is encountered. |
protected void |
handleText |
Called when PCDATA is encountered. |
protected void |
handleTitle |
Called when an HTML title tag is encountered. |
protected TagElement |
makeTag |
Makes a TagElement. |
protected TagElement |
makeTag |
Makes a TagElement. |
protected void |
markFirstTime |
Marks the first time a tag has been seen in a document |
void |
parse |
Parse an HTML stream, given a DTD. |
String |
parseDTDMarkup() |
Parses the Document Type Declaration markup declaration. |
protected boolean |
parseMarkupDeclarations |
Parse markup declarations. |
protected void |
startTag |
Handle a start tag. |
protected DTD dtd
protected boolean strict
public Parser(DTD dtd)
dtd
.dtd
- the dtd.protected int getCurrentLine()
protected TagElement makeTag(Element elem, boolean fictional)
elem
- the element storing the tag definitionfictional
- the value of the flag "fictional
" to be set for the tagTagElement
protected TagElement makeTag(Element elem)
elem
- the element storing the tag definitionTagElement
protected SimpleAttributeSet getAttributes()
SimpleAttributeSet
containing the attributesprotected void flushAttributes()
protected void handleText(char[] text)
text
- the section textprotected void handleTitle(char[] text)
text
- the title textprotected void handleComment(char[] text)
text
- the comment being handledprotected void handleEOFInComment()
protected void handleEmptyTag(TagElement tag) throws ChangedCharSetException
tag
- the tag being handledChangedCharSetException
- if the document charset was changedprotected void handleStartTag(TagElement tag)
tag
- the tag being handledprotected void handleEndTag(TagElement tag)
tag
- the tag being handledprotected void handleError(int ln, String msg)
ln
- the number of line containing the errormsg
- the error messageprotected void error(String err, String arg1, String arg2, String arg3)
err
- the error typearg1
- the 1st error message argumentarg2
- the 2nd error message argumentarg3
- the 3rd error message argumentprotected void error(String err, String arg1, String arg2)
err
- the error typearg1
- the 1st error message argumentarg2
- the 2nd error message argumentprotected void error(String err, String arg1)
err
- the error typearg1
- the 1st error message argumentprotected void error(String err)
err
- the error typeprotected void startTag(TagElement tag) throws ChangedCharSetException
tag
- the tagChangedCharSetException
- if the document charset was changedprotected void endTag(boolean omitted)
omitted
- true
if the tag is no actually present in the document, but is supposed by the parserprotected void markFirstTime(Element elem)
elem
- the element represented by the tagpublic String parseDTDMarkup() throws IOException
IOException
- if an I/O error occursprotected boolean parseMarkupDeclarations(StringBuffer strBuff) throws IOException
strBuff
- the markup declarationtrue
if this is a valid markup declaration; otherwise false
IOException
- if an I/O error occurspublic void parse(Reader in) throws IOException
in
- the reader to read the source fromIOException
- if an I/O error occursprotected int getCurrentPos()
© 1993, 2023, Oracle and/or its affiliates. All rights reserved.
Documentation extracted from Debian's OpenJDK Development Kit package.
Licensed under the GNU General Public License, version 2, with the Classpath Exception.
Various third party code in OpenJDK is licensed under different licenses (see Debian package).
Java and OpenJDK are trademarks or registered trademarks of Oracle and/or its affiliates.
https://docs.oracle.com/en/java/javase/21/docs/api/java.desktop/javax/swing/text/html/parser/Parser.html