|
![]() |
|||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectsunlabs.brazil.util.LexML (view source)
public class LexML
This class breaks angle-bracket-separated markup languages like SGML, XML, and HTML into tokens. It understands three types of tokens:
This class is intended to parse markup languages, not to validate them. "Malformed" data is interpreted as graciously as possible, in order to extract as much information as possible. For instance: spaces are allowed between the "<" and the tag name, values in tags do not need to be quoted, and unbalanced quotes are accepted.
One type of "malformed" data specifically not handled is a quoted
">" character occurring within the body of a tag. Even if it is
quoted, a ">" in the attributes of a tag will be interpreted as the
end of the tag. For example, the single tag <img src='foo.jpg'
alt='xyz > abc'>
will be erroneously broken by
this parser into two tokens:
<img src='foo.jpg' alt='xyz >
This class also may not properly parse all well-formed XML tags, such
as tags with extended paired delimiters <&
and
&>
, <?
and ?>
, or
<![CDATA[
and ]]>
.
Additionally, XML tags that have embedded comments containing the
">" character will not be parsed correctly (for example:
<!DOCTYPE foo SYSTEM -- a > b -- foo.dtd>
),
since the ">" in the comment will be interpreted as
the end of declaration tag, for the same reason mentioned
above.
Note: this behavior may be changed on a per-application basis by
overriding the findClose
method in a subclass.
Field Summary | |
---|---|
static int |
COMMENT
The value returned by getType for comment tokens |
static int |
STRING
The value returned by getType for string tokens |
static int |
TAG
The value returned by getType for tag tokens |
Constructor Summary | |
---|---|
LexML(String str)
Create a new ML parser, which can be used to iterate over the tokens in the given string. |
Method Summary | |
---|---|
String |
getArgs()
Gets the name/value pairs in the body of the current tag as a string. |
StringMap |
getAttributes()
Gets the name/value pairs in the body of the current tag as a table. |
String |
getBody()
Gets the string making up the current token, not including the angle brackets or comment delimiters, if appropriate. |
int |
getLocation()
Return the current processing location. |
String |
getString()
Return the string we are currently processing |
String |
getTag()
Gets the tag name at the beginning of the current tag. |
String |
getToken()
Gets the string making up the whole current token, including the brackets or comment delimiters, if appropriate. |
int |
getType()
Gets the type of the current token. |
boolean |
isSingleton()
A tag is a "singleton" if the closing ">" is preceded by a slash (/). |
boolean |
nextToken()
Advances to the next token. |
void |
replace(String str)
Changes the string that this LexML is parsing. |
String |
rest()
Gets the rest of the string that has not yet been parsed. |
Methods inherited from class java.lang.Object |
---|
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public static final int COMMENT
getType
for comment tokens
public static final int TAG
getType
for tag tokens
public static final int STRING
getType
for string tokens
Constructor Detail |
---|
public LexML(String str)
str
- The ML to parse.Method Detail |
---|
public boolean nextToken()
true
if a token was found, false
if there were no more tokens left.public int getType()
COMMENT
,
TAG
,
STRING
public boolean isSingleton()
<br/>
public String getToken()
public String getBody()
public String getString()
public int getLocation()
public String getTag()
<table border=3>
is
"table". Any surrounding space characters are removed, but the
case of the tag is preserved.
For comments, the "tag" is the first word in the comment. This can
be used to help parse comments that are structured similar to regular
tags, such as server-side include comments like
<!--#include virtual="file.inc">
. The tag in
this case would be "!--#include".
null
if the current token
was a string.public String getArgs()
null
if
the current token was a string.public StringMap getAttributes()
Any quote marks in the body, either single or double quotes, are left on the values, so that the values can be easily re-emitted and still form a valid body.
For names that have no associated value in the tag, the value is
stored as the empty string "". Therefore, the two tags
<table border>
and
<table border="">
cannot be distinguished
based on the result of calling getAttributes
.
null
if
the current token was a string.public String rest()
Example use: to help the parser in circumstances such as the HTML "<script>" tag where the script body doesn't the obey the rules because it might contain lone "<" or ">" characters, which this parser would interpret as the start or end of funny-looking tags.
replace(java.lang.String)
public void replace(String str)
Example use: the caller decided to parse part of the body, and now wants this LexML to pick up and parse the rest of it.
str
- The string that this LexML should now parse. Whatever
string this LexML was parsing is forgotten, and it now
starts parsing at the beginning of the new string.rest()
|
Version Kenai-svn-r24, Generated 08/18/09 Copyright (c) 2001-2009, Sun Microsystems. |
|||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |