Unless stated otherwise, all Perl variables mentioned are within the scope of package dtd.
require "dtd/dtd.pl";
The following routines are defined:
&'DTDread_dtd(
FILEHANDLE);
DTDread_dtd
parses the SGML DTD specified by FILEHANDLE. Parsing of the
DTD stops once the end of the file is reached. Any external entity references will
be parsed if an entity to filename mapping exists (see DTDread_mapfile).
DTDread_dtd
makes the following assumptions when parsing a DTD:
sgmls
, or other SGML validator, for such purposes.
$namechars
. There
is no size limit on name length.
EN
". Others can be
added by changing the $pubtl
variable.
DTDread_dtd
is finished, the following associative arrays are filled
(remember, all the arrays are within the scope of package dtd):
%ParEntity
%PubParEntity
PUBLIC
).%SysParEntity
SYSTEM
).%ElemCont
%ElemInc
%ElemExc
%ElemTag
%Attribute
To access the data stored in %Attribute
, it is best to use
DTDget_elem_attr.
%ElemCont
,
%ElemInc
, %ElemInc
, %ElemExc
, %ElemTag
, %Attribute
arrays.
When trying to locate external entity parameter entity files, DTDread_dtd
uses
the environment variable P_SGML_PATH
. P_SGML_PATH
is a colon separated
string telling DTDread_dtd
where to locate external entities. By default,
DTDread_dtd
will look in the current working directory or the sub-directory
called ents.
If DTDread_dtd
cannot cannot resolve an external entity reference, it will issue a
warning and continue parsing the DTD.
Current status of DTDread_dtd
:
<!DOCTYPE
is parsed, but external reference to file not implemented.
INCLUDE
and IGNORE
marked sections are processed with nested marked
sections allowed. CDATA
and RCDATA
marked sections are not recognized
and may cause incorrect behavior. However, CDATA
and RCDATA
marked
sections do not normally appear in a DTD.
IGNORE
has higher precedence than INCLUDE
in case of nested sections.
LINKTYPE
, NOTATION
, SHORTREF
, USEMAP
declerations are ignored.
DTDread_dtd
is not the best. DTDread_dtd
makes frequent
use of Perl's getc
function. If SGML did not have such screwing grammer rules,
I could have easily avoided getc
. I haven't bothered in trying to optimize
DTDread_dtd
's performance. So far it is working, and I do not feel like mucking
with it.
DTDread_dtd
is meant to process DTDs in separate files. If a document instance
is in the file DTDread_dtd
is parsing, God only knows what will happen.
&'DTDread_mapfile($filename);
DTDread_mapfile
parses a entity map file specified $filename
.
DTDread_mapfile
uses the environment variable P_SGML_PATH as described in
section DTDread_dtd to locate $filename
. This way, one can put the map file in
the same location of the entity files.
DTDread_mapfile
makes the following assumptions when parsing $filename
:
$pubtl
variable.
SYSTEM
entity names).
# DTDread_mapfile will ignore lines beginning with a `#' character.If
#####################
# ISO entity files
#
ISO 8879-1986//ENTITIES General Technical//EN iso-tech.ent
ISO 8879-1986//ENTITIES Publishing//EN iso-pub.ent
ISO 8879-1986//ENTITIES Numeric and Special Graphic//EN iso-num.ent
ISO 8879-1986//ENTITIES Greek Letters//EN iso-grk1.ent
ISO 8879-1986//ENTITIES Diacritical Marks//EN iso-dia.ent
ISO 8879-1986//ENTITIES Added Latin 1//EN iso-lat1.ent
ISO 8879-1986//ENTITIES Greek Symbols//EN iso-grk3.ent
ISO 8879-1986//ENTITIES Added Latin 2//EN ISOlat2
ISO 8879-1986//ENTITIES Added Math Symbols: Ordinary//EN ISOamso
#####################
# ArborText entity file
#
-//ArborText//ELEMENTS Math Equation Structures//EN ati-math.elm
#####################
# A sample SYSTEM entities
#
MyGraphics my_graphics.ent
# end of map file
DTDread_mapfile
cannot access $filename
, it will issue a warning to that
effect.
@elements = &'DTDget_elements();
DTDget_elements
retrieves a sorted array of all elements defined in the DTD.
This function is only useful after DTDread_dtd has been called.
@top_elements = &'DTDget_elements();
DTDget_top_elements
retrieves a sorted array of all top-most elements defined
in the DTD. Top-most elements are those elements that cannot be contained within
another element or can only be contained within itself.
This function is only useful after DTDread_dtd has been called.
%attribute = &'DTDget_elem_attr($elem);
DTDget_elem_attr
returns an associative array containing the attributes of
$elem
. The keys of the array are the attribute names, and the array values are $;
separated strings of the possible values for the attributes. Example of extracting an
attribute's values:
@values = split(/$;/, $attribute{`alignment'});
The first array value of the $;
splitted array is the default value for the attribute
(which may be an SGML reserved word), and the other array values are all posible
values for the attribute.
$;
is assumed to be the default value assigned by Perl: \\034. If $;
is
changed, unpredictable results may occur.
@parent_elements = &'DTDget_parents($elem);
DTDget_parents
returns an array of all elements that may be a parent of $elem
.
This function is only useful after DTDread_dtd has been called.
&'DTDis_attr_keyword($word);
DTDis_attr_keyword
returns 1
if $word
is an attribute content reserved value,
otherwise, it returns 0
. In the reference concrete syntax, the following values of
$word
will return 1
:
CDATA
ENTITY
ENTITIES
ID
IDREF
IDREFS
NAME
NAMES
NMTOKEN
NMTOKENS
NOTATION
NUMBER
NUMBERS
NUTOKEN
NUTOKENS
&'DTDis_elem_keyword($word);
DTDis_elem_keyword
returns 1
if $word
is an element content reserved value.
otherwise, it returns 0
. In the reference concrete syntax, the following values of
$word
will return 1
:
#PCDATA
CDATA
EMPTY
RCDATA
&'DTDprint_tree($elem, $depth,
FILEHANDLE);
DTDprint_tree
prints the content hierarchy of a single element, $elem
, to a
maximum depth of $depth
to the file specified by FILEHANDLE. If FILEHANDLE
is not specified then output goes to STDOUT
. A depth of 5 is used if $depth
is not
specified. The root of the tree has a depth of 1.
The routine cuts at elements that exist at higher (or equal) levels or if $depth
has
been reached. The string "...
" is appended to an element if it has been cut-off due
to preexistance at a higher (or equal) level.
Cutting the tree at repeat elements is necessary to avoid a combinatorical explosion with recursive element definitions.
Here's an example of what the output will look like due to pruning of recursive element contents:
htmlplusIf you see an element with "
|
|_body
| |
| |_address
| | |
| | |_p ...
| |
| |_div1
| | |
| | |_address ...
| | |_div2 ...
| | |_div3 ...
| | |_div4 ...
| | |_div5 ...
| | |_div6 ...
...
", just search through the output until you find the
element without the "...
".
In order to recognize cousins, a breadth first search is needed, or a full traversal of the hierarchy before outputing. The above technique currently is sufficient to avoid combinatorical explosions. Plus, it allows the printing of the tree while traversing the element data; there is no need to create a Perl tree data structure before printing (saves time, memory, and debugging).
...
" may be the only
place of reference to see the content hierarchy of that element. However, the
element may occur in multiple contents and have different ancestoral inclusion
and exclusion elements applied to it.
Have I lost you? Maybe an example may help:
openbookIgnoring the lines starting with ()'s, one gets the content hierachy of an element as defined by the DTD without concern of where it may occur in the overall structure. The ()'s line give additional information regarding the element with respect to its existance within a specific context. For example, when an
|
|_d1
| | (I): idx needbegin needend newline
| |
| |_abbrev
| | | (Ia): idx needbegin needend newline
| | | (X): needbegin needend
| | |
| | |_#PCDATA
| | |_acro
| | | | (Ia): idx needbegin needend newline
| | | | (Xa): needbegin needend
| | | |
| | | |_#PCDATA
| | | |_sub ...
| | | |_super ...
| | |
acro
element occurs
within openbook
/d1
/abbrev
, along with its normal content, it can contain idx
and newline
elements due to inclusions from ancestors. However, it cannot
contain needbegin
, needend
regardless of its defined content since an
ancestor(s) excludes them.
needbegin
, needend
are excluded from acro
.
(I)
(Ia)
(X)
(Xa)
&'DTDreset();
DTDreset
clears all data associated with the DTD read via DTDread_dtd. This
routine is useful if multiple DTDs need to be processed.