General Hours XML Parsing

I noted earlier, speaking about the XML parsing for the office information:

For the propers info (looking ahead) it was clear to me that a fully hand-rolled tokenizer would be a pain; if I were to use that model I would want to use flex as a lexer. At one time I used flex fairly heavily, and the pattern/action structure it uses works very well with XML parsing.
However, the data model for the token actions was relatively complicated. A set of antiphons can be represented in full, with (for example) VespersAntiphons having five (or one) Antiphon element as contained elements; or the element could be empty, with a ref attribute either being set to "Lauds" (take the antiphon from Lauds of that day) or a day name, meaning that the antiphons of the normal weekday were to be used. In an object context, it seemed to me that having an element knowing how to expand data so that higher level elements wouldn't need to look at details, and that in that context the element should control its own parse rather than having tokens handed to it from the lexer. So I wound up with something like a recursive descent parser, with every element knowing how to extract its own tags and able to hand the data to be parsed to the elements on the next level down, extracting the information, once parsed, from those elements. (Because no elements are nested, even at extended depth, inside themselves, there is no actual recursion. There is a gradual descent and the reascent through the call tree of the sort that a tokenizer approach, which works at a flat level, avoids.)

Here's what a full XML decription of a complete set of office information looks like:

<collect use="Anglican">Almighty God, give us grace that we may cast away the works of darkness, and put upon us the armour of light, now in the time of this mortal life, in which thy Son. Jesus Christ, came to visit us in great humility ; that in the last day, when He shall come again in His glorious Majesty to judge both tbe quick and dead, we may rise to the life immortal, through Him Who liveth and reigneth with Thee and the Holy Ghost, now and ever. Amen.</collect>

<collect use="Roman">Raise up, we pray Thee, Lord, Thy power, and come : that from the dangers which hang over us by reason of our sins, we may be shielded by Thy protection, and delivered by Thy salvation. Who livest.</collect>

</collects>

<chapter src="Isa. ii">it shall come to pass in the last days, that the mountain of the Lord's house shall be established in the top of the mountains, and shall be exalted above the hills; and all nations shall flow unto it.</chapter>

<versicle>In His days Judah shall be saved, and Israel shall dwell safely.</versicle>

<response>Behold, the days come, saith the Lord, that I will raise unto David a righteous branch, and a King shall reign and prosper, and shall execute judgment and justice in the earth * And this is his Name whereby He shall be called : The Lord our Righteousness</response>

</ChapterResponses>

<hymns>

</hymns>

<versicle>Drop down, ye heavens, from above.</versicle>

<response>And let the skies pour down righteousness : let the earth open, and let them bring forth a Saviour.</response>

</HymnResponses>

<MagnificatAntiphon>Behold, the Name of the Lord cometh from far : and his glory filleth the whole earth.</MagnificatAntiphon>

</Vespers>

<Lauds>

<versicle>Send, O Lord, the Lamb, the ruler of the land.</versicle>

<response>From the rock of the wilderness, unto the mount of the daughter of Sion.</response>

</LaudsResponsory>

<antiphon number="1">In that day the mountains shall drop down new wine : and the hills shall flow with milk and honey, alleluia.</antiphon>

<antiphon number="2">Rejoice greatly, O daughter of Sion : shout, O daughter of Jerusalem, alleluia.</antiphon>

<antiphon number="3">Behold, the Lord shall come, and all His Saints with Him : and there shall be in that day a great light. Alleluia,</antiphon>

<antiphon number="4">Ho, every one that thirsteth, come ye to the waters : seek ye the Lord, while He may be found, alleluia.</antiphon>

<antiphon number="5">Behold, a great Prophet Cometh : and He shall renew Jerusalem, alleluia.</antiphon>

</LaudsAntiphons>

<chapter src="Rom. xiii">Now it is high time to awake out of sleep : for now is our salvation nearer than when we believed.</chapter>

<hymns>

</hymns>

<versicle>A voice crying in the wilderness.</versicle>

<response>Prepare ye the way of the Lord : make straight a highway for our God.</response>

</ChapterResponses>

<BenedictusAntiphon>The Holy Ghost shall come down upon thee, Mary, fear not: thou shalt coceive in thy womb the Son of God, alleluia.</BenedictusAntiphon>

</Lauds>

<Terce>

<versicle>Shew the light of Thy countenance, and we shall be whole.</versicle>

<response>Come and save us, * O Lord God of Hosts.</response>

<versicle>The heathen shall fear thy Name, O Lord.</versicle>

<response>And all the kings of the earth thy majesty.</response>

</ChapterResponses>

</Terce>

<Sext>

<chapter src="Rom. xiii">The night is far spent, the day is at hand: let us therefore cast off the works of darkness, and let us put on the aiTaour of light.</chapter>

<versicle>And grant us Thy salvation.</versicle>

<response>O Lord, shew * thy mercy upon us.</response>

<versicle>Remember me, O Lord, according to the favour that thou bearest unto thy people.</versicle>

<response>O visit me with thy salvation.</response>

</ChapterResponses>

</Sext>

<None>

<chapter src="Rom. xiii">Let us walk honestly as in the day ; not in rioting and drunkenness, not in chambering and wantonness, not in strife and envying. But put ye on the Lord Jesus Christ.</chapter>

<versicle>And His glory shall be seen upon thee.</versicle>

<response>The Lord shall arise upon thee * O Jerusalem</response>

<versicle>Turn us again, O Lord God of Hosts.</versicle>

<response>Shew the light of Thy countenance, and we shall be whole.</response>

</ChapterResponses>

</None>

<versicle>For it is time that Thou have mercy upon her, yea, the time is come.</versicle>

<response>Thou shalt arise, O Lord * And have mercy upon Sion.</response>

</ChapterResponses>

<hymns>

</hymns>

<versicle>Drop down, ye heavens, from above.</versicle>

<response>And let the skies pour down righteousness : let the earth open, and let them bring forth a Saviour.</response>

</HymnResponses>

<MagnificatAntiphon>Fear not, Mary: thou hast found favour with God : behold, thou shalt conceive, and bring forth a Son, alleluia.</MagnificatAntiphon>

</Vespers>

</day>

There are a lot of tags, but the nesting isn't very deep. A schema, if I wrote one, would show that for a day element marked as "complete" almost everything has to be present, but that there are a range of partial elements, ranging from almost-complete ones down to ones which represent saint's days which have a more general common:

<collect>O God, who didst endue thy blessed Saint Vincent with apostolic virtue, to the intent that he should preach the gospel to the poor, and stablish the honour of the priesthood of thy Church: grant, we beseech thee, that we may so hold in reverence his works of righteousness, that we may learn to follow the pattern of his godly conversation, Through.</collect>

</collects>

</day>

Or ones which provide specific Benedictus and Magnificat antiphons, sometimes with a collect:

<collect>Almighty and merciful God, of Whose only gift it cometh that Thy faithful people do unto Thee true and laudable service; grant, we beseech Thee, that we may so faithfully serve Thee in this life, that we fail not finally to attain Thy heavenly promises; through the merits of Jesus Christ our Lord. Amen.</collect>

<Lauds>

<BenedictusAntiphon>A certain man went down from Jerusalem to Jericho, and fell among thieves : which stripped him of his raiment, and wounded him, and departed, leaving him half dead.</BenedictusAntiphon>

</Lauds>

<MagnificatAntiphon>Which now of these three, thinkest thou, was neighbour unto him that fell among the thieves? : And he said. He that shewed mercy on him. Go, and do thou likewise, alleluia.</MagnificatAntiphon>

</Vespers>

</day>

The type attribute on the Day element determines what is to be expected generally.

At any point in a parse there are only a few possible types of values to follow. But for a general lex-based parser the explicit state transition diagram would be somewhat complex.

So I decided to use the other obvious C++ model; element and tag classes which know how to parser themselves and report errors. It means a reasonable number of specific elements, but the logic falls into manageable chunks and using element nesting makes the object model mirror the XML model, which assists in maintainability.

The first thing to do was to provide some basic common logic. This made explicit use of one of the real performance advantages modern C++ provides: you can operate cheaply with std::string_view objects representing the current element and the remaining text to be parsed; this much minimizes the costs associated with this sort of logic.

The other thing that is really useful in XML parsing is the ability to do efficient and clean error reporting. My compiler provides (partial) support for std::expected, from C++23. (Partial because the class generally is supported but not the monodic functions which simplify handling the return values. Well, we can't have everything.)

Elements

There is also a general abstract base class for an element. It requires a known first tag name, but mainly manages the length of the text the element as a whole occupies. This allows it to be queried after a parse to see what part of the head of the remaining text should be dropped.

class HoursElementBase

{

public:

std::string::size_type getLength() const { return m_length; }

const std::string& getEndTag() const { return m_endTag; }

virtual ~HoursElementBase();

protected:

HoursElementBase(std::string_view inText, const std::string& inName):

m_length(GetStartTagLength(inText, inName)),

m_endTag(GetEndTag(inText, inName))

{ }

std::string::size_type incrementLength(std::string::size_type inVal)

{

m_length += inVal;

return inVal;

}

virtual const std::string &getStartTagName() const = 0;

private:

static auto GetStartTagLength(std::string_view inText,

const std::string &inName)

-> decltype(inText.find(inText));

static std::string GetEndTag(std::string_view inText,

const std::string &inName);

std::string::size_type m_length;

std::string m_endTag;

};

The two values set are of general use in all elements derived from the class:

auto HoursElementBase::GetStartTagLength(std::string_view inText,

const std::string &inName)

-> decltype(inText.find(inText))

{

if (!inText.starts_with("<" + inName))

throw OfficeParseException(inName + " element with bad beginning", inText);

auto index = inText.find('>');

if (index == std::string_view::npos)

{

throw OfficeParseException(inName + " element with no end character",

inText);

}

return index + 1;

}

std::string HoursElementBase::GetEndTag(std::string_view inText,

const std::string &inName)

{

auto index = inText.find('>');

if (inText[index - 1] == '/')

return {};

return "</" + inName + ">";

}

There are a couple of intermediate element types. TextElementBase represents an element which has a text content model.

class TextElementBase : public HoursElementBase

{

public:

TextElementBase(std::string_view inText, const std::string& inName):

HoursElementBase(inText, inName)

{ }

virtual ~TextElementBase();

const std::string &getText() const { return m_text; }

protected:

std::string m_text;

void setTextFromBody(std::string_view inText);

auto checkEndTag(std::string_view inText) const -> decltype(inText.length());

};

It encapsulates logic used in every such element:

void TextElementBase::setTextFromBody(std::string_view inText)

{

auto index = inText.find('<');

if (index == std::string::npos)

{

throw OfficeParseException(getStartTagName() + " body with no end");

}

m_text = std::string(inText.substr(0, index));

}

auto TextElementBase::checkEndTag(std::string_view inText) const

-> decltype(inText.length())

{

if (!inText.starts_with(getEndTag()))

{

throw OfficeParseException(

getStartTagName() + " body with unexpected end tag: "

+ std::string(inText.substr(0, getEndTag().length() + 1))

+ " expected *" + getEndTag() + "*");

}

return getEndTag().length();

}

The other general type of element is one which has element-only content. (There are no elements with mixed content).

class MultiElementElement : public HoursElementBase

{

public:

MultiElementElement(std::string_view inText, const std::string& inName):

HoursElementBase(inText, inName)

{ }

~MultiElementElement() override;

protected:

std::string_view::size_type incrementOverWhitespace(std::string_view inStr);

std::string_view::size_type compareActualWithExpectedIndex(

std::string_view inStr, const std::string_view::size_type inIndex1,

const std::string_view::size_type inIndex2) const;

auto getNextTagIndex(std::string_view inStr) const

-> decltype(inStr.find('<'));

};

The defined functions allow for eating up the space between tags, and validating index values into the text

std::string_view::size_type

MultiElementElement::incrementOverWhitespace(std::string_view inStr)

{

std::string_view::size_type index = 0;

while (std::isspace(inStr[index]))

{

++index;

incrementLength(1);

}

return index;

}

std::string_view::size_type

MultiElementElement::compareActualWithExpectedIndex(

std::string_view inStr, const std::string_view::size_type inIndex1,

const std::string_view::size_type inIndex2) const

{

if (inIndex1 != inIndex2)

throw OfficeParseException(getStartTagName() + " element illegal data: "

+ std::string(inStr.substr(0, inIndex2 + 10)));

return inIndex1;

}

auto

MultiElementElement::getNextTagIndex(std::string_view inStr) const

-> decltype(inStr.find('<'))

{

auto nextTagIndex = inStr.find('<');

if (nextTagIndex == std::string::npos)

throw OfficeParseException(getStartTagName()

+ " with no next internal tag or closing tag: "

+ std::string(inStr.substr(0, 200)));

return nextTagIndex;

}

Note that all the parsing in elements occurs inside their constructor. This is why although Tag objects return std::expected to mark an error, element objects throw an exception.

Performance

Recursive descent parsers generally perform worse than tokenizing lexer-based parsers, because the overhead is considerably greater. If we were, say, parsing entire files of all the XML definitions for the breviary, the approach taken here might be considerably too slow. However, the source code searches the data for a string with a day element matching an expected name, and then converts only the one or two elements needed. As this is a display application for interactive use, the overhead is unnoticeable.

Similarly, exceptions get thrown only if the day element being parsed is critical to the user's request, because no other elements are parsed. If the entire file were being parsed, additional logic would be wanted to allow parsing to continue after a badly-formed element was encountered. So we are not using exceptions for flow control: they will be handled by displaying the error and then terminating the program.

Search This Blog

C++ Development: The Breviary Project