XML Psalm Parsing
The PsalmParser is the first of three XML parsers used to support the Breviary.
It would have been possible to simply provide the psalm data in a C++ rather than an XML format. The domain is closed; only 150 psalms exist, at least as far as the Western Rite is concerned. But hardcoding it means that one can be stuck with typos, which require recompilation to fix; and it also constrains one's translation to that hardcoded (in this case, the original Coverdale text as it appears in the English BCP).
The choice to use XML for this (as well as for hymns and most of the propers) leads to a further decision: how to parse the XML?
I've been familiar with xerces for well over a decade. Using xerces is straightforward, gives a choice between DOM and SAX parsers, and can do additional validation. But as I'm doing this project, in part, to upgrade my skills, using a familiar third-party library -- and one with a C interface, to boot -- teaches me nothing.
I decided to write my own parsers. For the psalms and hymns, both simple formats, I used a fairly classic combination of a tokenizer level (hand-rolled, as the formats were extremely simple) and a state-machine semantic level making use of the tokens.
For the propers info (looking ahead) it was clear to me that a fully hand-rolled tokenizer would be a pain; if I were to use that model I would want to use flex as a lexer. At one time I used flex fairly heavily, and the pattern/action structure it uses works very well with XML parsing.
However, the data model for the token actions was relatively complicated. A set of antiphons can be represented in full, with (for example) VespersAntiphons having five (or one) Antiphon element as contained elements; or the element could be empty, with a ref attribute either being set to "Lauds" (take the antiphon from Lauds of that day) or a day name, meaning that the antiphons of the normal weekday were to be used. In an object context, it seemed to me that having an element knowing how to expand data so that higher level elements wouldn't need to look at details, and that in that context the element should control its own parse rather than having tokens handed to it from the lexer. So I wound up with something like a recursive descent parser, with every element knowing how to extract its own tags and able to hand the data to be parsed to the elements on the next level down, extracting the information, once parsed, from those elements. (Because no elements are nested, even at extended depth, inside themselves, there is no actual recursion. There is a gradual descent and the reascent through the call tree of the sort that a tokenizer approach, which works at a flat level, avoids.)
A tokenizer-based approach manages its own stacks of parsed tokens, incrementing and decrementing their depth. For something like the psalms or hymns that depth was very shallow - one or two levels - and required little complexity in its handling.
In the XML, a psalm is a second-level element, under the top-level psalms element. It can either have a tag attribute (in which case it has no sections other than, so to speak, itself), or no tag attribute, in which case it is expected tat all of its immediate children will be section elements (and the first tag encountered in parsing after the psalm tag is expected to be a section tag). Otherwise it has only verse children. Sections can have only verse children.
There are substantive as well as formal constraints. All psalms are numbered; all verses are numbered. The numbers of the psalms are expected to increase monotonically -- no omitting entire psalms as the Canadian BCP does Psalm 68. The verse numbers are expected to increase monotonically within psalms, and are not reset across sections.
This is why the Psalm class supports separate short calls to addSection() and addVerse(). In the course of a parse it manages the stack of tokens consumed while the parser is inside a psalm element; once the parser has passed over a section or a verse it can forget about it. Substantive errors will be signalled by exceptions.
(The alternative would have been to accumulate psalm subcomponents until they are all extracted and then create a psalm object all at once, leaving it immutable after construction. The accumulator looks a lot like a psalm object without formatting capabilities and a psalm would look a lot like an accumulator without extension capabilities. I decided that given that the psalm is very thoroughly hidden inside the API, I could get away with not creating two different specialized classes with a single shared storage model, but it would in principle have been the Right Thing to do. (It's rather like the difference between the C++ mutable string model as opposed to the immutable Java string model which requires a StringBuilder to construct a string in a piecewise manner.))
This renders the actual parsing code, if not simple, then at least focussed on the single task of managing the extraction of tokens from a stream of text.
Here's an example of the beginning of the XML representation:
<psalter>
<psalm number="1" tag="Beatus vir, qui non abiit">
<verse number="1">Blessed is the man that hath not walked in the counsel of the ungodly, nor stood in the way of sinners, and hath not sat in the seat of the scornful.</verse>
<verse number="2">But his delight is in the law of the Lord; and in his law will he exercise himself day and night.</verse>
<verse number="3">And he shall be like a
tree planted by the waterside, that will bring forth his
fruit in due season.</verse>
<verse number="4">His leaf also shall not
wither; and look, whatsoever
he doeth, it shall prosper.</verse>
<verse number="5">As for the ungodly, it is
not so with them; but they
are like the chaff, which the
wind scattereth away from
the face of the earth.</verse>
<verse number="6">Therefore the ungodly
shall not be able to stand
in the judgment, neither the
sinners in the congregation
of the righteous. </verse>
<verse number="7">But the Lord knoweth
the way of the righteous; and
the way of the ungodly shall
perish.</verse>
</psalm>
<psalm number="2" tag="Quare fremuerunt gentes?">
<verse number="1">Why do the heathen so
furiously rage together?
and why do the people imagine a vain thing?</verse>
...
A section tag, the only element not shown above, looks like:
<section tag="Dilexi, quoniam">
On analysis, I determined that there were six possible normal states plus an error state:
1) The beginning state, before the psalter tag is encountered;
2) An In Psalter state, which is like the beginning state except that it is between psalms and expects a psalm element next rather than having to enter the psalter element before meeting a psalm element;
3) An inside section/tag state, where the next token expected would be the tag;
4) An in psalm state, which means between verses
5) An in verse state, in which substantive text is being processed but and end-of-verse tag has not been seen
6) An in section state, which is like state 4 except that it expects an end of section rather than an end of psalm.
Some of these states have simple transitions; some require extra data (whether one is inside a section or just a psalm).
The states can be modelled by active parsers; this allows the parsing code per state to have to handle only the legal transitions for that state. In the event of an unexpected transition it returns an error code.
The parsers are static; one instance of each type exists. On returning a token, the parser also returns a pointer to the parser to be called next. There is a Null parser which is passed on end of input or on error, which will always return a pointer to itself.
A token returned by the parsers (which are really tokenizers strictu sensu) consists of a type code plus associated text. Some types have no text (no action needs to be taken on some transitions, such as end-of-section).
enum class TokenType
{
PsalmNumber,
VerseNumber,
PsalmTag,
SectionTag,
PsalmVerse,
Error,
End,
NoToken,
EndPsalmToken
};
struct Token
{
std::string value;
TokenType type;
};
class IParser
{
public:
virtual ~IParser();
virtual std::pair<Token, IParser *>
getNextToken(std::istream &inStream) const = 0;
};
The simplest parser is the NullParser, which is stateless and for which the implementation of getNextToken() is:
std::pair<PsalmParser::Token, PsalmParser::IParser *>
PsalmParser::NullParser::getNextToken(std::istream &inStream) const
{
return std::make_pair(Token{ ""s, TokenType::End },
const_cast<NullParser *>(this));
}
The parser with the most work is the InPsalterParser, which handles the transition from being inside a psalm to having processed the next tag, which may be a verse or a section tag.
std::pair<PsalmParser::Token, PsalmParser::IParser *>
PsalmParser::InPsalterParser::getNextToken(std::istream &inStream) const
{
std::string str;
std::getline(inStream, str);
while (str.empty() && inStream.good())
std::getline(inStream, str);
if (!inStream.good())
return std::make_pair(Token{ ""s, TokenType::End }, &theNullParser);
else if ((str.length() >= 10) && (str.substr(0, 10) == "</psalter>"))
return std::make_pair(Token{ ""s, TokenType::End }, &theNullParser);
else if (str.find("<psalm ") != std::string::npos)
{
std::string tag;
// Extract tag and store it, if there is one
auto offset = str.find("tag=\"");
if (offset != std::string::npos)
{
std::string temp = str.substr(offset + 5);
auto offset2 = temp.find("\"");
if (offset2 != std::string::npos)
tag = temp.substr(0, offset2);
else
tag = temp;
}
offset = str.find("number=\"");
if (offset == std::string::npos)
{
std::source_location sl = std::source_location::current();
return std::make_pair(
Token{ str + "-" + sl.function_name() + ":"
+ boost::lexical_cast<std::string>(sl.line()),
TokenType::Error },
&theNullParser);
}
std::string temp = str.substr(offset + 8);
auto offset2 = temp.find("\"");
if (offset2 != std::string::npos)
{
std::string num = temp.substr(0, offset2);
if (tag.empty())
return std::make_pair(Token{ num, TokenType::PsalmNumber },
&theInterPsalmParser);
else
{
theTagParser.set(tag, false);
return std::make_pair(Token{ num, TokenType::PsalmNumber },
&theTagParser);
}
}
}
std::source_location sl = std::source_location::current();
return std::make_pair(
Token{ str + "-" + sl.function_name() + ":"
+ boost::lexical_cast<std::string>(sl.line()),
TokenType::Error },
&theNullParser);
}
Note that text is read in as whole lines and not as a continuous stream of (say) whitespace-delimited words.
Aside from the error returns, the InPsalterParser can return the InterPsalmParser (if no tag needs to be passed back as a separate token) or a tag parser, which is essentially a holding place for the next token to be processed.
Thg TagParser is the simplest of the two:
class TagParser : public IParser
{
public:
TagParser() : m_isSection(false) {}
~TagParser() override;
std::pair<Token, IParser *>
getNextToken(std::istream &inStream) const override;
void
set(const std::string &inTag, const bool inIsSection)
{
m_tag = inTag;
m_isSection = inIsSection;
}
private:
std::string m_tag;
bool m_isSection;
};
where getNetToken() is defined as
std::pair<PsalmParser::Token, PsalmParser::IParser *>
PsalmParser::TagParser::getNextToken(std::istream &inStream) const
{
if (m_isSection)
return std::make_pair(Token{ m_tag, TokenType::SectionTag },
&theInSectionParser);
else
return std::make_pair(Token{ m_tag, TokenType::PsalmTag },
&theInterPsalmParser);
}
That is, it passes over the tag which it was handed, along with the appropriate token type, and a pointer to the appropriate following parser.
The InterPsalmParser and the InSectionParser share common logic which is factored out into a base class:
class VerseStartHandler
{
public:
virtual ~VerseStartHandler();
std::pair<Token, IParser *>
handleFirstVerseLine(const std::string &inStr) const;
bool isValidFirstLine(const std::string& inString) const;
protected:
virtual bool isForSection() const = 0;
};
handleFirstVerseLine() is a standard GoF Template function pattern, relying on isForSection():
std::pair<PsalmParser::Token, PsalmParser::IParser *>
PsalmParser::VerseStartHandler::handleFirstVerseLine(
const std::string &inStr) const
{
std::string s = inStr.substr(15);
auto offset = s.find("\"");
if ((offset > 0) && offset != std::string::npos)
{
std::string num = s.substr(0, offset);
while ((s[offset] == '\"') || (s[offset] == '>')
|| std::isspace(s[offset]))
++offset;
s = s.substr(offset);
if (!s.empty())
{
offset = s.find("</verse>");
if (offset == std::string::npos)
{
theVerseParser.set(s, false, isForSection());
return std::make_pair(Token{ num, TokenType::VerseNumber },
&theVerseParser);
}
else
{
theVerseParser.set(s.substr(0, offset), true, isForSection());
return std::make_pair(Token{ num, TokenType::VerseNumber },
&theVerseParser);
}
}
else
{
theVerseParser.set(s, false, isForSection());
return std::make_pair(Token{ num, TokenType::VerseNumber },
&theVerseParser);
}
}
std::source_location sl = std::source_location::current();
return std::make_pair(
Token{ inStr + "-" + sl.function_name() + ":"
+ boost::lexical_cast<std::string>(sl.line()),
TokenType::Error },
&theNullParser);
}
The VerseParser, to which this always hands off while passing back the verse number, is as follows:
std::pair<PsalmParser::Token, PsalmParser::IParser *>
PsalmParser::VerseParser::getNextToken(std::istream &inStream) const
{
if (!m_verseComplete)
{
std::string str;
std::getline(inStream, str);
while (!str.empty() && inStream.good())
{
if (!inStream.good())
return std::make_pair(Token{ ""s, TokenType::End },
&theNullParser);
auto offset = str.find("</verse>");
if (offset == std::string::npos)
m_text.append(" ").append(str);
else
{
m_text.append(" ").append(str.substr(0, offset));
break;
}
std::getline(inStream, str);
}
if (str.empty()) // Gap with no end of verse
{
std::source_location sl = std::source_location::current();
return std::make_pair(
Token{ str + "-" + sl.function_name() + ":"
+ boost::lexical_cast<std::string>(sl.line()),
TokenType::Error },
&theNullParser);
}
}
if (m_isForSection)
return std::make_pair(Token{ m_text, TokenType::PsalmVerse },
&theInSectionParser);
else
return std::make_pair(Token{ m_text, TokenType::PsalmVerse },
&theInterPsalmParser);
}
-- it reads through to the end of the verse and always passes a PsalmVerse token type back.
The InterPsalm and InSection parsers have, of course, somewhat different logic making use of the template function access which they get via inheritance:
std::pair<PsalmParser::Token, PsalmParser::IParser *>
PsalmParser::InterPsalmParser::getNextToken(std::istream &inStream) const
{
std::string str;
std::getline(inStream, str);
while (str.empty() && inStream.good())
std::getline(inStream, str);
if (!inStream.good())
return std::make_pair(Token{ ""s, TokenType::End }, &theNullParser);
if (str == "</psalm>")
{
return std::make_pair(Token{ ""s, TokenType::EndPsalmToken },
&theInPsalterParser);
}
else if ((str.length() > 14) && (str.substr(0, 14) == "<section tag=\""))
{
auto offset = str.find("tag=\"");
std::string temp = str.substr(offset + 5);
auto offset2 = temp.find("\"");
std::string tag;
if (offset2 != std::string::npos)
tag = temp.substr(0, offset2);
else
tag = temp;
theTagParser.set(tag, true);
return std::make_pair(Token{ ""s, TokenType::NoToken }, &theTagParser);
}
else if (isValidFirstLine(str))
{
return handleFirstVerseLine(str);
}
std::source_location sl = std::source_location::current();
return std::make_pair(
Token{ str + "-" + sl.function_name() + ":"
+ boost::lexical_cast<std::string>(sl.line()),
TokenType::Error },
&theNullParser);
}
std::pair<PsalmParser::Token, PsalmParser::IParser *>
PsalmParser::InSectionParser::getNextToken(std::istream &inStream) const
{
std::string str;
std::getline(inStream, str);
while (str.empty() && inStream.good())
std::getline(inStream, str);
if (!inStream.good())
return std::make_pair(Token{ ""s, TokenType::End }, &theNullParser);
if (str == "</section>")
{
return std::make_pair(Token{ ""s, TokenType::NoToken },
&theInterPsalmParser);
}
else if (isValidFirstLine(str))
{
return handleFirstVerseLine(str);
}
std::source_location sl = std::source_location::current();
return std::make_pair(
Token{ str + "-" + sl.function_name() + ":"
+ boost::lexical_cast<std::string>(sl.line()),
TokenType::Error },
&theNullParser);
}
Finally, the BeginningParser is a child of the InPsalterParser which hands off to its inherited implementation once it has negotiated the psalter opening:
std::pair<PsalmParser::Token, PsalmParser::IParser *>
PsalmParser::BeginningParser::getNextToken(std::istream &inStream) const
{
std::string str;
std::getline(inStream, str);
while (str.empty() && inStream.good())
std::getline(inStream, str);
if (!inStream.good())
return std::make_pair(Token{ ""s, TokenType::End }, &theNullParser);
if (str != "<psalter>")
{
std::source_location sl = std::source_location::current();
return std::make_pair(
Token{ str + "-" + sl.function_name() + ":"
+ boost::lexical_cast<std::string>(sl.line()),
TokenType::Error },
&theNullParser);
}
return InPsalterParser::getNextToken(inStream);
}
This is all moderately messy; I personally dislike writing parsers of any sort because a larger proportion of the code than usual is always given over to error handling and the errors are usually such as to render finding common code to simplify with difficult. It's true generally that there are many more ways to fail than to succeed, and most serious code can be up to 80% handling of errors and edge cases, but that seems to go for parsers doubled, redoubled, and in spades.
The implementation chosen does mean that the top level itself is fairly straightforward, as all the complexities are hidden in the tokenizers:
void
PsalmParser::parse(Psalter &outPsalter, std::istream &inFile,
const Log::ILogger &inLogger)
{
BeginningParser p;
std::pair<Token, IParser *> token = p.getNextToken(inFile);
uint16_t currentNum = 0;
uint16_t currentVerse = 0;
std::unique_ptr<Psalm> thePsalm;
while ((token.first.type != TokenType::Error)
&& (token.first.type != TokenType::End))
{
switch (token.first.type)
{
using enum TokenType;
case PsalmNumber:
currentNum = std::stoi(token.first.value);
break;
case VerseNumber:
currentVerse = std::stoi(token.first.value);
break;
case PsalmTag:
if (!currentNum)
{
throw std::runtime_error(
"Creating psalm tag with no psalm number");
}
thePsalm = std::make_unique<Psalm>(currentNum, token.first.value);
break;
case SectionTag:
if (!currentNum)
{
throw std::runtime_error(
"Creating section tag with no psalm number");
}
if (thePsalm.get() == nullptr)
{
thePsalm = std::make_unique<Psalm>(currentNum, ""s);
}
thePsalm->addSection(token.first.value);
break;
case PsalmVerse:
if (!thePsalm->addVerse(currentVerse, token.first.value, inLogger))
throw std::runtime_error("Failed in adding verse to psalm:" + token.first.value);
break;
case NoToken: // End of section, no action to be taken
break;
case EndPsalmToken:
outPsalter.addPsalm(thePsalm);
currentNum = 0;
currentVerse = 0;
break;
default:
break;
}
token = token.second->getNextToken(inFile);
}
if (token.first.type == TokenType::Error)
{
throw std::runtime_error("Error in parsing psalm: " + token.first.value);
}
}
We have the inevitable switch statement, but the number of token types is limited, the error handling at this level is reasonably simple (though we do guard against some unexpected transitions). If this were much more complex, it would make sense to turn this into a classic command pattern, with a group of functors stored in a map and keyed by enum values.
The main use of current language features is with the std::source_location code, which allows for better diagnostics when some data goes unexpectedly off the rails, plus the deployment of "using enum TokenType" which allows for slightly cleaner code.
Comments
Post a Comment