Thinking About Documenting Code

Writing about the interaction of objects in an effective way is not necessarily straightforward.

At a general object design level, one wants to talk about objects as block-box collaborators, and usually in terms of modelling actions.

At a level of talking about programming details, one usually wants to look inside a specific object and discuss the way in which pieces of data (frequently data members, not black-box objects) are wired together algorithmically.

These two types of interest barely intersect.

Let me take a small concrete example to illustrate.

I have a very small utility class called PathHandler.  If we want to talk about it as a black box, we look at its interface:

  PathHandler(const std::string &inEnvVariable);

  void process(std::function<void(const std::string &)> inFunc) const;

  void processUntilMatch(std::function<bool(const std::string &)> inFunc) const;

  bool empty() const { return m_members.empty(); }

The constructor takes the name of an environment variable.  Two of the other mechanisms allow for processing the component elements of a path once extracted (one processes all of the elements, the other only until some condition has been met).  Finally, there is an empty() accessor which is s convenience (we could get along without it, but awkwardly and sometimes inefficiently).

If we want to talk about how the class interacts, the main focus is on the two function-taking functions: this is essentially a mechanism for (a) avoiding some replicated code (i.e. following DRY) when handling UNIX-style paths taken from the environment, (b) doing so in a thread-safe manner -- the constructor will be handled properly even if it is a static local variable shared between threads (and otherwise will not be anything other than immutable), and (c) allowing for the focus on the activity itself in any processing by hiding the mechanics of parsing and processing, i.e. encapsulating the associated logic.

If we want to talk about how it is implemented, the two functions are of minor interest, as they have one-line implementations:

void PathHandler::process(

    std::function<void(const std::string &)> inFunc) const

{

  std::ranges::for_each(m_members, inFunc);

}

void PathHandler::processUntilMatch(std::function<bool(const std::string &)> inFunc) const

{

  std::ranges::find_if(m_members, inFunc);

}

where m_members is a set of strings.

If we want to talk about implementation, though, the interesting details are those hidden in the constructor.

namespace {

boost::char_separator<char> theSeparator(":", "");

}

PathHandler::PathHandler(const std::string &inEnvVariable)

{

  const char *cp = std::getenv(inEnvVariable.c_str());

  if (cp == nullptr)

    return;

  std::string s(cp);

  if (s.find(':') == std::string::npos)

    m_members.insert(s);

  else

    {

      boost::tokenizer<boost::char_separator<char>> tokens(s, theSeparator);

      std::ranges::copy(tokens, std::inserter(m_members, m_members.end()));

    }

}

There are at least three things of minor interest here:

1) Because the getenv() call is relatively expensive (note: it's not all that expensive, as the environment array is part of the address space of the program, optionally passed in to main) and the getenv() call occurs at the outset of the constructor, one wants to know that it is ideal not to declare this variable in any large scope -- we would want to defer it until it is genuinely needed.

Consider a (rather poor) alternative implementation:

class DeferredPathHandler

{

public:

  DeferredPathHandler(const std::string &inEnvVariable): m_var(inEnvVariable) { }

  void process(std::function<void(const std::string &)> inFunc) const;

  void processUntilMatch(std::function<bool(const std::string &)> inFunc) const;

  bool empty() const;

private:

  std::string m_var;

  mutable std::vector<std::string> m_members;

  mutable bool m_initialized = false;

  void init();

};

void DeferredPathHandler::init()

{

  const char *cp = std::getenv(m_var.c_str());

  if (cp != nullptr)

  {

    std::string s(cp);

    if (s.find(':') == std::string::npos)

      m_members.push_back(s);

    else

      {

        boost::char_separator<char> theSeparator(":", "");

        boost::tokenizer<boost::char_separator<char>> tokens(s, theSeparator);

        std::ranges::copy(tokens, std::back_inserter(m_members));

      }

  }

  m_initialized = true;

}

void DeferredPathHandler::process(

    std::function<void(const std::string &)> inFunc) const

{

  if (!m_initialized)

    init();

  std::ranges::for_each(m_members, inFunc);

}

void PathHandler::processUntilMatch(std::function<bool(const std::string &)> inFunc) const

{

  if (!m_initialized)

    init();

  std::ranges::find_if(m_members, inFunc);

}

bool empty() const

{

  if (!m_initialized)

    init();

  return m_members.empty();

}

In this other form, we use lazy initialization.  This allows an object to be declared early on and the expensive part comes later.  It's still re-entrant if it's a local variable; if it's shared around in any way with a longer lifetime, there may be threading issues.  The code is uglier, and the code above is not safe (it needs to be associated with a mutex at the object level to be made safe).  So now we can defer actions but we have another expensive part of the action added.  It also saves relatively little -- it defers but does not eliminate costs. So this is a poor alternative.

On the other hand, consider the thread-safe set of implementations, with m_data being a string:

OtherDeferredPathHandler::OtherDeferredPathHandler(const std::string &inEnvVariable)

{

  const char *cp = std::getenv(inEnvVariable.c_str());

  if (cp == nullptr)

    return;

  m_data = cp;

}

void OtherDeferredPathHandler::process(

    std::function<void(const std::string &)> inFunc) const

{

  if (m_data.empty())

     return;

  if (m_data.find(':') == std::string::npos)

    inFunc(m_data);

  else

    {

      boost::tokenizer<boost::char_separator<char>> tokens(m_data, theSeparator);

      std::ranges::for_each(tokens, inFunc);

    }

}

void OtherDeferredPathHandler::processUntilMatch(std::function<bool(const std::string &)> inFunc) const

{

  if (m_data.empty())

     return;

  if (m_data.find(':') == std::string::npos)

    inFunc(m_data);

  else

    {

      boost::tokenizer<boost::char_separator<char>> tokens(m_data, theSeparator);

      std::ranges::find_if(tokens, inFunc);

    }

}

This does the lookup immediately but defers any parsing until queried.  It's low on memory use, and if the data is used once (certainly) or a small number of times (depends on various factors) it's faster overall as well.  We no longer get the advantage of ordered (and de-duplicated) entries in the path, though.  The bigger downside is repeated logic, but that can be factored out into a private function:

bool OtherDeferredPathHandler::processSmallSet(std::function<void(const std::string &)> inFunc) const {

  if (m_data.empty())

    return true;

  if (m_data.find(':') != std::string::npos)

    return false;

  inFunc(m_data);

  return true;

}

The two implementations support different use cases. If the path information is going to be interned early on and then referenced throughout the program, the first implementation above is to be preferred: it optimizes the actual accesses to the data and puts most of the expense up front. But if you need the data in a narrow scope, and then have no more need for the data, this implementation which does not store into an STL container (other, that is, than a string) is to be preferred.

2) In the original form we do take some slight steps to minimize cost.  We check for whether the path has more than one component before applying the tokenizing logic.  By making theSeparator file-level static we avoid the threading costs associated with local static variables and the construction cost with repeat calls to functions using it as a a local variable.

3) We could have used boost split() to a vector instead of a tokenizer. However, the decision to use a set for storage -- which allows optimization of the function with the find_if implementation to break off searching based on lexical considerations if that is appropriate -- makes a two-step transfer (split to vector, copy to set) a poor idea.

In other words, the interesting discussion is about tradeoffs in implementation; a discussion about object usage will, instead, focus on the object's intents: providing a simple mechanism for dealing with UNIX-style path environment variables to follow the DRY principle, and to provide an interface that is generic enough that it could be refactored to descend from a pure interface class should alternative implementations be wanted. (And the alternatives aren't necessarily just with regard to deferring actions: you might want to pass a filter into the constructor to exclude some path members which were known to be irrelevant to the immediate context. Or maybe you use the moderately cheap form given above and just initialize the data collection that you want based on filtering during a process() call instead of doing extra specializations.)

So in general documentation, we need to keep an eye on what the audience wants, or maybe what it should want.

In my experience, most developers want a tool where the interface meets their needs, and rest there.  Occasionally they will follow some additional rules of thumb (e.g. using unordered_map rather than map because it's faster.  Well, it's faster for lookups.  After you get beyond a certain size (a red/green tree of one element is almost certainly faster to process that a hash map of one element).  Depending on the cost of hashing and the relative cost of comparison in the actual data sets, this may be true of structures with rather more than one element.  (Consider long keys which vary markedly a few characters in.)) ... I'm suspicious of rules of thumb. Though "use a vector whenever possible, unless you can use an array" is a fairly good one.).

Simple forms of documentation (man pages being the obvious example) are flexible enough to contain lots of detail, but they're not suited to the sort of discussion useful to maintainers, or to developers interested in internals or models of design.  They're aimed at what the general developer, above, wants.  I would argue that the general developer should both want more and (when documenting) deliver more. (Documenting need not always be permanent documentation; it could also be discussion around a code review, but archived for later reference.)

Likewise, design patterns are primarily shorthand for intermediate-level solutions which tend to apply inside an interface. If as an external user I pass a parameter which I know to be an action command into an interface, the interface hides from me whether it is handled by a command pattern, or a chain of responsibility pattern, or a broadcast/listener pattern, or is handled by a coroutine, or is passed to another thread by way of a queue.  At the level of design the key piece of information is the calling of the interface function in that particular context. And the more we program to interfaces, the more the implementation questions belong to a different domain.

(That PathHandler as a concrete class is very much about UNIX-style environment variables, but if I extract an interface and call it IPathHandler, and have some implementations which are assembled from other sources (disk files, database queries, user input) and one or two which are also (beyond the interface) mutable, then the interaction between caller and callee takes in a different cast.)

The best book I know on design is the first volume of P.J. Plauger's Programming on Purpose, now sadly out of print. That is in part because it doesn't get lost in the bushes of program details, and because he doesn't get hung up on specialized terms like afferent and efferent. And most of the book is about the flow of control, and how choices there shape programs.

(Decoupling - a term covering programming to interfaces, dependency injection, polymorphism, and a lot of template metaprogramming - is not a design issue in that sense.  Decoupling is about allowing more comprehensibility by setting up clearer boundaries. You still have to make choices about your broad structure which are entirely orthogonal to decoupling techniques.)

So we circle back to the question of what to document, and where, and how. Literate programming was all about documenting the comprehensible unit of processing, defined, usually, by a WEB or CWEB macro which would be expanded in place in a longer function. (Literate programming works really well when properly done: one need only read Knuth's TeX book as an example. We can't all be like Knuth, just as, in my old discipline, we can't all be Hugh Kenner.  In fact, almost none of us can approximate Knuth or Kenner as writers, let alone scholars.) Doxygen and Javadoc and their relatives take the opposite tack: they take the concrete class as the unit for documentation, and do so in a static manner. UML allows for calling diagram models, but most use of UML tends to be, also, at the static level and, to be honest, it's not easy to follow beyond the key ability to identify static dependencies. (Swimlane diagrams require you to keep the static classes represented at the top of the lanes in your head to meaningful.)

The problem is that for 90% of one's use of, say, a library function all you need is the call signature, the return values, and the domain for which it is valid. But for the other 10%, where you're worried about performance, or reentrancy, or edge cases, you need rather more information. And if you have to maintain it, not just use it, knowing why certain choices were made, and what the alternatives were, can be extremely valuable (and is almost always lacking).

More usefully, perhaps, if a developer writes out that sort of documentation, it will usually trigger the sort of reflection which is likely to lead to better code (cleaner, more bug-free, more efficient: all of those). We are back to pairing discussion and coding in a model which parallels, but does not displace, the discipline of code reviews. Literate programming was based on combining the two processes simultaneously, but required a form which separated the artifact itself from the code most developers were comfortable with.

If we drop the requirement of simultaneity, we can make the documentation reflective, triggered by relevant headings. A few selected headings might be:

- Efficiency (Consider use cases, trade-offs)

- Thread-safety (also immutability, and, again, use cases: if an object is initialized during startup but unchanging after that thread-safety is a minor issue)

- Edge cases (This is freeform now but will be structured by design by contract)

- Alternative implementations

- Enumerated responsibilities: how long is the list?  Can it be shortened?

Comments

Popular posts from this blog

Boundaries

Overview

Considerations on an Optimization