Thoughts on Refactoring and Design: Five Lines

One (very decent) book on refactoring has the title Five Lines of Code in which five lines is taken as the upper limit on a function (where five means "count semicolons and control statements").

In addition, it has two rules which tightly restrict what those five lines can be:

1) Only one if statement per function, at the top of the function. This statement would be the "one thing" that function does.
2) No if plus else, or switch, controls for data one has control over (i.e. part of one's own source and not from a standard or third-party library). Instead, late bindings using polymorphism are prescribed. A corollary of this is the elimination of enums in favour of substantive classes (except in languages like Java, where enums are substantive classes).

The author does note that the number five might vary a bit by domain.

You can certainly do this in C++ -- the TypeScript examples are easily convertible to C++ -- but it may be worth putting forward some concerns about this.

1) Many data domains are inherently messy. They involve neither relatively homogeneous sets of items (which can be managed by collections) nor items which break down in a neat hierarchy of relationships, allowing composition to be by two or three fields at a time as one builds up from smaller to larger structures. Especially when initializing objects in these domains, but also when processing them, cutting up the logic into small parts will tend to be arbitrary and counter-intuitive.

2) In some cases late bindings may not give any particular benefit. If you're modelling a Bridge game, "suits of cards" is highly stable but makes a difference primarily in the bidding process, and in checking for trumps during the play of cards. Modelling the suits as an enum is likely to be as clear, and as maintainable, as modelling them as separate classes, even if it leads to some execution contexts where switch statements are part of the flow of control.

Equally, one can have categories which are so widespread throughout a domain that what they touch is very heterogeneous. Making them into classes and moving logic in means getting really unrelated grab-bags of functionality. The other two alternatives are (1) letting the categories, as enums, be used to generate local state/strategy objects inside specific classes, which reduces the use of switch to one location per class, in the constructor or a factory, or (2) just letting the switch logic operate inside the class - which makes sense if the reference to the type occurs only once, or maybe twice, for the class. This is especially true if the behaviour is not tightly logically coupled to other uses of the type. (The enum value MONDAY generates both default antiphons for Lauds and for Vespers, but these have no inherent relation - if one site required an adjustment it wouldn't affect the other.)

3) The if-as-first-and-only action works without awkwardness only if your methods have a very tight contract in the Design-by-Contract sense. If you need to check that the value you have been given is not null, and that the value it dereferences is greater than 1, and you are not using exceptions to manage conformity to what is required, then you need two tests before you start doing anything substantial. (With a tight contract you can just define those conditions as "undefined" and put the responsibility for meeting them on the calling context.)

You can, of course, combine all the tests into one function, but if that function is useful only for one calling site, it's debatable whether you've gained in simplicity or lost it.

This is, of course, something which will change considerably once we get contracts, maybe in C++26. We can tighten up the contracts and put the tests into requires guards.

There are tricks with monadic operations which avoid having explicit if/else structures, but despite their use in std::optional and std::expected they are not (yet) very mainstream C++.

4) Use of ranges and views, along with a reasonable use of lambdas, can rack up "lines of code" very quickly if we count semicolons. Consider:

std::ranges::for_each(
m_fields | std::views::filter([&inIncludes](const auto &inVal) {
return std::ranges::any_of(inIncludes, [&inVal](const auto &inField) {
return inField == inVal.first;
});
}),
[&](const auto &inRec) {
auto [type, val] = inRec;
format(inFormatter, type, val);
});

We have five semicolons in that one statement. (We've effectively hidden an if inside the filter view, to boot.) One semicolon could be eliminated by extracting the two values in the pair inRec in the subsequent format call:

format(inFormatter, inRec.first, inRec.second);

but in my view extracting the values using structured bindings is clearer than the other option (which is why it is used). Clarity and simplicity are not necessarily equivalent to reducing the number of statements.

Taking out the lambdas and replacing them with previously-defined functors gives you (noting that all three are closures which require taking references):

std::ranges::for_each(
m_fields | std::views::filter([&inIncludes](const auto &inVal) {
return std::ranges::any_of(inIncludes, Functor1(inVal));
}),
Functor2(inFormatter));

which is reducible, by defining the first lambda as a functor (and thus hiding the first functor above) to

std::ranges::for_each(m_fields | std::views::filter(Functor3(inIncludes)),
Functor2(inFormatter));

If you have defined Functors 1, 2 and 3 externally to your function, then you now have a single statement, by extracting and redefining the logic in the lambdas elsewhere. (One cannot say "encapsulate" because the lambdas already encapsulated their logic.) If they are used elsewhere, that's a net benefit. If they aren't, (1) it's an open question as to whether it's more difficult to understand the logic spread across three functors and a calling function and (2) we have effectively banished the use of lambdas in the interest of simplification of calling sites. I'm not sure I want to do that. If you have defined the three functors locally to your function, then you have just racked up the number of statements again.

Again, consider:

std::ranges::for_each(
boost::tokenizer<boost::char_separator<char>> fields(inVal,
commaSeparator)
| std::views::transform([](const auto &inRec) -> std::string {
std::string modifiedName(
boost::algorithm::trim_copy(inRec));
std::ranges::replace(modifiedName, ' ', '_');
return modifiedName;
}),
[&](auto inVal) {
if (!m_topics.contains(inVal))
m_topics.insert(inVal);
});

In principle, the for_each statement is one statement, though if you count semicolons it is 5. The lambdas are, considered independently, in conformity with the general rules; the combination into one statement should not obscure this. Neither is used elsewhere.

Sometimes, increasing the number of statements arguably leads to an increase, rather than decrease, in comprehensibility and clarity.

Instead of

std::ranges::for_each(
m_fields | std::views::filter([&inIncludes](const auto &inVal) {
return std::ranges::any_of(inIncludes, [&inVal](const auto &inField) {
return inField == inVal.first;
});
}),
[&](const auto &inRec) {
auto [type, val] = inRec;
format(inFormatter, type, val);
});

consider, in contrast:

auto filterOnField = [&inIncludes](const auto &inVal) {
return std::ranges::any_of(inIncludes, [&inVal](const auto &inField) {
return inField == inVal.first;
});
};

auto individualRecordFormatter = [&](const auto &inRec) {
auto [type, val] = inRec;
format(inFormatter, type, val);
};

std::ranges::for_each(
m_fields | std::views::filter(filterOnField), individualRecordFormatter);

By breaking out the lambdas, I've increased the number of statements by two, but it's arguably clearer as a result of the addition of names. The compiler should treat them identically.

5) I have a disagreement with "a function should do one thing" being combined with "one thing is an if test".

Look at this naming question to see why:

Assume I have a method "processMessage()" and a function "messageIsValid()".

Then what should be the name of the function which simply wraps (only)

if (messageIsValid(msg))
processMessage(msg);

?

A genuinely descriptive name would be "testAndProcessMessage".

Which does not indicate one responsibility: that "and" means "two".

There are ways to finesse this, but they involve using a different model. I referenced monadic operations: instead of including the processing action you can return an object which, when called, is a no-op on test failure but an actual processing on success:

validateMessage(msg).process(msg);

which does avoid the naming problem.

On the other hand, that if control structure could be part of two or three actions that could be meaningfully called doInitialProcessing() without being restricted to just that one control structure:

if (messageIsValid(msg))
processMessage(msg);
else
{
returnErrorToSender(msg);
return;
}

forwardToNextHandler(msg);

(Just wrapping the if without propagating the test works only if you're handling errors via exception. If you're using std::expected, you can handle failure with monadic functions. Otherwise, you need to propagate the test out.)

I do, in fact, think that Five Lines of Code is a good book. But I think that it is too focussed on having simple rules (like counting lines) stand in for better but more difficult guidelines ("give a function one responsibility" combined with "the body of a function should be understandable as a whole on its face"). It seems like an oversimplified and overly restrictive model compared to measurements of cyclomatic complexity.

It is certainly miles away from most "professional" code I have seen over the past couple of decades. Most "professional" developers seem to think nothing of adding an extra ten lines of code, slightly adjusted, to five different locations in five different functions which are already over 500 lines long, without doing any encapsulation or refactoring, in an inversion of Martin's Boy Scout Principle.

Search This Blog

C++ Development: The Breviary Project

Thoughts on Refactoring and Design: Five Lines

Comments

Post a Comment

Popular posts from this blog

Boundaries

Overview

Considerations on an Optimization