Considerations on an Optimization

An application – actually, successive pair of applications – is meant to process (in the end) about three hundred or so entries to convert a simple markup into complex LaTeX markup. It also enhances the data with information drawn from a flat-file text database (i.e. literally a set of .txt files in a single directory).

This application was originally written in rust, which is not my favourite language, but one worth exercising to a degree. The rust application covered a limited subdomain, and generally avoided unsafe rust code, although in some cases it had to use cell mechanisms which converted compile-time to runtime safety.

The initial iteration had the text database as one large file, with some logic allowing for extended seeking through chapters. It soon became apparent, after extending my dataset to about twenty or so items, that this was going to result in an unacceptable runtime — it was taking longer to process than the later TeX processing took, and although in theory this would be acceptable — the application, once finished, would not need to be run often — the user experience was not ideal, especially considering the use case of adding data to the source files, which means in some cases iteratively tweaking the source and running the application. So I converted it to a multi-file database, broken into smaller blocks, to reduce the seek times on the data. This dropped the processing time to a second or so, which was acceptable (the LaTeX processing was at that point the major limit on speed).

I then, for other reasons, ported the application to C++ in order to extend it. Partly this initial port was to confirm my suspicion regarding the languages — and yes, the resulting code was shorter and clearer than the rust code — but I was mainly concerned with extending the domain and I thought C++ a more flexible way to do this. (I also used the opportunity to implement the C++ code in CWEB as an exercise in more extended documentation.)

While I did the port, I did a number of things which were mainly driven by an aim to produce patterns which were easier to manage and extend that the original rust code. This mainly involved converting the iterated tests which rust’s match construct makes easy into Command pattern implementations using std::unordered_map and (where order was important) std::map. In one case I had to match the beginnings of lines in a parser and I used a map with searches between a lower bound and an upper bound to limit the number of comparisons needed to find the appropriate command to execute. (The parser calls take place in what is essentially the main loop of the application.) I was also able to make more extensive use of span and string_view classes where rust would have complained about passing ownership around between functions, and particularly between closures.

I expanded the scope of the base test set to represent the additional domain. The test set now includes about 50 items, or about one-sixth of the expected ultimate size. I’m still doing some regular file searching, though, and I contemplated reading the database files into memory for additional optimization by reducing seek times even further.

But then I decided to check the processing time. The application was now taking about 0.08 to 0.09 seconds to run. By comparison, the pdflatex processing for the largest of four output files is about 1.5 seconds. Even if the project expands to 500 items, ten times the current number, the runtime would still be (just) under a second. Note that this is an interactive user application (though it could certainly be run by a shell script in the background). The application’s runtime is essentially a rounding error compared to the formatting runtime.

I was sure that converting the DB lookup to in-memory would lead to some additional savings, but I could see no way of justifying the work, even just to myself. For practical human purposes, the current runtime is instantaneous, and it will be nearly so on the completion of the data set.

The application doesn’t have large setup costs, so it’s reasonable to consider the processing time per record to be a little less than 2/1000 of a second.

Context is important. In an ongoing transaction processing application — or at least in any such application I’ve ever worked with — that sort of processing time would be undesirable, and obvious steps to improve the latency would be in order. (Only five hundred transactions per second? In some transaction contexts (some FX environments, for example) that would be fine. In processing equities it’s nowhere near fast enough.)

If shifting to in-memory lookups would also simplify the code, they would still be in order. However, this is another place where the parsing logic would remain unchanged but the complexity of setup would increase, i.e. a change made purely for the purposes of optimization.

If the faster approach were likely to be of benefit to some other application in the future, it makes sense to put that capability in a library now, calibrate the differences, and deploy it here. I can’t see any future application which is likely to have need of the faster input mechanism, but it might be handy.

Does it make any difference that the core (but only the core) of the additional functionality has another use? I have a couple of test objects which store a fixed-size array in memory to mock file access by line. If I take one of them and convert the storage to a vector, and store the general class in a library, I have an implementation for the storage, and I can simplify the test classes by removing the repeated structural logic. (This means three library classes: the core implementation; a version designed to be used in tests, differing mainly by its constructor but also providing tracking of calls; and a version designed to be constructed from the contents of a file for the optimization functionality.)

Are there functional differences between the in-memory and on-disk versions? Well, for some uses (though not this) where it’s important that a file be complete and that where there is an end marker, the in-memory version can be set up with a postcondition in the constructor allowing bad inputs to fail fast. There are also differences if a file is being written to concurrently by another application, but neither version is designed to handle that well, and it’s not an expected use case. If we make the in-memory version a template, we can store a set of records which can be generated from single lines as structured objects. Again, this isn’t the present use case, but it’s a possibility; but it’s more likely that such a class would want to be another variant making use of a factory rather than simply relying on an explicit constructor from a string; that use counts as "speculative" for now.

What about culture or work context? For the purposes of this question it’s important to note that the time required to make the aggregated changes might be a morning’s work at most. This is a "safe" change, in addition, mainly involving bolting already-functioning things together. In a typical work context, this would be a pro in favour of making the change; in this case, where my primary driver is personal interest, there’s no chance of learning anything much new technically.

What it will tell me is how much this sort of change is worth. It won’t have as great an effect as it would have a couple of decades or so ago, because hardware and system software has become far more likely to keep some of that data in its own cache, assuming there isn’t too much of it.

So I made the change out of pure interest. It probably didn’t amount to more than fifty lines, all told, and at least half of that was setting up the general utility classes which are of permanent benefit.

The result? Well, not earth-shattering. Some additional records had been added to the test suite, pushing the elapsed time up to about 1/10 of a second. Putting in the change drops that back down again to about 75/1000 of a second: about 3/4 of the time. That’s decent, and worth keeping, given the minimal work involved; at a guess, this was the single major hot spot, with the bulk of the additional time is probably spent doing in-memory copying, map lookups, and file output. (And at the current performance level it’s really not worthwhile breaking out valgrind to determine what else would be achievable.) At a level of user perception right now, that’s invisible. For a projected set of, say, seven times as many records, that means that what would have been about 7/10 of a second (just perceptible to a human at the keyboard) would now be about 1/2 second. As originally noted, there’s no way this will have a significant effect on this application’s perceived speed; it was already pretty low, especially as compared with the following pdflatex runs.

However, that’s not really the takeaway from this survey. The real takeaways are that

   (1) There was an original substantively important change that was related to file seek times. Ignoring that problem would have led to a relatively painful update cycle. It was dealt with not by clever coding but by a fairly straightforward addressing of the way in which the source data was delivered.

   (2) There was a second, unmeasured but certainly noticeable, set of optimizations which came about as a result of (1) a language port and (2) simply using standard patterns which improve both maintainability and speed. That’s really quite important, because it’s a reminder that Knuth’s dictum about premature optimization applies to optimizations done for their own sake. Optimizations which fall out of following good practice are a win all-around.

This optimization is closer to being done for its own sake, but it still has produced some library facilities which will be of more general use in the future.

Comments

Popular posts from this blog

Boundaries

Overview