A codebase rescue - the process of cleaning up, refactoring, and modernizing legacy open-source code - has some unique challenges. This essay is going to be about things to do and things not to do. It's based on extensive experience with codebase rescues including GPSD, NTPsec, VMS Empire, OpenAdventure, and many others.
A typical codebase rescue project (the kind I see most often, anyway) is some kind of Unix network service code written in C dating back to the early 1990s or even 1980s. It's been in use forever, it's been hacked over by many hands, and serious compatibility issues would be likely to arise if you tried meeting its functionality from scratch. Often the program's behavioral repertoire is not fully documented, and there may well be no one person who fully understands either the repertoire or the code. There may be an RFC that the codebase does not quite track.
Legacy games like OpenAdventure or Super Star Trek make an interesting contrast. They're not as complex as network service daemons, but the idiom of the code can be even older and weirder. I've done one rescue from FORTRAN that dated back to 1973! Nevertheless, they teach some of the same methodological lessons as more serious rescues, the things you do to keep the Internet running.
So, you're faced with 200KLOC of archaic C - for reasons I don't understand, these ancient projects seem to cluster around that size. Your job is to bring it up to date so that maintaining it will no longer be a nightmare, and it can be extended/improved.
Define your victory condition
Know what you want to accomplish, and state it clearly when you start your rescue. This will help you attract developers and funding. Also, you'll know when you're done - not a trivial matter on a large project.
Probably the single most frequent motivation for these rescues is that some piece of infrastructure software has developed security issues due to neglect and bit-rot. It may have CVEs, or its crypto might be out of date.
Keep track of what you fix so you can brag about it (in the best understated just-the-facts-ma'am style, of course). This also attracts developers and funding.
Get thee to modern version control
If the rescue codebase has a history under Subversion, you should probably move. If it's under CVS or anything proprietary you should certainly move.
The reason is simple: you want to reduce the friction costs of development, and in particular the entry barrier for new developers.
Practically speaking, this means moving to git. Procedure for that is beyond the scope of this essay, but you can read the http://www.catb.org/~esr/reposurgeon/dvcs-migration-guide.html[DVCS migration HOWTO].
First, write regression tests
The first thing to do on the code itself, if at all possible, is write regression tests. Programs as old as the typical rescue project rarely if ever have them already.
Very occasionally this isn't possible; NTPsec was like this, for reasons too painful and technical to go into here. But usually your program can either be made to run in a filter mode that deterministically transforms inputs to outputs, or it can be made to capture transaction logs that, with more effort, it can be taught to replay.
(The reproducibility is important enough that you shouldn't be afraid to do things like spoof the C library's clock-time calls if you have to. Do what it takes.)
Good tests will save you immense amounts of pain and work, because you'll know quickly when you broke something while changing the code. A practice that goes with this is: every bug report should become another regression test.
If your program's inputs and outputs are binary (say, TCP/IP packets with non-text content) put in the effort to write lossless conversion to and from some textualized format, and then keep your test loads and check files in that form. This will make it easier to hand-compose tests that probe obscure regions of the programs's behavioral space.
This whole section is good practice for any code you write or maintain. The reasons I'm emphasizing it for rescues are (a) the hacked-over-by-many-hands history has often left them lots of odd little behavioral corner cases which users later came to depend on and you must thus preserve, and (b) writing synthetic test loads is generally the only way you have to verify the odd behaviors you think you see in the code.
If you find writing tests irritating and boring, remind yourself of a truth: each hour of test construction tends to save you many hours of boring and frustrating debugging work down the road. And that's not even counting the moments of outright panic and terror you avoid if you do this right.
The combination of good tests with fast version control is especially liberating. It frees you to move fast, try chancy things, and have fun without worrying that your experiments will break the project.
More truth: sufficiently good tests eventually drive your rate of visible errors low enough to make you look like a coding god with precognition and wizard fingers. This is not a sham or an accident.
Second, clean up all compiler warnings
The first thing you should do to the code itself is clean up all compiler warnings. The reason you should do this is dumb and simple; if you don't, they form a visual thicket in which actual error messages can hide.
We applied this policy to NTPsec after a false start during which we noticed what happened when we didn't.
Another advantage of doing this step relatively early is that cleaning up warnings is a good way to motivate shallow little dives into the code, which are preparation for the deeper ones you'll need to do at later steps.
Motivation like that is important - reading code as an activity in itself is hard and most people can't actually make themselves do it at the volume required to wrap their brains around one of these large, crufty old projects.
Accordingly a loop you will repeat many times during any rescue is: find some small thing you want to modify, read just enough code to do it, make the change, check correctness...then look for another thing to modify. If you repeat often enough, the knowledge you acquire from each point change will accumulate and you will start being able to make larger changes.
Continue this at larger and larger scales until you understand everything you need to. Fixing compiler warnings is the first step on that road.
Third, apply code checkers
Static code-auditing programs like Coverity and cppcheck are your friend. Add productions to your project's build recipe to make applying them easy. Treat their warnings like the compiler warnings they resemble.
You'll do one big fix pass with these at the beginning of your cleanup; this, too, is part of your iterative code-learning process. You should also apply them regularly in later stages for the exact same reason you never want to leave a compiler warning hanging out.
Yes, coming up with the right set of suppressions for bogus warnings is fiddly and boring. Do it in the knowledge that every hour of work you put in will pay off in a lot more fiddly and boring work avoided down the road.
Fourth, enforce C99 compliance
I've been assuming the codebase you're writing is written in C under Unix. Thus is a safe bet because almost no other language and operating-system API has been roughly stable long enough to make codebase rescues both practical and necessary.
(One possible exception is very old FORTRAN; this is really annoying but basically tractable, as there are FORTRAN to C source-code translators that work.)
The first non-trivial challenge you will face is bringing the C up to date.
If you are very unlucky you might encounter pre-ANSI C without function prototypes, though such survivals are rare as I write in 2018 and might be extinct by the time you read this. If you trip over one, fix that first; this is a change that is easy to verify.
Over time, C - and the generally accepted style of C programming - has moved towards stricter typing and more enforcement of type safety. In old code you may find uses of type punning, casts, and unions that would be unacceptable today.
There's a reason these old-fashioned practices are now frowned upon; they are defect attractors. You should eliminate them. More type safety means your compiler and auditing tools can do more to warn you of problems, allowing you to lower your defect rate.
In particular, try hard to eliminate casts. These are a sign that the code is violating type safety, and that is always a potential trouble spot. There are two places you may be unable to - casting pointers returned from malloc/calloc is necessary, and dealing with machine-word length variations in printf(3) and friends is another. Any casts outside these contexts should be viewed with deep, deep suspicion and probably terminated with extreme prejudice.
Apply the C89 void type. Functions that don't return anything should get it as a return type (in older code it was generally int). Untyped pointers such as those returned by malloc(3) should be (void ), not (char ) as in older code.
If the codebase does not already use a 'bool', a step that is more
important than it looks at first sight is identifying semantically
boolean variables and lifting them to use bool and the 'true' and
'false' keywords. Old-school practice was to type these as 'init'
and use FALSE and TRUE macros. Get rid of these and use
Yes, you want the slight improvement in type checking to lower downstream defect rates. The less obvious but possibly more important reason is that this booleanizing step is likely to teach you important things about the code even while you only half notice that it's doing so. This is because semantic bools tend to be concentrated in important parts of the program's control flow.
Old code sometimes embeds a tacit assumption that C ints are 32 bits.
This is difficult to audit for systematically, but be aware of it when you're hunting casts and other defect attractors.
Try compiles with C99 mode set. This may produce warnings that will be useful.
Why C99, rather than C89 or C11? The good-practice advice used to be to assume only C89, but I learned by field observation in the late 2000s on the GPSD project that C99 support had become universal and solid in every C compiler. It may have happened sooner without me noticing; like most C programmers who lived through the pre-standardization years I had a strong habit of conservatism in my assumptions in order to avoid nasty surprises.
On the other hand, post-C99 features are seldom relevant to rescues of old code. There are two minor exceptions:
The more important is anonymous unions; these are useful and safe, but only officially standard in C++. When I write coding guidelines for my C projects I allow these and have never had a port problem as a result.
The other might be the atomic-operation and memory-barrier primitives in C11. If you need them you're really going to need them. Better to assume C11 conformance and accept that you can only use a compiler/hardware combination that supports it than do something ad-hoc; these primitives are notoriously hard to get right and the cost of screwing up can be nasty.
The C cleanup step may overlap some with the next one. That's OK.
Fifth, enforce POSIX compliance
Codebases that need rescue tend to be full of preprocessor conditionals to support Unixes not seen since dinosaurs roamed the Earth. This is often true even of code most of which postdates successful Unix standardization; Unix tradition long discouraged throwing out any portability shims at all, because you never knew what weird platform your code might be deployed to.
But it's the 21st century and time to throw that crap out. The standards people won; take advantage. Write to a conformance baseline of POSIX.1-2008 (until recently I might have said POSIX.1-2001, but toolchain progress has continued getting made).
You collect several gains from this. The most obvious is readability/maintainability. Those thickets of ifdefs and variant library calls that time forgot are hard on the eyes and brain. They proliferate test paths.
It is an empirical fact (related to proliferating test paths) that those grotty variant sections tend to be really powerful defect attractors. As elsewhere in software engineering, ugly correlates with bad. Removing them can reduce your downstream bug load significantly.
Finally, getting rid of port shims enables your next step.
Sixth, shoot autoconf through the head
This step is a bit controversial - autoconf still has its advocates. In unusual cases it might be best to leave autoconf in place; by the end of this section I hope you will know how to recognize those.
This is a complicated topic. A bit of history helps in understanding the mess we're in.
Original 1970s Unix practice for software build recipes was to use plain Makefiles. This still works well for small projects shipping highly portable, self-contained code. But Makefiles scale quite badly, and don't at all address the problem of discovering local environment features like variant system calls or which service libraries are installed. Makefiles are particularly weak in coping with divergent Unix dialects, a problem which began to become acute in the early 1980s with the AT&T/BSD split.
Autoconf originated in the early 1990s as an attack on the problems Makefiles didn't solve. It includes feature discovery for the local environment, which in autoconf-speak is confusingly called "configuration".
When autoconf was written it solved a real problem. But the design included some choices that turned into complexity sinkholes. No modern scripting language was yet widely enough deployed to be considered portable, so autoconf was written in a mix of shell and m4 macros - both weak languages for its purposes, resulting in large autoconf files being extremely difficult to understand and modify.
A particularly unfortunate call was making autoconf a "two-phase" builder that generates Makefiles to be run, rather than a one-phase system that processes a build recipe directly. As more discovery features were added, the Makefile generation process became an increasingly crufty pile of kludgy auxiliary programs and mysteriously interlinked processing passes.
Eventually, advancing Unix and C standardization reduced the complexity of the feature-discovery phase enough that autoconf started to become overkill. In theory, at that point, many projects could have dropped back to makefiles. A few actually did. But autoconf recipes have a lot of inertia; they're so difficult to understand as wholes that the expedient tweak, solving today's problem at the cost of adding yet another glob of complexity, is the rule.
The result is you probably have an autoconf recipe in your codebase that has accreted complexity right along with the hacked-by-many-hands accretions to your C. It is more than likely a Lovecraftian horror by now - vast and incomprehensible, yielding dog-slow builds with weird sporadic failure modes that are nigh-impossible to diagnose.
And, due to standardization, much of that complexity may now be unnecessary! Every port-shim ifdef you removed in the previous step probably made part of your autoconf recipe obsolete.
You may be very lucky. Your autoconf recipe size may be small enough to comprehend, and it will get smaller once you've ripped out the OS- and compiler-feature discovery. If you wind up with a recipe that is in the narrow complexity band that is low enough to wrap your brain around but a bit too high for a plain makefile, you're done.
But it is unlikely this will occur. Usually you should shoot autoconf through the head and replace it with a better build engine.
Exactly what to replace it with is beyond the scope of this essay. Drop back to a Makefile if you can; otherwise there are several good build engines out there, most notably SCons and waf. There is also cmake, which I dislike because it repeats the autoconf two-phase mistake, but some people swear by it.
After this point the rescue steps stop being serial; you can do parts of the rest in parallel with each other. I'll express this by no longer numbering them.
I can't say a lot about this extremely important step, because it's semantically driven and what you need to do varies so much by project. But there are a few generalities.
Your primary goal should be to increase readability and maintainability. If you reduce line count, you are probably heading in the right direction.
Remember, you're not actually refactoring unless you can verify that you're not changing the externally-visible behavior of the code. If you can't do that verification, what you're actually doing is "flailing around in a way likely to break stuff".
So...if you didn't already have a strong test suite before going into refactoring, shame on you: build one now. Throughout the refactoring phase, improve your test coverage. Use tools like coverage analyzers to prove that your tests reach it all.
Make it pretty. This doesn't guarantee that it will be right, but it improves your odds. Once you have that test suite, it greatly improves your odds.
Old-school code tended to be sloppier about information-hiding than we are nowadays after decades of learning about OO languages. Try bundling globals into context structures. See if you can narrow the visibility of those structures.
Despite traditional tools like YACC and LEX, generation of code from declarative descriptions is still a sadly underutilized technique. Any time you can reduce the amount of hand-written C in your codebase, you win.
Move code out of C when possible
C was designed for circumstances in which clock cycles were scarce and resource constraints were so tight that memory management had to be optimized by hand. In the early 1970s, that was everywhere. Now almost nothing but operating-system kernels and micro-controller firmware is so constrained.
What C traded for its efficiency was vulnerability to a large class of insidious errors in storage management. Those still bedevil us even after the original rationale for most uses of C is obsolete. C itself, now, is one of the most serious defect attractors.
We have alternatives. Modern scripting languages, with their automatic memory management, can take over for C in most circumstances where software only has to operate at human speed, as opposed to machine speed. They also bring many advantages in maintainability and ease of program modification. Python is especially strong in this regard.
Accordingly, it's good practice to move any code you can out of C into Python - or any other language with automatic memory management; but Python gets another advantage from nowadays being ubiquitous, often used in always-installed system scripts on modern Unixes.
If your code is a 200KLOC network service daemon, translating it to Python by hand might not be practical even if the result would handle machine-to-machine transaction speeds well (which, alas, it might not). But if your daemon has smaller client programs flying with it, those can often be moved to Python relatively easily.
In NTPsec, for example, we moved ntpq - the most-used monitoring tool - from C to Python. You can't tell this by watching it run; the default appearance is identical (which relates to a point I'll get to momentarily). But we now absolutely know that ntpq will never have a buffer-overrun crash. And it is far easier to debug or extend than the C code was.
(Another thing I can say is that as of early 2018 the time for worrying about Python 2 compatibility is basically past. All the major distros seem to be planning to go to 'python' equalling Python 3 within the next six months.)
Moving large, old C programs to a language with automatic memory management is just beginning to look like it might become practical. The enabling technology is Go, the language designed at Google to be about as close as possible to C and still have garbage collection. C to Go might be practical at large scale where C to Python is not.
I say "might" because as I write in early 2018 nobody has actually done this yet, at least not to well-known open-source code in public. I will be rather surprised if this remains true eighteen months from now.
Here, however, is a major caveat about these code moves: do not try to refactor or redesign while you are translating, that way lies doom and madness. I say this because the complexity of translation and the complexity of redesign on the fly are mutually multiplicative and are likely to rapidly take you to a place where you will be utterly lost.
To avoid this, do the stupidest possible 1:1 code transliteration that the impedance mismatch between your languages will let you get away with first; then go nuts with all the cool features of your target language.
(The main reason I expect C to Go to work better than C to Python at large scales is that the impedance mismatch at that junction is much lower. Go appears to have been designed with this as a goal.)
Apprentices are important
A final point about code rescues: it is good to have some apprentices around, and participating in the work, when you do one of these.
It's good for the apprentices. A codebase rescue is near-ideal circumstances to learn the kinds of good practices I've been describing, precisely because nobody is preoccupied by large questions of original design.
It's good for the future of the codebase, too. The usual reason that large old codebases bit-rot enough to need a rescue is that they've been under-maintained for years. Apprentices who learn good chops on a rescue are natural candidates to be future maintainers.