Compile C Faster on Linux

Christopher W. Fraser (Microsoft Research)
David R. Hanson (Princeton University)

(Reprinted with permission of Linux Journal. See the 5/96 issue.)

lcc is a small, fast C compiler now available on Linux. A perfectly good C compiler, gcc, comes with Linux. Why would anyone bother installing a second one? Because the two compilers make different tradeoffs, so they suit different stages of the development cycle. gcc has many targets and users, and it includes an ambitious optimizer. lcc is 75% smaller (more when counting source code), compiles more quickly, and helps prevent some porting bugs.

For those who've always wanted to customize or extend their compiler, our recent book, A Retargetable C Compiler: Design and Implementation, tours lcc's source code in detail and thus offers especially thorough documentation. Pointers to lcc's source code, executables, book, and authors appear at the end of this article.

Speed Tradeoffs

lcc is fast. gcc implements a more ambitious global code optimizer, so it emits better code, particularly with full optimization options, but global optimization takes time and space. lcc implements a few low-cost, high-yield optimizations that collaborate to yield respectable code in a hurry.

For example, lcc compiles itself in 36 seconds on a 90 megahertz Pentium running Linux. gcc takes 68 seconds to compile the same program with the default compiler options and 130 seconds with the highest level of optimization. Code quality varied less. gcc's default code for this input took 36 seconds to reprocess this input, just like lcc's code. gcc's best code for this input (that is, with optimization level 3) runs in 30 seconds, or about 20% faster. This is only a single data point, and both compilers evolve constantly, so your mileage may vary. Naturally, one can save time throughout by using lcc during development and optimizing with gcc for the final binary.

Porting Code

Indeed, compiling code with two different compilers helps expose portability bugs. If a program is useful and if the source code is available, sooner or later someone will try to port it to another machine, or compile it with another compiler, or both. With a new machine or compiler, glitches are not uncommon. Which of the following solutions will net you less unwanted email? For you to find and erase these blots while the code is fresh in your mind? Or for the porter to get diagnostics much later and about foreign source code?

lcc follows the ANSI standard faithfully and implements no extensions, including gcc's extensions. Indeed, an option directs lcc to warn about a variety of C constructs that are valid but give undefined results and thus can behave differently on another machine or with a different compiler. Some programmers use lcc mainly for its strict-ANSI option, which helps them keep their code portable.

Like gcc, lcc can be configured as a cross-compiler that runs on one machine and compiles code for another. Cross-compilers can simplify life for programmers with multiple target platforms. lcc takes this notion a step farther than most cross-compilers: we can and typically do link code generators for several machines into each executable.

For example, we maintain code generators for the MIPS, SPARC, and X86 architectures. That is, we both work on and emit code for multiple platforms, so it's handy to be able to emit code for all targets from any machine. We usually fold all three code generators into all of our executables. A run-time option tells lcc which target is needed at present. If you don't maintain code for multiple targets, you're free to use an executable that includes just one code generator, which saves roughly 50KB for each code generator omitted.

A Compact Compiler

lcc is small. lcc's Linux executable with one code generator is 232 KB, and its text segment is 192 KB. Both figures for the corresponding phase of gcc (namely, cc1) exceed a megabyte. lcc's compactness contributes to its speed, especially on modest systems, and a compact program benefits those who wish to modify the compiler. Most developers will use pre-built executables for lcc and never examine or even recompile the source code, but the Linux community particularly prizes the availability of source code, partly because it allows users to customize their programs or adapt them for other purposes. Compact programs are generally easier to adapt than bigger codes.

lcc is 12,000 lines of C source code if one counts just one code generator, for the Linux PC. gcc's root directory - without the target-specific specification files - holds 240,000 lines; surely some of this material is not part of the compiler proper, but the separation is not immediately apparent those who haven't browsed gcc's source recently. The machine-specific module is the part most often changed, because new target machines come along more often than, say, new source languages. For the Linux PC, this part of lcc is 1200 lines, and half of that repeats boilerplate declarations or supports the debugger, so the actual code generator is under 600 lines. The target-specific modules for gcc average about 3000 lines. These comparisons are among the best illustrations of the fact that the two compilers represent different trade-offs and that neither beats the other at everything: gcc can emit better code and offers many options, which requires more source code, where lcc is easier to comprehend but is otherwise less ambitious.

gcc and lcc use retargetable code generators driven in part by formal specifications of the target machine, just as a parser can be driven by a formal grammar of the input language. gcc's code generator is based in part on techniques that one of us (Fraser) originated in the late 1970's. lcc uses a different technique that is somewhat less flexible but simpler.

A compact source code helps people adapt lcc. An early adaptation by one of us (Hanson) injected profiling code that counts executions for a profiler that now comes with lcc. For example, the lines

for (<1965>r = 0; <15720>r < 8; <15720>r++)
   if (<15720>rows[r] && <5508>up[r-c+7] && <3420>down[r+c]) { ...

annotate two lines from the lcc test suite's implementation of a program that finds all ways to place eight queens on a chessboard such that none of them attacks any other. The numbers in angle brackets report how many times the following code executed, which can differ within one line. This profiler would have been a big project without lcc, but it was modest one with lcc. Other adaptations of lcc include interpreters (http://www.cit.gu.edu.au/~sosic/papers/sigplan92.ps.Z), code generators for multiple targets (ftp://ftp.cs.princeton.edu/pub/lcc/contrib/), a C++ compiler, programmable debuggers that can debug across a network (http://www.eecs.harvard.edu/~nr/ldb/), and a retargetable optimizing linker. A group at Stanford University has adapted lcc for use with a global optimizer (http://suif.stanford.edu/suif/suif.html). At least some of these efforts chose lcc over gcc because lcc's compactness made it seem easier to comprehend and change. Many of these projects were begun before the lcc book was done, so common sense predicts even more adaptations now that extensive documentation is available.

Literate Programming

Most developers will use pre-built executables for lcc and never study the source code. The Linux community, however, expects source, and lcc exceeds this expectation with an annotated version of most of its code, in the form of a book. lcc's annotations, like its small size, are designed to help developers modify lcc when they want to.

lcc is written as what Knuth has termed a "literate program," which interleaves the source code with prose explanations. Two programs process this input. One extracts conventional C source code, which can be compiled with any C compiler. The other program uses the prose too and emits the typescript for the lcc book. We generate the book and the compiler from a single source because it's too easy for multiple sources to get out of sync with one another.

A brief fragment of the chapter on the X86 code generator demonstrates literate programming:


Static locals get a generated name to avoid other static locals of the same name:

<X86 defsymbol>=
if (p->scope >= LOCAL && p->sclass == STATIC)
   p->x.name = stringf("L%d", genlabel(1));

Generated symbols already have a unique numeric name. Defsymbol simply prefixes a letter to make a valid assembler identifier:

<X86 defsymbol>+=
else if (p->generated)
   p->x.name = stringf("L%s",p->name);

Each of the two displays above consists of a "fragment label" in angle brackets and a "fragment" that's just plain code. The fragment label names a piece of the C program that's being described, here the version of the routine defsymbol for the X86. The "+=" in the second fragment says that the second fragment is appended to the first.

This example is necessarily tiny, but it shows how literate programming allows one to build up a complex program a bit at a time, explaining it on the way. The lcc distribution includes conventional C code that can be modified as usual, but when some explanation would help, one can easily get it from the annotated code in the book.

Not shown in this sample are page numbers in each fragment that point to adjoining fragments, and miniature indices in the page margin that point to the page that defines each identifier that's being used. Many readers have identified the mini-indices as especially helpful.

Availability

lcc's C source code and Linux executables are available for anonymous ftp at URL ftp://ftp.cs.princeton.edu/pub/lcc/. It's about a megabyte, so it can be downloaded using even, say, a 14.4kb modem in about 10 minutes. The package includes Dennis Ritchie's preprocessor for ANSI C, but lcc is also used with gcc's preprocessor. Like gcc, lcc emits assembler code for the standard Linux assembler, debugger, and C library, so the package need not include any of these items. A sub-directory collects code generators and other companion software contributed by others. The package describes mailing lists for communicating with others working on and with lcc.

Our book about lcc, A Retargetable C Compiler: Design and Implementation, (ISBN 0-8053-1670-1) is available from Addison Wesley at 800-447-2226 and from other sources listed on lcc's home page (http://www.cs.princeton.edu/software/lcc/).

lcc is free for non-commercial use. The lcc book amounts to a single-user license for lcc, so some have arranged commercial use by simply including a copy of the book with their product (and charging for it); the publisher offers substantial discounts. Other arrangements are possible.

Authors

Chris Fraser has been writing compilers since 1974. He earned a Ph.D. in computer science at Yale in 1977 and did computing research at Bell Laboratories in Murray Hill, New Jersey. He is now with Microsoft Research. Web.

Dave Hanson was Professor of Computer Science at Princeton University. His research interests include programming language design and implementation, software engineering, and programming environments. He is now with Microsoft Research. Web.