Software Engineering II

Inlining Considered Harmful

Feb 28, 2015

Inlining is the root of all evil.

Why do we inline? We inline because function calls are expensive.

This just in: they aren't. (Debunking the "Expensive Procedure Call" Myth, GL Steele, Jr., 1977)

If we expose more code to the compiler, then the compiler can make better optimizing decisions. Wouldn't it be awesome if, instead of providing a builtin memcpy(), the compiler was able to inline a generic definition from the library and then optimize it for the precise calling context?

And if wrappers are literally free, then that means we can have as many layers of abstraction as we care to. It frees us from one of the limits on our expressiveness.

Thus described, I summon thee: C++ Standard Template Library (STL)!

So now nearly our entire C++ library is defined in header files. And thanks to inlining, this is sometimes even faster than the equivalent C code, which would generally incur a function call overhead for each library call.

How much faster? Well, you can contrive an example showing around 10%. But generally, about 0.001-0.1% faster for idealized tests.

And how much more expressive is our code now that we are freed from the tyranny of expensive function calls? Well, you just go look in the <set> header on your system, and you tell me how expressive that looks to you.

And how joyous that we no longer have to implement builtins for memcpy()! Well, actually, we still implement the builtin, because it's even a little bit more clever than we can trust the optimizer to be on a for() loop. But since wrappers are free now, most compilers implement memcpy() as an inline function that calls __builtin_memcpy(). Seriously.

So now it takes wall time to compile any C++ program because it's parsing the entire STL (and then some) in header files for each and every compilation unit. And free wrappers have convinced the STL designers and implementors to make some really really heinous choices. And our code is sometimes as much as 0.1% faster than C code! Except in fact C++ code is waaay slower than C code in nearly every case.

The thing is, the reason we use a "system level" language like C is that it is sooo much faster than any interpretted language. We're talking 10% to 10000% faster in very common use cases.

0.1% here and there doesn't matter, it doesn't exist. Don't put source code in header files. Don't make wrappers without purpose.

Example

Let me expand on wrappers without a purpose. It is considered pretty slick that in C++ you can code:

for (iterator i = list.begin(); i != list.end(); ++i) {
   ...
}

And after a *lot* of inlining, copy propagation, constant propagation, and dead code elimination, you can wind up with code that is nearly as efficient as:

for (struct list_rec *i = list_head; i; i = i->next) {
   ...
}

The funny thing is that the C code is actually no more typing and is easier to read. And don't give me any polymorphic type bullshit, because you know that if you want anything more complicated than a linklist then you're going to need to customize your type to the sorts of visits you intend to do anyways. No STL will lift that burden from your shoulders.

And don't underestimate how much wrapping is involved! Using SGI's STLport library (a popular C++98 STL implementation), the begin() call is:

list<int, allocator<int>>::begin()
_List_iterator<int, _Nonconst_traits<int>>::_List_iterator(_List_node<int> *)
_List_iterator_base::_List_iterator_base(_List_node_base *) [subobject]
_List_iterator_base::_List_iterator_base(_List_node_base *)

The end() check is:

list<int, allocator<int>>::end()
_List_iterator<int, _Nonconst_traits<int>>::_List_iterator(_List_node<int> *)
_List_iterator_base::_List_iterator_base(_List_node_base *) [subobject]
_List_iterator_base::_List_iterator_base(_List_node_base *)
_List_iterator_base::operator !=(const _List_iterator_base&) const

And operator++() is:

_List_iterator<int, _Nonconst_traits<int>>::operator ++(int)
_List_iterator<int, _Nonconst_traits<int>>::_List_iterator(const _List_iterator<int, _Nonconst_traits<int>>&)
_List_iterator_base::_List_iterator_base(_List_node_base *) [subobject]
_List_iterator_base::_List_iterator_base(_List_node_base *)
_List_iterator_base::_M_incr()

And that's with a relatively simple C++98 STL, the C++11 STLs are often much more deeply wrapped.

The crazy thing is the generated code isn't really that bad! It's way worse than C code (though C++11 improves on that a little), but compared to how much source code went into generating it (about 9,000 lines of #include <list>), it is absolutely outstanding. Pretty slick!

The question is: What benefit did we get for this insane cost that we can ignore so long as we don't care about debugging or compile time?

Contact: