Special thanks to Marco Alesiani for many corrections and suggestions.
Anche tu campi a spaghetti e pizza? Leggi l’articolo in italiano.
When we say “efficiency”, we often think “time”. The sooner the code does its job, the more it is efficient.
What about memory? Granted, today even the lousiest laptop comes with “a bucket load” of RAM which… is never enough. My PC “wastes” 1.4GB just to idle. I open a browser, 300 more MB are gone.*.
Adding insult to injury, using memory is one of the slowest operations on current systems*.
Moreover, finding the culprit line among the code is not easy. Was it a “new” we wrote? Some allocation hidden inside a library? Are temporary objects to blame?
How to easily find the part of the code that uses most of the memory?
This post collects some personal experiments. You can “thank” the author for any mistake.
Let’s use some memory
Today’s toy-code is nothing special, but it does many an allocation using operator new.
Compile, run and… almost 42MB (measured “on the cheap” with /usr/bin/time -v).
Who is using all that memory?
The right way: memory profiler
The idea should be familiar: the “classic” profiler tells for how long each function executes. The memory profiler instead tells where and when the program uses memory, and how much.
For example, here is some of the information that Massif * returns about our program.
Should you work on Windows: https://blogs.msdn.microsoft.com/vcblog/2015/10/21/memory-profiling-in-visual-c-2015/
We can start with the memory growth (in ASCII art!) over “time” – actually its growth over the number of executed instructions:
MB 38.23^ ::::::::::::# | : # | : # | : # | : # | ::::::::::::: # | : : # | : : # | : : # | : : # | @@@@@@@@@@@@: : # | @ : : # | @ : : # | @ : : # | @ : : # | ::::::::::::@ : : # | : @ : : # | : @ : : # | : @ : : # | : @ : : # 0 +----------------------------------------------------------------------->Mi 0 6.203
Then we can get detailed snapshots (the “A”, “B” and “C” tags are ours):
-------------------------------------------------------------------------------- n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B) -------------------------------------------------------------------------------- ... 9 4,311,691 30,080,056 30,072,844 7,212 0 99.98% (30,072,844B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc. ->99.73% (30,000,000B) 0x4078E8: __gnu_cxx::new_allocator<char>::allocate(unsigned long, void const*) (new_allocator.h:104) | ->99.73% (30,000,000B) 0x40785A: std::allocator_traits<std::allocator<char> >::allocate(std::allocator<char>&, unsigned long) (alloc_traits.h:491) | ->99.73% (30,000,000B) 0x407800: std::_Vector_base<char, std::allocator<char> >::_M_allocate(unsigned long) (stl_vector.h:170) | ->99.73% (30,000,000B) 0x40777B: std::_Vector_base<char, std::allocator<char> >::_M_create_storage(unsigned long) (stl_vector.h:185) | ->99.73% (30,000,000B) 0x4076A7: std::_Vector_base<char, std::allocator<char> >::_Vector_base(unsigned long, std::allocator<char> const&) (stl_vector.h:136) | ->99.73% (30,000,000B) 0x407636: std::vector<char, std::allocator<char> >::vector(unsigned long, std::allocator<char> const&) (stl_vector.h:278) | ->99.73% (30,000,000B) 0x4075C5: SomeClass::SomeClass() (SomeClass.cpp:4) | A ====> ->33.24% (10,000,000B) 0x405F91: main (main.cpp:20) | | | B ====> ->33.24% (10,000,000B) 0x405EC1: h() (main.cpp:10) | | ->33.24% (10,000,000B) 0x405EEF: g() (main.cpp:12) | | ->33.24% (10,000,000B) 0x405EFB: f() (main.cpp:13) | | ->33.24% (10,000,000B) 0x405F07: MakeSomeClass() (main.cpp:14) | | ->33.24% (10,000,000B) 0x405F9A: main (main.cpp:21) | | | C ====> ->33.24% (10,000,000B) 0x4063F2: _ZN5boost11make_sharedI9SomeClassIEEENS_6detail15sp_if_not_arrayIT_E4typeEDpOT0_ (make_shared_object.hpp:254) | ->33.24% (10,000,000B) 0x405FA6: main (main.cpp:23) | ->00.24% (72,844B) in 1+ places, all below ms_print's threshold (01.00%)
We quickly see that line 20 of the main uses one third of the memory (A) where we wrote a new. The next 30% of the memory (B) is allocated in h() – Massif recorded all the call stack at the point of allocation. We can trace it down to the call to MakeSomeClass() in the main. Massif also works with shared pointers (C).
We can’t see the allocation at line 24 because it has not yet been executed and “intercepted” by Massif. We may spot it in a later snapshot. The remaining allocations are “small” and summarized in the last line.
A quick glance at the report tells us to go check the constructor of SomeClass. What the heck is it doing with a std::vector that takes 99% of the memory?
This is already a good result, obtained with little effort. Be aware that Massif can do more. It can measure the memory used “behind the scenes” by the system to make the heap work (extra-heap – 7,212 bytes in the example), track the stack…
The do-it-yourself way: override operator new
C++ allows to replace the operator to create objects (new) with a custom one.*
Almost nobody has a good reason to do so, but we do: I could not figure out how to use the profiler intercept heap allocations.
By and large, all we have to do is define a custom new (and its overloads) in any file of a program.
If the memory profiler is an equivalent of the “time” profiler, then you can compare this trick to the classic snippet cout << endTime - startTime;. Not really detailed or accurate, but simple and useful.
A few lines of code can give us something raw, but usable. You should compile with debug symbols. The code that outputs the stack trace can probably work only on Linux.*.
There is nothing portable when you work at low level.
If you are in the Microsoft world: https://msdn.microsoft.com/en-us/library/windows/desktop/bb204633%28v=vs.85%29.aspx.
That means:
.
Let’s add the “tricked out” operator new to our test program. This is an example of the result – can you guess the line of code behind it?
Allocation, size = 40 at 0x18705b0 ./overridenew(_Z14dumpStackTraceRSt14basic_ofstreamIcSt11char_traitsIcEE+0x3c) [0x40672c] ./overridenew(_Znwm+0xaf) [0x406879] ./overridenew(_ZN9__gnu_cxx13new_allocatorISt23_Sp_counted_ptr_inplaceI9SomeClassSaIS2_ELNS_12_Lock_policyE2EEE8allocateEmPKv+0x4a) [0x405d9e] ./overridenew(_ZNSt16allocator_traitsISaISt23_Sp_counted_ptr_inplaceI9SomeClassSaIS1_ELN9__gnu_cxx12_Lock_policyE2EEEE8allocateERS6_m+0x28) [0x405bef] ./overridenew(_ZSt18__allocate_guardedISaISt23_Sp_counted_ptr_inplaceI9SomeClassSaIS1_ELN9__gnu_cxx12_Lock_policyE2EEEESt15__allocated_ptrIT_ERS8_+0x21) [0x4059e2] ./overridenew(_ZNSt14__shared_countILN9__gnu_cxx12_Lock_policyE2EEC2I9SomeClassSaIS4_EJEEESt19_Sp_make_shared_tagPT_RKT0_DpOT1_+0x59) [0x4057e1] ./overridenew(_ZNSt12__shared_ptrI9SomeClassLN9__gnu_cxx12_Lock_policyE2EEC2ISaIS0_EJEEESt19_Sp_make_shared_tagRKT_DpOT0_+0x3c) [0x4056ae] ./overridenew(_ZNSt10shared_ptrI9SomeClassEC2ISaIS0_EJEEESt19_Sp_make_shared_tagRKT_DpOT0_+0x28) [0x40560e] ./overridenew(_ZSt15allocate_sharedI9SomeClassSaIS0_EIEESt10shared_ptrIT_ERKT0_DpOT1_+0x37) [0x405534] ./overridenew(_ZSt11make_sharedI9SomeClassJEESt10shared_ptrIT_EDpOT0_+0x3b) [0x405454] ./overridenew(main+0x9c) [0x4052e8] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7f83fe991830] ./overridenew(_start+0x29) [0x405079] ----------- Allocation, size = 10000000 at 0x7f83fc9c3010 ./overridenew(_Z14dumpStackTraceRSt14basic_ofstreamIcSt11char_traitsIcEE+0x3c) [0x40672c] ./overridenew(_Znwm+0xaf) [0x406879] ./overridenew(_ZN9__gnu_cxx13new_allocatorIcE8allocateEmPKv+0x3c) [0x406538] ./overridenew(_ZNSt16allocator_traitsISaIcEE8allocateERS0_m+0x28) [0x4064aa] ./overridenew(_ZNSt12_Vector_baseIcSaIcEE11_M_allocateEm+0x2a) [0x406450] ./overridenew(_ZNSt12_Vector_baseIcSaIcEE17_M_create_storageEm+0x23) [0x4063cb] ./overridenew(_ZNSt12_Vector_baseIcSaIcEEC1EmRKS0_+0x3b) [0x4062f7] ./overridenew(_ZNSt6vectorIcSaIcEEC2EmRKS0_+0x2c) [0x406286] ./overridenew(_ZN9SomeClassC1Ev+0x3d) [0x406215] ./overridenew(_ZN9__gnu_cxx13new_allocatorI9SomeClassE9constructIS1_JEEEvPT_DpOT0_+0x36) [0x405e3a] ./overridenew(_ZNSt16allocator_traitsISaI9SomeClassEE9constructIS0_JEEEvRS1_PT_DpOT0_+0x23) [0x405d51] ./overridenew(_ZNSt23_Sp_counted_ptr_inplaceI9SomeClassSaIS0_ELN9__gnu_cxx12_Lock_policyE2EEC2IJEEES1_DpOT_+0x8c) [0x405b4a] ./overridenew(_ZNSt14__shared_countILN9__gnu_cxx12_Lock_policyE2EEC2I9SomeClassSaIS4_EJEEESt19_Sp_make_shared_tagPT_RKT0_DpOT1_+0xaf) [0x405837] ./overridenew(_ZNSt12__shared_ptrI9SomeClassLN9__gnu_cxx12_Lock_policyE2EEC2ISaIS0_EJEEESt19_Sp_make_shared_tagRKT_DpOT0_+0x3c) [0x4056ae] ./overridenew(_ZNSt10shared_ptrI9SomeClassEC2ISaIS0_EJEEESt19_Sp_make_shared_tagRKT_DpOT0_+0x28) [0x40560e] ...
…I can’t. Where is “main+0xa8” in my code? Thankfully in the “gnu/Linux world” there are tools to de-mangle names and find the point in the code that corresponds to a given address. We can use them, for example, in a simple script.
As an alternative, we could to everything at run time, using the compiler’s demangling utilities, such as the gcc one. Personally I prefer to keep the code instrumentation as simple as possible and do the “heavy lifting” off-line. My script returns:
Used 40 dumpStackTrace(std::basic_ofstream<char, std::char_traits<char> >&) /home/stefano/projects/code/spy-memory-with-new/InstrumentedNew.cpp:29 operator new(unsigned long) /home/stefano/projects/code/spy-memory-with-new/InstrumentedNew.cpp:48 __gnu_cxx::new_allocator<std::_Sp_counted_ptr_inplace<SomeClass, std::allocator<SomeClass>, (__gnu_cxx::_Lock_policy)2> >::allocate(unsigned long, void const*) /usr/include/c++/5/ext/new_allocator.h:105 ... internal calls of the shared pointer... std::shared_ptr<SomeClass> std::allocate_shared<SomeClass, std::allocator<SomeClass>>(std::allocator<SomeClass> const&) /usr/include/c++/5/bits/shared_ptr.h:620 _ZSt11make_sharedI9SomeClassIEESt10shared_ptrIT_EDpOT0_ /usr/include/c++/5/bits/shared_ptr.h:636 main /home/stefano/projects/code/spy-memory-with-new/main.cpp:25 __libc_start_main ??:0 _start ??:? ----------- Used 10000000 dumpStackTrace(std::basic_ofstream<char, std::char_traits<char> >&) /home/stefano/projects/code/spy-memory-with-new/InstrumentedNew.cpp:29 operator new(unsigned long) /home/stefano/projects/code/spy-memory-with-new/InstrumentedNew.cpp:48 __gnu_cxx::new_allocator<char>::allocate(unsigned long, void const*) /usr/include/c++/5/ext/new_allocator.h:105 ...internal calls of vector... std::vector<char, std::allocator<char> >::vector(unsigned long, std::allocator<char> const&) /usr/include/c++/5/bits/stl_vector.h:279 SomeClass::SomeClass() /home/stefano/projects/code/spy-memory-with-new/SomeClass.cpp:4 (discriminator 2) ...
The first allocation are the 40 bytes requested by make_shared. 24 for SomeClass (its only member is a vector – sizeof(vector) is 24), the rest should be the control block of the shared pointer. The second allocation are the 10MB in the notorious constructor of SomeClass.
It takes some effort to navigate the stacks, but it is possible to understand that the mistery line was std::shared_ptr
Homework: how many allocations would there be with std::shared_ptr<SomeClass> notSoSmartPointer(new SomeClass());
?*
In a test I found:
24 bytes for SomeClass’s instance
10 MB to fill the vector
24 bytes for the shared pointer.
Looking at the implementation notes, I believe that the difference is in the content of the shared pointer’s control block.
In the end…
Programmers have been fighting against memory since the dawn of time, because it is slow and too small. As for every bottleneck, one can’t trust his instincts. We saw that there are proper tools (memory profilers) to measure the memory usage. We discovered that, in a pinch, there are “home made” tools we can build ourselves with a “stereotypical C++ hack”, the override of operator new.
You can find the “ready-to-compile” code in the ++It GitHub repo.