Special thanks to Marco Alesiani for many corrections and suggestions.
Anche tu campi a spaghetti e pizza? Leggi l’articolo in italiano.
When we say “efficiency”, we often think “time”. The sooner the code does its job, the more it is efficient.
What about memory? Granted, today even the lousiest laptop comes with “a bucket load” of RAM which… is never enough. My PC “wastes” 1.4GB just to idle. I open a browser, 300 more MB are gone.*.
Adding insult to injury, using memory is one of the slowest operations on current systems*.
Moreover, finding the culprit line among the code is not easy. Was it a “new” we wrote? Some allocation hidden inside a library? Are temporary objects to blame?
How to easily find the part of the code that uses most of the memory?
This post collects some personal experiments. You can “thank” the author for any mistake.
Let’s use some memory
Today’s toy-code is nothing special, but it does many an allocation using operator new.
/* Program that allocates some memory when it feels like.
No delete – today’s essay is not about memory leaks.*/
#include <string>
#include <memory>
#include <boost/shared_ptr.hpp>
#include <boost/make_shared.hpp>
#include "SomeClass.h"
//
void h() {
SomeClass* t = new SomeClass();
}
void g() { h(); }
void f() { g(); }
void MakeSomeClass() { f(); }
//
int main(int argc, char **argv) {
int * number = new int(89);
std::string * test = new std::string("abc");
//
SomeClass * oggetto = new SomeClass();
MakeSomeClass();
//
boost::shared_ptr<SomeClass> smartPointer = boost::make_shared<SomeClass>();
std::shared_ptr<SomeClass> stdSmartPointer = std::make_shared<SomeClass>();
return 0;
}
Compile, run and… almost 42MB (measured “on the cheap” with /usr/bin/time -v).
Who is using all that memory?
The right way: memory profiler
The idea should be familiar: the “classic” profiler tells for how long each function executes. The memory profiler instead tells where and when the program uses memory, and how much.
For example, here is some of the information that Massif * returns about our program.
Should you work on Windows: https://blogs.msdn.microsoft.com/vcblog/2015/10/21/memory-profiling-in-visual-c-2015/
We can start with the memory growth (in ASCII art!) over “time” – actually its growth over the number of executed instructions:
MB 38.23^ ::::::::::::# | : # | : # | : # | : # | ::::::::::::: # | : : # | : : # | : : # | : : # | @@@@@@@@@@@@: : # | @ : : # | @ : : # | @ : : # | @ : : # | ::::::::::::@ : : # | : @ : : # | : @ : : # | : @ : : # | : @ : : # 0 +----------------------------------------------------------------------->Mi 0 6.203
Then we can get detailed snapshots (the “A”, “B” and “C” tags are ours):
-------------------------------------------------------------------------------- n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B) -------------------------------------------------------------------------------- ... 9 4,311,691 30,080,056 30,072,844 7,212 0 99.98% (30,072,844B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc. ->99.73% (30,000,000B) 0x4078E8: __gnu_cxx::new_allocator<char>::allocate(unsigned long, void const*) (new_allocator.h:104) | ->99.73% (30,000,000B) 0x40785A: std::allocator_traits<std::allocator<char> >::allocate(std::allocator<char>&, unsigned long) (alloc_traits.h:491) | ->99.73% (30,000,000B) 0x407800: std::_Vector_base<char, std::allocator<char> >::_M_allocate(unsigned long) (stl_vector.h:170) | ->99.73% (30,000,000B) 0x40777B: std::_Vector_base<char, std::allocator<char> >::_M_create_storage(unsigned long) (stl_vector.h:185) | ->99.73% (30,000,000B) 0x4076A7: std::_Vector_base<char, std::allocator<char> >::_Vector_base(unsigned long, std::allocator<char> const&) (stl_vector.h:136) | ->99.73% (30,000,000B) 0x407636: std::vector<char, std::allocator<char> >::vector(unsigned long, std::allocator<char> const&) (stl_vector.h:278) | ->99.73% (30,000,000B) 0x4075C5: SomeClass::SomeClass() (SomeClass.cpp:4) | A ====> ->33.24% (10,000,000B) 0x405F91: main (main.cpp:20) | | | B ====> ->33.24% (10,000,000B) 0x405EC1: h() (main.cpp:10) | | ->33.24% (10,000,000B) 0x405EEF: g() (main.cpp:12) | | ->33.24% (10,000,000B) 0x405EFB: f() (main.cpp:13) | | ->33.24% (10,000,000B) 0x405F07: MakeSomeClass() (main.cpp:14) | | ->33.24% (10,000,000B) 0x405F9A: main (main.cpp:21) | | | C ====> ->33.24% (10,000,000B) 0x4063F2: _ZN5boost11make_sharedI9SomeClassIEEENS_6detail15sp_if_not_arrayIT_E4typeEDpOT0_ (make_shared_object.hpp:254) | ->33.24% (10,000,000B) 0x405FA6: main (main.cpp:23) | ->00.24% (72,844B) in 1+ places, all below ms_print's threshold (01.00%)
We quickly see that line 20 of the main uses one third of the memory (A) where we wrote a new. The next 30% of the memory (B) is allocated in h() – Massif recorded all the call stack at the point of allocation. We can trace it down to the call to MakeSomeClass() in the main. Massif also works with shared pointers (C).
We can’t see the allocation at line 24 because it has not yet been executed and “intercepted” by Massif. We may spot it in a later snapshot. The remaining allocations are “small” and summarized in the last line.
A quick glance at the report tells us to go check the constructor of SomeClass. What the heck is it doing with a std::vector that takes 99% of the memory?
This is already a good result, obtained with little effort. Be aware that Massif can do more. It can measure the memory used “behind the scenes” by the system to make the heap work (extra-heap – 7,212 bytes in the example), track the stack…
The do-it-yourself way: override operator new
C++ allows to replace the operator to create objects (new) with a custom one.*
Almost nobody has a good reason to do so, but we do: I could not figure out how to use the profiler intercept heap allocations.
By and large, all we have to do is define a custom new (and its overloads) in any file of a program.
If the memory profiler is an equivalent of the “time” profiler, then you can compare this trick to the classic snippet cout << endTime - startTime;. Not really detailed or accurate, but simple and useful.
A few lines of code can give us something raw, but usable. You should compile with debug symbols. The code that outputs the stack trace can probably work only on Linux.*.
There is nothing portable when you work at low level.
If you are in the Microsoft world: https://msdn.microsoft.com/en-us/library/windows/desktop/bb204633%28v=vs.85%29.aspx.
That means:
#include <iostream>
//
#include <Windows.h> // Capture stack traces.
#include <Dbghelp.h> // Read debug symbols.
//
void StackTrace() {
/* Capture the stack trace. */
const ULONG doNotSkipAnyFrame = 0;
const ULONG takeTenFrames = 10;
const PULONG doNotHash = nullptr;
PVOID stackTrace[takeTenFrames];
const USHORT framesCaptured = CaptureStackBackTrace(
doNotSkipAnyFrame,
takeTenFrames,
stackTrace,
doNotHash
);
//
/*Prepare the symbol table to convert from addresses to lines of code. */
const HANDLE thisProcess = GetCurrentProcess();
SymInitialize(thisProcess, NULL, TRUE); // Linkare Dbghelp.lib
//
for (ULONG i = 0; i < framesCaptured; i++) {
/*Estrae il nome della funzione. */
const size_t nameStringSize = 256;
SYMBOL_INFO * functionData = (SYMBOL_INFO*)malloc(sizeof(SYMBOL_INFO) + (nameStringSize + 1) * sizeof(char)); // +1 because there is \0
functionData->MaxNameLen = nameStringSize;
functionData->SizeOfStruct = sizeof(SYMBOL_INFO);
SymFromAddr(thisProcess, (DWORD64)(stackTrace[i]), 0, functionData);
//
/* Find the file matching the function call.*/
DWORD displacementInLine;
IMAGEHLP_LINE64 lineOfCode;
lineOfCode.SizeOfStruct = sizeof(IMAGEHLP_LINE64);
SymGetLineFromAddr64(thisProcess, (DWORD)(stackTrace[i]), &displacementInLine, &lineOfCode);
//
std::cout << functionData->Name << " at "
<< lineOfCode.FileName << ":" << lineOfCode.LineNumber << std::endl;
}
}
.
// Our special new must allocate memory as expected…
#include <cstdio>
#include <cstdlib>
// …but also inspect the stack and print some results.
#include <execinfo.h>
#include <unistd.h>
#include <fstream>
// Import bad_alloc, expected in case of errors.
#include <new>
//
/* Opens (once) and return the file to save the results.. */
static std::ofstream& resultFile() {
static std::ofstream memoryProfile;
static bool open = false; // Init on 1st use, as usual.
if (! open) {
memoryProfile.open ("allocations.txt");
open = true;
}
// Else, handle errors, close the file…
// We won’t do it, to keep the example simple.
return memoryProfile;
}
//
/* This is the "magic" function that inspect the stack and writes it in a file. */
static void dumpStackTrace(std::ofstream& memoryProfile) {
// Record 15 pointers to stack frame - enough for the example program.
const int maximumStackSize = 15;
void *callStack[maximumStackSize];
size_t framesInUse = backtrace(callStack, maximumStackSize);
// Now callStack is full of pointers. Request the names of the functions matching each frame.
char ** mangledFunctionNames = backtrace_symbols(callStack, framesInUse);
// Writes all the function names in the stream.
for (size_t i = 0; i < framesInUse; ++i)
memoryProfile << mangledFunctionNames[i] << std::endl;
// To be fair, we should release mangledFunctionNames with free…
}
//
/* Now we have all the elements to build the custom operator new. */
void* operator new(std::size_t sz) {
// Allocate the requested memory for the caller.
void * requestedMemory = std::malloc(sz);
if (! requestedMemory)
throw std::bad_alloc();
// Share our allocations with the world.
std::ofstream& memoryProfile = resultFile();
memoryProfile << "Allocation, size = " << sz << " at " << static_cast<void*>(requestedMemory) << std::endl;
dumpStackTrace(memoryProfile);
memoryProfile << "-----------" << std::endl; // Poor man’s separator.
return requestedMemory;
}
Let’s add the “tricked out” operator new to our test program. This is an example of the result – can you guess the line of code behind it?
Allocation, size = 40 at 0x18705b0 ./overridenew(_Z14dumpStackTraceRSt14basic_ofstreamIcSt11char_traitsIcEE+0x3c) [0x40672c] ./overridenew(_Znwm+0xaf) [0x406879] ./overridenew(_ZN9__gnu_cxx13new_allocatorISt23_Sp_counted_ptr_inplaceI9SomeClassSaIS2_ELNS_12_Lock_policyE2EEE8allocateEmPKv+0x4a) [0x405d9e] ./overridenew(_ZNSt16allocator_traitsISaISt23_Sp_counted_ptr_inplaceI9SomeClassSaIS1_ELN9__gnu_cxx12_Lock_policyE2EEEE8allocateERS6_m+0x28) [0x405bef] ./overridenew(_ZSt18__allocate_guardedISaISt23_Sp_counted_ptr_inplaceI9SomeClassSaIS1_ELN9__gnu_cxx12_Lock_policyE2EEEESt15__allocated_ptrIT_ERS8_+0x21) [0x4059e2] ./overridenew(_ZNSt14__shared_countILN9__gnu_cxx12_Lock_policyE2EEC2I9SomeClassSaIS4_EJEEESt19_Sp_make_shared_tagPT_RKT0_DpOT1_+0x59) [0x4057e1] ./overridenew(_ZNSt12__shared_ptrI9SomeClassLN9__gnu_cxx12_Lock_policyE2EEC2ISaIS0_EJEEESt19_Sp_make_shared_tagRKT_DpOT0_+0x3c) [0x4056ae] ./overridenew(_ZNSt10shared_ptrI9SomeClassEC2ISaIS0_EJEEESt19_Sp_make_shared_tagRKT_DpOT0_+0x28) [0x40560e] ./overridenew(_ZSt15allocate_sharedI9SomeClassSaIS0_EIEESt10shared_ptrIT_ERKT0_DpOT1_+0x37) [0x405534] ./overridenew(_ZSt11make_sharedI9SomeClassJEESt10shared_ptrIT_EDpOT0_+0x3b) [0x405454] ./overridenew(main+0x9c) [0x4052e8] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7f83fe991830] ./overridenew(_start+0x29) [0x405079] ----------- Allocation, size = 10000000 at 0x7f83fc9c3010 ./overridenew(_Z14dumpStackTraceRSt14basic_ofstreamIcSt11char_traitsIcEE+0x3c) [0x40672c] ./overridenew(_Znwm+0xaf) [0x406879] ./overridenew(_ZN9__gnu_cxx13new_allocatorIcE8allocateEmPKv+0x3c) [0x406538] ./overridenew(_ZNSt16allocator_traitsISaIcEE8allocateERS0_m+0x28) [0x4064aa] ./overridenew(_ZNSt12_Vector_baseIcSaIcEE11_M_allocateEm+0x2a) [0x406450] ./overridenew(_ZNSt12_Vector_baseIcSaIcEE17_M_create_storageEm+0x23) [0x4063cb] ./overridenew(_ZNSt12_Vector_baseIcSaIcEEC1EmRKS0_+0x3b) [0x4062f7] ./overridenew(_ZNSt6vectorIcSaIcEEC2EmRKS0_+0x2c) [0x406286] ./overridenew(_ZN9SomeClassC1Ev+0x3d) [0x406215] ./overridenew(_ZN9__gnu_cxx13new_allocatorI9SomeClassE9constructIS1_JEEEvPT_DpOT0_+0x36) [0x405e3a] ./overridenew(_ZNSt16allocator_traitsISaI9SomeClassEE9constructIS0_JEEEvRS1_PT_DpOT0_+0x23) [0x405d51] ./overridenew(_ZNSt23_Sp_counted_ptr_inplaceI9SomeClassSaIS0_ELN9__gnu_cxx12_Lock_policyE2EEC2IJEEES1_DpOT_+0x8c) [0x405b4a] ./overridenew(_ZNSt14__shared_countILN9__gnu_cxx12_Lock_policyE2EEC2I9SomeClassSaIS4_EJEEESt19_Sp_make_shared_tagPT_RKT0_DpOT1_+0xaf) [0x405837] ./overridenew(_ZNSt12__shared_ptrI9SomeClassLN9__gnu_cxx12_Lock_policyE2EEC2ISaIS0_EJEEESt19_Sp_make_shared_tagRKT_DpOT0_+0x3c) [0x4056ae] ./overridenew(_ZNSt10shared_ptrI9SomeClassEC2ISaIS0_EJEEESt19_Sp_make_shared_tagRKT_DpOT0_+0x28) [0x40560e] ...
…I can’t. Where is “main+0xa8” in my code? Thankfully in the “gnu/Linux world” there are tools to de-mangle names and find the point in the code that corresponds to a given address. We can use them, for example, in a simple script.
#!/usr/bin/python
#
# C++filt demangles names.
#
# addr2line converts code pointers (e. g. functions’ addresses)
# into the file:line couple corresponding to the code (if there are debug symbols).
#
# The python code should be portable, but the called utilities aren’t.
#
import re
import subprocess
#
# Opens a sub-process and passes shell commands to it. Returns the results as a string.
# Not very efficient, but easy.
def run_shell(command):
return subprocess.Popen(command, stdout=subprocess.PIPE).communicate()[0]
#
#
if __name__ == "__main__":
total_size = 0;
#
# There are 2 types of lines in the output: stack frames and allocation sizes.
size_line = re.compile("Allocation, size = (\d+) at (\d+)") # Allocation, size = <bytes> at <pointer somewhere in the heap>
stack_line = re.compile(".*\((.*)\+.*\) \[(.*)\]") # <rubbish>(mangled name) [<code pointer>]
#
allocations_file = open("allocations.txt")
for line in allocations_file:
match_size = size_line.match(line)
match_stack = stack_line.match(line)
#
# For a demo, I compute the sum of all the used memory.
# The things you can do with an overridden new!
if (match_size):
allocation_size = int(match_size.group(1))
total_size += allocation_size
print "Used " + str(allocation_size)
#
elif (match_stack):
mangled_name = match_stack.group(1)
line_address = match_stack.group(2)
demangled_name = run_shell(["c++filt", "-n", mangled_name])
line_number = run_shell(["addr2line", "-e", "./overridenew", line_address])
#
# This is not professional-grade formatting. The -1 cuts away the newlines.
print"\t" + demangled_name[:-1] + "\n\t\t" + line_number,
#
# Copy the separator as they were.
else:
print line
#
print "\n total allocated size " + str(total_size)
As an alternative, we could to everything at run time, using the compiler’s demangling utilities, such as the gcc one. Personally I prefer to keep the code instrumentation as simple as possible and do the “heavy lifting” off-line. My script returns:
Used 40 dumpStackTrace(std::basic_ofstream<char, std::char_traits<char> >&) /home/stefano/projects/code/spy-memory-with-new/InstrumentedNew.cpp:29 operator new(unsigned long) /home/stefano/projects/code/spy-memory-with-new/InstrumentedNew.cpp:48 __gnu_cxx::new_allocator<std::_Sp_counted_ptr_inplace<SomeClass, std::allocator<SomeClass>, (__gnu_cxx::_Lock_policy)2> >::allocate(unsigned long, void const*) /usr/include/c++/5/ext/new_allocator.h:105 ... internal calls of the shared pointer... std::shared_ptr<SomeClass> std::allocate_shared<SomeClass, std::allocator<SomeClass>>(std::allocator<SomeClass> const&) /usr/include/c++/5/bits/shared_ptr.h:620 _ZSt11make_sharedI9SomeClassIEESt10shared_ptrIT_EDpOT0_ /usr/include/c++/5/bits/shared_ptr.h:636 main /home/stefano/projects/code/spy-memory-with-new/main.cpp:25 __libc_start_main ??:0 _start ??:? ----------- Used 10000000 dumpStackTrace(std::basic_ofstream<char, std::char_traits<char> >&) /home/stefano/projects/code/spy-memory-with-new/InstrumentedNew.cpp:29 operator new(unsigned long) /home/stefano/projects/code/spy-memory-with-new/InstrumentedNew.cpp:48 __gnu_cxx::new_allocator<char>::allocate(unsigned long, void const*) /usr/include/c++/5/ext/new_allocator.h:105 ...internal calls of vector... std::vector<char, std::allocator<char> >::vector(unsigned long, std::allocator<char> const&) /usr/include/c++/5/bits/stl_vector.h:279 SomeClass::SomeClass() /home/stefano/projects/code/spy-memory-with-new/SomeClass.cpp:4 (discriminator 2) ...
The first allocation are the 40 bytes requested by make_shared. 24 for SomeClass (its only member is a vector – sizeof(vector) is 24), the rest should be the control block of the shared pointer. The second allocation are the 10MB in the notorious constructor of SomeClass.
It takes some effort to navigate the stacks, but it is possible to understand that the mistery line was std::shared_ptr
Homework: how many allocations would there be with std::shared_ptr<SomeClass> notSoSmartPointer(new SomeClass());
?*
In a test I found:
24 bytes for SomeClass’s instance
10 MB to fill the vector
24 bytes for the shared pointer.
Looking at the implementation notes, I believe that the difference is in the content of the shared pointer’s control block.
In the end…
Programmers have been fighting against memory since the dawn of time, because it is slow and too small. As for every bottleneck, one can’t trust his instincts. We saw that there are proper tools (memory profilers) to measure the memory usage. We discovered that, in a pinch, there are “home made” tools we can build ourselves with a “stereotypical C++ hack”, the override of operator new.
You can find the “ready-to-compile” code in the ++It GitHub repo.