stefano – Italian C++ Community

Spiare il consumo di memoria con l’operatore new

stefano — Fri, 27 Jan 2017 18:42:17 +0000

Un grazie speciale a Marco Alesiani per le sue correzioni e suggerimenti.

International reader? Read the post in English.

Quando diciamo “efficienza”, quasi sempre pensiamo “tempo”. Prima il codice fa il suo lavoro, più è efficiente.

E la memoria? Certo, oggi anche un portatile da quattro soldi arriva con “un secchio di RAM“… ma non basta mai. Il mio PC “sperpera” 1.4GB solo per restare acceso. Apro un browser, altri 300MB che se ne vanno*.

…e chiediamo scusa per gli errori “Allowed memory size of … bytes exhausted “ o le pagine bianche che potreste vedere ogni tanto su ++It. Capite perchè il tema “memoria” ci sta a cuore.

Oltre il danno, la beffa: usare la memoria è anche una delle operazioni più lente sui sistemi attuali*.

Daniele Maccioni: Data Oriented Design: alte performance in C++

Ma non è semplice capire a quale riga del codice dare la colpa. Le new che scriviamo noi stessi? Qualche allocazione nascosta in una libreria? O è colpa di oggetti temporanei?

Come trovare facilemente la parte di codice che usa più memoria?

Questo articolo raccoglie qualche esperimento personale. Tutti gli errori sono “merito” dell’autore.

Usiamo un po’ di memoria

Il programma-giocattolo di oggi non ha nulla di particolare, se non una gran varietà di allocazioni di memoria con operator new.

/* Programma che alloca memoria a casaccio.
Niente delete, questo non e’ un articolo sui memory leak.*/
#include
#include
#include
#include
#include "UnaClasseDelProgramma.h"

//
void h() {
UnaClasseDelProgramma * t = new UnaClasseDelProgramma();
}
void g() { h(); }
void f() { g(); }
void CreaUnaClasseDelProgramma() { f(); }

//
int main(int argc, char **argv) {
int * numero = new int(89);
std::string * test = new std::string("abc");
//
UnaClasseDelProgramma * oggetto = new UnaClasseDelProgramma();
CreaUnaClasseDelProgramma();
//
boost::shared_ptr smartPointer = boost::make_shared();
std::shared_ptr stdSmartPointer = std::make_shared();
return 0;
}

Compila, apri e… circa 42MB (misurati “alla buona” con /usr/bin/time -v).

Chi consuma tutta questa memoria?

Il modo corretto: memory profiler

Il concetto è familiare: il profiler “classico” indica per quanto tempo gira ogni funzione. Il memory profiler invece indica dove, quando e quanta memoria usa il programma.
Per esempio, ecco una parte di quello che Massif * dice del nostro programma.

http://valgrind.org/docs/manual/ms-manual.html
Ma se lavorate in Windows: https://blogs.msdn.microsoft.com/vcblog/2015/10/21/memory-profiling-in-visual-c-2015/

Per iniziare, otteniamo (in ASCII art!) come l’uso della memoria cresce nel “tempo” – in realtà come cresce col numero di istruzioni eseguite:

    MB
38.23^                                                           ::::::::::::#
     |                                                           :           #
     |                                                           :           #
     |                                                           :           #
     |                                                           :           #
     |                                               :::::::::::::           #
     |                                               :           :           #
     |                                               :           :           #
     |                                               :           :           #
     |                                               :           :           #
     |                                   @@@@@@@@@@@@:           :           #
     |                                   @           :           :           #
     |                                   @           :           :           #
     |                                   @           :           :           #
     |                                   @           :           :           #
     |                       ::::::::::::@           :           :           #
     |                       :           @           :           :           #
     |                       :           @           :           :           #
     |                       :           @           :           :           #
     |                       :           @           :           :           #
   0 +----------------------------------------------------------------------->Mi
     0                                                                   6.203

Poi dei resoconti più dettagliati (le annotazioni “A”, “B” e “C” sono nostre):

--------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
...
  9      4,313,116       30,080,056       30,072,844         7,212            0
99.98% (30,072,844B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->99.73% (30,000,000B) 0x407F68: __gnu_cxx::new_allocator::allocate(unsigned long, void const*) (new_allocator.h:104)
| ->99.73% (30,000,000B) 0x407EDA: std::allocator_traits >::allocate(std::allocator&, unsigned long) (alloc_traits.h:491)
|   ->99.73% (30,000,000B) 0x407E80: std::_Vector_base >::_M_allocate(unsigned long) (stl_vector.h:170)
|     ->99.73% (30,000,000B) 0x407DFB: std::_Vector_base >::_M_create_storage(unsigned long) (stl_vector.h:185)
|       ->99.73% (30,000,000B) 0x407D27: std::_Vector_base >::_Vector_base(unsigned long, std::allocator const&) (stl_vector.h:136)
|         ->99.73% (30,000,000B) 0x407CB6: std::vector >::vector(unsigned long, std::allocator const&) (stl_vector.h:278)
|           ->99.73% (30,000,000B) 0x407C45: UnaClasseDelProgramma::UnaClasseDelProgramma() (UnaClasseDelProgramma.cpp:4)
|   A ===>   ->33.24% (10,000,000B) 0x406611: main (main.cpp:20)
|             | 
|   B ===>    ->33.24% (10,000,000B) 0x406541: h() (main.cpp:10)
|             | ->33.24% (10,000,000B) 0x40656F: g() (main.cpp:12)
|             |   ->33.24% (10,000,000B) 0x40657B: f() (main.cpp:13)
|             |     ->33.24% (10,000,000B) 0x406587: CreaUnaClasseDelProgramma() (main.cpp:14)
|             |       ->33.24% (10,000,000B) 0x40661A: main (main.cpp:21)
|             |         
|   C ===>    ->33.24% (10,000,000B) 0x406A72: _ZN5boost11make_sharedI21UnaClasseDelProgrammaIEEENS_6detail15sp_if_not_arrayIT_E4typeEDpOT0_ (make_shared_object.hpp:254)
|               ->33.24% (10,000,000B) 0x406626: main (main.cpp:23)
|                 
->00.24% (72,844B) in 1+ places, all below ms_print's threshold (01.00%)

Vediamo subito che un terzo della memoria si spende alla riga 20 del main (A), dove c’è uno dei nostri new. Un altro 30% (B) lo alloca h() – che Massif mostra nello stack delle chiamate registrato al momento dell’allocazione. Seguendolo arriviamo alla chiamata a CreaUnaClasseDelProgramma() nel main. Massif cattura anche le allocazioni con shared pointer (C).

L’allocazione alla riga 24 non si vede perchè non è stata ancora eseguita e “intercettata” da Massif. Potrebbe comparire in uno snapshot successivo. Le altre allocazioni nel main sono “piccole” e aggregate nell’ultima riga.

Si vede subto che è il caso di dare un’occhiata al costruttore di UnaClasseDelProgramma. Che farà mai con uno std::vector che occupa il 99% della memoria?

Questo è già un ottimo aiuto, con poco sforzo. Volendo, Massif può fare di più. Può misurare la memoria usata “di nascosto” dal sistema per gestire l’heap (extra-heap – 7,212 byte nell’esempio), misurare lo stack…

Il metodo fai-da-te: override di operator new

In C++ si può sostituire l’operazione di creazione di un oggetto (new) con la propria.*

http://en.cppreference.com/w/cpp/memory/new/operator_new

Quasi nessuno ha una buona ragione per farlo, ma noi si: non sappiamo usare il profiler intercettare le allocaioni nello heap.

Semplificando, basta definire la nostra versione di operator new (e dei suoi overload) in qualunque file del programma.

Se il memory profiler equivale al “time” profiler, questo trucco è paragonabile al classico snippet cout << tempoFine - tempoInizio;. Non magnificamente dettagliato e accurato, ma semplice e comunque utile.

Bastano poche righe di codice per avere qualcosa di rozzo, ma utilizzabile. E’ meglio compilare con i simboli di debug. Il codice per scrivere lo stack trace è valido probabilmente solo su Linux*.

Non c’è niente di portabile a così basso livello.

Per chi lavora nel mondo Microsoft: https://msdn.microsoft.com/en-us/library/windows/desktop/bb204633%28v=vs.85%29.aspx.

Sarebbe a dire:

#include
//
#include // Cattura degli stack trace.
#include // Lettura simboli di debug.

//
void StackTrace() {
/* Cattura lo stack trace vero e proprio. */
const ULONG doNotSkipAnyFrame = 0;
const ULONG takeTenFrames = 10;
const PULONG doNotHash = nullptr;
PVOID stackTrace[takeTenFrames];
const USHORT framesCaptured = CaptureStackBackTrace(
doNotSkipAnyFrame,
takeTenFrames,
stackTrace,
doNotHash
);
//
/* Prepara la tabella dei simboli per tradurre da indirizzi a righe di codice. */
const HANDLE thisProcess = GetCurrentProcess();
SymInitialize(thisProcess, NULL, TRUE); // Linkare Dbghelp.lib
//
for (ULONG i = 0; i < framesCaptured; i++) {
/*Estrae il nome della funzione. */
const size_t nameStringSize = 256;
SYMBOL_INFO * functionData = (SYMBOL_INFO*)malloc(sizeof(SYMBOL_INFO) + (nameStringSize + 1) * sizeof(char)); // +1 per il \0
functionData->MaxNameLen = nameStringSize;
functionData->SizeOfStruct = sizeof(SYMBOL_INFO);
SymFromAddr(thisProcess, (DWORD64)(stackTrace[i]), 0, functionData);
//
/* Va a cercare il file corrispondende alla chiamata.*/
DWORD displacementInLine;
IMAGEHLP_LINE64 lineOfCode;
lineOfCode.SizeOfStruct = sizeof(IMAGEHLP_LINE64);
SymGetLineFromAddr64(thisProcess, (DWORD)(stackTrace[i]), &displacementInLine, &lineOfCode);
//
std::cout << functionData->Name << " at "
<< lineOfCode.FileName << ":" << lineOfCode.LineNumber << std::endl;
}
}

// Il nostro new deve poter allocare la memoria…
#include
#include
// …ma anche ispezionare lo stack e salvarlo in output.
#include
#include
#include
// Contiene std::bad_alloc – da lanciare in caso di errori.
#include
//
/* Apre (una sola volta) e restituisce il file stream per salvare
gli stack. */
std::ofstream& filePerRisultati() {
static std::ofstream memoryProfile;
static bool open = false; // Init on 1st use, classico.
if (! open) {
memoryProfile.open ("allocations.txt");
open = true;
}
// Else, gestire gli errori, chiudere il file…
// Omettiamo per semplicità.
return memoryProfile;
}
//
/* Questa funzione “fa la magia” e scrive nel file lo stack trace al momento della chiamata
(compreso il suo stesso frame). */
void segnaLoStackTrace(std::ofstream& memoryProfile) {
// Registriamo 15 puntatori agli stack frame (bastano per il programma di prova).
const int massimaDimensioneStack = 15;
void *callStack[massimaDimensioneStack];
size_t frameInUso = backtrace(callStack, massimaDimensioneStack);
// A questo punto callStack è pieno di puntatori. Chiediamo i nomi delle
// funzioni corrispondenti a ciascun frame.
char ** nomiFunzioniMangled = backtrace_symbols(callStack, frameInUso);
// Scrive tutte le stringhe con i nomi delle funzioni nello stream per il debug.
for (int i = 0; i < frameInUso; ++i)
memoryProfile << nomiFunzioniMangled[i] << std::endl;
// A essere precisi, dovremmo rilasciare nomiFunzioniMangled con free…
}
//
/* Finalmente abbiamo tutti gli elementi per costruire il nostro operator new. */
void* operator new(std::size_t sz) {
// Allochiamo la memoria che serve al chiamante.
void * memoriaRichiesta = std::malloc(sz);
if (! memoriaRichiesta)
throw std::bad_alloc();

// Raccontiamo al mondo intero le nostre allocaioni.
std::ofstream& memoryProfile = filePerRisultati();
memoryProfile << "Allocation, size = " << sz << " at " << static_cast(memoriaRichiesta) << std::endl;
segnaLoStackTrace(memoryProfile);
memoryProfile << "-----------" << std::endl; // Separatore dei poveri…
return memoriaRichiesta;
}

Aggiungiamo l’operator new “taroccato” al nostro programma di prova. Questo è un esempio del risultato – riuscite a capire quale riga di codice alloca la memoria?

Allocation, size = 40 at 0x18705b0
./overridenew(_Z14dumpStackTraceRSt14basic_ofstreamIcSt11char_traitsIcEE+0x3c) [0x40672c]
./overridenew(_Znwm+0xaf) [0x406879]
./overridenew(_ZN9__gnu_cxx13new_allocatorISt23_Sp_counted_ptr_inplaceI9SomeClassSaIS2_ELNS_12_Lock_policyE2EEE8allocateEmPKv+0x4a) [0x405d9e]
./overridenew(_ZNSt16allocator_traitsISaISt23_Sp_counted_ptr_inplaceI9SomeClassSaIS1_ELN9__gnu_cxx12_Lock_policyE2EEEE8allocateERS6_m+0x28) [0x405bef]
./overridenew(_ZSt18__allocate_guardedISaISt23_Sp_counted_ptr_inplaceI9SomeClassSaIS1_ELN9__gnu_cxx12_Lock_policyE2EEEESt15__allocated_ptrIT_ERS8_+0x21) [0x4059e2]
./overridenew(_ZNSt14__shared_countILN9__gnu_cxx12_Lock_policyE2EEC2I9SomeClassSaIS4_EJEEESt19_Sp_make_shared_tagPT_RKT0_DpOT1_+0x59) [0x4057e1]
./overridenew(_ZNSt12__shared_ptrI9SomeClassLN9__gnu_cxx12_Lock_policyE2EEC2ISaIS0_EJEEESt19_Sp_make_shared_tagRKT_DpOT0_+0x3c) [0x4056ae]
./overridenew(_ZNSt10shared_ptrI9SomeClassEC2ISaIS0_EJEEESt19_Sp_make_shared_tagRKT_DpOT0_+0x28) [0x40560e]
./overridenew(_ZSt15allocate_sharedI9SomeClassSaIS0_EIEESt10shared_ptrIT_ERKT0_DpOT1_+0x37) [0x405534]
./overridenew(_ZSt11make_sharedI9SomeClassJEESt10shared_ptrIT_EDpOT0_+0x3b) [0x405454]
./overridenew(main+0x9c) [0x4052e8]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7f83fe991830]
./overridenew(_start+0x29) [0x405079]
-----------
Allocation, size = 10000000 at 0x7f83fc9c3010
./overridenew(_Z14dumpStackTraceRSt14basic_ofstreamIcSt11char_traitsIcEE+0x3c) [0x40672c]
./overridenew(_Znwm+0xaf) [0x406879]
./overridenew(_ZN9__gnu_cxx13new_allocatorIcE8allocateEmPKv+0x3c) [0x406538]
./overridenew(_ZNSt16allocator_traitsISaIcEE8allocateERS0_m+0x28) [0x4064aa]
./overridenew(_ZNSt12_Vector_baseIcSaIcEE11_M_allocateEm+0x2a) [0x406450]
./overridenew(_ZNSt12_Vector_baseIcSaIcEE17_M_create_storageEm+0x23) [0x4063cb]
./overridenew(_ZNSt12_Vector_baseIcSaIcEEC1EmRKS0_+0x3b) [0x4062f7]
./overridenew(_ZNSt6vectorIcSaIcEEC2EmRKS0_+0x2c) [0x406286]
./overridenew(_ZN9SomeClassC1Ev+0x3d) [0x406215]
./overridenew(_ZN9__gnu_cxx13new_allocatorI9SomeClassE9constructIS1_JEEEvPT_DpOT0_+0x36) [0x405e3a]
./overridenew(_ZNSt16allocator_traitsISaI9SomeClassEE9constructIS0_JEEEvRS1_PT_DpOT0_+0x23) [0x405d51]
./overridenew(_ZNSt23_Sp_counted_ptr_inplaceI9SomeClassSaIS0_ELN9__gnu_cxx12_Lock_policyE2EEC2IJEEES1_DpOT_+0x8c) [0x405b4a]
./overridenew(_ZNSt14__shared_countILN9__gnu_cxx12_Lock_policyE2EEC2I9SomeClassSaIS4_EJEEESt19_Sp_make_shared_tagPT_RKT0_DpOT1_+0xaf) [0x405837]
./overridenew(_ZNSt12__shared_ptrI9SomeClassLN9__gnu_cxx12_Lock_policyE2EEC2ISaIS0_EJEEESt19_Sp_make_shared_tagRKT_DpOT0_+0x3c) [0x4056ae]
./overridenew(_ZNSt10shared_ptrI9SomeClassEC2ISaIS0_EJEEESt19_Sp_make_shared_tagRKT_DpOT0_+0x28) [0x40560e]

...

…io non ci riesco. Dove sta “main+0xa8” nel mio programma? Fortunatamente, nel “mondo gnu/Linux” ci sono strumenti per fare il de-mangling e trovare i punti del codice corrispondenti agli indirizzi. Possiamo usarli, per esempio, in un semplice script.

#!/usr/bin/python
#
# C++filt fa il demangling dei nomi.
#
# addr2line converte i puntatori a codice (es. indirizzi di funzioni)
# alla coppia file:riga col codice corrispondente (se ci sono i simboli di debug).
#
# Il codice python dovrebbe essere portabile, ma non le utility a riga di comando.
#

import re
import subprocess
#

# Apre un sottoprocesso e gli passa dei comandi per la shell, poi ritorna il risultato in una stringa.
# Non molto efficiente, ma semplice.
def run_shell(command):
return subprocess.Popen(command, stdout=subprocess.PIPE).communicate()[0]
#
#
if __name__ == “__main__”:
total_size = 0;
#
# L’output ha 2 tipi di righe: quella con la dimensione dell’allocazione, e quella con uno stack frame.
size_line = re.compile(“Allocation, size = (\d+) at (\d+)”) # Allocation, size = at
stack_line = re.compile(“.*\((.*)\+.*\) \[(.*)\]”) # (nome mangled) []
#
allocations_file = open(“allocations.txt”)
for line in allocations_file:
match_size = size_line.match(line)
match_stack = stack_line.match(line)
#
# A scopo dimostrativo, accumulo il totale della memoria allocata.
# Un esempio di quello che si puo’ fare quando si controlla new!
if (match_size):
allocation_size = int(match_size.group(1))
total_size += allocation_size
print “Allocati ” + str(allocation_size)
#
elif (match_stack):
mangled_name = match_stack.group(1)
line_address = match_stack.group(2)
demangled_name = run_shell(["c++filt", "-n", mangled_name])
line_number = run_shell([“addr2line", “-e”, “./overridenew”, line_address])
#
# La formattazione non e’ molto professionale. Il -1 "gratuito" e’ per togliere un newline.
print”\t” + demangled_name[:-1] + “\n\t\t” + line_number,
#
# Rimette i separatori esattamente dov’erano.
else:
print line
#
print “\n total allocated size ” + str(total_size)

In alternativa, si può fare tutto a run time, con le utility di demangling dei compilatori. Per esempio quella di gcc. Personalmente preferisco tenere il codice di misurazione il più semplice possibile e “sbrigarmela” off-line. Con il mio script ottengo:

Allocati 40
    segnaLoStackTrace(std::basic_ofstream >&)
        /home/stefano/projects/overrideNew/InstrumentedNew.cpp:31
    operator new(unsigned long)
        /home/stefano/projects/overrideNew/InstrumentedNew.cpp:51
    __gnu_cxx::new_allocator, (__gnu_cxx::_Lock_policy)2> >::allocate(unsigned long, void const*)
        /usr/include/c++/5/ext/new_allocator.h:105

   ... stack delle chiamate "interne" di shared_ptr...

    std::shared_ptr std::allocate_shared>(std::allocator const&)
        /usr/include/c++/5/bits/shared_ptr.h:620
    std::shared_ptr std::make_shared()
        /usr/include/c++/5/bits/shared_ptr.h:636
    main
        /home/stefano/projects/overrideNew/main.cpp:25
    __libc_start_main
        ??:0
    _start
        ??:?
-----------

Allocati 10000000
    segnaLoStackTrace(std::basic_ofstream >&)
        /home/stefano/projects/overrideNew/InstrumentedNew.cpp:31
    operator new(unsigned long)
        /home/stefano/projects/overrideNew/InstrumentedNew.cpp:51
    __gnu_cxx::new_allocator::allocate(unsigned long, void const*)
        /usr/include/c++/5/ext/new_allocator.h:105

         ... stack delle chiamate interne di vector...

    std::vector >::vector(unsigned long, std::allocator const&)
        /usr/include/c++/5/bits/stl_vector.h:279
    UnaClasseDelProgramma::UnaClasseDelProgramma()
        /home/stefano/projects/overrideNew/UnaClasseDelProgramma.cpp:4 (discriminator 2)
...

La prima allocazione sono 40 byte chiesti da make_shared. 24 per UnaClasseDelProgramma (che contiene un vector come membro – sizeof(vector) è 24), i restanti dovrebbero essere il control block dello shared pointer. La seconda allocazione sono i 10MB del famigerato costruttore di UnaClasseDelProgramma.

Bisogna faticare un po’ per decifrare gli stack, ma si riesce a capire che la riga misteriosa era std::shared_ptr stdSmartPointer = std::make_shared(); – dalle parti del return a main.cpp:25.

Compito per casa: quante allocazioni ci sarebbero con std::shared_ptr notSoSmartPointer(new UnaClasseDelProgramma());
?*

Tre, e si usano 8 byte in più.
In un test ho misurato:
24 byte per l’istanza di UnaClasseDelProgramma
10 MB per il contenuto del vector
24 byte per lo shared pointer.

Giudiacando dalle implementation notes, penso che la differenza sia nel contenuto del control_block dello shared pointer.

Riassumendo…

I programmatori combattono da sempre con la memoria, vuoi perché è poca, vuoi perché è lenta. Come per tutti i colli di bottiglia, non ci si può fidare dell’istinto. Abbiamo visto che esistono strumenti appropriati (i memory profiler) per misurare il consumo di memoria. Abbiamo scoperto che, male che vada, esistono strumenti “casarecci” che possiamo costruirci da soli con il “classico hack da C++”, manipolando operator new.

Trovate il codice degli esempi “pronto da compilare” sul repo GitHub di ++It .

Spy your memory usage with operator new

stefano — Fri, 27 Jan 2017 18:18:53 +0000

Special thanks to Marco Alesiani for many corrections and suggestions.

Anche tu campi a spaghetti e pizza? Leggi l’articolo in italiano.

When we say “efficiency”, we often think “time”. The sooner the code does its job, the more it is efficient.

What about memory? Granted, today even the lousiest laptop comes with “a bucket load” of RAM which… is never enough. My PC “wastes” 1.4GB just to idle. I open a browser, 300 more MB are gone.*.

…we take the occasion to apologize for the “Allowed memory size of … bytes exhausted “ errors and the white pages that you may occasionally see on ++It. There is a reason why we care so much about memory.

Adding insult to injury, using memory is one of the slowest operations on current systems*.

(Italian only) Daniele Maccioni: Data Oriented Design: alte performance in C++

Moreover, finding the culprit line among the code is not easy. Was it a “new” we wrote? Some allocation hidden inside a library? Are temporary objects to blame?

How to easily find the part of the code that uses most of the memory?

This post collects some personal experiments. You can “thank” the author for any mistake.

Let’s use some memory

Today’s toy-code is nothing special, but it does many an allocation using operator new.

/* Program that allocates some memory when it feels like.
No delete – today’s essay is not about memory leaks.*/
#include
#include
#include
#include
#include "SomeClass.h"
//
void h() {
SomeClass* t = new SomeClass();
}
void g() { h(); }
void f() { g(); }
void MakeSomeClass() { f(); }
//
int main(int argc, char **argv) {
int * number = new int(89);
std::string * test = new std::string("abc");
//
SomeClass * oggetto = new SomeClass();
MakeSomeClass();
//
boost::shared_ptr smartPointer = boost::make_shared();
std::shared_ptr stdSmartPointer = std::make_shared();
return 0;
}

Compile, run and… almost 42MB (measured “on the cheap” with /usr/bin/time -v).

Who is using all that memory?

The right way: memory profiler

The idea should be familiar: the “classic” profiler tells for how long each function executes. The memory profiler instead tells where and when the program uses memory, and how much.
For example, here is some of the information that Massif * returns about our program.

http://valgrind.org/docs/manual/ms-manual.html
Should you work on Windows: https://blogs.msdn.microsoft.com/vcblog/2015/10/21/memory-profiling-in-visual-c-2015/

We can start with the memory growth (in ASCII art!) over “time” – actually its growth over the number of executed instructions:

    MB
38.23^                                                           ::::::::::::#
     |                                                           :           #
     |                                                           :           #
     |                                                           :           #
     |                                                           :           #
     |                                               :::::::::::::           #
     |                                               :           :           #
     |                                               :           :           #
     |                                               :           :           #
     |                                               :           :           #
     |                                   @@@@@@@@@@@@:           :           #
     |                                   @           :           :           #
     |                                   @           :           :           #
     |                                   @           :           :           #
     |                                   @           :           :           #
     |                       ::::::::::::@           :           :           #
     |                       :           @           :           :           #
     |                       :           @           :           :           #
     |                       :           @           :           :           #
     |                       :           @           :           :           #
   0 +----------------------------------------------------------------------->Mi
     0                                                                   6.203

Then we can get detailed snapshots (the “A”, “B” and “C” tags are ours):

--------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
...
  9      4,311,691       30,080,056       30,072,844         7,212            0
99.98% (30,072,844B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->99.73% (30,000,000B) 0x4078E8: __gnu_cxx::new_allocator::allocate(unsigned long, void const*) (new_allocator.h:104)
| ->99.73% (30,000,000B) 0x40785A: std::allocator_traits >::allocate(std::allocator&, unsigned long) (alloc_traits.h:491)
|   ->99.73% (30,000,000B) 0x407800: std::_Vector_base >::_M_allocate(unsigned long) (stl_vector.h:170)
|     ->99.73% (30,000,000B) 0x40777B: std::_Vector_base >::_M_create_storage(unsigned long) (stl_vector.h:185)
|       ->99.73% (30,000,000B) 0x4076A7: std::_Vector_base >::_Vector_base(unsigned long, std::allocator const&) (stl_vector.h:136)
|         ->99.73% (30,000,000B) 0x407636: std::vector >::vector(unsigned long, std::allocator const&) (stl_vector.h:278)
|           ->99.73% (30,000,000B) 0x4075C5: SomeClass::SomeClass() (SomeClass.cpp:4)
|  A ====>   ->33.24% (10,000,000B) 0x405F91: main (main.cpp:20)
|             | 
|  B ====>    ->33.24% (10,000,000B) 0x405EC1: h() (main.cpp:10)
|             | ->33.24% (10,000,000B) 0x405EEF: g() (main.cpp:12)
|             |   ->33.24% (10,000,000B) 0x405EFB: f() (main.cpp:13)
|             |     ->33.24% (10,000,000B) 0x405F07: MakeSomeClass() (main.cpp:14)
|             |       ->33.24% (10,000,000B) 0x405F9A: main (main.cpp:21)
|             |         
|  C ====>    ->33.24% (10,000,000B) 0x4063F2: _ZN5boost11make_sharedI9SomeClassIEEENS_6detail15sp_if_not_arrayIT_E4typeEDpOT0_ (make_shared_object.hpp:254)
|               ->33.24% (10,000,000B) 0x405FA6: main (main.cpp:23)
|                 
->00.24% (72,844B) in 1+ places, all below ms_print's threshold (01.00%)

We quickly see that line 20 of the main uses one third of the memory (A) where we wrote a new. The next 30% of the memory (B) is allocated in h() – Massif recorded all the call stack at the point of allocation. We can trace it down to the call to MakeSomeClass() in the main. Massif also works with shared pointers (C).

We can’t see the allocation at line 24 because it has not yet been executed and “intercepted” by Massif. We may spot it in a later snapshot. The remaining allocations are “small” and summarized in the last line.

A quick glance at the report tells us to go check the constructor of SomeClass. What the heck is it doing with a std::vector that takes 99% of the memory?

This is already a good result, obtained with little effort. Be aware that Massif can do more. It can measure the memory used “behind the scenes” by the system to make the heap work (extra-heap – 7,212 bytes in the example), track the stack…

The do-it-yourself way: override operator new

C++ allows to replace the operator to create objects (new) with a custom one.*

http://en.cppreference.com/w/cpp/memory/new/operator_new

Almost nobody has a good reason to do so, but we do: I could not figure out how to use the profiler intercept heap allocations.

By and large, all we have to do is define a custom new (and its overloads) in any file of a program.

If the memory profiler is an equivalent of the “time” profiler, then you can compare this trick to the classic snippet cout << endTime - startTime;. Not really detailed or accurate, but simple and useful.

A few lines of code can give us something raw, but usable. You should compile with debug symbols. The code that outputs the stack trace can probably work only on Linux.*.

There is nothing portable when you work at low level.

If you are in the Microsoft world: https://msdn.microsoft.com/en-us/library/windows/desktop/bb204633%28v=vs.85%29.aspx.

That means:

#include
//
#include // Capture stack traces.
#include // Read debug symbols.

//
void StackTrace() {
/* Capture the stack trace. */
const ULONG doNotSkipAnyFrame = 0;
const ULONG takeTenFrames = 10;
const PULONG doNotHash = nullptr;
PVOID stackTrace[takeTenFrames];
const USHORT framesCaptured = CaptureStackBackTrace(
doNotSkipAnyFrame,
takeTenFrames,
stackTrace,
doNotHash
);
//
/*Prepare the symbol table to convert from addresses to lines of code. */
const HANDLE thisProcess = GetCurrentProcess();
SymInitialize(thisProcess, NULL, TRUE); // Linkare Dbghelp.lib
//
for (ULONG i = 0; i < framesCaptured; i++) {
/*Estrae il nome della funzione. */
const size_t nameStringSize = 256;
SYMBOL_INFO * functionData = (SYMBOL_INFO*)malloc(sizeof(SYMBOL_INFO) + (nameStringSize + 1) * sizeof(char)); // +1 because there is \0
functionData->MaxNameLen = nameStringSize;
functionData->SizeOfStruct = sizeof(SYMBOL_INFO);
SymFromAddr(thisProcess, (DWORD64)(stackTrace[i]), 0, functionData);
//
/* Find the file matching the function call.*/
DWORD displacementInLine;
IMAGEHLP_LINE64 lineOfCode;
lineOfCode.SizeOfStruct = sizeof(IMAGEHLP_LINE64);
SymGetLineFromAddr64(thisProcess, (DWORD)(stackTrace[i]), &displacementInLine, &lineOfCode);
//
std::cout << functionData->Name << " at "
<< lineOfCode.FileName << ":" << lineOfCode.LineNumber << std::endl;
}
}

// Our special new must allocate memory as expected…
#include
#include
// …but also inspect the stack and print some results.
#include
#include
#include
// Import bad_alloc, expected in case of errors.
#include
//
/* Opens (once) and return the file to save the results.. */
static std::ofstream& resultFile() {
static std::ofstream memoryProfile;
static bool open = false; // Init on 1st use, as usual.
if (! open) {
memoryProfile.open ("allocations.txt");
open = true;
}
// Else, handle errors, close the file…
// We won’t do it, to keep the example simple.
return memoryProfile;
}
//
/* This is the "magic" function that inspect the stack and writes it in a file. */
static void dumpStackTrace(std::ofstream& memoryProfile) {
// Record 15 pointers to stack frame - enough for the example program.
const int maximumStackSize = 15;
void *callStack[maximumStackSize];
size_t framesInUse = backtrace(callStack, maximumStackSize);
// Now callStack is full of pointers. Request the names of the functions matching each frame.
char ** mangledFunctionNames = backtrace_symbols(callStack, framesInUse);
// Writes all the function names in the stream.
for (size_t i = 0; i < framesInUse; ++i)
memoryProfile << mangledFunctionNames[i] << std::endl;
// To be fair, we should release mangledFunctionNames with free…
}
//
/* Now we have all the elements to build the custom operator new. */
void* operator new(std::size_t sz) {
// Allocate the requested memory for the caller.
void * requestedMemory = std::malloc(sz);
if (! requestedMemory)
throw std::bad_alloc();
// Share our allocations with the world.
std::ofstream& memoryProfile = resultFile();
memoryProfile << "Allocation, size = " << sz << " at " << static_cast(requestedMemory) << std::endl;
dumpStackTrace(memoryProfile);
memoryProfile << "-----------" << std::endl; // Poor man’s separator.

return requestedMemory;
}

Let’s add the “tricked out” operator new to our test program. This is an example of the result – can you guess the line of code behind it?

Allocation, size = 40 at 0x18705b0
./overridenew(_Z14dumpStackTraceRSt14basic_ofstreamIcSt11char_traitsIcEE+0x3c) [0x40672c]
./overridenew(_Znwm+0xaf) [0x406879]
./overridenew(_ZN9__gnu_cxx13new_allocatorISt23_Sp_counted_ptr_inplaceI9SomeClassSaIS2_ELNS_12_Lock_policyE2EEE8allocateEmPKv+0x4a) [0x405d9e]
./overridenew(_ZNSt16allocator_traitsISaISt23_Sp_counted_ptr_inplaceI9SomeClassSaIS1_ELN9__gnu_cxx12_Lock_policyE2EEEE8allocateERS6_m+0x28) [0x405bef]
./overridenew(_ZSt18__allocate_guardedISaISt23_Sp_counted_ptr_inplaceI9SomeClassSaIS1_ELN9__gnu_cxx12_Lock_policyE2EEEESt15__allocated_ptrIT_ERS8_+0x21) [0x4059e2]
./overridenew(_ZNSt14__shared_countILN9__gnu_cxx12_Lock_policyE2EEC2I9SomeClassSaIS4_EJEEESt19_Sp_make_shared_tagPT_RKT0_DpOT1_+0x59) [0x4057e1]
./overridenew(_ZNSt12__shared_ptrI9SomeClassLN9__gnu_cxx12_Lock_policyE2EEC2ISaIS0_EJEEESt19_Sp_make_shared_tagRKT_DpOT0_+0x3c) [0x4056ae]
./overridenew(_ZNSt10shared_ptrI9SomeClassEC2ISaIS0_EJEEESt19_Sp_make_shared_tagRKT_DpOT0_+0x28) [0x40560e]
./overridenew(_ZSt15allocate_sharedI9SomeClassSaIS0_EIEESt10shared_ptrIT_ERKT0_DpOT1_+0x37) [0x405534]
./overridenew(_ZSt11make_sharedI9SomeClassJEESt10shared_ptrIT_EDpOT0_+0x3b) [0x405454]
./overridenew(main+0x9c) [0x4052e8]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7f83fe991830]
./overridenew(_start+0x29) [0x405079]
-----------
Allocation, size = 10000000 at 0x7f83fc9c3010
./overridenew(_Z14dumpStackTraceRSt14basic_ofstreamIcSt11char_traitsIcEE+0x3c) [0x40672c]
./overridenew(_Znwm+0xaf) [0x406879]
./overridenew(_ZN9__gnu_cxx13new_allocatorIcE8allocateEmPKv+0x3c) [0x406538]
./overridenew(_ZNSt16allocator_traitsISaIcEE8allocateERS0_m+0x28) [0x4064aa]
./overridenew(_ZNSt12_Vector_baseIcSaIcEE11_M_allocateEm+0x2a) [0x406450]
./overridenew(_ZNSt12_Vector_baseIcSaIcEE17_M_create_storageEm+0x23) [0x4063cb]
./overridenew(_ZNSt12_Vector_baseIcSaIcEEC1EmRKS0_+0x3b) [0x4062f7]
./overridenew(_ZNSt6vectorIcSaIcEEC2EmRKS0_+0x2c) [0x406286]
./overridenew(_ZN9SomeClassC1Ev+0x3d) [0x406215]
./overridenew(_ZN9__gnu_cxx13new_allocatorI9SomeClassE9constructIS1_JEEEvPT_DpOT0_+0x36) [0x405e3a]
./overridenew(_ZNSt16allocator_traitsISaI9SomeClassEE9constructIS0_JEEEvRS1_PT_DpOT0_+0x23) [0x405d51]
./overridenew(_ZNSt23_Sp_counted_ptr_inplaceI9SomeClassSaIS0_ELN9__gnu_cxx12_Lock_policyE2EEC2IJEEES1_DpOT_+0x8c) [0x405b4a]
./overridenew(_ZNSt14__shared_countILN9__gnu_cxx12_Lock_policyE2EEC2I9SomeClassSaIS4_EJEEESt19_Sp_make_shared_tagPT_RKT0_DpOT1_+0xaf) [0x405837]
./overridenew(_ZNSt12__shared_ptrI9SomeClassLN9__gnu_cxx12_Lock_policyE2EEC2ISaIS0_EJEEESt19_Sp_make_shared_tagRKT_DpOT0_+0x3c) [0x4056ae]
./overridenew(_ZNSt10shared_ptrI9SomeClassEC2ISaIS0_EJEEESt19_Sp_make_shared_tagRKT_DpOT0_+0x28) [0x40560e]

...

…I can’t. Where is “main+0xa8” in my code? Thankfully in the “gnu/Linux world” there are tools to de-mangle names and find the point in the code that corresponds to a given address. We can use them, for example, in a simple script.

#!/usr/bin/python
#
# C++filt demangles names.
#
# addr2line converts code pointers (e. g. functions’ addresses)
# into the file:line couple corresponding to the code (if there are debug symbols).
#
# The python code should be portable, but the called utilities aren’t.
#

import re
import subprocess
#

# Opens a sub-process and passes shell commands to it. Returns the results as a string.
# Not very efficient, but easy.
def run_shell(command):
return subprocess.Popen(command, stdout=subprocess.PIPE).communicate()[0]
#
#
if __name__ == "__main__":
total_size = 0;
#
# There are 2 types of lines in the output: stack frames and allocation sizes.
size_line = re.compile("Allocation, size = (\d+) at (\d+)") # Allocation, size = at
stack_line = re.compile(".*\((.*)\+.*\) \[(.*)\]") # (mangled name) [] # allocations_file = open("allocations.txt") for line in allocations_file: match_size = size_line.match(line) match_stack = stack_line.match(line) # # For a demo, I compute the sum of all the used memory. # The things you can do with an overridden new! if (match_size): allocation_size = int(match_size.group(1)) total_size += allocation_size print "Used " + str(allocation_size) # elif (match_stack): mangled_name = match_stack.group(1) line_address = match_stack.group(2) demangled_name = run_shell(["c++filt", "-n", mangled_name]) line_number = run_shell(["addr2line", "-e", "./overridenew", line_address]) # # This is not professional-grade formatting. The -1 cuts away the newlines. print"\t" + demangled_name[:-1] + "\n\t\t" + line_number, # # Copy the separator as they were. else: print line # print "\n total allocated size " + str(total_size)


As an alternative, we could to everything at run time, using the compiler’s demangling utilities, such as the gcc one. Personally I prefer to keep the code instrumentation as simple as possible and do the “heavy lifting” off-line. My script returns:
Used 40
    dumpStackTrace(std::basic_ofstream >&)
        /home/stefano/projects/code/spy-memory-with-new/InstrumentedNew.cpp:29
    operator new(unsigned long)
        /home/stefano/projects/code/spy-memory-with-new/InstrumentedNew.cpp:48
    __gnu_cxx::new_allocator, (__gnu_cxx::_Lock_policy)2> >::allocate(unsigned long, void const*)
        /usr/include/c++/5/ext/new_allocator.h:105
    
    ... internal calls of the shared pointer...
    
    std::shared_ptr std::allocate_shared>(std::allocator const&)
        /usr/include/c++/5/bits/shared_ptr.h:620
    _ZSt11make_sharedI9SomeClassIEESt10shared_ptrIT_EDpOT0_
        /usr/include/c++/5/bits/shared_ptr.h:636
    main
        /home/stefano/projects/code/spy-memory-with-new/main.cpp:25
    __libc_start_main
        ??:0
    _start
        ??:?
-----------

Used 10000000
    dumpStackTrace(std::basic_ofstream >&)
        /home/stefano/projects/code/spy-memory-with-new/InstrumentedNew.cpp:29
    operator new(unsigned long)
        /home/stefano/projects/code/spy-memory-with-new/InstrumentedNew.cpp:48
    __gnu_cxx::new_allocator::allocate(unsigned long, void const*)
        /usr/include/c++/5/ext/new_allocator.h:105
    
    ...internal calls of vector...
    
    std::vector >::vector(unsigned long, std::allocator const&)
        /usr/include/c++/5/bits/stl_vector.h:279
    SomeClass::SomeClass()
        /home/stefano/projects/code/spy-memory-with-new/SomeClass.cpp:4 (discriminator 2)
    ...

The first allocation are the 40 bytes requested by make_shared. 24 for SomeClass (its only member is a vector – sizeof(vector) is 24), the rest should be the control block of the shared pointer. The second allocation are the 10MB in the notorious constructor of SomeClass.
It takes some effort to navigate the stacks, but it is possible to understand that the mistery line was std::shared_ptr stdSmartPointer = std::make_shared(); –  close to the return at main.cpp:25.
Homework: how many allocations would there be with std::shared_ptr notSoSmartPointer(new SomeClass());

?*

Three, and using 8 more bytes.

In a test I found:

24 bytes for SomeClass’s instance

10 MB to fill the vector

24 bytes for the shared pointer.
Looking at the implementation notes, I believe that the difference is in the content of the shared pointer’s control block.


In the end…
Programmers have been fighting against memory since the dawn of time, because it is slow and too small. As for every bottleneck, one can’t trust his instincts. We saw that there are proper tools (memory profilers) to measure the memory usage. We discovered that, in a pinch, there are “home made” tools we can build ourselves with a “stereotypical C++ hack”, the override of operator new.
You can find the “ready-to-compile” code in the ++It GitHub repo.



Primi passi con Boost.Python
stefano — Wed, 02 Dec 2015 18:12:00 +0000
“Finalmente un linguaggio più moderno e funzionale”
Chi fra noi non vorrebbe programmare in un linguaggio multiparadigma, altamente espressivo, in piena evoluzione e con una vastissima libreria standard? Stiamo parlando, ovviamente, di… Python.
Ci sono casi in cui il nostro solito campione (C++11), non è la scelta migliore. Per un prototipo da sviluppare in fretta, uno script “usa e getta”, il server di un’applicazione web, del codice di ricerca… la complessità del C++ è più un peso che un vantaggio.
Come possiamo continuare a sfruttare l’efficienza del C++ o riutilizzare codice già esistente senza passare per cavernicoli fuori moda?
L’interprete Python può caricare moduli scritti in C, compilati in librerie dinamiche. Boost.Python ci aiuta, enormemente, a prepararli. Uniamo la potenza di Boost e C++ alla semplicità di Python.
Attenzione: anche se tutti gli esempi compilano, girano e passano i test questa non è la guida definiva su Boost.Python. Il codice è illustrativo, riflette solo la nostra (scarsa) esperienza con Boost.Python. Non esitate a segnalarci errori.
Un problema di velocità
Vediamo un caso (non troppo) pratico. Ci sono numeri uguali alla somma dei loro divisori (6 = 3 + 2 + 1; numeri perfetti). Il reparto marketing ha fiutato l’affare, ma è fondamentale calcolarne il più possibile prima della concorrenza. La velocità di sviluppo di Python è l’arma vincente, dopo 5 minuti rilasciamo Pefect 1.0®:

	
def trova_divisori(numero):
	divisori = []
	for i in range(1, numero):
		if numero % i == 0:
			divisori.append(i)
	return divisori


def perfetto(numero):
	divisori = trova_divisori(numero)
	return numero == sum(divisori)


def trova_perfetti(quanti_ne_vuoi):
	trovati = 0
	numero_da_provare = 1
	while (trovati < quanti_ne_vuoi):
		if perfetto(numero_da_provare):
			print numero_da_provare
			trovati += 1
		numero_da_provare += 1


if __name__ == "__main__":
	trova_perfetti(4) # Cercatene di più a vostro rischio e pericolo.
                        # L'attesa sarà lunga...




Questo codice non è perfettamente “pythonico” (https://www.python.org/dev/peps/pep-0008/), ma è stato veramente creato, testato e debuggato nel tempo che di solito spendiamo a leggere un’errore di compilazione¹.
Peccato che il tempo di esecuzione sia paragonabile: 6,5 secondi sulla mia macchina di prova (che non è la vostra, non è il server di produzione, non è il PC del Python-boy che a lui gira tutto in un picosecondo… è un esempio!).
Da bravi ingegneri cerchiamo il collo di bottiglia con il profiler:

	
import cProfile

... stesso codice di prima ...

if __name__ == "__main__":
	cProfile.run('trova_perfetti(4)')




Ed ecco il risultato:
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    7.420    7.420 :1()
     8128    0.709    0.000    7.326    0.001 purePython-profiler.py:15(perfetto)
        1    0.095    0.095    7.420    7.420 purePython-profiler.py:19(trova_perfetti)
     8128    5.190    0.001    6.523    0.001 purePython-profiler.py:8(trova_divisori)
    66318    0.819    0.000    0.819    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
     8128    0.514    0.000    0.514    0.000 {range}
     8128    0.094    0.000    0.094    0.000 {sum}

trova_divisori “ruba” quasi tutti i 6,5 secondi!
boost::python
Nessuno nega che si possa scrivere codice efficiente in Python (Java, VisualQualcosa, il linguaggio funzionale di questa settimana…), ma ottimizzare l’algoritmo di trova_divisori è fuori discussione: vogliamo mostrare Boost.Python, non fare una lezione di Algebra.
Per prima cosa, ci procuriamo Boost.Python. Su una macchina Linux è semplice quanto usare:
sudo apt-get install libboost-all-dev
Potreste dover installare anche i package “dev” di Python. Non è difficile trovare su internet istruzioni per tutte le piattaforme, ma installare (e compilare) può essere la parte più difficile. Non scoraggiatevi.
Questo è il codice C++:


	
#include "boost/python.hpp"  // (1)

boost::python::list trovaDivisori(uint64_t numero) // (2)
{
	boost::python::list divisori;
	for (uint64_t i = 1; i < numero; ++i)  // (3)
		if (numero % i == 0)
			divisori.append(i);
	return divisori;
}

BOOST_PYTHON_MODULE(divisori)
{
    using namespace boost::python;
    def("trova_divisori", trovaDivisori);  // (4)
}





Includiamo Boost.Python. Deve essere incluso prima di ogni altro eventuale header per evitare warning alla compilazione.
La funzione equivalente a quella che vogliamo sostituire in Python. Manteniamo la stessa segnatura (prende un intero, ritorna una lista) dell’originale in Python per rendere la sostituzione “trasparente”.
Anche l’algoritmo è esattamente lo stesso. Cambia solo la sintassi, e neanche di molto. In questo caso tutta la differenza la fa, probabilmente, il runtime C++.
Dichiariamo la funzione nel modulo python con “def” (…come in Python).

La guida (http://www.boost.org/doc/libs/1_59_0/libs/python/doc/) spiega molto chiaramente tutti dettagli.
La compilazione, purtroppo, non è esattamente elementare, dovrete probabilmente adattarla caso per caso. Vediamo l’esempio un passo alla volta (si tratta di una sola riga di comando, naturalmente):
g++ divisori.cpp			    compilo un file C++, qui tutto normale
 -o divisori.so  			    nome del file: Python esige sia lo stesso del modulo
-I /usr/include/python2.7/	            includo gli header di Python (ho Boost già nel path)
-l python2.7 -lboost_python -lboost_system  includo Python, Boost
-shared -fPIC -Wl,-export-dynamic           chiedo di creare una libreria dinamica

stackoverflow.com farà il resto. Notare che, per “par condicio”, non stiamo usando le opzioni di ottimizzazione di g++.
Una volta che la nostra libreria è nel path di sistema (altrimenti Python non la trova) possiamo includerla nel codice Python:


	
from divisori import trova_divisori

def perfetto(numero):
	divisori = trova_divisori(int(numero)) # Adesso chiama quella in C++
	return numero == sum(divisori)

… stesso codice di prima …




Tempo di esecuzione: poco meno di un secondo. Siamo testimoni del classico “l’80% del tempo si spreca nel 20% del codice”. Lo stesso algoritmo è 6 volte più veloce, ma l’unica parte su cui abbiamo perso tempo con la programmazione a basso livello (dopotutto, è ancora C++98!) è una sola funzione. Per tutto il resto possiamo ancora approfittare della praticità di Python.
Qualche possibilità in più
Boost.Python non si limita a convertire i tipi primitivi e a incapsulare le liste di Python in un adapter C++. Ecco una selezione dei casi “tipici” per chi programma nel “C con classi”:


	
class RiutilizzabileInPython 
{
	public:
		RiutilizzabileInPython() {};
		RiutilizzabileInPython(int x, const std::string& y) {};
		int variabileIstanza;
		static void metodoStatico() {};
		void metodo() {}
};

BOOST_PYTHON_MODULE(oop)
{
    using namespace boost::python;
    class_("implementata_in_CPP")	//(1)
	.def(init())				//(2)
	.def_readwrite("variabile_istanza", &RiutilizzabileInPython::variabileIstanza)//(3)
	.def("metodo_statico", &RiutilizzabileInPython::metodoStatico).staticmethod("metodo_statico") //(4)
	.def("metodo", &RiutilizzabileInPython::metodo)		// (5)
    ;
}





>Apriamo la dichiarazione della classe, passando la stringa con il nome Python.
Traduzione del costruttore in Python (…init, ricorda niente?).
La “tradizione” Python non disdegna le variabili di oggetto pubbliche. Eccone una.
Solo una ripetizione del nome Python per esporre un metodo statico.
Il classico, semplice metodo d’istanza.

Una volta compilato (…tra il dire e il fare…) possiamo usare la classe C++ in Python:

	
from oop import implementata_in_CPP

x = implementata_in_CPP()
y = implementata_in_CPP(3, "ciao")
x.variabil_istanza = 23
implementata_in_CPP.metodo_statico()
x.metodo()




Boost si preoccupa di convertire parametri, tipi di ritorno eccetera. Ci sono opzioni per l’“esportazione” diretta delle classi della STL (e se non ci sono è possibile definirle) e per le policy dei tipi ritornati (per reference, per copia…). Le possibilità sono moltissime, affidatevi alla guida ufficiale.
Quando il gioco si fa duro, Boost continua a giocare. Un assaggio:

	
class Problems
{
	public:
		void stampa() {
			std::cout << "cout continua a funzionare" << std::endl;
		}

		void eccezione() {
			throw std::runtime_error("Oh, no!!!");
		}

		void coreDump() {
			int * nullPointer = 0;
			*nullPointer = 24;
		}
};

BOOST_PYTHON_MODULE(oop)
{
    using namespace boost::python;

     class_("Problems")
	.def("stampa", &Problems::stampa)
	.def("eccezione", &Problems::eccezione)
	.def("coreDump", &Problems::coreDump)
    ;
}





Il “test-driver” in Python, con un esempio di output:


	
from oop import Problems
p = Problems()
p.stampa()
try:
	p.eccezione()
except RuntimeError as e:
	print "Il codice C++ non ha funzionato: " + str(e);
p.coreDump()





cout continua a funzionare				(1)
Il codice C++ non ha funzionato: Oh, no!!!	        (2)
Segmentation fault (core dumped)			(3)


Debuggare a colpi di std::cout non è una buona pratica… ma funziona!
Le eccezioni sono perfettamente “inoltrate” al runtime Python.
…pensavate di salvarvi, eh?

Multithreading
Boost.Python non è l’unica arma per affrontare problemi che richiedono efficienza. Il codice multi thread è un modo comune di aumentare le prestazioni, tanto per per trovare divisori che per minare Bitcoin o craccare password. Ecco una classe C++ che sta per saltare in un thread Python.

	
class JobTrovaDivisori {

	public:
		JobTrovaDivisori(uint64_t numero, uint64_t begin, uint64_t end) :
			numero(numero), begin(begin), end(end) {}
		
		boost::python::list trovaDivisori()
		{
			std::cout << "Start" << std::endl;

			boost::python::list divisori;
			for (uint64_t i = begin; i < end; ++i)
				 if (numero % i == 0)
					divisori.append(i);

			std::cout << "end" << std::endl;
			return divisori;
		}

	private:
		uint64_t numero;
		uint64_t begin;
 		uint64_t end;
};

BOOST_PYTHON_MODULE(fattorizzare)
{
    using namespace boost::python;
    class_("JobTrovaDivisori", init())
	.def("trova_divisori", &JobTrovaDivisori::trovaDivisori)
    ;
}




L’oggetto “JobTrovaDivisori” controlla se i numeri tra “begin” e “end” sono divisori di “numero”. Parallelizziamo il problema di trovare tutti i divisori in più “job” usando ogni oggetto su un intervallo diverso. Non ci sono dati condivisi, non abbiamo alcun problema di concorrenza. Questa è l’unica nota positiva di questa soluzione, ma ancora una volta tralasciamo la matematica (e l’ingegneria del software).
La chiamata in Python:


	
from threading import Thread
from fattorizzare import JobTrovaDivisori

class Job():							# (1)
	def __init__(self, numero, begin, end):
		self.cppJob = JobTrovaDivisori(numero, begin, end)
		self.divisori = []
	
	def __call__(self):
		self.divisori = self.cppJob.trova_divisori()

		
def trova_divisori_parallelo(numero):			# (2)
	limite = numero / 2

	job1 = Job(numero, 1, limite)
	job2 = Job(numero, limite, numero)

	t1 = Thread(None, job1)
	t2 = Thread(None, job2)
	
	t1.start()
	t2.start()
	t1.join()
	t2.join()

	return [job1.divisori, job2.divisori]


if __name__ == "__main__":
	print trova_divisori_parallelo(223339244);	#(3)





Incapsuliamo il Job C++ per “non complicarci la vita” cercando di esportare un callable C++.
Questo metodo crea 2 job, esegue il “fork e join” (o, come dicono oggi, “map e reduce”), poi stampa il risultato.
Fattorizziamo un numero qualunque.

Ecco l’output: ricordate le stampe di “Start” e “end” nella classe C++? Dopo circa 8 secondi e mezzo il calcolo termina, senza nessun parallelismo:
Start
end
Start
end
[[1L, 2L, 4L, 53L, 106L, 212L, 1053487L, 2106974L, 4213948L, 55834811L], [111669622L]]

Non è un caso. Gli oggetti Python sono protetti dal Global Interpreter Lock (GIL). Spetta al programmatore di ciascun thread rilasciarlo per dare il “via libera” agli altri thread. L’accortezza è di non chiamare codice puramente Python quando non si possiede il lock.
Come al solito in C++ controlliamo le risorse col metodo RAII. L’idioma per il GIL è (https://wiki.python.org/moin/boost.python/HowTo#Multithreading_Support_for_my_function):


	
class ScopedGILRelease
{
public:
    inline ScopedGILRelease(){
        m_thread_state = PyEval_SaveThread();
    }
    inline ~ScopedGILRelease()    
        PyEval_RestoreThread(m_thread_state);
        m_thread_state = NULL;
    }
private:
    PyThreadState * m_thread_state;
};




Rilasciamo il lock nella classe C++:


	
boost::python::list trovaDivisori() {
	ScopedGILRelease noGil = ScopedGILRelease(); // (1)
	std::cout << "Start" << std::endl;
		
	boost::python::list divisori;
	for (uint64_t i = begin; i < end; ++i)
		 if (numero % i == 0)  
			divisori.append(i); // (2) Possibile Core Dump!
	std::cout << "end" << std::endl;
	return divisori;
}





Quando questa variabile esce dallo scope, il lock è ri-acquisito, come se fosse uno smart pointer “al contrario”.
Qui è dove prenderemo il core dump. Ma solo in produzione.

Ricordate la clausola “l’accortezza è di non chiamare codice puramente Python quando non si possiede il lock”? La riga (2) potrebbe fare esattamente quello. Provate a far crescere la lista a dismisura (ad esempio, elimiate la “if (numero…” e salvate tutti i numeri nella lista). Credo che, probabilmente (affidatevi alle guide ufficiali per conoscere la vera risposta!) l’interprete Python deve allocare una lista più grossa, ma non avendo il lock qualcosa si corrompe.
Racchiudiamo la sezione parallelizzabile in uno scope a parte, salvando i numeri in una variabile non condivisa con Python:


	
boost::python::list trovaDivisori() {
	std::cout << "Start" << std::endl;
	std::vector divisoriTemp;
	{
	ScopedGILRelease noGil = ScopedGILRelease();
		for (uint64_t i = begin; i < end; ++i)
			 if (numero % i == 0) 
				divisoriTemp.push_back(i);
		std::cout << "end" << std::endl;
	} // noGil esce dallo scope. Riprendiamo il lock.
	boost::python::list divisori;
	BOOST_FOREACH(uint64_t n, divisoriTemp) {
		divisori.append(n);
	}
	return divisori;
}





Dopo 6 secondi e mezzo (-2 rispetto alla versione “accidentalmente sequenziale”) otteniamo l’interleaving previsto (Start Start – end end). Quei 2 secondi possiamo spenderli per pensare a una soluzione meno rimediata.
Questo conclude l’introduzione a Boost.Python. Ora conosciamo un modo per “incastrare” moduli C++ nelle applicazioni Python, sia per riutilizzarli che per ragioni di efficienza. Boost.Python connette i due mondi senza sacrificare la semplicità di Python e senza limitare le possibilità in C++, pur se è necessaria qualche accortezza. Soprattutto, d’ora in avanti avremo l’ultima parola nel classico flame “Python vs C++” su tutti i forum del mondo!

1E’ vero che si fa prima a fare un programma in Python che aggiustare un solo bug C++.
Fate la prova. Pronti, partenza, via: 
/usr/include/c++/4.8/bits/stl_map.h:646:7: note: no known conversion for argument 1 from 
‘int’ to ‘std::map, std::basic_string<
;char> > >::iterator {aka std::_Rb_tree_iterator, std::basic_string > > >}’
/usr/include/c++/4.8/bits/stl_map.h:670:9: note: template void std::map<_Key, _Tp, _Compare, _Alloc>::insert(_InputIterator, 
_InputIterator) [with _InputIterator = _InputIterator; _Key = int; _Tp = 
std::map, std::basic_string >; _Compare 
= std::less; _Alloc = std::allocator, std::basic_string > > >





First steps with Boost.Python
stefano — Wed, 02 Dec 2015 18:11:41 +0000
“Finally a modern, pragmatic language.”
Who among us wants to work with a multi-paradigm, highly-expressive, fast-evolving language with a huge standard library? We are talking, as usual, about… Python.
There are scenarios where our trusty champion (C++11) doesn’t cut it. For a prototype to rush out in a hurry, a “single use” script, the server side of a web application, research code… the complexity of C++ is more a problem than an asset.
How can we continue to take advantage of C++ efficiency or re-use some already available code without looking like old-fashioned cavemen?
The Python interpreter can load modules written in C, compiled as dynamic libraries. Boost.Python helps, a lot, to prepare them. It joins the power of Boost and C++ with the ease of use of Python.
Danger: even if all the examples compile, run and pass the tests this is not the ultimate guide about Boost.Python. The code is meant to be an example, it mirrors our (minimal) experience with Boost.Python. Do not hesitate to report any error we made.
A speed problem
Let’s see a (not too) practical use case. There are numbers which are equal to the sum of their divisors (6 = 3 + 2 + 1; perfect numbers). The marketing department believes it is something hot, but we must compute as many perfect numbers as possible and release them before our competitors. The development speed enabled by Python is key, after 5 minutes we release Pefect 1.0®:

	
def find_divisors(number):
	divisors = []
	for i in range(1, number):
		if number % i == 0:
			divisors.append(i)
	return divisors


def perfect(number):
	divisors = find_divisors(number)
	return number == sum(divisors)


def find_perfect_numbers(how_many):
	found = 0
	number_to_try = 1
	while (found < how_many):
		if perfect(number_to_try):
			print number_to_try
			found += 1
		number_to_try += 1


if __name__ == "__main__":
	find_perfect_numbers(4)  # Look for more at your own risk.
							 # And prepare for a long wait.




This code is not really “pythonic” (https://www.python.org/dev/peps/pep-0008/), but it really was created, tested and debugged in less time that it takes to read a C++ compilation error.¹.
Unfortunately the execution time is similar: 6.5 seconds on my test machine (which is not your test machine, nor the production server, nor the Python fanboy’s PC which can run everything in a picosecond… it’s an example!).
Let’s look for the bottleneck with the profiler, like the savvy engineers we are.

	
import cProfile

... same code as before ...

if __name__ == "__main__":
	cProfile.run("find_perfect_numbers(4)")




Here is the outcome:
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    5.657    5.657 :1()
     8128    0.283    0.000    5.582    0.001 purePython.py:16(perfect)
        1    0.075    0.075    5.657    5.657 purePython.py:21(find_perfect_numbers)
     8128    4.294    0.001    5.229    0.001 purePython.py:8(find_divisors)
    66318    0.528    0.000    0.528    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
     8128    0.406    0.000    0.406    0.000 {range}
     8128    0.070    0.000    0.070    0.000 {sum}

find_divisors “steals” almost all of the 5.6 seconds it took to run this test!
boost::python
No-one denies that it is possible to write efficient code in Python (Java, VisualWhatever, this week’s functional language…), but optimize the algorithm of find_divisors is out of the question: we are here to show off Boost.Python, not to give an Algebra lesson.
First of all, we get our hands on Boost.Python. On a Linux box this is as easy as typing:
sudo apt-get install libboost-all-dev
You may need to install Python’s “dev” packages. It is easy to find instructions for any platform over the web, but installing (and compiling) the library may be the most difficult step. Do not lose heart.
This is the C++ code:


	
#include "boost/python.hpp"  // (1)

boost::python::list findDivisors(uint64_t number) // (2)
{
	boost::python::list divisors;
	for (uint64_t i = 1; i < number; ++i)  // (3)
		if (number % i == 0)
			divisors.append(i);
	return divisors;
}

BOOST_PYTHON_MODULE(divisors)
{
    using namespace boost::python;
    def("find_divisors", findDivisors);  // (4)
}





Include Boost.Python. It must be included before any other header to avoid compilation warning.
The function corresponding to the one we want to replace in Python. It keeps the same signature (takes an integer, returns a list) as the Python original to achieve a “transparent” replacement.
Even the logic is exactly the same. Just a few syntax differences. The C++ runtime should make the difference in this case.
Declare the function with “def” (…hey, it’s just like Python).

The guide (http://www.boost.org/doc/libs/1_59_0/libs/python/doc/) has a clear explanation with all the details.
Compiling, sadly, is not so easy, we will have to adapt to your case. Let’s check a step-by-step example (naturally, this is a single line on the console):
g++ divisors.cpp			    compile a C++ file, as usual
 -o divisors.so  			    file name: Python demands it is the same as the module name
-I /usr/include/python2.7/	            to include Python's headers (I already set boost in the path)
-l python2.7 -lboost_python -lboost_system  include python, boost
-shared -fPIC -Wl,-export-dynamic           request to create a dynamic library

stackoverflow.com will cover the rest. Notice that “to level the play field”, I do not use optimization options in g++.
Once our library is in the system path (some place where Python can find it) we can include it in Python:


	
from divisors import find_divisors

def perfect(number):
	divisors = find_divisors(int(number))  # Calls the C++ implementation
	return number == sum(divisors)

… same code as before …





Run time: a bit less than a second. We are witnessing the classic “80% of time is wasted by 20% of the code”. The same algorithm is 6 times faster, but the part where we had to deal with low level programming (yes, still C++98!) is just one function. Everywhere else we can still take advantage of Python’s practicality.
Some more opportunities
Boost.Python is not limited to primitive types conversion or adapters to pass Python lists in C++. Here is a selection of “common” cases often met when doing “C with classes”:


	
class ReuseInPython 
{
	public:
		ReuseInPython() {};
		ReuseInPython(int x, const std::string& y) {};
		int instanceVariable;
		static void staticMethod() {};
		void method() {}
};

BOOST_PYTHON_MODULE(oop)
{
    using namespace boost::python;
    class_("implemented_in_CPP")		// (1)
	.def(init())  // (2)
	.def_readwrite("instance_variable", &ReuseInPython::instanceVariable)  // (3)
	.def("static_method", &ReuseInPython::staticMethod).staticmethod("static_method")  // (4)        
	.def("method", &ReuseInPython::method)  // (5)
    ;
}





Open a class declaration, passing a string with its alias in Python.
Translate the constructor in Python (…init, does that ring a bell?).
The Python “translation” won’t balk at public instance variables. Here is one.
Only repeat the Python name to expose a static method.
The run-of-the mill, basic instance method.

Once it is compiled (…sounds easy, but…) we can use the C++ class in Python:

	
from oop import implemented_in_CPP

x = implemented_in_CPP()
y = implemented_in_CPP(3, "hello")
x.instance_variable = 23
implemented_in_CPP.static_method()
x.method()




Boost takes care of converting parameters, return types etcetera. There are options to “export” directly STL classes (and more can be defined if something is missing) and for the return type policy (by reference, by copy…). There are really many options, trust the official guide.
When the going gets tough, Boost keeps going. A sample:

	
class Problems
{
	public:
		void print() {
			std::cout << "cout still works" << std::endl;
		}

		void exception() {
			throw std::runtime_error("Oh, no!!!");
		}

		void coreDump()	{
			int * nullPointer = 0;
			*nullPointer = 24;
		}
};

BOOST_PYTHON_MODULE(oop)
{
    using namespace boost::python;

    class_("Problems")
	.def("print_something", &Problems::print  // Print is a Python keyword.    
	.def("exception", &Problems::exception)
	.def("coreDump", &Problems::coreDump)
    ;
}





The Python “test-driver”, with an example of the output:


	
from oop import Problems
p = Problems()
p.print_something()
try:
	p.exception()
except RuntimeError as e:
	print "The C++ code bombed: " + str(e);
p.coreDump()





cout still works	(1)
The C++ code bombed: Oh, no!!!	(2)
Segmentation fault (core dumped)	(3)


Debugging with std::cout is not a recommended practice… but it works!
Exception are perfectly “thrown” to the Python runtime.
…well, what did you expect?

Multithreading
Boost.Python is not the only weapon to tackle problems that demand efficiency.. Multithreading is a common way to improve performance, as good when computing divisors as to mine bitcoins or crack passwords. Here is a C++ class which is about to jump in a Python thread:

	
class JobFindDivisors {

	public:
		JobFindDivisors(uint64_t number, uint64_t begin, uint64_t end) :
			number(number), begin(begin), end(end) {}
		
		boost::python::list findDivisors()
		{
			std::cout << "Start" << std::endl;

			boost::python::list divisors;
			for (uint64_t i = begin; i < end; ++i)
				 if (number % i == 0)
					divisors.append(i);

			std::cout << "end" << std::endl;
			return divisors;
		}

	private:
		uint64_t number;
		uint64_t begin;
 		uint64_t end;
};

BOOST_PYTHON_MODULE(factor)
{
    using namespace boost::python;
    class_("JobFindDivisors", init())
	.def("find_divisors", &JobFindDivisors::findDivisors)
    ;
}




The “JobFindDivisors” object checks if the numbers between “begin” and “end” are divisors of “number”. We parallelize the problem of finding all the divisors in many “jobs”, dedicating each object to a different interval. No data is shared between jobs, there are no concurrency problems. This is the only advantage of such a solution, but once again let’s forget about math (and proper software engineering).
The Python call:


	
from threading import Thread
from factor import JobFindDivisors

class Job():									# (1)
	def __init__(self, number, begin, end):
		self.cppJob = JobFindDivisors(number, begin, end)
		self.divisors = []
	
	def __call__(self):
		self.divisors = self.cppJob.find_divisors()

		
def find_divisors_in_parallel(number):			# (2)
	limit = number / 2

	job1 = Job(number, 1, limit)
	job2 = Job(number, limit, number)

	t1 = Thread(None, job1)
	t2 = Thread(None, job2)
	
	t1.start()
	t2.start()
	t1.join()
	t2.join()

	return [job1.divisors, job2.divisors]


if __name__ == "__main__":
	print  find_divisors_in_parallel(223339244); # (3)





Encapsulate the C++ Job to “keep it simple”, without exporting a C++ callable.
This method creates 2 jobs, does “fork and join” (or, as they say nowadays, “map and reduce”), then prints the results.
Factoring any number would do.

The output: do you remember the “Start” and “end” printouts in the C++ class? After around 8 seconds the computation terminates, with no parallelism whatsoever:
Start
end
Start
end
[[1L, 2L, 4L, 53L, 106L, 212L, 1053487L, 2106974L, 4213948L, 55834811L], [111669622L]]

Working as designed. Python’s objects are protected by the Global Interpreter Lock (GIL). It is up to the programmer to release it in each thread to “give way” to the other threads. The trick is to call pure Python code only when holding the lock.
As usual in C++ we control resources with RAII. The idiom for the GIL is (https://wiki.python.org/moin/boost.python/HowTo#Multithreading_Support_for_my_function):


	
class ScopedGILRelease
{
public:
    inline ScopedGILRelease(){
        m_thread_state = PyEval_SaveThread();
    }
    inline ~ScopedGILRelease()    
        PyEval_RestoreThread(m_thread_state);
        m_thread_state = NULL;
    }
private:
    PyThreadState * m_thread_state;
};




Release the lock in the C++ class:


	
boost::python::list findDivisors() {
	ScopedGILRelease noGil = ScopedGILRelease();  // (1)
	std::cout << "Start" << std::endl;

	boost::python::list divisors;
	for (uint64_t i = begin; i < end; ++i)
		 if (number % i == 0)
			divisors.append(i);  // (2) Possible core dump!

	std::cout << "end" << std::endl;
	return divisors;
}





When this variable goes out of scope, the lock is taken again. Like a “reversed” smart pointer.
Here is where we will certainly take a core dump. But only in production.

Do you remember that “the trick is to call pure Python code only when holding the lock”? Line (2) may do just that, without the lock. You can try to massively grow the list (say erase the “if (number…” and save all the number in the list). I believe that, maybe (please read the official documents for the real answer!) the Python interpreter must allocate a bigger list, but without the lock all it gets is corrupted memory.
Let’s encapsulate the parallelizable section in a dedicated scope, saving the numbers in a variable which we do not share with Python:


	
boost::python::list findDivisors()
{
	std::cout << "Start" << std::endl;
	std::vector divisorsTemp;
	boost::python::list divisors;
	{
		ScopedGILRelease noGil = ScopedGILRelease();
		for (uint64_t i = begin; i < end; ++i)
			if (number % i == 0)
				divisorsTemp.push_back(i);
	} // noGil goes out of scope, we take the lock again.
	BOOST_FOREACH(uint64_t n, divisorsTemp) {
		divisors.append(n);
	}
	std::cout << "end" << std::endl;
	return divisors;
}





After six and a half seconds (-2 compared with the “accidentally sequential” version) we get the expected interleaving (Start Start – end end). We can invest those 2 seconds to think to a less duck-tape-and-chewing-gum-oriented solution.
This completes the introduction to Boost.Python. Now we know how to “push” C++ modules in Python applications either to re-use, either for efficiency reasons. Boost.Python connects the two worlds without sacrificing Python’s simplicity and without adding constraints to C++, even if some spots do need care. Above all, from now on we are going to always have the last word in the unavoidable “Python vs C++” flame in every forum of the world.

1It is true: it takes less time to create a whole program in Python than to fix a single bug in C++.
Try it. Ready, steady, go:
/usr/include/c++/4.8/bits/stl_map.h:646:7: note: no known conversion for argument 1 from 
‘int’ to ‘std::map, std::basic_string<
;char> > >::iterator {aka std::_Rb_tree_iterator, std::basic_string > > >}’

/usr/include/c++/4.8/bits/stl_map.h:670:9: note: template void std::map<_Key, _Tp, _Compare, _Alloc>::insert(_InputIterator, _InputIterator) [with _InputIterator = _InputIterator; _Key = int; _Tp = std::map, std::basic_string >; _Compare = std::less; _Alloc = std::allocator, std::basic_string > > >