My problem
I am using PAPI high-level API to measure a C++ tree traversal code. With an abstract example:
retval = PAPI_hl_region_begin("traversal"); if ( retval != PAPI_OK ) printf("Error: %d\n", retval); traverseTree(root); retval = PAPI_hl_region_end("traversal"); if ( retval != PAPI_OK ) printf("Error: %d", retval); Then I proceed to measure the time spent on leaf-processing only.
void traverseTree (Node &node){ if (node.leftChild == NULL && node.rightChild == NULL){ retval = PAPI_hl_region_begin("leaf_proc"); if ( retval != PAPI_OK ) printf("Error: %d\n", retval); /* do leaf processing stuff here */ retval = PAPI_hl_region_end("leaf_proc"); if ( retval != PAPI_OK ) printf("Error: %d", retval); return; } // the traverse code for non-leaf nodes goes on here... } I measure the two different cases separately (commenting one of them at each time), so that the "leaf_proc" mearurement would not interfer in the "traversal" one.
To my surprise, the PAPI_TOT_INS was reported 1.5x higher in the "leaf_proc" than in the whole "traversal", but obviously the leaf-processing is a sub-part of the traversal, so this can't be right.
A reproducible test
My guess was that maybe PAPI was introducing a lot of overhead while measuring, capturing the transcurred instruction of it's own call, which inflated the number of instructions in the leaf processing measurements.
I tried to replicate this scenario of one big region vs many small regions with the following C code. To mimic the behavior, I measure around a for loop vs inside the for loop:
#include <stdio.h> #include "papi.h" int main(){ unsigned long acc = 0; int retval; /// Measuring around the for loop retval = PAPI_hl_region_begin("computation1"); if ( retval != PAPI_OK ) printf("Error: %d\n", retval); /* Do some computation here */ for (int i = 0; i < 1000000; ++i) { acc++; acc += i*2; acc +=acc; acc++; } retval = PAPI_hl_region_end("computation1"); if ( retval != PAPI_OK ) printf("Error: %d", retval); printf("acc = %lu\n", acc); acc = 0; /// Measuring inside the for loop /* Do some computation here */ for (int i = 0; i < 1000000; ++i) { retval = PAPI_hl_region_begin("computation2"); if ( retval != PAPI_OK ) printf("Error: %d", retval); acc++; acc += i*2; acc +=acc; acc++; retval = PAPI_hl_region_end("computation2"); if ( retval != PAPI_OK ) printf("Error: %d", retval); } printf("acc = %lu\n", acc); } After running papi_hl_output_writer.py --source papi_hl_output --type summary --notation raw I got:
{ "computation1": { "region_count": 1, "cycles": 43731546, "real_time_nsec": 20690957, "perf::TASK-CLOCK": 20704055, "PAPI_TOT_INS": 10003720, "PAPI_TOT_CYC": 20346060 }, "computation2": { "region_count": 1000000, "cycles": { "total": 2480211274, "min": 1572, "median": 2156.0, "max": 165442 }, "real_time_nsec": { "total": 1036821474, "min": 706, "median": 975.0, "max": 64073 }, "perf::TASK-CLOCK": { "total": 1173014388, "min": 747, "median": 1017.0, "max": 76794 }, "PAPI_TOT_INS": { "total": 1366279106, "min": 1366, "median": 1366.0, "max": 1569 }, "PAPI_TOT_CYC": { "total": 1060149619, "min": 826, "median": 899.0, "max": 60641 } } } The instructions in both regions should be at least in the same order of magnitude (I now I'm ignoring a few instructions that control the loop). However, PAPI_TOT_INS is reported 100x for "computation2" over "computation1". I'm guessing that PAPI library instructions are increasing the hw counter as well (taking 100x more than the silly instructions inside my for loop), before they are actually read and this is disturbing my code. I also think that a similar thing is happening with my "leaf_proc" measurement.
Is my understanding right or am I just using PAPI the wrong way? Any alternatives to measure small portions of code (such as my "leaf_proc") without inflating the instruction measurements with PAPI code itself?
gcc -pg/gcc -fprofile-arcscould produce data which could be analyzed (taking into account the number of instructions in each basic block). Still, interesting question exactly why this attempt gives this result; kernel insns seems plausible.