2

My problem

I am using PAPI high-level API to measure a C++ tree traversal code. With an abstract example:

retval = PAPI_hl_region_begin("traversal"); if ( retval != PAPI_OK ) printf("Error: %d\n", retval); traverseTree(root); retval = PAPI_hl_region_end("traversal"); if ( retval != PAPI_OK ) printf("Error: %d", retval); 

Then I proceed to measure the time spent on leaf-processing only.

void traverseTree (Node &node){ if (node.leftChild == NULL && node.rightChild == NULL){ retval = PAPI_hl_region_begin("leaf_proc"); if ( retval != PAPI_OK ) printf("Error: %d\n", retval); /* do leaf processing stuff here */ retval = PAPI_hl_region_end("leaf_proc"); if ( retval != PAPI_OK ) printf("Error: %d", retval); return; } // the traverse code for non-leaf nodes goes on here... } 

I measure the two different cases separately (commenting one of them at each time), so that the "leaf_proc" mearurement would not interfer in the "traversal" one.

To my surprise, the PAPI_TOT_INS was reported 1.5x higher in the "leaf_proc" than in the whole "traversal", but obviously the leaf-processing is a sub-part of the traversal, so this can't be right.

A reproducible test

My guess was that maybe PAPI was introducing a lot of overhead while measuring, capturing the transcurred instruction of it's own call, which inflated the number of instructions in the leaf processing measurements.

I tried to replicate this scenario of one big region vs many small regions with the following C code. To mimic the behavior, I measure around a for loop vs inside the for loop:

#include <stdio.h> #include "papi.h" int main(){ unsigned long acc = 0; int retval; /// Measuring around the for loop retval = PAPI_hl_region_begin("computation1"); if ( retval != PAPI_OK ) printf("Error: %d\n", retval); /* Do some computation here */ for (int i = 0; i < 1000000; ++i) { acc++; acc += i*2; acc +=acc; acc++; } retval = PAPI_hl_region_end("computation1"); if ( retval != PAPI_OK ) printf("Error: %d", retval); printf("acc = %lu\n", acc); acc = 0; /// Measuring inside the for loop /* Do some computation here */ for (int i = 0; i < 1000000; ++i) { retval = PAPI_hl_region_begin("computation2"); if ( retval != PAPI_OK ) printf("Error: %d", retval); acc++; acc += i*2; acc +=acc; acc++; retval = PAPI_hl_region_end("computation2"); if ( retval != PAPI_OK ) printf("Error: %d", retval); } printf("acc = %lu\n", acc); } 

After running papi_hl_output_writer.py --source papi_hl_output --type summary --notation raw I got:

{ "computation1": { "region_count": 1, "cycles": 43731546, "real_time_nsec": 20690957, "perf::TASK-CLOCK": 20704055, "PAPI_TOT_INS": 10003720, "PAPI_TOT_CYC": 20346060 }, "computation2": { "region_count": 1000000, "cycles": { "total": 2480211274, "min": 1572, "median": 2156.0, "max": 165442 }, "real_time_nsec": { "total": 1036821474, "min": 706, "median": 975.0, "max": 64073 }, "perf::TASK-CLOCK": { "total": 1173014388, "min": 747, "median": 1017.0, "max": 76794 }, "PAPI_TOT_INS": { "total": 1366279106, "min": 1366, "median": 1366.0, "max": 1569 }, "PAPI_TOT_CYC": { "total": 1060149619, "min": 826, "median": 899.0, "max": 60641 } } } 

The instructions in both regions should be at least in the same order of magnitude (I now I'm ignoring a few instructions that control the loop). However, PAPI_TOT_INS is reported 100x for "computation2" over "computation1". I'm guessing that PAPI library instructions are increasing the hw counter as well (taking 100x more than the silly instructions inside my for loop), before they are actually read and this is disturbing my code. I also think that a similar thing is happening with my "leaf_proc" measurement.

Is my understanding right or am I just using PAPI the wrong way? Any alternatives to measure small portions of code (such as my "leaf_proc") without inflating the instruction measurements with PAPI code itself?

3
  • 1
    For dynamic instruction counts, you might want something like code instrumentation tools (to add counters for each basic block), rather than trying to use HW performance counters. e.g. Intel PIN can probably do this for x86; or maybe stuff like gcc -pg / gcc -fprofile-arcs could produce data which could be analyzed (taking into account the number of instructions in each basic block). Still, interesting question exactly why this attempt gives this result; kernel insns seems plausible. Commented Sep 5, 2023 at 23:45
  • Regardless, if you care about time (cycles) and performance, I'm always highly skeptical of the usefulness of any attempt to measure anything smaller than the out-of-order exec window of your CPU (a couple hundred uops), since real executions will be overlapping work with surrounding code. Microbenchmark chunks of code for throughput or latency with a repeat loop. Commented Sep 5, 2023 at 23:47
  • Thanks for the feedback. I was indeed looking at intel SDE (which is based on PIN I think). It just sounded OK to use PAPI and specifically mark the region of interest and avoid some post-processing to find this region with PIN (it's a somewhat big codebase I'm evaluating). I found this stackoverflow.com/a/64863392/5319497, which is basically low-level accessing the counters. Overheads seem to be steady across executions, so I can probably subtract them and get the analysis I need. If it works I'll try to answer my own question. Otherwise, I'll go with your suggestion! Commented Sep 6, 2023 at 10:43

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.