Quick way to count number of instructions executed in a C program

Question

Is there an easy way to quickly count the number of instructions executed (x86 instructions - which and how many each) while executing a C program ?

I use gcc version 4.7.1 (GCC) on a x86_64 GNU/Linux machine.

I agree with Doness' answer that typically people want to profile execution time per function. However, if you really want to get exact counts of each instruction executed, then you need to run your code on an instruction set simulator, such as simplescalar.com — TJD
– TJD, Commented Nov 9, 2012 at 18:18
Can you elaborate on what you are trying to accomplish? On x86, instruction execution performance depends far, far more on context than it does on the actual instruction -- virtually all instructions can optionally be loads or stores, for example. And purely register-to-register instructions are going to depend in complex ways on the pipeline state on modern CPUs. This doesn't sound like useful information to me. — Andy Ross
– Andy Ross, Commented Nov 9, 2012 at 19:08
Why do you ask? Usually profiling means something different... Eg compile with gcc -pg -Wall -O and use gprof or perhaps oprofile !! — Basile Starynkevitch
– Basile Starynkevitch, Commented Nov 9, 2012 at 19:17
I am implementing a complex mathematical algorithm and I wanted to count the number of multiplications(and divisions) which happens during its execution.I was looking for an easy way other than looking at the high level code and inferring the numbers.Maybe I should use a custom multiply function and insert a counter in it. — Jean
– Jean, Commented Nov 9, 2012 at 19:51
I'm not sure I believe "zero wait memory", even L1 cache on modern CPUs is 4 cycles! But regardless: looks to tricks like building your app in C++ using a custom operator*() implementation. Note that on modern compilers even "multiplication" may not be implemented in an easy to detect way (consider the classic tricks played with the LEA instruction). — Andy Ross
– Andy Ross, Commented Nov 9, 2012 at 20:47

Ciro Santilli OurBigBook.com · Accepted Answer · 2022-12-18 21:24:30Z

Linux perf_event_open system call with config = PERF_COUNT_HW_INSTRUCTIONS

This Linux system call appears to be a cross architecture wrapper for performance events, including both hardware performance counters from the CPU and software events from the kernel.

Here's an example adapted from the man perf_event_open page:

perf_event_open.c

#define _GNU_SOURCE #include <asm/unistd.h> #include <linux/perf_event.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/ioctl.h> #include <unistd.h> #include <inttypes.h> #include <sys/types.h> static long perf_event_open(struct perf_event_attr *hw_event, pid_t pid, int cpu, int group_fd, unsigned long flags) { int ret; ret = syscall(__NR_perf_event_open, hw_event, pid, cpu, group_fd, flags); return ret; } int main(int argc, char **argv) { struct perf_event_attr pe; long long count; int fd; uint64_t n; if (argc > 1) { n = strtoll(argv[1], NULL, 0); } else { n = 10000; } memset(&pe, 0, sizeof(struct perf_event_attr)); pe.type = PERF_TYPE_HARDWARE; pe.size = sizeof(struct perf_event_attr); pe.config = PERF_COUNT_HW_INSTRUCTIONS; pe.disabled = 1; pe.exclude_kernel = 1; // Don't count hypervisor events. pe.exclude_hv = 1; fd = perf_event_open(&pe, 0, -1, -1, 0); if (fd == -1) { fprintf(stderr, "Error opening leader %llx\n", pe.config); exit(EXIT_FAILURE); } ioctl(fd, PERF_EVENT_IOC_RESET, 0); ioctl(fd, PERF_EVENT_IOC_ENABLE, 0); /* Loop n times, should be good enough for -O0. */ __asm__ ( "1:;\n" "sub $1, %[n];\n" "jne 1b;\n" : [n] "+r" (n) : : ); ioctl(fd, PERF_EVENT_IOC_DISABLE, 0); read(fd, &count, sizeof(long long)); printf("Used %lld instructions\n", count); close(fd); }

Compile and run:

g++ -ggdb3 -O0 -std=c++11 -Wall -Wextra -pedantic -o perf_event_open.out perf_event_open.c ./perf_event_open.out

Output:

Used 20016 instructions

So we see that the result is pretty close to the expected value of 20000: 10k * two instructions per loop in the __asm__ block (sub, jne).

If I vary the argument, even to low values such as 100:

./perf_event_open.out 100

it gives:

Used 216 instructions

maintaining that constant + 16 instructions, so it seems that accuracy is pretty high, those 16 must be just the ioctl setup instructions after our little loop.

Now you might also be interested in:

prevent reordering of the syscalls: Enforcing statement order in C++
prevent the test loop from being optimized out: How to prevent GCC from optimizing out a busy wait loop?

Other events of interest that can be measured by this system call:

cycle counts: How to get the CPU cycle count in x86_64 from C++?

Tested on Ubuntu 20.04 amd64, GCC 9.3.0, Linux kernel 5.4.0, Intel Core i7-7820HQ CPU.

perf stat CLI utility

The perf CLI utility can print an instruction estimate. Ubuntu 22.04 setup:

sudo apt install linux-tools-common linux-tools-generic echo -1 | sudo tee /proc/sys/kernel/perf_event_paranoid

Usage:

perf stat <mycmd>

Let's test with the following Linux x86 program which loops 1 million times. Each loop has 2 instructions: inc and loop, so we expect about 2 million instruction.

main.S

.text .global _start _start: mov $0, %rax mov $1000000, %rcx .Lloop_label: inc %rax loop .Lloop_label /* exit */ mov $60, %rax /* syscall number */ mov $0, %rdi /* exit status */ syscall

Assemble and run:

as -o main.o main.S ld -o main.out main.o perf stat ./main.out

Sample output:

 Performance counter stats for './main.out': 1.51 msec task-clock # 0.802 CPUs utilized 0 context-switches # 0.000 /sec 0 cpu-migrations # 0.000 /sec 2 page-faults # 1.328 K/sec 5,287,702 cycles # 3.511 GHz 2,092,040 instructions # 0.40 insn per cycle 1,017,489 branches # 675.654 M/sec 1,156 branch-misses # 0.11% of all branches 0.001878269 seconds time elapsed 0.001922000 seconds user 0.000000000 seconds sys

So it says about 2 million instructions. Only about 92k off. So it is not absolutely precise, but good enough for many applications. And we also get some other fun statistics like branch misses and page faults.

The extra instructions presumably come from imprecise sampling barriers that ended up including kernel/other processes' instructions.

perf can also do a bunch more advanced things, e.g. here I show how to use it to profile code: How do I profile C++ code running on Linux?

When I run this I get: "Error opening leader 1". Does this require root privilege? I checked the documentation for perf_event_open and this doesn't seem to be the case but I might be missing something.
@AlexSpurling I have just re-run on Ubuntu 20.10 + same hardware as mentioned in the answer now and it worked without sudo. Therefore, either you're missing some kernel config, or there's some hardware support issue. What's your distro + exact CPU model? Dedicated discussion at: stackoverflow.com/questions/38442839/…

lol · Accepted Answer · 2021-05-20 17:03:01Z

Intel Pin's `instcount`

You can use the Binary Instrumentation tool 'Pin' by Intel. I would avoid using a simulator (they are often extremely slow). Pin does most of the stuff you can do with a simulator without recompiling the binary and at a normal execution like speed (depends on the pin tool you are using).

To count the number of instructions with Pin:

Download the latest (or 3.10 if this answer gets old) pin kit from here.
Extract everything and go to the directory: cd pin-root/source/tools/ManualExample/
Make all the tools in the directory: make all
Run the tool called inscount0.so using the command: ../../../pin -t obj-intel64/inscount0.so -- your-binary-here
Get the instruction count in the file inscount.out, cat inscount.out.

The output would be something like:

➜ ../../../pin -t obj-intel64/inscount0.so -- /bin/ls buffer_linux.cpp itrace.cpp buffer_windows.cpp little_malloc.c countreps.cpp makefile detach.cpp makefile.rules divide_by_zero_unix.c malloc_mt.cpp isampling.cpp w_malloctrace.cpp ➜ cat inscount.out Count 716372

husin alhaj ahmade · Accepted Answer · 2016-12-18 23:44:11Z

2

You can easily count the number of executed instruction using Hardware Performance Counter (HPC). In order to access the HPC, you need an interface to it. I recommended you to use PAPI Performance API.

answered Dec 18, 2016 at 23:44

husin alhaj ahmade

4914 silver badges18 bronze badges

2 Comments

user2316602 Over a year ago

Could you expand the answer? While a good pointer, for someone who does not know these technologies, it is difficult to know what exactly it is.

husin alhaj ahmade Over a year ago

@user2316602, today processors are equipped with special registers called hardware performance counters, or hardware performance monitoring unit. These registers can be configured to count micro-architecture events like cache miss, number of store , load instruction and the number of executed instructions, also called retired instructions. some operating system provide an interface to access these counters directly. I have been performed many experiments and processes to access and use these counters. The best way is to use the PAPI infrastructure. PAPI

Community · Accepted Answer · 2017-05-23 11:53:46Z

Probably a duplicate of this question

I say probably because you asked for the assembler instructions, but that question handles the C-level profiling of code.

My question to you would be, however: why would you want to profile the actual machine instructions executed? As a very first issue, this would differ between various compilers, and their optimization settings. As a more practical issue, what could you actually DO with that information? If you are in the process of searching for/optimizing bottlenecks, the code profiler is what you are looking for.

I might miss something important here, though.

Number of CPU instructions executed would be an easy way to compare algorithms without worrying about hiccups or competing for resources with other programs, independently of processing power although still dependent on instruction set.
@mpen: not necessarily, e.g. if you have one algorithm which use large lookup tables, and another which does the same thing using a more computational approach, then the first may have a lot more load instructions, each of which could potentially stall for > 100 cycles due cache misses. Similarly you might have one algorithm which uses a lot of expensive instructions, e.g. FSQRT, and another algorithm which avoids such expensive instructions and maybe uses a few more adds/multiplies - the second may well be faster even though it executes more instructions.

georges abitbol · Accepted Answer · 2019-04-18 07:04:17Z

Although not "quick" depending on the program, this may have been answered in this question. Here, Mark Plotnick suggests to use gdb to watch your program counter register changes:

# instructioncount.gdb set pagination off set $count=0 while ($pc != 0xyourstoppingaddress) stepi set $count++ end print $count quit

Then, start gdb on your program:

gdb --batch --command instructioncount.gdb --args ./yourexecutable with its arguments

To get the end address 0xyourstoppingaddress, you can use the following script:

# stopaddress.gdb break main run info frame quit

which puts a breakpoint on the function main, and gives:

$ gdb --batch --command stopaddress.gdb --args ./yourexecutable with its arguments ... Stack level 0, frame at 0x7fffffffdf70: rip = 0x40089d in main (main_aes.c:33); saved rip 0x7ffff7a66d20 source language c. Arglist at 0x7fffffffdf60, args: argc=3, argv=0x7fffffffe048 ...

Here what is important is the saved rip 0x7ffff7a66d20 part. On my CPU, rip is the instruction pointer, and the saved rip is the "return address", as stated by pepero in this answer.

So in this case, the stopping address is 0x7ffff7a66d20, which is the return address of the main function. That is, the end of the program execution.

Collectives™ on Stack Overflow

Quick way to count number of instructions executed in a C program

5 Answers 5

2 Comments

Intel Pin's `instcount`

Comments

2 Comments

3 Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

2 Comments

Intel Pin's instcount

Comments

2 Comments

3 Comments

Comments

Linked

Related

Intel Pin's `instcount`