Is there an easy way to quickly count the number of instructions executed (x86 instructions - which and how many each) while executing a C program ?
I use gcc version 4.7.1 (GCC) on a x86_64 GNU/Linux machine.
Linux perf_event_open system call with config = PERF_COUNT_HW_INSTRUCTIONS
This Linux system call appears to be a cross architecture wrapper for performance events, including both hardware performance counters from the CPU and software events from the kernel.
Here's an example adapted from the man perf_event_open page:
perf_event_open.c
#define _GNU_SOURCE #include <asm/unistd.h> #include <linux/perf_event.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/ioctl.h> #include <unistd.h> #include <inttypes.h> #include <sys/types.h> static long perf_event_open(struct perf_event_attr *hw_event, pid_t pid, int cpu, int group_fd, unsigned long flags) { int ret; ret = syscall(__NR_perf_event_open, hw_event, pid, cpu, group_fd, flags); return ret; } int main(int argc, char **argv) { struct perf_event_attr pe; long long count; int fd; uint64_t n; if (argc > 1) { n = strtoll(argv[1], NULL, 0); } else { n = 10000; } memset(&pe, 0, sizeof(struct perf_event_attr)); pe.type = PERF_TYPE_HARDWARE; pe.size = sizeof(struct perf_event_attr); pe.config = PERF_COUNT_HW_INSTRUCTIONS; pe.disabled = 1; pe.exclude_kernel = 1; // Don't count hypervisor events. pe.exclude_hv = 1; fd = perf_event_open(&pe, 0, -1, -1, 0); if (fd == -1) { fprintf(stderr, "Error opening leader %llx\n", pe.config); exit(EXIT_FAILURE); } ioctl(fd, PERF_EVENT_IOC_RESET, 0); ioctl(fd, PERF_EVENT_IOC_ENABLE, 0); /* Loop n times, should be good enough for -O0. */ __asm__ ( "1:;\n" "sub $1, %[n];\n" "jne 1b;\n" : [n] "+r" (n) : : ); ioctl(fd, PERF_EVENT_IOC_DISABLE, 0); read(fd, &count, sizeof(long long)); printf("Used %lld instructions\n", count); close(fd); } Compile and run:
g++ -ggdb3 -O0 -std=c++11 -Wall -Wextra -pedantic -o perf_event_open.out perf_event_open.c ./perf_event_open.out Output:
Used 20016 instructions So we see that the result is pretty close to the expected value of 20000: 10k * two instructions per loop in the __asm__ block (sub, jne).
If I vary the argument, even to low values such as 100:
./perf_event_open.out 100 it gives:
Used 216 instructions maintaining that constant + 16 instructions, so it seems that accuracy is pretty high, those 16 must be just the ioctl setup instructions after our little loop.
Now you might also be interested in:
Other events of interest that can be measured by this system call:
Tested on Ubuntu 20.04 amd64, GCC 9.3.0, Linux kernel 5.4.0, Intel Core i7-7820HQ CPU.
perf stat CLI utility
The perf CLI utility can print an instruction estimate. Ubuntu 22.04 setup:
sudo apt install linux-tools-common linux-tools-generic echo -1 | sudo tee /proc/sys/kernel/perf_event_paranoid Usage:
perf stat <mycmd> Let's test with the following Linux x86 program which loops 1 million times. Each loop has 2 instructions: inc and loop, so we expect about 2 million instruction.
main.S
.text .global _start _start: mov $0, %rax mov $1000000, %rcx .Lloop_label: inc %rax loop .Lloop_label /* exit */ mov $60, %rax /* syscall number */ mov $0, %rdi /* exit status */ syscall Assemble and run:
as -o main.o main.S ld -o main.out main.o perf stat ./main.out Sample output:
Performance counter stats for './main.out': 1.51 msec task-clock # 0.802 CPUs utilized 0 context-switches # 0.000 /sec 0 cpu-migrations # 0.000 /sec 2 page-faults # 1.328 K/sec 5,287,702 cycles # 3.511 GHz 2,092,040 instructions # 0.40 insn per cycle 1,017,489 branches # 675.654 M/sec 1,156 branch-misses # 0.11% of all branches 0.001878269 seconds time elapsed 0.001922000 seconds user 0.000000000 seconds sys So it says about 2 million instructions. Only about 92k off. So it is not absolutely precise, but good enough for many applications. And we also get some other fun statistics like branch misses and page faults.
The extra instructions presumably come from imprecise sampling barriers that ended up including kernel/other processes' instructions.
perf can also do a bunch more advanced things, e.g. here I show how to use it to profile code: How do I profile C++ code running on Linux?
instcountYou can use the Binary Instrumentation tool 'Pin' by Intel. I would avoid using a simulator (they are often extremely slow). Pin does most of the stuff you can do with a simulator without recompiling the binary and at a normal execution like speed (depends on the pin tool you are using).
To count the number of instructions with Pin:
cd pin-root/source/tools/ManualExample/make all../../../pin -t obj-intel64/inscount0.so -- your-binary-hereinscount.out, cat inscount.out.The output would be something like:
➜ ../../../pin -t obj-intel64/inscount0.so -- /bin/ls buffer_linux.cpp itrace.cpp buffer_windows.cpp little_malloc.c countreps.cpp makefile detach.cpp makefile.rules divide_by_zero_unix.c malloc_mt.cpp isampling.cpp w_malloctrace.cpp ➜ cat inscount.out Count 716372 You can easily count the number of executed instruction using Hardware Performance Counter (HPC). In order to access the HPC, you need an interface to it. I recommended you to use PAPI Performance API.
Probably a duplicate of this question
I say probably because you asked for the assembler instructions, but that question handles the C-level profiling of code.
My question to you would be, however: why would you want to profile the actual machine instructions executed? As a very first issue, this would differ between various compilers, and their optimization settings. As a more practical issue, what could you actually DO with that information? If you are in the process of searching for/optimizing bottlenecks, the code profiler is what you are looking for.
I might miss something important here, though.
FSQRT, and another algorithm which avoids such expensive instructions and maybe uses a few more adds/multiplies - the second may well be faster even though it executes more instructions.Although not "quick" depending on the program, this may have been answered in this question. Here, Mark Plotnick suggests to use gdb to watch your program counter register changes:
# instructioncount.gdb set pagination off set $count=0 while ($pc != 0xyourstoppingaddress) stepi set $count++ end print $count quit Then, start gdb on your program:
gdb --batch --command instructioncount.gdb --args ./yourexecutable with its arguments To get the end address 0xyourstoppingaddress, you can use the following script:
# stopaddress.gdb break main run info frame quit which puts a breakpoint on the function main, and gives:
$ gdb --batch --command stopaddress.gdb --args ./yourexecutable with its arguments ... Stack level 0, frame at 0x7fffffffdf70: rip = 0x40089d in main (main_aes.c:33); saved rip 0x7ffff7a66d20 source language c. Arglist at 0x7fffffffdf60, args: argc=3, argv=0x7fffffffe048 ... Here what is important is the saved rip 0x7ffff7a66d20 part. On my CPU, rip is the instruction pointer, and the saved rip is the "return address", as stated by pepero in this answer.
So in this case, the stopping address is 0x7ffff7a66d20, which is the return address of the main function. That is, the end of the program execution.
gcc -pg -Wall -Oand usegprofor perhapsoprofile!!operator*()implementation. Note that on modern compilers even "multiplication" may not be implemented in an easy to detect way (consider the classic tricks played with theLEAinstruction).