15

I'd like to statistically profile my C code at the instruction level. I need to know how many additions, multiplications, divisions, etc I'm performing.

This is not your usual run of the mill code profiling requirement. I'm an algorithm developer and I want to estimate the cost of converting my code to hardware implementations. For this, I'm being asked the instruction call breakdown during run-time (parsing the compiled assembly isn't sufficient as it doesn't consider loops in the code).

After looking around, it seems VMware may offer a possible solution, but I still couldn't find the specific feature that will allow me to trace the instruction call stream of my process.

Are you aware of any profiling tools which enable this?

2
  • Seems to me you want to look not only at execution counts of such instructions, but at the overall fraction of time they account for because, for example, suppose they take 50% of the time, and you can reduce that part to negligible, you've only gotten a factor of 2 speedup. Commented Jun 4, 2010 at 16:17
  • Execution time is not important in this specific case. I'm not profiling for code optimization. Commented Jun 6, 2010 at 16:55

5 Answers 5

13

I eventually used a trivial yet effective solution.

  1. Configured GDB to display the disassembly of the next instruction (every time it stops) by invoking:

display/i $pc

  1. Configured a simple gdb script that breaks in the function I need to analyze and proceeds to step instruction by instruction:

    set $i=0 break main run while ($i<100000) si set $i = $i + 1 end quit 

  2. Executed gdb with my script dumping output into a log file:

    gdb -x script a.out > log.txt

  3. Analyzed the log to count specific instruction calls.

Crude, but it works...

Sign up to request clarification or add additional context in comments.

1 Comment

How do you pass a setting like 'display/i $pc' to GDB?
6

You can use pin-instat which is a PIN tool. It's a bit over kill as it records more information than the instruction count. It still should be more efficient than your gdb approach through.

Disclaimer: I'm the author of pin-instat.

3 Comments

As a computer architect, this little piece of software really helps me reason about my designs, Thank you for doing this work!
It is possible to attach pin-instat to an existing process?
@Sush Yes. PINDIR/pin.sh -t obj-intel64/instat.so -p <pid>
6

The Linux tool perf will give you a good deal of profiling information; specifically, perf annotate will give you per-instruction relative counts.

It is possible to drill down to the instruction level with perf annotate. For that, you need to invoke perf annotate with the name of the command to annotate. All the functions with samples will be disassembled and each instruction will have its relative percentage of samples reported:
 perf record ./noploop 5 perf annotate -d ./noploop ------------------------------------------------ Percent | Source code & Disassembly of noploop.noggdb ------------------------------------------------ : : : : Disassembly of section .text: : : 08048484 <main>: 0.00 : 8048484: 55 push %ebp 0.00 : 8048485: 89 e5 mov %esp,%ebp [...] 0.00 : 8048530: eb 0b jmp 804853d <main+0xb9> 15.08 : 8048532: 8b 44 24 2c mov 0x2c(%esp),%eax 0.00 : 8048536: 83 c0 01 add $0x1,%eax 14.52 : 8048539: 89 44 24 2c mov %eax,0x2c(%esp) 14.27 : 804853d: 8b 44 24 2c mov 0x2c(%esp),%eax 56.13 : 8048541: 3d ff e0 f5 05 cmp $0x5f5e0ff,%eax 0.00 : 8048546: 76 ea jbe 8048532 <main+0xae> [...] 

Comments

4

The valgrind tool cachegrind can be used to get execution counts of each line in the compiled assembly (the Ir value in the first column).

Comments

4

QEMU user mode -d in_asm

Edit: -d in_asm alone shows only translation chunks, so if a chunk gets executed multiple times, it won't show multiple times. There are some extra flags that make it more accurate, see: What instructions does qemu trace? QEMU's exec_tb trace backend might however do it: How to use QEMU's simple trace backend?

This is another simple thing you can do to get an instruction trace:

sudo apt-get install qemu-user qemu-x86_64 -d in_asm main.out 

Let's test it with an x86_64 triple hello world:

main.S

.text .global _start _start: asm_main_after_prologue: mov $3, %rbx write: mov $1, %rax /* syscall number */ mov $1, %rdi /* stdout */ mov $msg, %rsi /* buffer */ mov $len, %rdx /* len */ syscall dec %rbx jne write exit: mov $60, %rax /* syscall number */ mov $0, %rdi /* exit status */ syscall msg: .ascii "hello\n" len = . - msg 

adapted from GitHub upstream.

Assemble and run with:

as -o main.o main.S ld -o main.out main.o ./main.out 

Stdout output:

hello hello hello 

Running it through QEMU outputs the instruction trace to stderr:

warning: TCG doesn't support requested feature: CPUID.01H:ECX.vmx [bit 5] host mmap_min_addr=0x10000 Reserved 0x1000 bytes of guest address space Relocating guest address space from 0x0000000000400000 to 0x400000 guest_base 0x0 start end size prot 0000000000400000-0000000000401000 0000000000001000 r-x 0000004000000000-0000004000001000 0000000000001000 --- 0000004000001000-0000004000801000 0000000000800000 rw- start_brk 0x0000000000000000 end_code 0x00000000004000b8 start_code 0x0000000000400000 start_data 0x00000000004000b8 end_data 0x00000000004000b8 start_stack 0x00000040007fed70 brk 0x00000000004000b8 entry 0x0000000000400078 ---------------- IN: 0x0000000000400078: mov $0x3,%rbx 0x000000000040007f: mov $0x1,%rax 0x0000000000400086: mov $0x1,%rdi 0x000000000040008d: mov $0x4000b2,%rsi 0x0000000000400094: mov $0x6,%rdx 0x000000000040009b: syscall ---------------- IN: 0x000000000040009d: dec %rbx 0x00000000004000a0: jne 0x40007f ---------------- IN: 0x000000000040007f: mov $0x1,%rax 0x0000000000400086: mov $0x1,%rdi 0x000000000040008d: mov $0x4000b2,%rsi 0x0000000000400094: mov $0x6,%rdx 0x000000000040009b: syscall ---------------- IN: 0x00000000004000a2: mov $0x3c,%rax 0x00000000004000a9: mov $0x0,%rdi 0x00000000004000b0: syscall 

I expect this method to be relatively fast. It works by reading input instructions and producing output instructions that the host can run, much like cachegrind which was mentioned at: https://stackoverflow.com/a/2971979/895245

One cool thing about this is that you can also trivially trace executables of other architectures, see for example aarch64: How does native android code written for ARM run on x86?

This method also displays the current symbol of unstripped executables, e.g. the trace of:

main.c

#include <stdio.h> int say_hello() { puts("hello"); } int main(void) { say_hello(); } 

compile and run:

gcc -ggdb3 -O0 -o main.out main.c qemu-x86_64 -d in_asm ./main.out 

contains:

---------------- IN: main 0x0000000000400537: push %rbp 0x0000000000400538: mov %rsp,%rbp 0x000000000040053b: mov $0x0,%eax 0x0000000000400540: callq 0x400526 ---------------- IN: say_hello 0x0000000000400526: push %rbp 0x0000000000400527: mov %rsp,%rbp 0x000000000040052a: mov $0x4005d4,%edi 0x000000000040052f: callq 0x400400 ---------------- IN: 0x0000000000400400: jmpq *0x200c12(%rip) # 0x601018 

It does not however show symbols within shared libraries such as puts.

But you can see them if you compile with -static:

---------------- IN: main 0x00000000004009bf: push %rbp 0x00000000004009c0: mov %rsp,%rbp 0x00000000004009c3: mov $0x0,%eax 0x00000000004009c8: callq 0x4009ae ---------------- IN: say_hello 0x00000000004009ae: push %rbp 0x00000000004009af: mov %rsp,%rbp 0x00000000004009b2: mov $0x4a1064,%edi 0x00000000004009b7: callq 0x40faa0 ---------------- IN: puts 0x000000000040faa0: push %r12 0x000000000040faa2: push %rbp 0x000000000040faa3: mov %rdi,%r12 0x000000000040faa6: push %rbx 0x000000000040faa7: callq 0x423830 

Related: https://unix.stackexchange.com/questions/147343/how-to-determine-what-instructions-a-process-is-executing

Tested in Ubuntu 16.04, QEMU 2.5.0.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.