19

I need to estimate the exact starting location of some hotspot in a program, in terms of x86 machine instruction count (so that it can later be run in some emulator/simulator). Is there a way to use gdb to count the number of machine instructions being executed up to a breakpoint?

There are other alternatives of course, I could use a emulation / binary instrumentation tool (like Pin), and track the run while counting instructions, but that would require installing this tool on every platform I work on - not always possible. I need some tool that's available on pretty much any linux machine.

With gdb, I guess it's also possible to run stepi X over large strides as some sort of coarse grained search until we hit the breakpoint, then repeat with reduced the resolution, but that would be excruciatingly slow. Is there another way to do this?

8
  • GDB is completely unsuitable for this purpose. Use something like PAPI to accurately measure how your application performs. You should have instrumentation tools everywhere you have an editor too, anyway. Commented Feb 7, 2014 at 12:40
  • 1
    @mfukar thanks, but i'm not sure it's easily available everywhere like GDB is. I also wouldn't say GDB is entirely unsuitable, it seems a really simple feature to add, as it already knows how to step at machine inst resolution - all it needs is to keep track of instruction count somewhere. Commented Feb 7, 2014 at 12:45
  • ptraceing a program in a debugger alters program state which may be vital to performance (cache state, TLB misses, etc). The results you'll get while running a program in a debugger apply only on that situation. Commented Feb 7, 2014 at 12:47
  • 1
    @pentadecagon, like I said - I need to run a certain section in a simulator (gem5 for e.g.), which can be triggered to start at a given instruction count Commented Feb 7, 2014 at 12:49
  • 1
    @EmployedRussian, I didn't say it has to be fast, I just wanted to know if there's any such capability. Like I said, Pin also does something similar when it instruments at the instruction level, and that has acceptable performance. Commented Feb 7, 2014 at 19:05

3 Answers 3

29

Try this:

set pagination off set $count = 0 while $pc != 0xyourstoppingaddress stepi set $count++ end print $count 

Then go get a cup of coffee. Or a long lunch.

Sign up to request clarification or add additional context in comments.

2 Comments

How do i use it when i want to wait for a segfault or any other signal which exists the program?
@12431234123412341234123 Try changing while $pc != 0xyourstoppingaddress to while $pc != 0xyourstoppingaddress && $_siginfo.si_signo != 11 to run the loop until SIGSEGV is received. The stepi command is going to cause the program to get a SIGTRAP signal (5), so if you want to stop on any signal other than SIGTRAP, try while $pc != 0xyourstoppingaddress && $_siginfo.si_signo == 5
8

This is actually only a slight improvement of the usability of Mark's solution.

We can define a function do_count:

define do_count set $count=0 while ($pc != $arg0) stepi set $count=$count+1 end print $count end 

and then this function can be reused for counting the number of steps over and over again:

set pagination off do_count 0xaddress1 do_count 0xaddress2 

One can even put this definition into .gdbinit (on Linux, on Windows it should be called gdb.ini) in the home-folder, so it becomes available automatically after the start of the gdb (use show user to see, whether the function was loaded).

1 Comment

How can we change the condition while ($pc != $arg0) so that we can count how many times a specific instruction has been executed? As you know, we have 2 concepts, static instruction and dynamic one. A static instruction can be executed several times.
8

If you actually want a cycle count (maybe as an approximation of instruction count with known IPC), and you're running on bare metal ARM, you might be able to read the cycle counter, see for example Cycle counter on ARM Cortex M4 (or M3)?


In your scenario, I would try Process Record and Replay to obtain the elapsed instruction count (available since GDB 7.0 and improved afterwards):

  1. Start measurement: record btrace (or record full if the former is not available).
  2. continue execution (until a breakpoint, or use next or other commands to step through).
  3. Obtain measurement: info record
  4. Clear recorded results: record stop (recommended as the buffer is of limited size).

Example:

 (gdb) record btrace (gdb) frame #0 __sanitizer::InitTlsSize () at .../lib/sanitizer_common/sanitizer_linux_libcdep.cc:220 220 void *get_tls_static_info_ptr = dlsym(RTLD_NEXT, "_dl_get_tls_static_info"); (gdb) info record Active record target: record-btrace Recording format: Branch Trace Store. Buffer size: 64kB. Recorded 0 instructions in 0 functions (0 gaps) for thread 1 (Thread 0xf7c92300 (LWP 20579)). (gdb) next 226 ... (gdb) info record Active record target: record-btrace Recording format: Branch Trace Store. Buffer size: 64kB. Recorded 2859 instructions in 145 functions (0 gaps) for thread 1 (Thread 0xf7c92300 (LWP 20579)). 

Limitations:

  • The record buffer has a limited size (this can be increased with set record btrace pt buffer-size <size> for the BTS format above, see the documentation for other types).
  • With record full, not all instructions can be captured. Notably, SSE and AVX instructions are unsupported and will cause gdb to pause execution.
  • There is some overhead while recording every instruction (especially with the full format). Though it should not be as bad as the gdb step approach described in other answers (which has to go through ptrace every time).

3 Comments

Nice answer, but why mention cycle counting? This question is about exact instruction counts, and IPC is variable depending on many microarchitectural factors including memory / cache contention from other cores, so it's not even exactly repeatable for the same code.
That might be true for x86, but if you run bare metal ARM with a single core, they will be the same. That was exactly the case I needed a solution for.
If your ARM doesn't have a cache or branch prediction that can be hot or not, and doesn't have possible contention from DMA, then sure. e.g. Cortex-M3. But faster ARM CPUs aren't necessarily as deterministic, where different input data could give different IPC because of more or less cache locality. Actually, not all instructions on Cortex-M3 cost the same cycles (taken branches cost extra to reload the pipeline), so reading the cycle timer still doesn't answer the question there.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.