optimized itoa function

Question

I am thinking on how to implement the conversion of an integer (4byte, unsigned) to string with SSE instructions. The usual routine is to divide the number and store it in a local variable, then invert the string (the inversion routine is missing in this example):

char *convert(unsigned int num, int base) { static char buff[33]; char *ptr; ptr = &buff[sizeof(buff) - 1]; *ptr = '\0'; do { *--ptr="0123456789abcdef"[num%base]; num /= base; } while(num != 0); return ptr; }

But inversion will take extra time. Is there any other algorithm than can be used preferably with SSE instruction to parallelize the function?

Just an idea: maybe if you use CPU cache control instructions available in some compilers to preload buff, you'll be able to write buff backwards without any performance impact. — hamstergene
– hamstergene, Commented Oct 25, 2011 at 14:07
actually the buffer will be in cache already, I probably would have to align it to cache line, but what I am looking is that if the algorithm itself can be parallelized somehow ? — Nulik
– Nulik, Commented Oct 25, 2011 at 14:54
unless your number strings are very large, you'll probably see a drop in performance (or at least no increase) if you use SSE. SSE string optimization is generally only viable for large bodies of text (just see the GCC and LLVM source) — Necrolis
– Necrolis, Commented Oct 25, 2011 at 15:52
This doesn't make a lot of sense. Hex is for humans. You'll have to output the string somewhere. Screen, file, etc. The time spent on that will completely bury any nanoseconds you could gain from SSE. — Hans Passant
– Hans Passant, Commented Oct 25, 2011 at 16:09
@HansPassant Didn't you ever see 50MB "human-readable" XML files, which were never supposed to be read? ;) — hamstergene
– hamstergene, Commented Oct 25, 2011 at 16:43

Wayne Uroda · Accepted Answer · 2017-09-07 23:15:47Z

Terje Mathisen invented a very fast itoa() that does not require lookup tables. If you're not interested in the explanation of how it works, skip down to Performance or Implementation.

More than 15 years ago Terje Mathisen came up with a parallelized itoa() for base 10. The idea is to take a 32-bit value and break it into two chunks of 5 digits. (A quick Google search for "Terje Mathisen itoa" gave this post: http://computer-programming-forum.com/46-asm/7aa4b50bce8dd985.htm)

We start like so:

void itoa(char *buf, uint32_t val) { lo = val % 100000; hi = val / 100000; itoa_half(&buf[0], hi); itoa_half(&buf[5], lo); }

Now we can just need an algorithm that can convert any integer in the domain [0, 99999] to a string. A naive way to do that might be:

// 0 <= val <= 99999 void itoa_half(char *buf, uint32_t val) { // Move all but the first digit to the right of the decimal point. float tmp = val / 10000.0; for(size_t i = 0; i < 5; i++) { // Extract the next digit. int digit = (int) tmp; // Convert to a character. buf[i] = '0' + (char) digit; // Remove the lead digit and shift left 1 decimal place. tmp = (tmp - digit) * 10.0; } }

Rather than use floating-point, we will use 4.28 fixed-point math because it is significantly faster in our case. That is, we fix the binary point at the 28th bit position such that 1.0 is represented as 2^28. To convert into fixed-point, we simply multiply by 2^28. We can easily round down to the nearest integer by masking with 0xf0000000, and we can extract the fractional portion by masking with 0x0fffffff.

(Note: Terje's algorithm differs slightly in the choice of fixed-point format.)

So now we have:

typedef uint32_t fix4_28; // 0 <= val <= 99999 void itoa_half(char *buf, uint32_t val) { // Convert `val` to fixed-point and divide by 10000 in a single step. // N.B. we would overflow a uint32_t if not for the parentheses. fix4_28 tmp = val * ((1 << 28) / 10000); for(size_t i = 0; i < 5; i++) { int digit = (int)(tmp >> 28); buf[i] = '0' + (char) digit; tmp = (tmp & 0x0fffffff) * 10; } }

The only problem with this code is that 2^28 / 10000 = 26843.5456, which is truncated to 26843. This causes inaccuracies for certain values. For example, itoa_half(buf, 83492) produces the string "83490". If we apply a small correction in our conversion to 4.28 fixed-point, then the algorithm works for all numbers in the domain [0, 99999]:

// 0 <= val <= 99999 void itoa_half(char *buf, uint32_t val) { fix4_28 const f1_10000 = (1 << 28) / 10000; // 2^28 / 10000 is 26843.5456, but 26843.75 is sufficiently close. fix4_28 tmp = val * ((f1_10000 + 1) - (val / 4); for(size_t i = 0; i < 5; i++) { int digit = (int)(tmp >> 28); buf[i] = '0' + (char) digit; tmp = (tmp & 0x0fffffff) * 10; } }

Terje interleaves the itoa_half part for the low & high halves:

void itoa(char *buf, uint32_t val) { fix4_28 const f1_10000 = (1 << 28) / 10000; fix4_28 tmplo, tmphi; lo = val % 100000; hi = val / 100000; tmplo = lo * (f1_10000 + 1) - (lo / 4); tmphi = hi * (f1_10000 + 1) - (hi / 4); for(size_t i = 0; i < 5; i++) { buf[i + 0] = '0' + (char)(tmphi >> 28); buf[i + 5] = '0' + (char)(tmplo >> 28); tmphi = (tmphi & 0x0fffffff) * 10; tmplo = (tmplo & 0x0fffffff) * 10; } }

There is an additional trick that makes the code slightly faster if the loop is fully unrolled. The multiply by 10 is implemented as either a LEA+SHL or LEA+ADD sequence. We can save 1 instruction by multiplying instead by 5, which requires only a single LEA. This has the same effect as shifting tmphi and tmplo right by 1 position each pass through the loop, but we can compensate by adjusting our shift counts and masks like this:

uint32_t mask = 0x0fffffff; uint32_t shift = 28; for(size_t i = 0; i < 5; i++) { buf[i + 0] = '0' + (char)(tmphi >> shift); buf[i + 5] = '0' + (char)(tmplo >> shift); tmphi = (tmphi & mask) * 5; tmplo = (tmplo & mask) * 5; mask >>= 1; shift--; }

This only helps if the loop is fully-unrolled because you can precalculate the value of shift and mask for each iteration.

Finally, this routine produces zero-padded results. You can get rid of the padding by returning a pointer to the first character that is not 0 or the last character if val == 0:

char *itoa_unpadded(char *buf, uint32_t val) { char *p; itoa(buf, val); p = buf; // Note: will break on GCC, but you can work around it by using memcpy() to dereference p. if (*((uint64_t *) p) == 0x3030303030303030) p += 8; if (*((uint32_t *) p) == 0x30303030) p += 4; if (*((uint16_t *) p) == 0x3030) p += 2; if (*((uint8_t *) p) == 0x30) p += 1; return min(p, &buf[15]); }

There is one additional trick applicable to 64-bit (i.e. AMD64) code. The extra, wider registers make it efficient to accumulate each 5-digit group in a register; after the last digit has been calculated, you can smash them together with SHRD, OR them with 0x3030303030303030, and store to memory. This improves performance for me by about 12.3%.

Vectorization

We could execute the above algorithm as-is on the SSE units, but there is almost no gain in performance. However, if we split the value into smaller chunks, we can take advantage of SSE4.1 32-bit multiply instructions. I tried three different splits:

2 groups of 5 digits
3 groups of 4 digits
4 groups of 3 digits

The fastest variant was 4 groups of 3 digits. See below for the results.

Performance

I tested many variants of Terje's algorithm in addition to the algorithms suggested by vitaut and Inge Henriksen. I verified through exhaustive testing of inputs that each algorithm's output matches itoa().

My numbers are taken from a Westmere E5640 running Windows 7 64-bit. I benchmark at real-time priority and locked to core 0. I execute each algorithm 4 times to force everything into the cache. I time 2^24 calls using RDTSCP to remove the effect of any dynamic clock speed changes.

I timed 5 different patterns of inputs:

itoa(0 .. 9) -- nearly best-case performance
itoa(1000 .. 1999) -- longer output, no branch mispredicts
itoa(100000000 .. 999999999) -- longest output, no branch mispredicts
itoa(256 random values) -- varying output length
itoa(65536 random values) -- varying output length and thrashes L1/L2 caches

The data:

ALG TINY MEDIUM LARGE RND256 RND64K NOTES NULL 7 clk 7 clk 7 clk 7 clk 7 clk Benchmark overhead baseline TERJE_C 63 clk 62 clk 63 clk 57 clk 56 clk Best C implementation of Terje's algorithm TERJE_ASM 48 clk 48 clk 50 clk 45 clk 44 clk Naive, hand-written AMD64 version of Terje's algorithm TERJE_SSE 41 clk 42 clk 41 clk 34 clk 35 clk SSE intrinsic version of Terje's algorithm with 1/3/3/3 digit grouping INGE_0 12 clk 31 clk 71 clk 72 clk 72 clk Inge's first algorithm INGE_1 20 clk 23 clk 45 clk 69 clk 96 clk Inge's second algorithm INGE_2 18 clk 19 clk 32 clk 29 clk 36 clk Improved version of Inge's second algorithm VITAUT_0 9 clk 16 clk 32 clk 35 clk 35 clk vitaut's algorithm VITAUT_1 11 clk 15 clk 33 clk 31 clk 30 clk Improved version of vitaut's algorithm LIBC 46 clk 128 clk 329 clk 339 clk 340 clk MSVCRT12 implementation

My compiler (VS 2013 Update 4) produced surprisingly bad code; the assembly version of Terje's algorithm is just a naive translation, and it's a full 21% faster. I was also surprised at the performance of the SSE implementation, which I expected to be slower. The big surprise was how fast INGE_2, VITAUT_0, and VITAUT_1 were. Bravo to vitaut for coming up with a portable solution that bests even my best effort at the assembly level.

Note: INGE_1 is a modified version of Inge Henriksen's second algorithm because the original has a bug.

INGE_2 is based on the second algorithm that Inge Henriksen gave. Rather than storing pointers to the precalculated strings in a char*[] array, it stores the strings themselves in a char[][5] array. The other big improvement is in how it stores characters in the output buffer. It stores more characters than necessary and uses pointer arithmetic to return a pointer to the first non-zero character. The result is substantially faster -- competitive even with the SSE-optimized version of Terje's algorithm. It should be noted that the microbenchmark favors this algorithm a bit because in real-world applications the 600K data set will constantly blow the caches.

VITAUT_1 is based on vitaut's algorithm with two small changes. The first change is that it copies character pairs in the main loop, reducing the number of store instructions. Similar to INGE_2, VITAUT_1 copies both final characters and uses pointer arithmetic to return a pointer to the string.

Implementation

Here I give code for the 3 most interesting algorithms.

TERJE_ASM:

; char *itoa_terje_asm(char *buf<rcx>, uint32_t val<edx>) ; ; *** NOTE *** ; buf *must* be 8-byte aligned or this code will break! itoa_terje_asm: MOV EAX, 0xA7C5AC47 ADD RDX, 1 IMUL RAX, RDX SHR RAX, 48 ; EAX = val / 100000 IMUL R11D, EAX, 100000 ADD EAX, 1 SUB EDX, R11D ; EDX = (val % 100000) + 1 IMUL RAX, 214748 ; RAX = (val / 100000) * 2^31 / 10000 IMUL RDX, 214748 ; RDX = (val % 100000) * 2^31 / 10000 ; Extract buf[0] & buf[5] MOV R8, RAX MOV R9, RDX LEA EAX, [RAX+RAX] ; RAX = (RAX * 2) & 0xFFFFFFFF LEA EDX, [RDX+RDX] ; RDX = (RDX * 2) & 0xFFFFFFFF LEA RAX, [RAX+RAX*4] ; RAX *= 5 LEA RDX, [RDX+RDX*4] ; RDX *= 5 SHR R8, 31 ; R8 = buf[0] SHR R9, 31 ; R9 = buf[5] ; Extract buf[1] & buf[6] MOV R10, RAX MOV R11, RDX LEA EAX, [RAX+RAX] ; RAX = (RAX * 2) & 0xFFFFFFFF LEA EDX, [RDX+RDX] ; RDX = (RDX * 2) & 0xFFFFFFFF LEA RAX, [RAX+RAX*4] ; RAX *= 5 LEA RDX, [RDX+RDX*4] ; RDX *= 5 SHR R10, 31 - 8 SHR R11, 31 - 8 AND R10D, 0x0000FF00 ; R10 = buf[1] << 8 AND R11D, 0x0000FF00 ; R11 = buf[6] << 8 OR R10D, R8D ; R10 = buf[0] | (buf[1] << 8) OR R11D, R9D ; R11 = buf[5] | (buf[6] << 8) ; Extract buf[2] & buf[7] MOV R8, RAX MOV R9, RDX LEA EAX, [RAX+RAX] ; RAX = (RAX * 2) & 0xFFFFFFFF LEA EDX, [RDX+RDX] ; RDX = (RDX * 2) & 0xFFFFFFFF LEA RAX, [RAX+RAX*4] ; RAX *= 5 LEA RDX, [RDX+RDX*4] ; RDX *= 5 SHR R8, 31 - 16 SHR R9, 31 - 16 AND R8D, 0x00FF0000 ; R8 = buf[2] << 16 AND R9D, 0x00FF0000 ; R9 = buf[7] << 16 OR R8D, R10D ; R8 = buf[0] | (buf[1] << 8) | (buf[2] << 16) OR R9D, R11D ; R9 = buf[5] | (buf[6] << 8) | (buf[7] << 16) ; Extract buf[3], buf[4], buf[8], & buf[9] MOV R10, RAX MOV R11, RDX LEA EAX, [RAX+RAX] ; RAX = (RAX * 2) & 0xFFFFFFFF LEA EDX, [RDX+RDX] ; RDX = (RDX * 2) & 0xFFFFFFFF LEA RAX, [RAX+RAX*4] ; RAX *= 5 LEA RDX, [RDX+RDX*4] ; RDX *= 5 SHR R10, 31 - 24 SHR R11, 31 - 24 AND R10D, 0xFF000000 ; R10 = buf[3] << 24 AND R11D, 0xFF000000 ; R11 = buf[7] << 24 AND RAX, 0x80000000 ; RAX = buf[4] << 31 AND RDX, 0x80000000 ; RDX = buf[9] << 31 OR R10D, R8D ; R10 = buf[0] | (buf[1] << 8) | (buf[2] << 16) | (buf[3] << 24) OR R11D, R9D ; R11 = buf[5] | (buf[6] << 8) | (buf[7] << 16) | (buf[8] << 24) LEA RAX, [R10+RAX*2] ; RAX = buf[0] | (buf[1] << 8) | (buf[2] << 16) | (buf[3] << 24) | (buf[4] << 32) LEA RDX, [R11+RDX*2] ; RDX = buf[5] | (buf[6] << 8) | (buf[7] << 16) | (buf[8] << 24) | (buf[9] << 32) ; Compact the character strings SHL RAX, 24 ; RAX = (buf[0] << 24) | (buf[1] << 32) | (buf[2] << 40) | (buf[3] << 48) | (buf[4] << 56) MOV R8, 0x3030303030303030 SHRD RAX, RDX, 24 ; RAX = buf[0] | (buf[1] << 8) | (buf[2] << 16) | (buf[3] << 24) | (buf[4] << 32) | (buf[5] << 40) | (buf[6] << 48) | (buf[7] << 56) SHR RDX, 24 ; RDX = buf[8] | (buf[9] << 8) ; Store 12 characters. The last 2 will be null bytes. OR R8, RAX LEA R9, [RDX+0x3030] MOV [RCX], R8 MOV [RCX+8], R9D ; Convert RCX into a bit pointer. SHL RCX, 3 ; Scan the first 8 bytes for a non-zero character. OR EDX, 0x00000100 TEST RAX, RAX LEA R10, [RCX+64] CMOVZ RAX, RDX CMOVZ RCX, R10 ; Scan the next 4 bytes for a non-zero character. TEST EAX, EAX LEA R10, [RCX+32] CMOVZ RCX, R10 SHR RAX, CL ; N.B. RAX >>= (RCX % 64); this works because buf is 8-byte aligned. ; Scan the next 2 bytes for a non-zero character. TEST AX, AX LEA R10, [RCX+16] CMOVZ RCX, R10 SHR EAX, CL ; N.B. RAX >>= (RCX % 32) ; Convert back to byte pointer. N.B. this works because the AMD64 virtual address space is 48-bit. SAR RCX, 3 ; Scan the last byte for a non-zero character. TEST AL, AL MOV RAX, RCX LEA R10, [RCX+1] CMOVZ RAX, R10 RETN

INGE_2:

uint8_t len100K[100000]; char str100K[100000][5]; void itoa_inge_2_init() { memset(str100K, '0', sizeof(str100K)); for(uint32_t i = 0; i < 100000; i++) { char buf[6]; itoa(i, buf, 10); len100K[i] = strlen(buf); memcpy(&str100K[i][5 - len100K[i]], buf, len100K[i]); } } char *itoa_inge_2(char *buf, uint32_t val) { char *p = &buf[10]; uint32_t prevlen; *p = '\0'; do { uint32_t const old = val; uint32_t mod; val /= 100000; mod = old - (val * 100000); prevlen = len100K[mod]; p -= 5; memcpy(p, str100K[mod], 5); } while(val != 0); return &p[5 - prevlen]; }

VITAUT_1:

static uint16_t const str100p[100] = { 0x3030, 0x3130, 0x3230, 0x3330, 0x3430, 0x3530, 0x3630, 0x3730, 0x3830, 0x3930, 0x3031, 0x3131, 0x3231, 0x3331, 0x3431, 0x3531, 0x3631, 0x3731, 0x3831, 0x3931, 0x3032, 0x3132, 0x3232, 0x3332, 0x3432, 0x3532, 0x3632, 0x3732, 0x3832, 0x3932, 0x3033, 0x3133, 0x3233, 0x3333, 0x3433, 0x3533, 0x3633, 0x3733, 0x3833, 0x3933, 0x3034, 0x3134, 0x3234, 0x3334, 0x3434, 0x3534, 0x3634, 0x3734, 0x3834, 0x3934, 0x3035, 0x3135, 0x3235, 0x3335, 0x3435, 0x3535, 0x3635, 0x3735, 0x3835, 0x3935, 0x3036, 0x3136, 0x3236, 0x3336, 0x3436, 0x3536, 0x3636, 0x3736, 0x3836, 0x3936, 0x3037, 0x3137, 0x3237, 0x3337, 0x3437, 0x3537, 0x3637, 0x3737, 0x3837, 0x3937, 0x3038, 0x3138, 0x3238, 0x3338, 0x3438, 0x3538, 0x3638, 0x3738, 0x3838, 0x3938, 0x3039, 0x3139, 0x3239, 0x3339, 0x3439, 0x3539, 0x3639, 0x3739, 0x3839, 0x3939, }; char *itoa_vitaut_1(char *buf, uint32_t val) { char *p = &buf[10]; *p = '\0'; while(val >= 100) { uint32_t const old = val; p -= 2; val /= 100; memcpy(p, &str100p[old - (val * 100)], sizeof(uint16_t)); } p -= 2; memcpy(p, &str100p[val], sizeof(uint16_t)); return &p[val < 10]; }

Thanks for extensive testing and improvements! Pretty pleased with the results =).
Also, this ASM version was crafted by Terje or it is your effort?
I can't speak for Terje, though I don't believe there is any sort of patent encumbrance on his algorithm. I wrote the assembly routines, and I have no interest in asserting rights to them; therefore they are public domain.

R.. GitHub STOP HELPING ICE · Accepted Answer · 2011-10-25 19:54:45Z

9

The first step to optimizing your code is getting rid of the arbitrary base support. This is because dividing by a constant is almost surely multiplication, but dividing by base is division, and because '0'+n is faster than "0123456789abcdef"[n] (no memory involved in the former).

If you need to go beyond that, you could make lookup tables for each byte in the base you care about (e.g. 10), then vector-add the (e.g. decimal) results for each byte. As in:

00 02 00 80 (input) 0000000000 (place3[0x00]) +0000131072 (place2[0x02]) +0000000000 (place1[0x00]) +0000000128 (place0[0x80]) ========== 0000131200 (result)

edited Oct 25, 2011 at 19:54

answered Oct 25, 2011 at 15:41

R.. GitHub STOP HELPING ICE

217k36 gold badges404 silver badges744 bronze badges

7 Comments

Nulik Over a year ago

thanks, these are good tips, just one question, the base is constant and it is actually 10, but how do you replace division by a constant with multiplication ? Regarding the lookup tables, I don't understand well, you mean lookup a 'byte', not 'bit' ? What is the puprose of 'vector add'? And, what is the resulting number 131200 mean?

R.. GitHub STOP HELPING ICE Over a year ago

Oops, I messed up my example. Will fix.

R.. GitHub STOP HELPING ICE Over a year ago

My example is converting an integer 0x00020080 to decimal. You can convert each byte (or whatever subunit you like) with a lookup table, then do the addition of these components once it's in decimal (akin to BCD) - and this addition operation could be vectorized.

R.. GitHub STOP HELPING ICE Over a year ago

As for replacing division with multiplication, compile int main(int argc, char **argv) { return argv/10; } with gcc -O2 and look at the assembly it generates.

R.. GitHub STOP HELPING ICE Over a year ago

It's not just multiplying by the reciprocal; that would be floating point math and would be inaccurate because the reciprocal is not exactly representable. The technique is related but somewhat different.

|

vitaut · Accepted Answer · 2021-06-08 01:13:17Z

This post compares several methods of integer to string conversion aka itoa. The fastest method reported there is fmt::format_int from the {fmt} library which is 5-18 times faster than sprintf/std::stringstream and ~4 times faster than a naive ltoa/itoa implementation (the actual numbers may of course vary depending on platform).

Unlike most other methods fmt::format_int does one pass over the digits. It also minimizes the number of integer divisions using the idea from Alexandrescu's talk Fastware. The implementation is available here.

This is of course if C++ is an option and you are not restricted by the itoa's API.

Disclaimer: I'm the author of this method and the fmt library.

That talk was removed for some reason so I replaced with a link to another one which also discusses this optimization.

Plonk · Accepted Answer · 2011-12-12 11:13:54Z

http://sourceforge.net/projects/itoa/

Its uses a big static const array of all 4-digits integers and uses it for 32-bits or 64-bits conversion to string.

Portable, no need of a specific instruction set.

The only faster version I could find was in assembly code and limited to 32 bits.

13 revs user152949 · Accepted Answer · 2013-03-18 11:01:35Z

Interesting problem. If you're interested in a 10 radix only itoa() then I have made a 10 times as fast example and a 3 times as fast example as the typical itoa() implementation.

First example (3x performance)

The first, which is 3 times as fast as itoa(), uses a single-pass non-reversal software design pattern and is based on the open source itoa() implementation found in groff.

// itoaSpeedTest.cpp : Defines the entry point for the console application. // #pragma comment(lib, "Winmm.lib") #include "stdafx.h" #include "Windows.h" #include <iostream> #include <time.h> using namespace std; #ifdef _WIN32 /** a signed 32-bit integer value type */ #define _INT32 __int32 #else /** a signed 32-bit integer value type */ #define _INT32 long int // Guess what a 32-bit integer is #endif /** minimum allowed value in a signed 32-bit integer value type */ #define _INT32_MIN -2147483647 /** maximum allowed value in a signed 32-bit integer value type */ #define _INT32_MAX 2147483647 /** maximum allowed number of characters in a signed 32-bit integer value type including a '-' */ #define _INT32_MAX_LENGTH 11 #ifdef _WIN32 /** Use to init the clock */ #define TIMER_INIT LARGE_INTEGER frequency;LARGE_INTEGER t1, t2;double elapsedTime;QueryPerformanceFrequency(&frequency); /** Use to start the performance timer */ #define TIMER_START QueryPerformanceCounter(&t1); /** Use to stop the performance timer and output the result to the standard stream */ #define TIMER_STOP QueryPerformanceCounter(&t2);elapsedTime=(t2.QuadPart-t1.QuadPart)*1000.0/frequency.QuadPart;wcout<<elapsedTime<<L" ms."<<endl; #else /** Use to init the clock */ #define TIMER_INIT /** Use to start the performance timer */ #define TIMER_START clock_t start;double diff;start=clock(); /** Use to stop the performance timer and output the result to the standard stream */ #define TIMER_STOP diff=(clock()-start)/(double)CLOCKS_PER_SEC;wcout<<fixed<<diff<<endl; #endif /** Array used for fast number character lookup */ const char numbersIn10Radix[10] = {'0','1','2','3','4','5','6','7','8','9'}; /** Array used for fast reverse number character lookup */ const char reverseNumbersIn10Radix[10] = {'9','8','7','6','5','4','3','2','1','0'}; const char *reverseArrayEndPtr = &reverseNumbersIn10Radix[9]; /*! \brief Converts a 32-bit signed integer to a string \param i [in] Integer \par Software design pattern Uses a single pass non-reversing algorithm and is 3x as fast as \c itoa(). \returns Integer as a string \copyright GNU General Public License \copyright 1989-1992 Free Software Foundation, Inc. \date 1989-1992, 2013 \author James Clark<[email protected]>, 1989-1992 \author Inge Eivind Henriksen<[email protected]>, 2013 \note Function was originally a part of \a groff, and was refactored & optimized in 2013. \relates itoa() */ const char *Int32ToStr(_INT32 i) { // Make room for a 32-bit signed integers digits and the '\0' char buf[_INT32_MAX_LENGTH + 2]; char *p = buf + _INT32_MAX_LENGTH + 1; *--p = '\0'; if (i >= 0) { do { *--p = numbersIn10Radix[i % 10]; i /= 10; } while (i); } else { // Negative integer do { *--p = reverseArrayEndPtr[i % 10]; i /= 10; } while (i); *--p = '-'; } return p; } int _tmain(int argc, _TCHAR* argv[]) { TIMER_INIT // Make sure we are playing fair here if (sizeof(int) != sizeof(_INT32)) { cerr << "Error: integer size mismatch; test would be invalid." << endl; return -1; } const int steps = 100; { char intBuffer[20]; cout << "itoa() took:" << endl; TIMER_START; for (int i = _INT32_MIN; i < i + steps ; i += steps) itoa(i, intBuffer, 10); TIMER_STOP; } { cout << "Int32ToStr() took:" << endl; TIMER_START; for (int i = _INT32_MIN; i < i + steps ; i += steps) Int32ToStr(i); TIMER_STOP; } cout << "Done" << endl; int wait; cin >> wait; return 0; }

On 64-bit Windows the result from running this example is:

itoa() took: 2909.84 ms. Int32ToStr() took: 991.726 ms. Done

On 32-bit Windows the result from running this example is:

itoa() took: 3119.6 ms. Int32ToStr() took: 1031.61 ms. Done

Second example (10x performance)

If you don't mind spending some time initializing some buffers then it's possible to optimize the function above to be 10x faster than the typical itoa() implementation. What you need to do is to create string buffers rather than character buffers, like this:

// itoaSpeedTest.cpp : Defines the entry point for the console application. // #pragma comment(lib, "Winmm.lib") #include "stdafx.h" #include "Windows.h" #include <iostream> #include <time.h> using namespace std; #ifdef _WIN32 /** a signed 32-bit integer value type */ #define _INT32 __int32 /** a signed 8-bit integer value type */ #define _INT8 __int8 /** an unsigned 8-bit integer value type */ #define _UINT8 unsigned _INT8 #else /** a signed 32-bit integer value type */ #define _INT32 long int // Guess what a 32-bit integer is /** a signed 8-bit integer value type */ #define _INT8 char /** an unsigned 8-bit integer value type */ #define _UINT8 unsigned _INT8 #endif /** minimum allowed value in a signed 32-bit integer value type */ #define _INT32_MIN -2147483647 /** maximum allowed value in a signed 32-bit integer value type */ #define _INT32_MAX 2147483647 /** maximum allowed number of characters in a signed 32-bit integer value type including a '-' */ #define _INT32_MAX_LENGTH 11 #ifdef _WIN32 /** Use to init the clock */ #define TIMER_INIT LARGE_INTEGER frequency;LARGE_INTEGER t1, t2;double elapsedTime;QueryPerformanceFrequency(&frequency); /** Use to start the performance timer */ #define TIMER_START QueryPerformanceCounter(&t1); /** Use to stop the performance timer and output the result to the standard stream. Less verbose than \c TIMER_STOP_VERBOSE */ #define TIMER_STOP QueryPerformanceCounter(&t2);elapsedTime=(t2.QuadPart-t1.QuadPart)*1000.0/frequency.QuadPart;wcout<<elapsedTime<<L" ms."<<endl; #else /** Use to init the clock to get better precision that 15ms on Windows */ #define TIMER_INIT timeBeginPeriod(10); /** Use to start the performance timer */ #define TIMER_START clock_t start;double diff;start=clock(); /** Use to stop the performance timer and output the result to the standard stream. Less verbose than \c TIMER_STOP_VERBOSE */ #define TIMER_STOP diff=(clock()-start)/(double)CLOCKS_PER_SEC;wcout<<fixed<<diff<<endl; #endif /* Set this as large or small as you want, but has to be in the form 10^n where n >= 1, setting it smaller will make the buffers smaller but the performance slower. If you want to set it larger than 100000 then you must add some more cases to the switch blocks. Try to make it smaller to see the difference in performance. It does however seem to become slower if larger than 100000 */ static const _INT32 numElem10Radix = 100000; /** Array used for fast lookup number character lookup */ const char *numbersIn10Radix[numElem10Radix] = {}; _UINT8 numbersIn10RadixLen[numElem10Radix] = {}; /** Array used for fast lookup number character lookup */ const char *reverseNumbersIn10Radix[numElem10Radix] = {}; _UINT8 reverseNumbersIn10RadixLen[numElem10Radix] = {}; void InitBuffers() { char intBuffer[20]; for (int i = 0; i < numElem10Radix; i++) { itoa(i, intBuffer, 10); size_t numLen = strlen(intBuffer); char *intStr = new char[numLen + 1]; strcpy(intStr, intBuffer); numbersIn10Radix[i] = intStr; numbersIn10RadixLen[i] = numLen; reverseNumbersIn10Radix[numElem10Radix - 1 - i] = intStr; reverseNumbersIn10RadixLen[numElem10Radix - 1 - i] = numLen; } } /*! \brief Converts a 32-bit signed integer to a string \param i [in] Integer \par Software design pattern Uses a single pass non-reversing algorithm with string buffers and is 10x as fast as \c itoa(). \returns Integer as a string \copyright GNU General Public License \copyright 1989-1992 Free Software Foundation, Inc. \date 1989-1992, 2013 \author James Clark<[email protected]>, 1989-1992 \author Inge Eivind Henriksen, 2013 \note This file was originally a part of \a groff, and was refactored & optimized in 2013. \relates itoa() */ const char *Int32ToStr(_INT32 i) { /* Room for INT_DIGITS digits, - and '\0' */ char buf[_INT32_MAX_LENGTH + 2]; char *p = buf + _INT32_MAX_LENGTH + 1; _INT32 modVal; *--p = '\0'; if (i >= 0) { do { modVal = i % numElem10Radix; switch(numbersIn10RadixLen[modVal]) { case 5: *--p = numbersIn10Radix[modVal][4]; case 4: *--p = numbersIn10Radix[modVal][3]; case 3: *--p = numbersIn10Radix[modVal][2]; case 2: *--p = numbersIn10Radix[modVal][1]; default: *--p = numbersIn10Radix[modVal][0]; } i /= numElem10Radix; } while (i); } else { // Negative integer const char **reverseArray = &reverseNumbersIn10Radix[numElem10Radix - 1]; const _UINT8 *reverseArrayLen = &reverseNumbersIn10RadixLen[numElem10Radix - 1]; do { modVal = i % numElem10Radix; switch(reverseArrayLen[modVal]) { case 5: *--p = reverseArray[modVal][4]; case 4: *--p = reverseArray[modVal][3]; case 3: *--p = reverseArray[modVal][2]; case 2: *--p = reverseArray[modVal][1]; default: *--p = reverseArray[modVal][0]; } i /= numElem10Radix; } while (i); *--p = '-'; } return p; } int _tmain(int argc, _TCHAR* argv[]) { InitBuffers(); TIMER_INIT // Make sure we are playing fair here if (sizeof(int) != sizeof(_INT32)) { cerr << "Error: integer size mismatch; test would be invalid." << endl; return -1; } const int steps = 100; { char intBuffer[20]; cout << "itoa() took:" << endl; TIMER_START; for (int i = _INT32_MIN; i < i + steps ; i += steps) itoa(i, intBuffer, 10); TIMER_STOP; } { cout << "Int32ToStr() took:" << endl; TIMER_START; for (int i = _INT32_MIN; i < i + steps ; i += steps) Int32ToStr(i); TIMER_STOP; } cout << "Done" << endl; int wait; cin >> wait; return 0; }

On 64-bit Windows the result from running this example is:

itoa() took: 2914.12 ms. Int32ToStr() took: 306.637 ms. Done

On 32-bit Windows the result from running this example is:

itoa() took: 3126.12 ms. Int32ToStr() took: 299.387 ms. Done

Why do you use reverse string lookup buffers?

It's possible to do this without the reverse string lookup buffers (thus saving 1/2 the internal memory), but this makes it significantly slower (timed at about 850 ms on 64-bit and 380 ms on 32-bit systems). It's not clear to me exactly why it's so much slower - especially on 64-bit systems, to test this further yourself you can change simply the following code:

#define _UINT32 unsigned _INT32 ... static const _UINT32 numElem10Radix = 100000; ... void InitBuffers() { char intBuffer[20]; for (int i = 0; i < numElem10Radix; i++) { _itoa(i, intBuffer, 10); size_t numLen = strlen(intBuffer); char *intStr = new char[numLen + 1]; strcpy(intStr, intBuffer); numbersIn10Radix[i] = intStr; numbersIn10RadixLen[i] = numLen; } } ... const char *Int32ToStr(_INT32 i) { char buf[_INT32_MAX_LENGTH + 2]; char *p = buf + _INT32_MAX_LENGTH + 1; _UINT32 modVal; *--p = '\0'; _UINT32 j = i; do { modVal = j % numElem10Radix; switch(numbersIn10RadixLen[modVal]) { case 5: *--p = numbersIn10Radix[modVal][4]; case 4: *--p = numbersIn10Radix[modVal][3]; case 3: *--p = numbersIn10Radix[modVal][2]; case 2: *--p = numbersIn10Radix[modVal][1]; default: *--p = numbersIn10Radix[modVal][0]; } j /= numElem10Radix; } while (j); if (i < 0) *--p = '-'; return p; }

There's something seriously wrong in this concept, integers like 10000009 are stringified to "1009" :(

Jakub · Accepted Answer · 2014-07-03 12:20:04Z

That's part of my code in asm. It works only for range 255-0 It can be faster however here you can find direction and main idea.

4 imuls 1 memory read 1 memory write

You can try to reduce 2 imule's and use lea's with shifting. However you can't find anything faster in C/C++/Python ;)

void itoa_asm(unsigned char inVal, char *str) { __asm { // eax=100's -> (some_integer/100) = (some_integer*41) >> 12 movzx esi,inVal mov eax,esi mov ecx,41 imul eax,ecx shr eax,12 mov edx,eax imul edx,100 mov edi,edx // ebx=10's -> (some_integer/10) = (some_integer*205) >> 11 mov ebx,esi sub ebx,edx mov ecx,205 imul ebx,ecx shr ebx,11 mov edx,ebx imul edx,10 // ecx = 1 mov ecx,esi sub ecx,edx // -> sub 10's sub ecx,edi // -> sub 100's add al,'0' add bl,'0' add cl,'0' //shl eax, shl ebx,8 shl ecx,16 or eax,ebx or eax,ecx mov edi,str mov [edi],eax } }

Serguei Fedorov · Accepted Answer · 2014-09-23 15:28:35Z

@Inge Henriksen

I believe your code has a bug:

IntToStr(2701987) == "2701987" //Correct IntToStr(27001987) == "2701987" //Incorrect

Here's why your code is wrong:

modVal = i % numElem10Radix; switch (reverseArrayLen[modVal]) { case 5: *--p = reverseArray[modVal][4]; case 4: *--p = reverseArray[modVal][3]; case 3: *--p = reverseArray[modVal][2]; case 2: *--p = reverseArray[modVal][1]; default: *--p = reverseArray[modVal][0]; } i /= numElem10Radix;

There should be a leading 0 before "1987", which is "01987". But after the first iteration, you get 4 digits instead of 5.

So,

IntToStr(27000000) = "2700" //Incorrect

poby · Accepted Answer · 2020-07-07 01:46:29Z

For unsigned 0 to 9,999,999 with terminating null. (99,999,999 without)

void itoa(uint64_t u, char *out) // up to 9,999,999 with terminating zero { *out = 0; do { uint64_t n0 = u; *((uint64_t *)out) = (*((uint64_t *)out) << 8) | (n0 + '0' - (u /= 10) * 10); } while (u); }

Collectives™ on Stack Overflow

optimized itoa function

8 Answers 8

Vectorization

Performance

Implementation

6 Comments

7 Comments

1 Comment

Comments

1 Comment

Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

8 Answers 8

Vectorization

Performance

Implementation

6 Comments

7 Comments

1 Comment

Comments

1 Comment

Comments

Comments

Comments

Linked

Related