Is there a C compiler that fails to compile this?

Question

I was hanging out in my profiler for a while trying to figure out how to speed up a common log parser which was bottlenecked around the date parsing, and I tried various algorithms to speed things up.

The thing I tried that was fastest for me was also by far the most readable, but potentially non-standard C.

This worked quite well in GCC, icc, and my really old and picky SGI compiler. As it's a quite readable optimization, where doesn't it do what I want?

static int parseMonth(const char *input) { int rv=-1; int inputInt=0; int i=0; for(i=0; i<4 && input[i]; i++) { inputInt = (inputInt << 8) | input[i]; } switch(inputInt) { case 'Jan/': rv=0; break; case 'Feb/': rv=1; break; case 'Mar/': rv=2; break; case 'Apr/': rv=3; break; case 'May/': rv=4; break; case 'Jun/': rv=5; break; case 'Jul/': rv=6; break; case 'Aug/': rv=7; break; case 'Sep/': rv=8; break; case 'Oct/': rv=9; break; case 'Nov/': rv=10; break; case 'Dec/': rv=11; break; } return rv; }

@JaredPar: I wouldn't hold up VS as a good measure of portability. — Robert Gamble
– Robert Gamble, Commented Nov 30, 2008 at 17:49

Jonathan Leffler · Accepted Answer · 2008-11-30 02:01:32Z

Solaris 10 - SPARC - SUN Compiler.

Test code:

#include <stdio.h> static int parseMonth(const char *input) { int rv=-1; int inputInt=0; int i=0; for(i=0; i<4 && input[i]; i++) { inputInt = (inputInt << 8) | input[i]; } switch(inputInt) { case 'Jan/': rv=0; break; case 'Feb/': rv=1; break; case 'Mar/': rv=2; break; case 'Apr/': rv=3; break; case 'May/': rv=4; break; case 'Jun/': rv=5; break; case 'Jul/': rv=6; break; case 'Aug/': rv=7; break; case 'Sep/': rv=8; break; case 'Oct/': rv=9; break; case 'Nov/': rv=10; break; case 'Dec/': rv=11; break; } return rv; } static const struct { char *data; int result; } test_case[] = { { "Jan/", 0 }, { "Feb/", 1 }, { "Mar/", 2 }, { "Apr/", 3 }, { "May/", 4 }, { "Jun/", 5 }, { "Jul/", 6 }, { "Aug/", 7 }, { "Sep/", 8 }, { "Oct/", 9 }, { "Nov/", 10 }, { "Dec/", 11 }, { "aJ/n", -1 }, }; #define DIM(x) (sizeof(x)/sizeof(*(x))) int main(void) { size_t i; int result; for (i = 0; i < DIM(test_case); i++) { result = parseMonth(test_case[i].data); if (result != test_case[i].result) printf("!! FAIL !! %s (got %d, wanted %d)\n", test_case[i].data, result, test_case[i].result); } return(0); }

Results (GCC 3.4.2 and Sun):

$ gcc -O xx.c -o xx xx.c:14:14: warning: multi-character character constant xx.c:15:14: warning: multi-character character constant xx.c:16:14: warning: multi-character character constant xx.c:17:14: warning: multi-character character constant xx.c:18:14: warning: multi-character character constant xx.c:19:14: warning: multi-character character constant xx.c:20:14: warning: multi-character character constant xx.c:21:14: warning: multi-character character constant xx.c:22:14: warning: multi-character character constant xx.c:23:14: warning: multi-character character constant xx.c:24:14: warning: multi-character character constant xx.c:25:14: warning: multi-character character constant $ ./xx $ cc -o xx xx.c $ ./xx !! FAIL !! Jan/ (got -1, wanted 0) !! FAIL !! Feb/ (got -1, wanted 1) !! FAIL !! Mar/ (got -1, wanted 2) !! FAIL !! Apr/ (got -1, wanted 3) !! FAIL !! May/ (got -1, wanted 4) !! FAIL !! Jun/ (got -1, wanted 5) !! FAIL !! Jul/ (got -1, wanted 6) !! FAIL !! Aug/ (got -1, wanted 7) !! FAIL !! Sep/ (got -1, wanted 8) !! FAIL !! Oct/ (got -1, wanted 9) !! FAIL !! Nov/ (got -1, wanted 10) !! FAIL !! Dec/ (got -1, wanted 11) $

Note that the last test case still passed - that is, it generated a -1.

Here's a revised - more verbose - version of parseMonth() which does work the same under both GCC and Sun C compiler:

#include <stdio.h> /* MONTH_CODE("Jan/") does not reduce to an integer constant */ #define MONTH_CODE(x) ((((((x[0]<<8)|x[1])<<8)|x[2])<<8)|x[3]) #define MONTH_JAN (((((('J'<<8)|'a')<<8)|'n')<<8)|'/') #define MONTH_FEB (((((('F'<<8)|'e')<<8)|'b')<<8)|'/') #define MONTH_MAR (((((('M'<<8)|'a')<<8)|'r')<<8)|'/') #define MONTH_APR (((((('A'<<8)|'p')<<8)|'r')<<8)|'/') #define MONTH_MAY (((((('M'<<8)|'a')<<8)|'y')<<8)|'/') #define MONTH_JUN (((((('J'<<8)|'u')<<8)|'n')<<8)|'/') #define MONTH_JUL (((((('J'<<8)|'u')<<8)|'l')<<8)|'/') #define MONTH_AUG (((((('A'<<8)|'u')<<8)|'g')<<8)|'/') #define MONTH_SEP (((((('S'<<8)|'e')<<8)|'p')<<8)|'/') #define MONTH_OCT (((((('O'<<8)|'c')<<8)|'t')<<8)|'/') #define MONTH_NOV (((((('N'<<8)|'o')<<8)|'v')<<8)|'/') #define MONTH_DEC (((((('D'<<8)|'e')<<8)|'c')<<8)|'/') static int parseMonth(const char *input) { int rv=-1; int inputInt=0; int i=0; for(i=0; i<4 && input[i]; i++) { inputInt = (inputInt << 8) | input[i]; } switch(inputInt) { case MONTH_JAN: rv=0; break; case MONTH_FEB: rv=1; break; case MONTH_MAR: rv=2; break; case MONTH_APR: rv=3; break; case MONTH_MAY: rv=4; break; case MONTH_JUN: rv=5; break; case MONTH_JUL: rv=6; break; case MONTH_AUG: rv=7; break; case MONTH_SEP: rv=8; break; case MONTH_OCT: rv=9; break; case MONTH_NOV: rv=10; break; case MONTH_DEC: rv=11; break; } return rv; } static const struct { char *data; int result; } test_case[] = { { "Jan/", 0 }, { "Feb/", 1 }, { "Mar/", 2 }, { "Apr/", 3 }, { "May/", 4 }, { "Jun/", 5 }, { "Jul/", 6 }, { "Aug/", 7 }, { "Sep/", 8 }, { "Oct/", 9 }, { "Nov/", 10 }, { "Dec/", 11 }, { "aJ/n", -1 }, { "/naJ", -1 }, }; #define DIM(x) (sizeof(x)/sizeof(*(x))) int main(void) { size_t i; int result; for (i = 0; i < DIM(test_case); i++) { result = parseMonth(test_case[i].data); if (result != test_case[i].result) printf("!! FAIL !! %s (got %d, wanted %d)\n", test_case[i].data, result, test_case[i].result); } return(0); }

I wanted to use MONTH_CODE() but the compilers did not cooperate.

Is SPARC is big-endian? If so, reversing the characters (or the loop ordering) ought to fix it.
I'm very intrigued at the difference between GCC and Sun C. Granted this is at best implementation defined behaviour, but normally, GCC emulates the native compilers pretty well. I don't regard it as a bug, though - not in either compiler. It is simply 'implementation specific' behaviour and AOK.
Also, I tried: "/naJ" wanting -1 and got back a 'test failure' with the function returning 0. So, 4-byte word reverse ordering lets the test pass. But it hardly resolves the problem - that the parseMonth() code is not reliable across platforms.
@Jonathan - agreed. People have already posted a couple faster solutions, or solutions just as pretty to my problem. I have no excuse to leave this code in, pretty as I think it is, when there are real compilers I might encounter that will fail on it.
@Dustin: see my revision which provides a portable solution.

paxdiablo · Accepted Answer · 2008-11-30 00:31:56Z

13

if ( !input[0] || !input[1] || !input[2] || input[3] != '/' ) return -1; switch ( input[0] ) { case 'F': return 1; // Feb case 'S': return 8; // Sep case 'O': return 9; // Oct case 'N': return 10; // Nov case 'D': return 11; // Dec; case 'A': return input[1] == 'p' ? 3 : 7; // Apr, Aug case 'M': return input[2] == 'r' ? 2 : 4; // Mar, May default: return input[1] == 'a' ? 0 : (input[2] == 'n' ? 5 : 6); // Jan, Jun, Jul }

Slightly less readable and not so much validating, but perhaps even faster, no?

edited Nov 30, 2008 at 0:31

paxdiablo

888k243 gold badges1.6k silver badges2k bronze badges

answered Nov 30, 2008 at 0:14

Vilx-

107k93 gold badges293 silver badges432 bronze badges

7 Comments

Dustin Over a year ago

Typo on the last line ('a' ? 0 -- not 1), but yes, this is slightly over twice as fast on my machine. Where performance matters, you win. I'm wondering where this isn't readable enough, though.

paxdiablo Over a year ago

@Dustin, it's readable if you add comments explaining it - won't affect run speed, will make it as readable as yours for the next bod looking at it. I would make sure the 31-day months are listed first since that would give a tiny bit of extra speed (statistically).

paxdiablo Over a year ago

Assuming the case statements are compiled in the given order, that is.

Jonathan Leffler Over a year ago

I'm not convinced that 'Zax' should be accepted as January; nor that Xxx should be accepted as July.

Vilx- Over a year ago

As I said - it's not validating. If you want validation, then these are note the lines are looking for.

|

tvanfosson · Accepted Answer · 2008-11-30 00:38:53Z

You're just computing a hash of those four characters. Why not predefine some integer constants that compute the hash in the same way and use those? Same readability and you're not depending on any implementation specific idiosyncrasies of the compiler.

uint32_t MONTH_JAN = 'J' << 24 + 'a' << 16 + 'n' << 8 + '/'; uint32_t MONTH_FEB = 'F' << 24 + 'e' << 16 + 'b' << 8 + '/'; ... static uint32_t parseMonth(const char *input) { uint32_t rv=-1; uint32_t inputInt=0; int i=0; for(i=0; i<4 && input[i]; i++) { inputInt = (inputInt << 8) | (input[i] & 0x7f); // clear top bit } switch(inputInt) { case MONTH_JAN: rv=0; break; case MONTH_FEB: rv=1; break; ... } return rv; }

I do like this one aesthetically. I think it looks just as nice as what I made, but can be made to work more consistently and without the occasional compiler warning. Thanks.
Use macros; #define MON_HASH(s) ((((s)[0]) & 0xff << 24) | (((s)[1]) & 0xff << 16) | (((s)[2]) & 0xff << 8) | (((s)[3]) & 0xff)) ... switch (MON_HASH(input)) { ... case MON_HASH("Jan/"). NOTE that using something like gperf to generate an 8-bit (or shorter) hash (Vilx- comes closest to what gperf does) will be MUCH more efficient than the int "hashes" above (a switch on an int will NOT be implemented with a jump table, whereas a switch on a char or __int8 probably will.)

Johannes Schaub - litb · Accepted Answer · 2008-11-30 00:06:08Z

I only know what the C Standard says about this (C99):

The value of an integer character constant containing more than one character (e.g., 'ab'), or containing a character or escape sequence that does not map to a single-byte execution character, is implementation-deﬁned. If an integer character constant contains a single character or escape sequence, its value is the one that results when an object with type char whose value is that of the single character or escape sequence is converted to type int.

(6.4.4.4/10 taken from a draft)

So it's implementation defined. Meaning it is not guaranteed it works the same everywhere, but the behavior must be documented by the implementation. For example if int is only 16 bits wide in a particular implementation, then 'Jan/' can't be represented anymore like you intend it (char must be at least 8 bits, while a character literal is always of type int).

It's theoretically wrong, I understand. I'm wondering if there are any C compilers where it doesn't work in practice. It's completely out of curiosity as it worked for me even on an ancient compiler that fails on over half the open source stuff I find.
@Dustin, try compiling for PPC (non-Intel Mac OS X). I think that's big-endian.
@strager Did... I ran this on my G3 as well as my SGI (which is sort of bidirectional, but runs IRIX in big-endian).
@Dustin, I have an old 8051 compiler that fails to compile this - it only had 16-bit ints and it gags on the 'xxxx' constants (it would also fail at runtime with the shift-or code). In fact it also gags on 'xx' constants.
Dustin. you will probably not figure out by compiler, but by their backends. i can change by GCC backend to make int 16bits wide, and then gcc-eco32 will miscompile it.

Scott Evernden · Accepted Answer · 2008-11-30 01:54:08Z

6

char *months = "Jan/Feb/Mar/Apr/May/Jun/Jul/Aug/Sep/Oct/Nov/Dec/"; char *p = strnstr(months, input, 4); return p ? (p - months) / 4 : -1;

answered Nov 30, 2008 at 1:54

Scott Evernden

40.1k15 gold badges81 silver badges84 bronze badges

7 Comments

Dustin Over a year ago

I did that one as well. It's certainly smaller, but wasn't faster as it took considerably longer to parse December dates than January dates. I had one that would adapt the haystack to take advantage of the input being chronological as well. More complicated, but not really faster.

Scott Evernden Over a year ago

that surprising as the strstr() should compile into nearly a single machine instruction on most PCs

Robert Gamble Over a year ago

@Dustin, I ran some benchmarks without optimization using your original method and Scott's method on my 5 year old computer. Scott found Jan ten million times in 1.6s and Dec in 8.3s, your program found both Jan and Dec in 2.6s. So while significantly clearer, Scott's method is a little slower...

Robert Gamble Over a year ago

in some cases. My question is this: what exactly are you doing that only being able to call this function a million times a second causes a bottleneck?

vladr Over a year ago

@Robert, "ran some benchmarks without optimization" <-- the results are meaningless. Turn optimization on when you're benchmarking!

|

Robert Gamble · Accepted Answer · 2008-11-30 00:59:47Z

There are at least 3 things that keep this program from being portable:

Multi-character constants are implementation-defined so different compilers may handle them differently.
A byte can be more than 8 bits, there is plenty of hardware where the smallest addressable unit of memory is 16 or even 32 bits, you often find this in DSPs for example. If a byte is more than 8 bits then so will char since char is by definition one byte long; your program will not function properly on such systems.
Lastly, there are many machines where int is only 16-bits (which is the smallest size allowed for int) including embedded devices and legacy machines, your program will fail on these machines as well.

My 64-bit processor has 32-bit int's, and it works fine. Maybe you meant less than?
Really or the BYTE thing? AFAIK byte is by definition 8 bits
@JaredPar: A byte in C has at least 8 bits but may have more. There are plenty of systems around today (new ones too!) with bytes larger than 8 bits.
Okay, you're using the C definition (notoriously vague). I've gotten to the point now that I use explicit sized types in my code to stop confusing other people. Ex: __int16, __int32, etc ...
@JaredPar: There is plenty of hardware where the smallest addressable unit of memory is 16 or 32 bits, this is not just a C definition thing. On such systems a char will be 16 or 32 bits and an int may be only 1 byte. The world isn't a Windows PC ;)

AShelly · Accepted Answer · 2008-12-02 02:37:54Z

National Instrument's CVI 8.5 for Windows compiler fails on your original code with multiple warnings:

 Warning: Excess characters in multibyte character literal ignored.

and errors of the form:

 Duplicate case label '77'.

It succeeds on Jonathan's code.

strager · Accepted Answer · 2008-11-30 00:01:29Z

2

I get warnings, but no errors (gcc). Seems to compile and operate fine. May not work for big-endian systems, though!

I wouldn't suggest this method, though. Perhaps you can xor instead of or-shift, to create a single byte. Then use the case statement on a byte (or, faster, use a LUT of the first N bits).

answered Nov 30, 2008 at 0:01

strager

90.3k27 gold badges139 silver badges181 bronze badges

5 Comments

Dustin Over a year ago

That's effectively a perfect hash. It doesn't matter too much for this application, but the downside of a perfect hash is that it can return wrong answers for bad input.

strager Over a year ago

@Dustin, then you can do a strcmp once you have your possible answer to check it. Use a LUT for this. Easy.

Dustin Over a year ago

One of my earlier attempts at this was to write a program that took a set of words as input and generated a hierarchical case statement switching on letters before doing a strcmp. For example: pastebin.com/f30e09462 (optionally breaking the Js down further).

strager Over a year ago

@Dustin, you can perhaps break your switch up into further switches. This should be faster than calling memcmp. If you don't want to do that, you can take advantage of the fact that you already know the first character, and don't need to compare it.

Dustin Over a year ago

@strager I wrote this thing to build that for me: github.com/dustin/snippets/tree/master/python/misc/gensearch.py

Alnitak · Accepted Answer · 2008-11-30 00:12:47Z

The fact that a four character constant is equivalent to an particular 32-bit integer is a non-standard feature often seen on compilers for MS Windows and Mac computers (and PalmOS, AFAICR).

On theses systems a four character string is commonly used as a tag for identifying chunks of data files, or as an application / data-type identifier (e.g. "APPL").

It's a convenience then for the developer that they can store such a string into various data-structures without worrying about zero-byte termination, pointers, etc.

Alessandro Jacopson · Accepted Answer · 2008-11-30 10:45:57Z

Comeau compiler

Comeau C/C++ 4.3.10.1 (Oct 6 2008 11:28:09) for ONLINE_EVALUATION_BETA2 Copyright 1988-2008 Comeau Computing. All rights reserved. MODE:strict errors C99 "ComeauTest.c", line 11: warning: multicharacter character literal (potential portability problem) case 'Jan/': rv=0; break; ^ "ComeauTest.c", line 12: warning: multicharacter character literal (potential portability problem) case 'Feb/': rv=1; break; ^ "ComeauTest.c", line 13: warning: multicharacter character literal (potential portability problem) case 'Mar/': rv=2; break; ^ "ComeauTest.c", line 14: warning: multicharacter character literal (potential portability problem) case 'Apr/': rv=3; break; ^ "ComeauTest.c", line 15: warning: multicharacter character literal (potential portability problem) case 'May/': rv=4; break; ^ "ComeauTest.c", line 16: warning: multicharacter character literal (potential portability problem) case 'Jun/': rv=5; break; ^ "ComeauTest.c", line 17: warning: multicharacter character literal (potential portability problem) case 'Jul/': rv=6; break; ^ "ComeauTest.c", line 18: warning: multicharacter character literal (potential portability problem) case 'Aug/': rv=7; break; ^ "ComeauTest.c", line 19: warning: multicharacter character literal (potential portability problem) case 'Sep/': rv=8; break; ^ "ComeauTest.c", line 20: warning: multicharacter character literal (potential portability problem) case 'Oct/': rv=9; break; ^ "ComeauTest.c", line 21: warning: multicharacter character literal (potential portability problem) case 'Nov/': rv=10; break; ^ "ComeauTest.c", line 22: warning: multicharacter character literal (potential portability problem) case 'Dec/': rv=11; break; ^ "ComeauTest.c", line 1: warning: function "parseMonth" was declared but never referenced static int parseMonth(const char *input) { ^

Thanks for the compiler pointer. I'm not worried about compilers complaining about potential portability problems, just compilers producing code that behaves differently.

artificialidiot · Accepted Answer · 2008-11-30 00:13:27Z

Machine word size issues aside, your compiler may promote input[i] to a negative integer which will just set the upper bits of inputInt with or operation, so I suggest you to be explicit about signedness of char variables.

But since in US, no one cares about the 8th bit, it is probably a non-issue for you.

IBM cares a great deal about the 8th bit, they even do Unicode :-)

Charlie Martin · Accepted Answer · 2008-11-30 01:31:04Z

I'd sure love to see the profiling that shows this is your most significant bottleneck, but in any case if you're going to pull something like this, use a union instead of 50 instructions looping and shifting. Here's a little example program, I'll leave it to you to fit it into your program.

/* union -- demonstrate union for characters */ #include <stdio.h> union c4_i { char c4[5]; int i ; } ; union c4_i ex; int main (){ ex.c4[0] = 'a'; ex.c4[1] = 'b'; ex.c4[2] = 'c'; ex.c4[3] = 'd'; ex.c4[4] = '\0'; printf("%s 0x%08x\n", ex.c4, ex.i ); return 0; }

Here's example output:

bash $ ./union abcd 0x64636261 bash $

I did a similar thing by just casting the char* to an int* and then reading from within it. That's where I got into endian problems.
None of these things is going to be immune to problems with endian-ness, etc, plus this is vulnerable to someone mixing case.

HUAGHAGUAH · Accepted Answer · 2008-11-30 18:38:45Z

As mentioned by others, that code throws a bunch of warnings and is probably not endian-safe.

Was your original date parser hand-written as well? Have you tried strptime(3)?

Collectives™ on Stack Overflow

Is there a C compiler that fails to compile this?

13 Answers 13

10 Comments

7 Comments

2 Comments

7 Comments

7 Comments

8 Comments

Comments

5 Comments

Comments

1 Comment

1 Comment

2 Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

13 Answers 13

10 Comments

7 Comments

2 Comments

7 Comments

7 Comments

8 Comments

Comments

5 Comments

Comments

1 Comment

1 Comment

2 Comments

Comments

Linked

Related