2

I'm writing a function that is supposed to read a string of numbers, separated by commas. The format of the string is as following:

"1, 2, 3" 

The only "rule" is that the function will tolerate any spaces or tabs, as long as there's one comma between each number.

If the string is valid, the numbers are to be stored in a linked list.

for instance, the following strings are valid:

"1,2,14,2,80" " 250 , 1, 88" 

but the following are not valid:

" 5, 1, 3 ," "51, 60, 5,,9" 

I first tried my luck with strtok() (using delimiters ", \t", but from what i understand at the moment, it's impossible to check for errors. So I wrote my own function, but I'm extremely unhappy with it - I think the code is pretty bad, and although it seems to work, I'd really like to know if there is a cleaner, easier way to implement such a function.

My function is:

void sliceNumbers(char * string) { /*flag which marks if we're expecting a comma or not*/ int comma = FALSE; /*Are we inside a number?*/ int nFlag = TRUE; /*error flag*/ int error = FALSE; /*pointer to string start*/ char * pStart = string; /*pointer to string end*/ char * pEnd = pStart; /*if received string is null*/ if (!string) { /*add error and exit function*/ printf("You must specify numbers"); return; } /*this loop checks if all characters in the string are legal*/ while (*pStart != '\0') { if ((isdigit(*pStart)) || (*pStart == ',') || (*pStart == ' ') || (*pStart == '\t')) { pStart++; } else { char tmp[2]; tmp[0] = *pStart; tmp[1] = 0; printf("Invalid character"); error = TRUE; pStart++; } } if (!error) { pStart = string; if (*pStart == ',') { printf("Cannot start data list with a comma"); return; } pEnd = pStart; while (*pEnd != '\0') { if (comma) { if (*pEnd == ',') { if (!nFlag) { } if (*(pEnd + 1) == '\0') { printf("Too many commas"); return; } *pEnd = '\0'; /*Add the number to the linked list*/ addNumber(pStart, line, DC); comma = FALSE; nFlag = FALSE; pStart = pEnd; pStart++; pEnd = pStart; } else if (isdigit(*pEnd)) { if (!nFlag) { printf("numbers must be seperated by commas"); pEnd++; } else { if (*(pEnd + 1) == '\0') { pEnd++; /*Add the number to the linked list*/ addNumber(pStart); comma = FALSE; nFlag = FALSE; pStart = pEnd; pStart++; pEnd = pStart; } else { pEnd++; } } } else if (*pEnd == '\0') { if (nFlag) { /*Add the number to the linked list*/ addNumber(pStart, line, DC); } else { printf("Too many commas"); } } else if (*pEnd == ' ' || *pEnd == '\t') { nFlag = FALSE; pEnd++; } } else { if (*pEnd == ',') { printf("There must be only 1 comma between numbers"); return; } else if (isdigit(*pEnd)) { if (*(pEnd + 1) == '\0') { pEnd++; /*Add the number to the linked list*/ addnumber(pStart, line, DC); comma = FALSE; nFlag = FALSE; pStart = pEnd; pStart++; pEnd = pStart; } else { pStart = pEnd; pEnd++; nFlag = TRUE; comma = TRUE; } } else if (*pEnd == ' ' || *pEnd == '\t') { if (!nFlag) { pEnd++; } else { pEnd++; } } } } } } 
4
  • Take a look to strsep Commented Mar 13, 2017 at 9:51
  • You tell us what the program should do if the input is valid, but not what to do if it is not valid. I can think of two ways this problem could have been set for you: "Your program must detect invalid input and report an error" or "The input is guaranteed to be valid, so your program doesn't need to handle it". This will affect our answers. Commented Mar 13, 2017 at 10:23
  • it should not be a problem to check the return value of strtok to determine whether a line is valid or not. if the token returned has length zero, it is invalid if not, add to array by doing atoi() on the token. Commented Mar 13, 2017 at 11:19
  • Construct a state machine. You have ~3 token types and ~5 states. Commented Mar 13, 2017 at 11:34

4 Answers 4

3

You have defined a number of booleans (although you've declared them as ints) that keep track of the current state. You could combine these into one state variable, using #define to define possible values:

#define STATE_START 0 #define STATE_IN_NUMBER 1 #define STATE_COMMA 2 #define STATE_FINISHED 3 #define STATE_ERROR 4 int state = STATE_START; 

You can draw a diagram (a bit like a flow chart) showing how each character moves us from one state to another.

enter image description here

(For my image I've kept things simple and only shown the non-error states for input with no spaces)

Or just put it in words:

current state | input | next state| side effect ----------------------------------------------------------------------- START | digit | IN_NUMBER | start storing a number START | other | ERROR | IN_NUMBER | digit | IN_NUMBER | continue storing a number IN_NUMBER | comma | COMMA | complete storing a number IN_NUMBER | null | FINISHED | finalise output IN_NUMBER | other | ERROR | report error COMMA | digit | IN_NUMBER | start storing a number COMMA | comma | ERROR | COMMA | other | ERROR | 

(For my table I've added basic error states, but still haven't accounted for whitespace)

You will need to add some more states and transitions to deal with spaces and tabs, but the principle doesn't change. I'd suggest starting with an implementation that works without spaces, then adding to it.

This allows you to write a finite state machine, one implementation of which looks like this:

int state = STATE_START; while(state != STATE_FINISHED && state != STATE_ERROR) { char c = input[offset++]; switch(state) { case STATE_START: state = handleStateStart(...); break; case STATE_IN_NUMBER: state = handleInNumber(...); break; // etc. default: sprintf(error_message, "Reached unsupported state: %c", state); state = STATE_ERROR; } } 

The parameters to the handling functions need to pass in the data structures it will be reading and modifying. For example:

int handleStateStart( char c, int* current_number, char *error_message) { if( ! isDigit(c)) { sprintf(error_message, "Expected a digit at char %d", *offset); return STATE_ERROR; } *current_number = atoi(c); return STATE_IN_NUMBER; } 

(This is an easy-to-understand way of implementing a state machine, but there are other ways to do it: Is there a typical state machine implementation pattern? )

Your CSV parsing problem lends itself very well to a state machine, and the resulting code will be very neat and clean. State machines are used for more complex parsing tasks, and are heavily used in things like compilers. Later in your study you will encounter regular expressions -- formally, regular expressions are a compact way of expressing finite state machines that consume characters.

Sign up to request clarification or add additional context in comments.

4 Comments

You probably need to handle whitespace, too. I think: 1, 2 3, 4, 5 should not be considered a valid string.
@joop "You will need to add some more states and transitions to deal with spaces and tabs, but the principle doesn't change"
You might simplify handleStateStart() by passing a struct containing all that working data -- but I decided not to assume knowledge of structs.
Also note I've used sprintf. In real code use snprintf.
1

strtok() is the right way to do that. But pass "," (comma) only as delimiter. You can check the resulting string for being zero-length (strlen(tok)==0), which means that you have two consecutive ','. After checking this you simply trim the results, i. e. as described here.

6 Comments

From 'man strtok' A sequence of two or more contiguous delimiter characters in the parsed string is considered to be a single delimiter. Delimiter characters at the start or end of the string are ignored. Put another way: the tokens returned by strtok() are always non-empty strings.
I assume this is a homework question, so the aim is to write an algorithm, not use a library function.
@slim Then why would the OP mention strtok(), then say the only problem he/she has with it is because of error checking? Don't think OP has a problem with using library functions for his "algorithm".
@dmuir Thanks for the hint. However, checking the first character shouldn't be a big deal.
@AminNegm-Awad My point was that the OP says consecutive delimiters are an error, and that you can't detect this using strtok
|
0

You can use regex lib

1) Validate string

[^\d, ]|,[[:blank:]]+,|,{2,} 

where
[^\d, ] - found all symbols except digits, commas and white spaces
,[[:blank:]]+,|,{2,} - Validate string 2 or more commas with spaces and tabs without digits between commas

2) Process numbers

\d+ 

You can try it online here

Comments

0

A quite efficient straight forward approach:

  1. Remove all the spaces and tabs at one pass. You can avoid the space overhead by doing it in place.
  2. Read the stream of numbers and keep adding them to a linked list. If any invalid number, for example, of length 0, would be detected, simply return a NULL pointer and thus stop processing further.
  3. If pass 2 successfully completed, return the head pointer of that linked list.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.