1

So, I'm writing this simple HTTP client in C and I seem to be stuck on this problem - how do I strip the HTTP headers from the response? After all, if I get a binary file I can't just write the headers out to my output file. I can't seem to go in once the data is already written to a file because linux screams when you try to even view the first few lines of a binary file, even if you know they're just text HTTP headers.

Now, here's the rub (well, I suppose the whole thing is a rub). Sometimes the whole header doesn't even in come in on the first response packet, so I can't even guarantee that we'll have the whole header in our first iteration (that is, iteration of receiving an HTTP response. We're using recv(), here), which means I need to somehow... well, I don't even know. I can't seem to mess with the data once it's already written to disk, so I need to deal with it as it's coming in, but we can't be sure how it's going to come in, and even if we were sure, strtok() is a nightmare to use.

I guess I'm just hoping someone out there has a better idea. Here's the relevant code. This is really stripped down, I'm going for MCVE, of course. Also, you can just assume that socket_file_descriptor is already instantiated and get_request contains the text of our GET request. Here is it:

FILE* fp = fopen("output", "wb"); // Open the file for writing char buf[MAXDATASIZE]; // The buffer size_t numbytes; // For the size of the response /* * Do all the socket programming stuff to get the socket file descriptor that we need * ... * ... */ send(socket_file_descriptor, get_request, strlen(get_request), 0); // Send the HTTP GET request while ((numbytes = recv(socket_file_descriptor, buf, MAXDATASIZE - 1, 0)) > 0) { /* I either need to do something here, to deal with getting rid of the headers before writing to file */ fwrite(buf, 1, numbytes, fp); // Write to file memset(buf, 0, MAXDATASIZE); // This just resets the buffer to make room for the next packet } close(s); fclose(fp); /* Or I need to do something here, to strip the file of its headers after it's been written to disk */ 

So, I thought about doing something like this. The only thing we know for sure is that the header is going to end in \r\n\r\n (two carriage returns). So we can use that. This doesn't really work, but hopefully you can figure out where I'm trying to go with it (comments from above removed):

FILE* fp = fopen("output", "wb"); char buf[MAXDATASIZE]; size_t numbytes; int header_found = 0; // Add a flag, here /* ... * ... */ send(socket_file_descriptor, get_request, strlen(get_request), 0); while ((numbytes = recv(socket_file_descriptor, buf, MAXDATASIZE - 1, 0)) > 0) { if (header_found == 1) { // So this won't happen our first pass through fwrite(buf, 1, numbytes, fp); memset(buf, 0, MAXDATASIZE); } else { // This will happen our first pass through, maybe our second or third, the header doesn't always come in in full on the first packet /* And this is where I'm stuck. * I'm thinking about using strtok() to parse through the lines, but.... * well I just can't figure it out. I'm hoping someone can at least point * me in the right direction. * * The point here would be to somehow determine when we've seen two carriage returns * in a row and then mark header_found as 1. But even if we DID manage to find the * two carriage returns, we still need to write the remaining data from this packet to * the file before moving on to the next iteration, but WITHOUT including the * header information. */ } } close(s); fclose(fp); 

I've been staring at this code for three days straight and am slowly losing my mind, so I really appreciate any insight anyone is able to provide. To generalize the problem, I guess this really comes down to me just not understanding how to do text parsing in C.

7
  • 2
    Is this "How do I write an HTTP parser in C?" If so, that's a lot to work through in one question. If you've never written a parser before, start with something simple, like a line-delimited parser that splits into lines. From there, parse headers by correctly splitting the header name from header value. Commented Sep 7, 2020 at 20:40
  • 1
    But yes, I think this question really does just come down to "How do I write an HTTP parser in C." Well, not even, I don't care about the information in the header, I just want to lop it off and only take the message body. Couldn't care less what the headers actually say. It's really just "how do I parse this buffer." Commented Sep 7, 2020 at 21:00
  • 2
    In C this generally plays out as simple state machines, where you loop over the content and branch to different states based on character matches inside a switch statement. There are innumerable HTTP parsers out there, many in C, which are open-source and easily obtained for inspiration. There's surely also dozens of well-written examples that walk you through this. Commented Sep 7, 2020 at 21:00
  • 3
    If you're just looking for the end of the headers, use strstr to look for CRLFCRLF and there's your data. The headers terminate with that sequence. Commented Sep 7, 2020 at 21:01
  • 1
    @ALittleHelpFromMyFriends See my answer to Differ between header and content of http server response (sockets) Commented Sep 8, 2020 at 17:09

3 Answers 3

1

The second self-answer is better than the first one, but it still could be made much simpler:

const char* pattern = "\r\n\r\n"; const char* patp = pattern; while ((numbytes = recv(socket_file_descriptor, buf, MAXDATASIZE - 1, 0)) > 0) { for (int i = 0; i < numbytes; i++) { if (*patp == 0) { fwrite(buf + i, 1, numbytes - i, fp); break; } else if (buf[i] == *patp) ++patp; else patp = pattern; } /* This memset isn't really necessary */ memset(buf, 0, MAXDATASIZE); } 

That looks like a general solution, but it's not really: there are values for pattern for which it might fail to see a terminator under particular circumstances. But this particular pattern is not problematic. You might want to think about what sort of pattern would cause a problem before taking a look at the more general solution.

Sign up to request clarification or add additional context in comments.

11 Comments

@ALittleHelpFromMyFriends: when it points at the NUL terminator at the end of pattern. (*patp == 0 doesn't test if patp is 0. It tests whether it points to a 0. Those are very different tests.)
\0 means "the character whose code is 0". So, yes. And not just in this context. C character literals have type int
@ALittleHelpFromMyFriends: All it is doing is using the termination pattern to contain the individual characters, so that instead of hardcoding the characters into the code (which is complicated), it just steps through the pattern. In effect, the pointer patp could be your counter (and I could have implemented it as a counter, but why?). Doing it this way means that I can use exactly the same code with a different terminator sequence, without even having to know how long the terminator sequence is. (But watch out for the note at the end of the answer: not every pattern works.)
@ALittleHelpFromMyFriends: If you're going to use C for string functions, you need to have a clear idea what a pointer is, what a null-terminator is, and how they work together to avoid having to constantly count the length of a string. None of it is complicated. It's just how you look at the problem.
For example, why did I just drop your boolean header_found? Because the test is so simple (*patp == 0) that nothing is saved by caching a boolean value. Without that boolean, you would just end up doing the same test twice, once before the loop and once at the beginning of the loop. That's pointless, so I could just eliminate the redundant test along with the boolean.
|
0

So, I know this is not the most elegant way to go about this, but... I did get it. For anyone who finds this question and is curious about at least an answer, here it is:

int count = 0; int firstr_found = 0; int firstn_found = 0; int secondr_found = 0; int secondn_found = 0; FILE* fp = fopen("output", "wb"); char buf[MAXDATASIZE]; size_t numbytes; int header_found = 0; /* ... * ... */ send(socket_file_descriptor, get_request, strlen(get_request), 0); while ((numbytes = recv(socket_file_descriptor, buf, MAXDATASIZE - 1, 0)) > 0) { if (header_found == 1) { fwrite(buf, 1, numbytes, fp); } else { // These buf[i]'s are going to return as integers (ASCII) // \r is 13 and \n is 10, so we're looking for 13 10 13 10 // This also needs to be agnostic of which packet # we're on; sometimes the header is split up. for (int i = 0; i < numbytes; i++) { if (firstr_found == 1 && firstn_found == 1 && secondr_found == 1 && secondn_found == 1) { // WE FOUND IT! header_found = 1; // We want to skip the parts of the buffer we've already looked at, that's header, and our numbytes will be decreased by that many fwrite(buf + i, 1, numbytes - i, fp); break; } if (buf[i] == 13 && firstr_found == 0) { // We found our first \r, mark it and move on to next iteration firstr_found = 1; continue; } if (buf[i] == 10 && firstr_found == 1 && firstn_found == 0) { // We found our first \n, mark it and move on firstn_found = 1; continue; } else if (buf[i] != 13 && buf[i] != 10) { // Think about the second r, it'll ignore the first if, but fail on the second if, but we don't want to jump into this else block firstr_found = 0; firstn_found = 0; continue; } if (buf[i] == 13 && firstr_found == 1 && firstn_found == 1 && secondr_found == 0) { secondr_found = 1; continue; } else if (buf[i] != 10) { firstr_found = 0; firstn_found = 0; secondr_found = 0; continue; } if(buf[i] == 10 && firstr_found == 1 && firstn_found == 1 && secondr_found == 1 && secondn_found == 0) { secondn_found = 1; continue; } } } memset(buf, 0, MAXDATASIZE); count++; } close(s); fclose(fp); 

2 Comments

Tip: Instead of first/second, just keep a counter.
Ahhh r_found = 0, n_found = 0 and just increment it when we find one. That's a good idea.
0

Adding another answer because, well I suppose I think I'm clever. Thanks to @tadman for the idea of a counter. Look here (I'm going to shave off a lot of the bloat and just do the while loop, if you've looked at my other code blocks you should be able to see what I mean here) ...

/* ... * ... */ int consec_success = 0; while ((numbytes = recv(socket_file_descriptor, buf, MAXDATASIZE - 1, 0)) > 0) { if (header_found == 1) { fwrite(buf, 1, numbytes, fp); } else { for (int i = 0; i < numbytes; i++) { if (consec_success == 4) { header_found = 1; fwrite(buf + i, 1, numbytes - i, fp); break; } if (buf[i] == 13 && consec_success % 2 == 0) { consec_success++; } else if (buf[i] == 10 && consec_success % 2 == 1) { consec_success++; } else { consec_success = 0; } } } memset(buf, 0, MAXDATASIZE); } /* ... * ... */ 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.