How are textual data files parsed in modern C++?

Question

I am (too) often confronted with the task of having to parse textual data files -- the kind of textual structured data representation you used before "everyone" used XML -- that are some kind of industry standard. (There are too many of these.)

Anyways, the basic task is always taking a text file and stuffing what's in there in some kind of datastructure so that our C++ code can do something with the info.

Now, I have implemented a few simple (and oh so buggy) parsers by hand, and there is little I despise more. :-)

So - I was wondering what the current state of the art is when I want to "parse" structured textual data into a in-memory representation (think: XML data binding for an arbitrary language).

What I found so far was "What parser generator do you recommend", but I'm not so sure I'm after a parser generator (like ANTLR).

Obvious candidates seem to be pegtl and Boost.Spirit but they both seem rather complicated (but at least they're in-language) and last time I tried Spirit, the compiler errors drove me nuts. (And pegtl needs a C++11 compatible compiler which is still a problem here (VC++ 2005).)

So am I missing a simpler solution for just getting something like

/begin COMPU_METHOD DEC " Decimal value" RAT_FUNC "%3.0" "dec" COEFFS 0 1.000000 0.000000 0 0.000000 1.000000 /end COMPU_METHOD

into a C++ datastructure? (This is just an arbitrary example of how part of such a file may look. For this format I could (and probably should) buy a library to parse it, as it is widespread enough -- which is not the case for all formats I encounter.)

-- or should I just go for the complexity of, say Boost.Spirit?

boost.spirit isn't really much more complicated than any other (E)BNF using tool. The question here is probably more if you want to reflect the full grammar of the file (if it has any) or just "grab out" some of the information. The full grammar has the advantage that you will have less logical bugs that are misinterpreting the file format in handcoded parsing code. — PlasmaHH
– PlasmaHH, Commented Nov 4, 2011 at 11:09
Is the format of the text files already defined and unchangeable or are you allowed to define or change the format of the text files? — rve
– rve, Commented Nov 4, 2011 at 12:08
@rve - No, the format is always fixed coming from 3rd parties. — Martin Ba
– Martin Ba, Commented Nov 4, 2011 at 12:31
4 years passed and in case you are interested, we "just" released a new and greatly improved version of the PEGTL at github.com/ColinH/PEGTL which should work with VS2015 — Daniel Frey
– Daniel Frey, Commented Apr 16, 2015 at 6:32
@Daniel - great! We're on VS2015 ... no. wait 2005: 20_0_5 ... sigh :-) — Martin Ba
– Martin Ba, Commented Apr 16, 2015 at 20:35

Community · Accepted Answer · 2017-05-23 12:10:50Z

Boost Spirit

See
- my answer here for a demo that resembles your sample;
- a more advanced, shorter demo here that parses into a tree structure
- more samples search
Coco/R (C++)

I have had good results with this very pragmatic parser generator that supports many lnaguages/platforms using a common grammar format. The speed of parsing is comparable to Boost Spirit (allthough the processing of parsed data may be more efficient using generic programming)

Edit To make things perfectly clear, there never has been a thing that I wasn't able to do with Coco/R.

However, I'm really addicted to the ease with which Spirit deduces attribute type (conversions) for me generically. That is the main timesaver. There is a cost involved though:

learning curve, maintenance
compile time (but parsers don't often change)

Greyson · Accepted Answer · 2011-11-11 09:59:40Z

I highly recommend biting the bullet and using Boost.Spirit. Although the error messages can be enough to put one out of one's skull, it's been worth it for me. I have used it to implement parsers for under- (or un-) documented custom file formats in a matter of hours, instead of days.

I found that the best way to approach it was to view it as an "std::istream on steroids", since it uses the same double-angle notation to denote separation.

jszpilewski · Accepted Answer · 2011-11-04 10:15:17Z

0

You do not mention how sophisticated were the parsers you created by hand. But I believe such simple files could be definitely parsed by hand crafted routines as long as you split your work to lexical and syntactic parsing performed by dedicated state machines. The first one recognizes tokens like in your example keywords, numbers and strings, and feeds them to the second trying to recognize longer sentences and create corresponding data structures. With a simple files following regular grammars with no ambiguities and other conflicts it should be really simple and manageable.

answered Nov 4, 2011 at 10:15

jszpilewski

1,6421 gold badge21 silver badges19 bronze badges

5 Comments

Martin Ba Over a year ago

They were not sophisticated. They were buggy (didn't fully parse all possible inputs), had crappy error messages (if at all), often were slow (too many manual dynamic allocations), messed up whitespace and string handling, got confused by comments -- and I could continue here. Writing a parser by hand (by reading file, iterating over a buffer, etc.) is the most horrible, error prone and unmaintainable thing I can think of. I know this is "just me", and there are people that take delight in it, but I'd rather not. Thank you. :-)

Matthieu M. Over a year ago

@Martin: automated parsers (like Spirit or YACC) don't generally have a pretty recovery either. The only parsers with good error recovery I have seen are hand-crafted, and they use heuristics to "guess" what the input should have been.

jszpilewski Over a year ago

I suppose you tried to accomplish too much in a single step. I think you could look at lex or derived utility. It generates a C function from a regular grammar description that deals with the problems you described like matching tokens, eliminating comments and whitespaces. It is an external utility but you will need to use it only once per grammar modification.

Martin Ba Over a year ago

@jszpilewski - Are we going in circles? :-) I stated in the original question that I'm unsure whether a parser generator is the right tool / then you answered that "creating by hand" might be good enough / then you seem to recommend lex which is a parser generator as far as I can see ... ?

jszpilewski Over a year ago

@Martin Not a full parser but just a lexer that creates tokens that you still need to combine into grammatical statements (like config lines). But it is a single function and it could solve the problems you mentioned with comments etc.

Collectives™ on Stack Overflow

How are textual data files parsed in modern C++?

3 Answers 3

Comments

Comments

5 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

5 Comments

Linked

Related