How to parse escape element '\' and unicode character '\u' using boost regex in C++

Question

I am parsing a text file using boost regex in C++. I am looking for '\' characters from the file. This file also contains some unicode '\u' characters as well. So, is there a way to separate out '\' and '\u' character. Following is content of test.txt that I am parsing

"ID": "\u01FE234DA - this is id ", "speed": "96\/78", "avg": "\u01FE234DA avg\83"

Following is my try

#include <boost/regex.hpp> #include <string> #include <iostream> #include <fstream> using namespace std; const int BUFSIZE = 500; int main(int argc, char** argv) { if (argc < 2) { cout << "Pass the input file" << endl; exit(0); } boost::regex re("\\\\+"); string file(argv[1]); char buf[BUFSIZE]; boost::regex uni("\\\\u+"); ifstream in(file.c_str()); while (!in.eof()) { in.getline(buf, BUFSIZE-1); if (boost::regex_search(buf, re)) { cout << buf << endl; cout << "(\) found" << endl; if (boost::regex_search(buf, uni)) { cout << buf << endl; cout << "unicode found" << endl; } } } }

Now when I use above code it prints following

"ID": "\u01FE234DA - this is id ", (\) found "ID": "\u01FE234DA - this is id ", unicode found "speed": "96\/78", (\) found "avg": "\u01FE234DA avg\83" (\) found "avg": "\u01FE234DA avg\83" unicode found

Instead of I want following

 "ID": "\u01FE234DA - this is id ", unicode found "speed": "96\/78", (\) found "avg": "\u01FE234DA avg\83" (\) and unicode found

I think the code is not able to distinguish '\' and '\u' separately but I am not sure where to change what.

Your current code does not produce the output you show due to the commented-out statements. Also, what is wrong with running both checks? (It's a flawed method anyway. Probably better would be to not use regex and inspect one backslash at a time, going from first to last. "Let's use a regex - now you have two problems.") — Jongware
– Jongware, Commented Apr 5, 2016 at 21:59
If we keep this code as is then "ID" field is showed up twice. i.e "ID" is considered as unicode as well as () found one — kkard
– kkard, Commented Apr 5, 2016 at 22:07
But I bet it does not work on \\\u123 testing (well - give or take a few more backslashes). Is there a particular reason to do this with regexes? As I said, iterating over the backslashes ought to be simple, straightforward, and robust. — Jongware
– Jongware, Commented Apr 5, 2016 at 22:21

Jerome Devost · Accepted Answer · 2016-04-06 13:03:20Z

Try using [^u] in your first regex to match any character that is not u.

boost::regex re("\\\\[^u]"); // matches \ not followed by u boost::regex uni("\\\\u"); // matches \u

It's probably best to use one regex expression.

boost:regex re("\\\\(u)?"); // matches \ with or without u

Then check if the partial match m[1] is 'u':

m = boost::regex_search(buf, uni) if (m && m[1] === "u") { // pseudo-code // unicode } else { // not unicode }

It's better to use regex for pattern matching. They seem more complex but they are actually easier to maintain once you get used to them and less bug-prone than iterating over strings one character at a time.

Collectives™ on Stack Overflow

How to parse escape element '\' and unicode character '\u' using boost regex in C++

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related