0

I am parsing a text file using boost regex in C++. I am looking for '\' characters from the file. This file also contains some unicode '\u' characters as well. So, is there a way to separate out '\' and '\u' character. Following is content of test.txt that I am parsing

"ID": "\u01FE234DA - this is id ", "speed": "96\/78", "avg": "\u01FE234DA avg\83" 

Following is my try

#include <boost/regex.hpp> #include <string> #include <iostream> #include <fstream> using namespace std; const int BUFSIZE = 500; int main(int argc, char** argv) { if (argc < 2) { cout << "Pass the input file" << endl; exit(0); } boost::regex re("\\\\+"); string file(argv[1]); char buf[BUFSIZE]; boost::regex uni("\\\\u+"); ifstream in(file.c_str()); while (!in.eof()) { in.getline(buf, BUFSIZE-1); if (boost::regex_search(buf, re)) { cout << buf << endl; cout << "(\) found" << endl; if (boost::regex_search(buf, uni)) { cout << buf << endl; cout << "unicode found" << endl; } } } } 

Now when I use above code it prints following

"ID": "\u01FE234DA - this is id ", (\) found "ID": "\u01FE234DA - this is id ", unicode found "speed": "96\/78", (\) found "avg": "\u01FE234DA avg\83" (\) found "avg": "\u01FE234DA avg\83" unicode found 

Instead of I want following

 "ID": "\u01FE234DA - this is id ", unicode found "speed": "96\/78", (\) found "avg": "\u01FE234DA avg\83" (\) and unicode found 

I think the code is not able to distinguish '\' and '\u' separately but I am not sure where to change what.

4
  • Your current code does not produce the output you show due to the commented-out statements. Also, what is wrong with running both checks? (It's a flawed method anyway. Probably better would be to not use regex and inspect one backslash at a time, going from first to last. "Let's use a regex - now you have two problems.") Commented Apr 5, 2016 at 21:59
  • If we keep this code as is then "ID" field is showed up twice. i.e "ID" is considered as unicode as well as () found one Commented Apr 5, 2016 at 22:07
  • I have remove the comments now I think code should work Commented Apr 5, 2016 at 22:18
  • But I bet it does not work on \\\u123 testing (well - give or take a few more backslashes). Is there a particular reason to do this with regexes? As I said, iterating over the backslashes ought to be simple, straightforward, and robust. Commented Apr 5, 2016 at 22:21

1 Answer 1

1

Try using [^u] in your first regex to match any character that is not u.

boost::regex re("\\\\[^u]"); // matches \ not followed by u boost::regex uni("\\\\u"); // matches \u 

It's probably best to use one regex expression.

boost:regex re("\\\\(u)?"); // matches \ with or without u 

Then check if the partial match m[1] is 'u':

m = boost::regex_search(buf, uni) if (m && m[1] === "u") { // pseudo-code // unicode } else { // not unicode } 

It's better to use regex for pattern matching. They seem more complex but they are actually easier to maintain once you get used to them and less bug-prone than iterating over strings one character at a time.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.