1

I want to find the string inside double quotes or single quotes in a text file(the text file is multiline),

for example:

I have a test "foo bar1" test2 "foo\"bar2", "foo 'bar3", 'foo bar4', 'foo \'bar5', 'foo "bar6', 

It will output

foo bar1 foo\"bar2 foo 'bar3 foo bar4 foo \'bar5 foo "bar6 

the hard point is:

  1. The text file is multiline.
  2. It may have escaped double or single quotes inside quotes.
  3. The double quotes may have single quotes inside.
  4. The single quotes may have double quotes inside.
  5. The quotes must be paired match.
8
  • 4
    What have you tried so far? Commented Jun 21, 2020 at 23:54
  • , and , can be replaced by newline? And after this replacement there can be only one element per line? Commented Jun 22, 2020 at 0:03
  • the seperator between words may be not ,, the seperator can be others. Commented Jun 22, 2020 at 0:09
  • 2
    You should make clear what can be relied on. Obviously there can be quoted " and '. Can there be quoted \ , too? I guess noone will feel like writing a parser for all possible kinds of input. Commented Jun 22, 2020 at 0:14
  • for now just consider " and '. Commented Jun 22, 2020 at 0:32

3 Answers 3

4

We can use match-time code interpolation feature of Perl (??{ match time regex }) to tackle this. Essentially what it does is, based on what quote matched, it places the corresponding, valid regex fir that quote, such that the regex engine will natch the pair of that quote.

$ perl -lne ' print substr($&, 1, -2+length($&)) while /(?:(["'\''])(??{q<(?:[^\\\\>.$1.q<]|\\\\.)*>.$1}))/gx; ' file 

Results:

foo bar1 foo\"bar2 foo 'bar3 foo bar4 foo \'bar5 foo "bar6 

A smoother rewrite of the above is as follows:

$ perl -lne ' BEGIN { $genRE = sub { my $openingQ = shift; # look in the Notes below for why qq<(?:[^\\\\${openingQ}]|\\\\.)*> }; } print $2 while / (["'\'']) (?#: opening quote) ((??{ $genRE->($1) })) (?#: run of in between quote pair stuff) \1 (?#: corresponding closing quote) /gx; ' file 

Notes::

  • "........" is matched by /"[^"]*"/
  • "...... \"......" is matched by /"(?:[^\\"]|\\.)*"/
  • similar ly for the single quote as well.
3

Another perl approach:

perl -lne 'print $2 while m{(["'\''])((?:\\.|(?!\1).)*+)\1}g' 

Here using a negative look ahead operator in (?!\1). to match characters other than the one matched by the first capture group. You could also simply cover the '...' and "..." cases separately with:

perl -lne 'print $1 while m{(?|"((?:\\.|[^"])*+)"|'"'((?:\\\.|[^'])*+)')}g" 
0

This is difficult. I do not have a solution. I am not even sure what the best tool for this task is.

I have come close:

$ grep -oP '((?<!\\)"\K.*?(?=(?<!\\)"))|'"((?<!\\\\)'\K.*?(?=(?<!\\\\)'))" input foo bar1 foo\"bar2 foo 'bar3 foo bar4 , foo \'bar5 , foo "bar6 

The problem with multiple matches per line is that the closing quote of the earlier sting is matched as starting quote of the in-between text. And I cannot block that with a positive look-behind for an even number of quotes because the look-behind must have a fixed length. At least for grep.

Furthermore the matching of several ' within " (or the other way round) is interesting, to say the least.

Maybeawk is the better tool for this. With it you can check which quote type comes first, jump to the next and check if it is preceded by a backslash.

0

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.