I want to only remove grouped commas from this csv to change the number of variables to two

Question

I have a csv where the first few rows look like this

c("4288", "57534"),MIB1 c("2272", "2385"),FHIT c("5550", "10531", "56239"),PREP c("25809", "23669"),TTLL1

I want to manipulate the number of variables so that everything grouped in parenthesis is one variable. Unfortunately my document has several entries like line 3 where there are more than one comma separating the values inside parenthesis.

Is there a sed expression capable of manipulating only the commas inside the parenthesis?

The expected output would be something like this:

c("4288" "57534"), MIB1 c("2272" "2385"),FHIT c("5550" "10531" "56239"),PREP c("25809" "23669"),TTLL1

Cheers.

For this to be an actual CSV file, the fields containing commas would be quoted. — Kusalananda
– Kusalananda ♦, Commented Apr 22, 2020 at 8:41

Shawn · Accepted Answer · 2020-04-22 08:26:33Z

Using perl instead of sed to get more advanced regular expressions:

perl -pe 's/(?:\G[^,)]*|\([^,)]*)\K,(?=.*?\))//g' input.csv c("4288" "57534"),MIB1 c("2272" "2385"),FHIT c("5550" "10531" "56239"),PREP c("25809" "23669"),TTLL1

This will remove all commas that appear inside parenthesis.

αғsнιη · Accepted Answer · 2020-04-22 11:13:39Z

Same solution I have answered here, that will also apply to your question with a bit modification here:

sed -E ':loop s/(\([^)]*),([^)]*\))/\1\2/; t loop' infile

Breaking down:

Note: un-escaped ( or ) outside character class [...] is to used for grouping match; escaped \( or \) or within character class [...] will match literal ( and ); ^ is negation match, so [^)] matches "any single character but not a )".

then we have:

(\([^)]*): first group match, back referend \1 is referring to.
,: match a single comma.
([^)]*\)): second group match, back-reference \2 is referring to.

Considering one sample line like below and explaining on how this match works:

c(("4288", "57534", "somtoher")),d("f1", "f2", "f3"),MIB1

this (\([^)]*),([^)]*\)) will match:

from very first open parenthesis ( followed by anything but not a ) and up-to last , to the first close parenthesis ); so, first group match \1 will match (("4288", "57534", part of the sample line at above;
then anything after last , to the first close parenthesis up-to first close parenthesis and ) itself will be in second group match \2; it will be "somtoher") part of the sample line above.
in replacement part in \1\2, we revert the both matched groups back but dropped comma between them.
:loop s///; t loop; do steps 1 to 3 in until all commas between (&) cleared in a sed's loop (loop is used as label).

at first attempt, our sample line would change to:
```
c(("4288", "57534" "somtoher")),d("f1", "f2", "f3"),MIB1 
```
at second attempt would be:
```
c(("4288" "57534" "somtoher")),d("f1", "f2", "f3"),MIB1 
```
at third attempt would be:
```
c(("4288" "57534" "somtoher")),d("f1", "f2" "f3"),MIB1 
```
and so on.

Stack Exchange Network

I want to only remove grouped commas from this csv to change the number of variables to two

2 Answers 2

Breaking down:

You must log in to answer this question.

Linked

Hot Network Questions

I want to only remove grouped commas from this csv to change the number of variables to two

2 Answers 2

Breaking down:

You must log in to answer this question.

Linked

Related

Hot Network Questions