Return to Answer

added 165 characters in body

Source Link

edited Dec 16, 2023 at 5:04

jubilatious1

3.9k
10
21

Assuming \t tab as column separator:

The BEGIN statement outputs the header line (can be omitted if the header line is \t tab separated like the body rows).
The body rowslines (linesrows) can be split on \t and saved into @a array. Note, it might be possible (but risky) to split on \s**4 i.e. four-consecutive whitespace characters (or even, or even \h**4 four(four-consecutive horizontal whitespace characters), if the third column does not contain four-consecutive horizontal whitespace charactersthat pattern. But again, this is risky.
The third column (i.e. @a[2]) is replaced by @a[2] column text that has been combed (i.e. positively-selected) for one-or-more matches to the Regex GO\: \d+. Think of comb as the converse of split (which is destructive). The selected GO-IDs are then joined with commas.
Finally, the split columns are joined back together on \t tabs, and output.

The BEGIN statement outputs the header line (can be omitted if the header line is \t tab separated like the body rows).
The body rows (lines) can be split on \t and saved into @a array. Note, it might be possible to split on \s**4 i.e. four-consecutive whitespace characters (or even or \h**4 four-consecutive horizontal whitespace characters), if the third column does not contain four-consecutive horizontal whitespace characters.
The third column (i.e. @a[2]) is replaced by @a[2] column text that has been combed (i.e. positively-selected) for one-or-more matches to the Regex GO\: \d+. Think of comb as the converse of split (which is destructive). The selected GO-IDs are then joined with commas.
Finally, the split columns are joined back together on \t tabs, and output.

Assuming \t tab as column separator:

The BEGIN statement outputs the header line (can be omitted if the header line is \t tab separated like the body rows).
The body lines (rows) can be split on \t and saved into @a array. Note, it might be possible (but risky) to split on \s**4 i.e. four-consecutive whitespace characters, or even \h**4 (four-consecutive horizontal whitespace characters), if the third column does not contain that pattern. But again, this is risky.
The third column (i.e. @a[2]) is replaced by @a[2] column text that has been combed (i.e. positively-selected) for one-or-more matches to the Regex GO\: \d+. Think of comb as the converse of split (which is destructive). The selected GO-IDs are then joined with commas.
Finally, the split columns are joined back together on \t tabs, and output.

added 165 characters in body

Source Link

edited Dec 16, 2023 at 4:58

jubilatious1

3.9k
10
21

The BEGIN statement outputs the header line (can be omitted if the header line is \t tab separated like the body rows).
The body rows (lines) can be split on \t and saved into @a array. ItNote, it might be possible to split on \s**4 i.e. four-consecutive whitespace characters, or (or even or \h**4 four-consecutive horizontal whitespace characters), if the third column does not contain four-consecutive horizontal whitespace characters.
The third column (i.e. @a[2]) is replaced by @a[2] column text that has been combed (i.e. positively-selected) for a matchone-or-more matches to the Regex GO\: \d+. Think of comb as the inverseconverse of split (which is destructive). TheseThe selected GO ids-IDs are then joined with commas.
Finally, the split columns are joined back together on \t tabs, and output.

The BEGIN statement outputs the header line (can be omitted if the header line is \t tab separated like the body rows).
The body rows (lines) can be split on \t. It might be possible to split on \s**4 i.e. four-consecutive whitespace characters, or even or \h**4 four-consecutive horizontal whitespace characters.
The third column (i.e. @a[2]) is replaced by @a[2] column text that has been combed (i.e. positively-selected) for a match to GO\: \d+. Think of comb as the inverse of split. These selected GO ids are then joined with commas.
Finally, the split columns are joined back together on \t tabs, and output.

The BEGIN statement outputs the header line (can be omitted if the header line is \t tab separated like the body rows).
The body rows (lines) can be split on \t and saved into @a array. Note, it might be possible to split on \s**4 i.e. four-consecutive whitespace characters (or even or \h**4 four-consecutive horizontal whitespace characters), if the third column does not contain four-consecutive horizontal whitespace characters.
The third column (i.e. @a[2]) is replaced by @a[2] column text that has been combed (i.e. positively-selected) for one-or-more matches to the Regex GO\: \d+. Think of comb as the converse of split (which is destructive). The selected GO-IDs are then joined with commas.
Finally, the split columns are joined back together on \t tabs, and output.

Source Link

answered Dec 16, 2023 at 4:44

jubilatious1

3.9k
10
21

Using Raku (formerly known as Perl_6)

~$ raku -ne 'BEGIN put get; \ my @a = .split(:skip-empty, / \t /, 3); \ @a[2] = (@a[2] // "").comb(/ GO\: \d+ /).join(","); \ @a.join("\t").trim-trailing.put;' file

Here's an answer coded in Raku, a member of the Perl-family of programming languages. Going line by line:

The BEGIN statement outputs the header line (can be omitted if the header line is \t tab separated like the body rows).
The body rows (lines) can be split on \t. It might be possible to split on \s**4 i.e. four-consecutive whitespace characters, or even or \h**4 four-consecutive horizontal whitespace characters.
The third column (i.e. @a[2]) is replaced by @a[2] column text that has been combed (i.e. positively-selected) for a match to GO\: \d+. Think of comb as the inverse of split. These selected GO ids are then joined with commas.
Finally, the split columns are joined back together on \t tabs, and output.

Sample Input:

ID transcript_id go_description MA_10000213g0010 MA_10000213g0010 MA_10000405g0010 MA_10000405g0010 GO:0006468-protein phosphorylation;GO:0030246-carbohydrate binding;GO:0005524-ATP binding;GO:0004672-protein kinase activity MA_1000049g0010 MA_1000049g0010 MA_10000516g0010 MA_10000516g0010 GO:0005515-protein binding MA_10001015g0010 MA_10001015g0010 MA_10001337g0010 MA_10001337g0010 MA_10001425g0010 MA_10001425g0010 MA_10001478g0010 MA_10001478g0010 MA_10001558g0010 MA_10001558g0010 MA_10001g0010 MA_10001g0010 MA_10002030g0010 MA_10002030g0010 GO:0005737-cytoplasm;GO:0000184-nuclear-transcribed mRNA catabolic process, nonsense-mediated decay;GO:0004386-helicase activity;GO:0008270-zinc ion binding;GO:0003677-DNA binding;GO:0005524-ATP binding MA_10002157g0010 MA_10002157g0010 GO:0006468-protein phosphorylation;GO:0005524-ATP binding;GO:0004672-protein kinase activity MA_10002549g0010 MA_10002549g0010 MA_10002583g0010 MA_10002583g0010 GO:0008168-methyltransferase activity MA_10002614g0010 MA_10002614g0010 MA_10002643g0010 MA_10002643g0010 GO:0055114-oxidation-reduction process

Sample Output:

ID transcript_id go_description MA_10000213g0010 MA_10000213g0010 MA_10000405g0010 MA_10000405g0010 GO:0006468,GO:0030246,GO:0005524,GO:0004672 MA_1000049g0010 MA_1000049g0010 MA_10000516g0010 MA_10000516g0010 GO:0005515 MA_10001015g0010 MA_10001015g0010 MA_10001337g0010 MA_10001337g0010 MA_10001425g0010 MA_10001425g0010 MA_10001478g0010 MA_10001478g0010 MA_10001558g0010 MA_10001558g0010 MA_10001g0010 MA_10001g0010 MA_10002030g0010 MA_10002030g0010 GO:0005737,GO:0000184,GO:0004386,GO:0008270,GO:0003677,GO:0005524 MA_10002157g0010 MA_10002157g0010 GO:0006468,GO:0005524,GO:0004672 MA_10002549g0010 MA_10002549g0010 MA_10002583g0010 MA_10002583g0010 GO:0008168 MA_10002614g0010 MA_10002614g0010 MA_10002643g0010 MA_10002643g0010 GO:0055114

https://docs.raku.org
https://raku.org