Using Raku (formerly known as Perl_6)
~$ raku -ne 'BEGIN put get; \ my @a = .split(:skip-empty, / \t /, 3); \ @a[2] = (@a[2] // "").comb(/ GO\: \d+ /).join(","); \ @a.join("\t").trim-trailing.put;' file
Here's an answer coded in Raku, a member of the Perl-family of programming languages. Going line by line:
The BEGIN statement outputs the header line (can be omitted if the header line is \t tab separated like the body rows).
The body rows (lines) can be split on \t. It might be possible to split on \s**4 i.e. four-consecutive whitespace characters, or even or \h**4 four-consecutive horizontal whitespace characters.
The third column (i.e. @a[2]) is replaced by @a[2] column text that has been combed (i.e. positively-selected) for a match to GO\: \d+. Think of comb as the inverse of split. These selected GO ids are then joined with commas.
Finally, the split columns are joined back together on \t tabs, and output.
Sample Input:
ID transcript_id go_description MA_10000213g0010 MA_10000213g0010 MA_10000405g0010 MA_10000405g0010 GO:0006468-protein phosphorylation;GO:0030246-carbohydrate binding;GO:0005524-ATP binding;GO:0004672-protein kinase activity MA_1000049g0010 MA_1000049g0010 MA_10000516g0010 MA_10000516g0010 GO:0005515-protein binding MA_10001015g0010 MA_10001015g0010 MA_10001337g0010 MA_10001337g0010 MA_10001425g0010 MA_10001425g0010 MA_10001478g0010 MA_10001478g0010 MA_10001558g0010 MA_10001558g0010 MA_10001g0010 MA_10001g0010 MA_10002030g0010 MA_10002030g0010 GO:0005737-cytoplasm;GO:0000184-nuclear-transcribed mRNA catabolic process, nonsense-mediated decay;GO:0004386-helicase activity;GO:0008270-zinc ion binding;GO:0003677-DNA binding;GO:0005524-ATP binding MA_10002157g0010 MA_10002157g0010 GO:0006468-protein phosphorylation;GO:0005524-ATP binding;GO:0004672-protein kinase activity MA_10002549g0010 MA_10002549g0010 MA_10002583g0010 MA_10002583g0010 GO:0008168-methyltransferase activity MA_10002614g0010 MA_10002614g0010 MA_10002643g0010 MA_10002643g0010 GO:0055114-oxidation-reduction process
Sample Output:
ID transcript_id go_description MA_10000213g0010 MA_10000213g0010 MA_10000405g0010 MA_10000405g0010 GO:0006468,GO:0030246,GO:0005524,GO:0004672 MA_1000049g0010 MA_1000049g0010 MA_10000516g0010 MA_10000516g0010 GO:0005515 MA_10001015g0010 MA_10001015g0010 MA_10001337g0010 MA_10001337g0010 MA_10001425g0010 MA_10001425g0010 MA_10001478g0010 MA_10001478g0010 MA_10001558g0010 MA_10001558g0010 MA_10001g0010 MA_10001g0010 MA_10002030g0010 MA_10002030g0010 GO:0005737,GO:0000184,GO:0004386,GO:0008270,GO:0003677,GO:0005524 MA_10002157g0010 MA_10002157g0010 GO:0006468,GO:0005524,GO:0004672 MA_10002549g0010 MA_10002549g0010 MA_10002583g0010 MA_10002583g0010 GO:0008168 MA_10002614g0010 MA_10002614g0010 MA_10002643g0010 MA_10002643g0010 GO:0055114
https://docs.raku.org
https://raku.org