Join lines with the same value in the first column

Question

I have a tab-delimited file with three columns (excerpt):

AC147602.5_FG004 IPR000146 Fructose-1,6-bisphosphatase class 1/Sedoheputulose-1,7-bisphosphatase AC147602.5_FG004 IPR023079 Sedoheptulose-1,7-bisphosphatase AC148152.3_FG001 IPR002110 Ankyrin repeat AC148152.3_FG001 IPR026961 PGG domain

and I'd like to get this using bash:

AC147602.5_FG004 IPR000146 Fructose-1,6-bisphosphatase class 1/Sedoheputulose-1,7-bisphosphatase IPR023079 Sedoheptulose-1,7-bisphosphatase AC148152.3_FG001 IPR023079 Sedoheptulose-1,7-bisphosphatase IPR002110 Ankyrin repeat IPR026961 PGG domain

So if ID in the first column are the same in several lines, it should produce one line for each ID with all other parts of lines joined. In the example it will give two-row file.

@oberlies, it is sometimes OK to add tags to a question that cover technologies used in answers, but not mentioned in the question. This would be one of those cases, especially when the alternative is creating new meta tags. — Charles
– Charles, Commented Mar 29, 2014 at 3:22
@close-voters: How can this question be too broad? The answer is a one-line awk script. — oberlies
– oberlies, Commented Apr 18, 2014 at 14:57

Kent · Accepted Answer · 2013-11-06 22:37:05Z

11

give this one-liner a try:

 awk -F'\t' -v OFS='\t' '{x=$1;$1="";a[x]=a[x]$0}END{for(x in a)print x,a[x]}' file

answered Nov 6, 2013 at 22:37

Kent

197k36 gold badges248 silver badges317 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Dakusan · Accepted Answer · 2017-01-20 14:03:08Z

For whatever reason, the awk solution does not work for me in cygwin. So I used Perl instead. It joins around a tab character and separates line by \n

cat FILENAME | perl -e 'foreach $Line (<STDIN>) { @Cols=($Line=~/^\s*(\d+)\s*(.*?)\s*$/); push(@{$Link{$Cols[0]}}, $Cols[1]); } foreach $List (values %Link) { print join("\t", @{$List})."\n"; }'

NeronLeVelu · Accepted Answer · 2017-01-20 14:38:19Z

will depend off file size (and awk limitation)

if too big this will reduce the awk need by sorting file first and only keep 1 label in memory for printing

A classical version with post print using a modification of the whole line

sort YourFile \ | awk ' last==$1 { sub( /^[^[:blank:]]*[[:blank:]]+/, ""); C = C " " $0; next} NR > 1 { print Last C; Last = $1; C = ""} END { print Last} '

Another version using field and pre-print but less "human readable"

sort YourFile \ | awk ' last!=$1 {printf( "%s%s", (! NR ? "\n" : ""), Last=$1)} last==$1 {for( i=2;i<NF;i++) printf( " %s", $i)} '

Thomas · Accepted Answer · 2021-07-01 07:32:47Z

A pure bash version. It has no additional dependencies, but requires bash 4.0 or above (2009) for associative array support.

All on one line:

{ declare -A merged; merged=(); while IFS=$'\t' read -r key value; do merged[$key]="${merged[$key]}"$'\t'"$value"; done; for key in "${!merged[@]}"; do echo "$key${merged[$key]}"; done } < INPUT_FILE.tsv

Readable and commented equivalent:

{ # Define `merged` as an empty associative array. declare -A merged merged=() # Read tab-separated lines. Any leftover fields also end up in `value`. while IFS=$'\t' read -r key value do # Append to any value that's already there, separated by a tab. merged[$key]="${merged[$key]}"$'\t'"$value" done # Loop over the input keys. Note that the order is arbitrary; # pipe through `sort` if you want a predictable order. for key in "${!merged[@]}" do # Each value is prefixed with a tab, so no need for a tab here. echo "$key${merged[$key]}" done } < INPUT_FILE.tsv

Collectives™ on Stack Overflow

Join lines with the same value in the first column

4 Answers 4

Comments

Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

Comments

Comments

Linked

Related