9

I have a tab-delimited file with three columns (excerpt):

AC147602.5_FG004 IPR000146 Fructose-1,6-bisphosphatase class 1/Sedoheputulose-1,7-bisphosphatase AC147602.5_FG004 IPR023079 Sedoheptulose-1,7-bisphosphatase AC148152.3_FG001 IPR002110 Ankyrin repeat AC148152.3_FG001 IPR026961 PGG domain 

and I'd like to get this using bash:

AC147602.5_FG004 IPR000146 Fructose-1,6-bisphosphatase class 1/Sedoheputulose-1,7-bisphosphatase IPR023079 Sedoheptulose-1,7-bisphosphatase AC148152.3_FG001 IPR023079 Sedoheptulose-1,7-bisphosphatase IPR002110 Ankyrin repeat IPR026961 PGG domain 

So if ID in the first column are the same in several lines, it should produce one line for each ID with all other parts of lines joined. In the example it will give two-row file.

2
  • 1
    @oberlies, it is sometimes OK to add tags to a question that cover technologies used in answers, but not mentioned in the question. This would be one of those cases, especially when the alternative is creating new meta tags. Commented Mar 29, 2014 at 3:22
  • @close-voters: How can this question be too broad? The answer is a one-line awk script. Commented Apr 18, 2014 at 14:57

4 Answers 4

11

give this one-liner a try:

 awk -F'\t' -v OFS='\t' '{x=$1;$1="";a[x]=a[x]$0}END{for(x in a)print x,a[x]}' file 
Sign up to request clarification or add additional context in comments.

Comments

0

For whatever reason, the awk solution does not work for me in cygwin. So I used Perl instead. It joins around a tab character and separates line by \n

cat FILENAME | perl -e 'foreach $Line (<STDIN>) { @Cols=($Line=~/^\s*(\d+)\s*(.*?)\s*$/); push(@{$Link{$Cols[0]}}, $Cols[1]); } foreach $List (values %Link) { print join("\t", @{$List})."\n"; }' 

Comments

0

will depend off file size (and awk limitation)

if too big this will reduce the awk need by sorting file first and only keep 1 label in memory for printing

A classical version with post print using a modification of the whole line

sort YourFile \ | awk ' last==$1 { sub( /^[^[:blank:]]*[[:blank:]]+/, ""); C = C " " $0; next} NR > 1 { print Last C; Last = $1; C = ""} END { print Last} ' 

Another version using field and pre-print but less "human readable"

sort YourFile \ | awk ' last!=$1 {printf( "%s%s", (! NR ? "\n" : ""), Last=$1)} last==$1 {for( i=2;i<NF;i++) printf( " %s", $i)} ' 

Comments

0

A pure bash version. It has no additional dependencies, but requires bash 4.0 or above (2009) for associative array support.

All on one line:

{ declare -A merged; merged=(); while IFS=$'\t' read -r key value; do merged[$key]="${merged[$key]}"$'\t'"$value"; done; for key in "${!merged[@]}"; do echo "$key${merged[$key]}"; done } < INPUT_FILE.tsv 

Readable and commented equivalent:

{ # Define `merged` as an empty associative array. declare -A merged merged=() # Read tab-separated lines. Any leftover fields also end up in `value`. while IFS=$'\t' read -r key value do # Append to any value that's already there, separated by a tab. merged[$key]="${merged[$key]}"$'\t'"$value" done # Loop over the input keys. Note that the order is arbitrary; # pipe through `sort` if you want a predictable order. for key in "${!merged[@]}" do # Each value is prefixed with a tab, so no need for a tab here. echo "$key${merged[$key]}" done } < INPUT_FILE.tsv 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.