removing extensions in a column

Question

I have a file like this

ILMN_1343291 TGTGTTGAGAGCTTCTCAGACTATCCACCTTTGGGTCGCTTTGCTGTTCG NM_001402.5 ILMN_1343295 CTTCAACAGCGACACCCACTCCTCCACCTTTGACGCTGGGGCTGGCATTG NM_002046.3 ILMN_1651209 TCACGGCGTACGCCCTCATGGGGAAAATCTCCCCGGTGACTTTCAGGTCC NM_182838.1

I want to remove the numeric extensions from the end in the 3rd column so that my output file looks like this

ILMN_1343291 TGTGTTGAGAGCTTCTCAGACTATCCACCTTTGGGTCGCTTTGCTGTTCG NM_001402 ILMN_1343295 CTTCAACAGCGACACCCACTCCTCCACCTTTGACGCTGGGGCTGGCATTG NM_002046 ILMN_1651209 TCACGGCGTACGCCCTCATGGGGAAAATCTCCCCGGTGACTTTCAGGTCC NM_182838

How can I do it on command line preferably using awk? I can do this in perl but I am pretty sure there is a single command line to do it.

αғsнιη · Accepted Answer · 2014-11-18 05:54:05Z

With awk:

awk -F'.' '{print $1}' file

-F option change default field separator(space) to dot(.).
$1 is index of field position(with . field separator).

{ILMN_1343291 TGTGTTGAGAGCTTCTCAGACTATCCACCTTTGGGTCGCTTTGCTGTTCG NM_001402}.{5} ^^ field index is $1 ^^$2

With rev and awk:

rev file | awk -F'.' '{print $2}'|rev # reverse characters of each lines,\ print field number 2 with (.) separator \ and reverse the result again

The rev utility copies the specified files to standard output, reversing the order of characters in every line. If no files are specified, standard input is read.

With sed:

sed 's/.[0-9]*$//' file sed 's/.[^.]*$//' file

$ point to end of line. In first sed command search for char(.) which followed by zero or more occurrences of numbers and replace them with whitespace.

In second sed command remove everything that followed by (.) and also remove dot(.) itself.

With rev and sed:

rev file| sed 's/.*[.]//' |rev

Delete everything before dot(.) Also include and remove . itself.

With grep:

grep -oP '.*(?=\.[0-9])' file

 -o, --only-matching Print only the matched (non-empty) parts of a matching line, with each such part on a separate output line. -P, --perl-regexp Interpret PATTERN as a Perl compatible regular expression (PCRE)

(?=pattern): Positive Lookahead: The positive lookahead construct is a pair of parentheses, with the opening parenthesis followed by a question mark and an equals sign.

.*(?=\.[0-9]): (positive lookahead) matches everything(.*) followed by one dot(.) and any occurrences of numbers, without making the pattern(\.[0-9]) part of the match.

With rev and grep:

rev file |grep -oP '(?<=[0-9]\.).*' |rev rev file |grep -oP '[0-9]\.\K.*' |rev

(?<=pattern): Positive Lookbehind. A pair of parentheses, with the opening parenthesis followed by a question mark, "less than" symbol, and an equals sign.

(?<=[0-9]\.).* (positive lookbehind) matches everything which followed by any occurrences of numbers and end with dot(.).

In second grep command, you can use the nifty \K in place of the lookbehind assertion.

With cut:

cut -f1 -d. file cut -c 1-77 file # Print first 77 characters of each line.

 cut - remove sections from each line of files -d, --delimiter=DELIM use DELIM instead of TAB for field delimiter -f, --fields=LIST select only these fields; -c, --characters=LIST select only these characters

With while loop:

while read line; do echo "${line::-2}";done <file

This will work if you have only number with length=1 at the end of each lines and they are fix length. above command remove last two character at the end of every lines in input file. alternative commands is ${line%??}.

muru · Accepted Answer · 2014-11-17 18:42:17Z

Assuming the extensions are all-digit:

perl -pi -e 's/\.\d+$//' /path/to/file

-i does in-place editing (like in sed). \d means digits, and $ denotes the end of the line.

With awk:

awk 'gsub(/\.[0-9]+$/,"")' /path/to/file

gawk has an in-place editing option in newer versions, but I am not sure how portable that is. gsub supports an optional parameter, specifying the target column:

awk 'gsub(/\.[0-9]+$/,"",$3)' /path/to/file

The last form has the undesired side-effect of separating each column by a single space in its output, as if you'd done print $1,..,$NF. I do not know why.

jasonwryan · Accepted Answer · 2014-11-17 20:26:25Z

Using awk it is straightforward, just set your field separator as .:

awk -F. '{print $1}' file

Another approach, using the shell (in this case bash):

while IFS=.; read -r lines _; do line+=("$lines"); done <file printf "%s\n" "${line[@]}" ILMN_1343291 TGTGTTGAGAGCTTCTCAGACTATCCACCTTTGGGTCGCTTTGCTGTTCG NM_001402 ILMN_1343295 CTTCAACAGCGACACCCACTCCTCCACCTTTGACGCTGGGGCTGGCATTG NM_002046 ILMN_1651209 TCACGGCGTACGCCCTCATGGGGAAAATCTCCCCGGTGACTTTCAGGTCC NM_182838

Ketan · Accepted Answer · 2014-11-17 18:36:57Z

3

With sed, you can do:

sed 's/\.[0-9][0-9]*$//' x.txt

Assuming the filename is x.txt. If you want to modify the file inline, use the -i switch of sed as below:

sed -i 's/\.[0-9][0-9]*$//' x.txt

If you want to preserve the contents of original file, use redirection as below:

sed 's/\.[0-9][0-9]*$//' x.txt > newfile.txt

edited Nov 17, 2014 at 18:36

answered Nov 17, 2014 at 18:27

Ketan

9,4467 gold badges45 silver badges57 bronze badges

The OP said the extensions were numeric. While this works, you might want to adjust it so that .* does not end up matching more than the extension should a dot appear somewhere else.

John WH Smith
– John WH Smith

2014-11-17 18:32:49 +00:00
Commented Nov 17, 2014 at 18:32
There is not need g extention to s command of sed - you substitute just 1 pattern occurence in line

Costas
– Costas

2014-11-17 18:33:01 +00:00
Commented Nov 17, 2014 at 18:33
Thanks @JohnWHSmith made the change. @Costas, sorry habitually put the g.

Ketan
– Ketan

2014-11-17 18:37:47 +00:00
Commented Nov 17, 2014 at 18:37

Add a comment |

Barmar · Accepted Answer · 2014-11-17 18:28:03Z

1

This removes everything starting with the dot:

sed 's/\..*//'

answered Nov 17, 2014 at 18:28

Barmar

10.6k1 gold badge22 silver badges29 bronze badges

Add a comment |

Stack Exchange Network

removing extensions in a column

5 Answers 5

With awk:

With rev and awk:

With sed:

With rev and sed:

With grep:

With rev and grep:

With cut:

With while loop:

You must log in to answer this question.

Linked

Hot Network Questions

removing extensions in a column

5 Answers 5

With awk:

With rev and awk:

With sed:

With rev and sed:

With grep:

With rev and grep:

With cut:

With while loop:

You must log in to answer this question.

Linked

Related

Hot Network Questions