Revisions to Bash: Nested while loop to detect duplicates and number the duplicates

deleted 8 characters in body

edited Nov 5, 2020 at 10:44

161
1
2
9

So I have a textfile with headers for genes and there are different gene sequences under the same species. So I have extracted the headers (headers.txt) and copied it into another file (uniqueheaders.txt). I removed all the duplicates in uniqueheaders.txt.

I am trying to loop read a line of uniqueheaders.txt then loop read headers.txt to check for duplicates. The if statement detects the duplicate and increments a counter to append it to the header. This will number all the headers in headers.txt so I insert them back into my FASTA file. my code is here:

while IFS= read -r uniqueline do counter=0 while IFS= read headline do if [ "$uniqueline" == "$headline" ] then let "counter++" #append counter to the headline variable to number it. sed "$headline s/$/$counter/" -i headers if done < headers.txt done < uniqueheaders.txt

The issue is that the terminal keeps spitting out the error

sed: -e expression #1, char 1: unknown command: 'M'

and

sed: -e expression #1, char 2: extra characters after command

Both files contain unique header names:

Mus musculus Homo sapiens Rattus norvegicus

How do I modify the sed command to prevent this error? Is there a better way of doing this in bash?

Example of inputs (note that gene sequence don't really have a pattern in terms of how many lines it takes up) **** Gene sequences are all in one file

Mus musculus MDFJSGHDFSBGKJBDFSGKJBDFS NGBJDFSBGKJDFSHNGKJDFSGHG   Rattus norvegicus SNOFBDSFNLSFSFSFSJFJSDFSD   Mus musculus NJALDJASJDLAJSJAPOJPOASDJG DSFHBDSFHSDFHDFSHJDFSJKSSF

Desired output:

Mus musculus1 MDFJSGHDFSBGKJBDFSGKJBDFS NGBJDFSBGKJDFSHNGKJDFSGHG   Rattus norvegicus SNOFBDSFNLSFSFSFSJFJSDFSD   Mus musculus2 NJALDJASJDLAJSJAPOJPOASDJG DSFHBDSFHSDFHDFSHJDFSJKSSF

So I have a textfile with headers for genes and there are different gene sequences under the same species. So I have extracted the headers (headers.txt) and copied it into another file (uniqueheaders.txt). I removed all the duplicates in uniqueheaders.txt.

I am trying to loop read a line of uniqueheaders.txt then loop read headers.txt to check for duplicates. The if statement detects the duplicate and increments a counter to append it to the header. This will number all the headers in headers.txt so I insert them back into my FASTA file. my code is here:

while IFS= read -r uniqueline do counter=0 while IFS= read headline do if [ "$uniqueline" == "$headline" ] then let "counter++" #append counter to the headline variable to number it. sed "$headline s/$/$counter/" -i headers if done < headers.txt done < uniqueheaders.txt

The issue is that the terminal keeps spitting out the error

sed: -e expression #1, char 1: unknown command: 'M'

and

sed: -e expression #1, char 2: extra characters after command

Both files contain unique header names:

Mus musculus Homo sapiens Rattus norvegicus

How do I modify the sed command to prevent this error? Is there a better way of doing this in bash?

Example of inputs (note that gene sequence don't really have a pattern in terms of how many lines it takes up)

Mus musculus MDFJSGHDFSBGKJBDFSGKJBDFS NGBJDFSBGKJDFSHNGKJDFSGHG   Rattus norvegicus SNOFBDSFNLSFSFSFSJFJSDFSD   Mus musculus NJALDJASJDLAJSJAPOJPOASDJG DSFHBDSFHSDFHDFSHJDFSJKSSF

Desired output:

Mus musculus1 MDFJSGHDFSBGKJBDFSGKJBDFS NGBJDFSBGKJDFSHNGKJDFSGHG   Rattus norvegicus SNOFBDSFNLSFSFSFSJFJSDFSD   Mus musculus2 NJALDJASJDLAJSJAPOJPOASDJG DSFHBDSFHSDFHDFSHJDFSJKSSF

So I have a textfile with headers for genes and there are different gene sequences under the same species. So I have extracted the headers (headers.txt) and copied it into another file (uniqueheaders.txt). I removed all the duplicates in uniqueheaders.txt.

I am trying to loop read a line of uniqueheaders.txt then loop read headers.txt to check for duplicates. The if statement detects the duplicate and increments a counter to append it to the header. This will number all the headers in headers.txt so I insert them back into my FASTA file. my code is here:

while IFS= read -r uniqueline do counter=0 while IFS= read headline do if [ "$uniqueline" == "$headline" ] then let "counter++" #append counter to the headline variable to number it. sed "$headline s/$/$counter/" -i headers if done < headers.txt done < uniqueheaders.txt

The issue is that the terminal keeps spitting out the error

sed: -e expression #1, char 1: unknown command: 'M'

and

sed: -e expression #1, char 2: extra characters after command

Both files contain unique header names:

Mus musculus Homo sapiens Rattus norvegicus

How do I modify the sed command to prevent this error? Is there a better way of doing this in bash?

Example of inputs (note that gene sequence don't really have a pattern in terms of how many lines it takes up) **** Gene sequences are all in one file

Mus musculus MDFJSGHDFSBGKJBDFSGKJBDFS NGBJDFSBGKJDFSHNGKJDFSGHG Rattus norvegicus SNOFBDSFNLSFSFSFSJFJSDFSD Mus musculus NJALDJASJDLAJSJAPOJPOASDJG DSFHBDSFHSDFHDFSHJDFSJKSSF

Desired output:

Mus musculus1 MDFJSGHDFSBGKJBDFSGKJBDFS NGBJDFSBGKJDFSHNGKJDFSGHG Rattus norvegicus SNOFBDSFNLSFSFSFSJFJSDFSD Mus musculus2 NJALDJASJDLAJSJAPOJPOASDJG DSFHBDSFHSDFHDFSHJDFSJKSSF

Formatting and tags

Source Link

edited Nov 5, 2020 at 10:19

AdminBee

23.6k
25
55
77

So I have a textfile with headers for genes and there are different gene sequences under the same species. So I have extracted the headers (headers.txt) and copied it into another file (uniqueheaders.txtuniqueheaders.txt). I removed all the duplicates in uniqueheaders.txtuniqueheaders.txt. I

I am trying to loop read a line of uniqueheaders.txtuniqueheaders.txt then loop read headers.txtheaders.txt to check for duplicates. The ifif statement detects the duplicate and increments a counter to append it to the header. This will number all the headers in headers.txtheaders.txt so I insert them back into my fastaFASTA file. my code is here:

while IFS= read -r uniqueline do counter=0 while IFS= read headline do if [ "$uniqueline" == "$headline" ] then let "counter++" #append counter to the headline variable to number it. sed "$headline s/$/$counter/" -i headers if done < headers.txt done < uniqueheaders.txt

The issue is that the terminal keeps spitting out the error sed: -e expression #1, char 1: unknown command: 'M' and sed: -e expression #1, char 2: extra characters after command. Both files contain unique header names:

Mus musculus

sed: -e expression #1, char 1: unknown command: 'M'

Homo sapiensand

sed: -e expression #1, char 2: extra characters after command

Rattus norvegicusBoth files contain unique header names:

Mus musculus Homo sapiens Rattus norvegicus

How do I modify the sedsed command to prevent this error? Is there a better way of doing this in bashbash?

Example of inputs (note that gene sequence don't really have a pattern in terms of how many lines it takes up)

Mus musculus

MDFJSGHDFSBGKJBDFSGKJBDFS

NGBJDFSBGKJDFSHNGKJDFSGHG

Rattus norvegicus

SNOFBDSFNLSFSFSFSJFJSDFSD

Mus musculus

NJALDJASJDLAJSJAPOJPOASDJG

DSFHBDSFHSDFHDFSHJDFSJKSSF

Mus musculus MDFJSGHDFSBGKJBDFSGKJBDFS NGBJDFSBGKJDFSHNGKJDFSGHG Rattus norvegicus SNOFBDSFNLSFSFSFSJFJSDFSD Mus musculus NJALDJASJDLAJSJAPOJPOASDJG DSFHBDSFHSDFHDFSHJDFSJKSSF

Desired output:

Mus musculus1

MDFJSGHDFSBGKJBDFSGKJBDFS

NGBJDFSBGKJDFSHNGKJDFSGHG

Rattus norvegicus

SNOFBDSFNLSFSFSFSJFJSDFSD

Mus musculus2

NJALDJASJDLAJSJAPOJPOASDJG

DSFHBDSFHSDFHDFSHJDFSJKSSF

Mus musculus1 MDFJSGHDFSBGKJBDFSGKJBDFS NGBJDFSBGKJDFSHNGKJDFSGHG Rattus norvegicus SNOFBDSFNLSFSFSFSJFJSDFSD Mus musculus2 NJALDJASJDLAJSJAPOJPOASDJG DSFHBDSFHSDFHDFSHJDFSJKSSF

So I have a textfile with headers for genes and there are different gene sequences under the same species. So I have extracted the headers (headers.txt) and copied it into another file (uniqueheaders.txt). I removed all the duplicates in uniqueheaders.txt. I am trying to loop read a line of uniqueheaders.txt then loop read headers.txt to check for duplicates. The if statement detects the duplicate and increments a counter to append it to the header. This will number all the headers in headers.txt so I insert them back into my fasta file. my code is here:

while IFS= read -r uniqueline do counter=0 while IFS= read headline do if [ "$uniqueline" == "$headline" ] then let "counter++" #append counter to the headline variable to number it. sed "$headline s/$/$counter/" -i headers if done < headers.txt done < uniqueheaders.txt

The issue is that the terminal keeps spitting out the error sed: -e expression #1, char 1: unknown command: 'M' and sed: -e expression #1, char 2: extra characters after command. Both files contain unique header names:

Mus musculus

Homo sapiens

Rattus norvegicus

How do I modify the sed to prevent this error? Is there a better way of doing this in bash?

Example of inputs (note that gene sequence don't really have a pattern in terms of how many lines it takes up)

Mus musculus

MDFJSGHDFSBGKJBDFSGKJBDFS

NGBJDFSBGKJDFSHNGKJDFSGHG

Rattus norvegicus

SNOFBDSFNLSFSFSFSJFJSDFSD

Mus musculus

NJALDJASJDLAJSJAPOJPOASDJG

DSFHBDSFHSDFHDFSHJDFSJKSSF

Desired output:

Mus musculus1

MDFJSGHDFSBGKJBDFSGKJBDFS

NGBJDFSBGKJDFSHNGKJDFSGHG

Rattus norvegicus

SNOFBDSFNLSFSFSFSJFJSDFSD

Mus musculus2

NJALDJASJDLAJSJAPOJPOASDJG

DSFHBDSFHSDFHDFSHJDFSJKSSF

So I have a textfile with headers for genes and there are different gene sequences under the same species. So I have extracted the headers (headers.txt) and copied it into another file (uniqueheaders.txt). I removed all the duplicates in uniqueheaders.txt.

I am trying to loop read a line of uniqueheaders.txt then loop read headers.txt to check for duplicates. The if statement detects the duplicate and increments a counter to append it to the header. This will number all the headers in headers.txt so I insert them back into my FASTA file. my code is here:

while IFS= read -r uniqueline do counter=0 while IFS= read headline do if [ "$uniqueline" == "$headline" ] then let "counter++" #append counter to the headline variable to number it. sed "$headline s/$/$counter/" -i headers if done < headers.txt done < uniqueheaders.txt

The issue is that the terminal keeps spitting out the error

sed: -e expression #1, char 1: unknown command: 'M'

and

sed: -e expression #1, char 2: extra characters after command

Both files contain unique header names:

Mus musculus Homo sapiens Rattus norvegicus

How do I modify the sed command to prevent this error? Is there a better way of doing this in bash?

Example of inputs (note that gene sequence don't really have a pattern in terms of how many lines it takes up)

Mus musculus MDFJSGHDFSBGKJBDFSGKJBDFS NGBJDFSBGKJDFSHNGKJDFSGHG Rattus norvegicus SNOFBDSFNLSFSFSFSJFJSDFSD Mus musculus NJALDJASJDLAJSJAPOJPOASDJG DSFHBDSFHSDFHDFSHJDFSJKSSF

Desired output:

Mus musculus1 MDFJSGHDFSBGKJBDFSGKJBDFS NGBJDFSBGKJDFSHNGKJDFSGHG Rattus norvegicus SNOFBDSFNLSFSFSFSJFJSDFSD Mus musculus2 NJALDJASJDLAJSJAPOJPOASDJG DSFHBDSFHSDFHDFSHJDFSJKSSF

added 541 characters in body

Source Link

edited Nov 5, 2020 at 10:14

Jerry

161
1
2
9

So I have a textfile with headers for genes and there are different gene sequences under the same species. So I have extracted the headers (headers.txt) and copied it into another file (uniqueheaders.txt). I removed all the duplicates in uniqueheaders.txt. I am trying to loop read a line of uniqueheaders.txt then loop read headers.txt to check for duplicates. The if statement detects the duplicate and increments a counter to append it to the header. This will number all the headers in headers.txt so I insert them back into my fasta file. my code is here:

while IFS= read -r uniqueline do counter=0 while IFS= read headline do if [ "$uniqueline" == "$headline" ] then let "counter++" #append counter to the headline variable to number it. sed "$headline s/$/$counter/" -i headers if done < headers.txt done < uniqueheaders.txt

The issue is that the terminal keeps spitting out the error sed: -e expression #1, char 1: unknown command: 'M' and sed: -e expression #1, char 2: extra characters after command. Both files contain unique header names:

Mus musculus

Homo sapiens

Rattus norvegicus

How do I modify the sed to prevent this error? Is there a better way of doing this in bash?

Example of inputs (note that gene sequence don't really have a pattern in terms of how many lines it takes up)

Mus musculus

MDFJSGHDFSBGKJBDFSGKJBDFS

NGBJDFSBGKJDFSHNGKJDFSGHG

Rattus norvegicus

SNOFBDSFNLSFSFSFSJFJSDFSD

Mus musculus

NJALDJASJDLAJSJAPOJPOASDJG

DSFHBDSFHSDFHDFSHJDFSJKSSF

Desired output:

Mus musculus1

MDFJSGHDFSBGKJBDFSGKJBDFS

NGBJDFSBGKJDFSHNGKJDFSGHG

Rattus norvegicus

SNOFBDSFNLSFSFSFSJFJSDFSD

Mus musculus2

NJALDJASJDLAJSJAPOJPOASDJG

DSFHBDSFHSDFHDFSHJDFSJKSSF

So I have a textfile with headers for genes and there are different gene sequences under the same species. So I have extracted the headers (headers.txt) and copied it into another file (uniqueheaders.txt). I removed all the duplicates in uniqueheaders.txt. I am trying to loop read a line of uniqueheaders.txt then loop read headers.txt to check for duplicates. The if statement detects the duplicate and increments a counter to append it to the header. This will number all the headers in headers.txt so I insert them back into my fasta file. my code is here:

while IFS= read -r uniqueline do counter=0 while IFS= read headline do if [ "$uniqueline" == "$headline" ] then let "counter++" #append counter to the headline variable to number it. sed "$headline s/$/$counter/" -i headers if done < headers.txt done < uniqueheaders.txt

The issue is that the terminal keeps spitting out the error sed: -e expression #1, char 1: unknown command: 'M' and sed: -e expression #1, char 2: extra characters after command. Both files contain unique header names:

Mus musculus

Homo sapiens

Rattus norvegicus

How do I modify the sed to prevent this error? Is there a better way of doing this in bash?

So I have a textfile with headers for genes and there are different gene sequences under the same species. So I have extracted the headers (headers.txt) and copied it into another file (uniqueheaders.txt). I removed all the duplicates in uniqueheaders.txt. I am trying to loop read a line of uniqueheaders.txt then loop read headers.txt to check for duplicates. The if statement detects the duplicate and increments a counter to append it to the header. This will number all the headers in headers.txt so I insert them back into my fasta file. my code is here:

while IFS= read -r uniqueline do counter=0 while IFS= read headline do if [ "$uniqueline" == "$headline" ] then let "counter++" #append counter to the headline variable to number it. sed "$headline s/$/$counter/" -i headers if done < headers.txt done < uniqueheaders.txt

The issue is that the terminal keeps spitting out the error sed: -e expression #1, char 1: unknown command: 'M' and sed: -e expression #1, char 2: extra characters after command. Both files contain unique header names:

Mus musculus

Homo sapiens

Rattus norvegicus

How do I modify the sed to prevent this error? Is there a better way of doing this in bash?

Example of inputs (note that gene sequence don't really have a pattern in terms of how many lines it takes up)

Mus musculus

MDFJSGHDFSBGKJBDFSGKJBDFS

NGBJDFSBGKJDFSHNGKJDFSGHG

Rattus norvegicus

SNOFBDSFNLSFSFSFSJFJSDFSD

Mus musculus

NJALDJASJDLAJSJAPOJPOASDJG

DSFHBDSFHSDFHDFSHJDFSJKSSF

Desired output:

Mus musculus1

MDFJSGHDFSBGKJBDFSGKJBDFS

NGBJDFSBGKJDFSHNGKJDFSGHG

Rattus norvegicus

SNOFBDSFNLSFSFSFSJFJSDFSD

Mus musculus2

NJALDJASJDLAJSJAPOJPOASDJG

DSFHBDSFHSDFHDFSHJDFSJKSSF

Source Link

asked Nov 5, 2020 at 9:22

Jerry

161
1
2
9

Loading

Stack Exchange Network

Return to Question