How to remove four random characters before the .extension from various files using a for loop?

Question

I am working on an undergraduate research project that is heavy in bioinformatics, and I am going down a pipeline of file processing. Some background: I am working with shotgun metagenomic data which is very large swatches of A,T,G,C (nucleotides in a DNA sample), and from what I gather, some qualifiers. I have gone through a few steps of the pipeline already which trimmed and cleaned up the files some, along with adding some qualifiers. The important thing is that these reads are mostly paired end reads, meaning two files reading the nucleotides right to left and left to right.

Prior to this, I had crammed my head with basically only biology and ecology so I don't really have any context for coding or how/why to do things or common practices/functions, etc. You get the point.

That said, I taught myself very basic for loops and string manipulation in UNIX, making some bash files that ran through different folders using different modules and functions. Here is example code:

cd ~/ncbi/public/sra/indian for forward_read_file in *_1.fastq do rev=_2 reverse_read_file=${forward_read_file/_1/$rev} perl /home/gomeza/shared/sharm646-2021-02-24-09_22/Softwares/NGSQCToolkit_v2.3.3/Trimming/AmbiguityFiltering.pl -i ${forward_read_file} -irev ${reverse_read_file} -c 1 -t5 -t3 rm ${forward_read_file} ${reverse_read_file} done #CAMEROON cd ~/ncbi/public/sra/cameroon for forward_read_file in *_1.fastq do rev=_2 reverse_read_file=${forward_read_file/_1/$rev} perl /home/gomeza/shared/sharm646-2021-02-24-09_22/Softwares/NGSQCToolkit_v2.3.3/Trimming/AmbiguityFiltering.pl -i ${forward_read_file} -irev ${reverse_read_file} -c 1 -t5 -t3 rm ${forward_read_file} ${reverse_read_file} done

and so on for many folders. I used string manipulation to get each iteration of the for loop to call the paired end files, and then some arguments and parameters for the module I'm using.

The big issue I am running into now is that I can't think of a way to pair the paired end files for my next step in the pipeline as they have four random characters right before the extension, and I can't predict them. They don't contain meaningful data, so my plan is to delete them from the filenames and proceed as I have been.

Here are examples of the problem files; the issue is the four characters at the end of the string. If I get rid of those I can do the string manipulation as usual.

SRR5898908_1_prinseq_good_ZsSX.fastq SRR5898928_2_prinseq_good_VygO.fastq SRR5898979_1_prinseq_good_CRzI.fastq SRR6166642_2_prinseq_good_nqVP.fastq SRR6166693_2_prinseq_good_y_OD.fastq SRR5898908_2_prinseq_good_HPTU.fastq SRR5898929_1_prinseq_good_p2mS.fastq SRR5898979_2_prinseq_good_vYcE.fastq SRR6166643_1_prinseq_good_fc8y.fastq SRR6166694_1_prinseq_good_Ka1C.fastq SRR5898909_1_prinseq_good_X41r.fastq SRR5898929_2_prinseq_good_uO8g.fastq SRR5898980_1_prinseq_good_WuPS.fastq SRR6166643_2_prinseq_good_QUUK.fastq SRR6166694_2_prinseq_good_ZlNk.fastq SRR5898909_2_prinseq_good_GbmA.fastq SRR5898930_1_prinseq_good_3qyA.fastq

Where the beginning SRRxxxxx is the sample, and the 1 or 2 are forward and reverse reads respectively, hence my string manipulation. The issue is the four characters at the end of the string. If I get rid of those I can do the string manipulation as usual. My mentor recommended I use the FIND or CUT functions somehow, and talked about using the return of the find as a variable to manipulate, but I feel like that would still run into the same issue.

How can I remove these characters safely using a for loop? Or whatever you think would work best.

Thank you!

It seems that for most of the SRR* sample IDs, you have _1 and _2 files...but some of them have only _1 or _2 files. How should those the exceptions be handled? — cas
– cas, Commented May 27, 2021 at 2:21
Hi cas, this was just a few files I copy pasted. They all have paired files, but I didn't see the point in copying all of the files in the folder, I just gave a few examples. I do have other samples from a different population with only forward read files (_1), but these aren't an issue at this step because I don't need to match them with their respective reverse reads to execute the next step. — TreaHugger
– TreaHugger, Commented May 27, 2021 at 2:31

cas · Accepted Answer · 2021-05-27 03:03:19Z

Try something like this:

for forward_read_file in *_1*.fastq; do srr=$(echo "$forward_read_file" | cut -d_ -f1) rrf_array=( $(find . -name "${srr}_2_*.fastq") ) case "${#rrf_array[@]}" in 0) echo "Warning: No reverse read file found for $forward_read_file" > /dev/stderr ;; 1) reverse_read_file="${rrf_array[1]}" perl /home/gomeza/shared/sharm646-2021-02-24-09_22/Softwares/NGSQCToolkit_v2.3.3/Trimming/AmbiguityFiltering.pl -i "$forward_read_file" -irev "$reverse_read_file" -c 1 -t5 -t3 ;; *) echo "Error: multiple reverse read files found for $forward_read_file" > /dev/stderr ;; esac done

This iterates over all the _1 files. It uses cut to extract the SRR sample id, and then uses that with the find command to find any matching _2 files. find's output is stored in an array because we don't know how many results might be returned.

It handles three possible outcomes - no matches (not good), exactly 1 match (good, that's what we want), and more than 1 match (again, not good).

If there's only one result, extract the matching file from the array and process it with your perl script.

If there is zero or more than one result, print a warning message to stderr and continue on to the next _1 filename. You could add ; exit 1 (or other code to handle the error) before the ;; for those cases if you wanted to.

This ignores all parts of the filenames except for the SRR sample id at the beginning and the _1 or _2 that identifies it as a forward or reverse pairing file.

BTW, this could have been done with an if; then; else instead of a case statement but I thought it was useful to handle zero and more-than-one cases differently. e.g.

if [ "${#rrf_array[@]}" == 1 ]; reverse_read_file="${rrf_array[1]}" perl /home/gomeza/shared/sharm646-2021-02-24-09_22/Softwares/NGSQCToolkit_v2.3.3/Trimming/AmbiguityFiltering.pl -i "$forward_read_file" -irev "$reverse_read_file" -c 1 -t5 -t3 else echo "Warning: unknown problem with reverse read file for $forward_read_file" > /dev/stderr fi

If you just want to ignore the "problem" files, delete the else block.

BTW, to make your script more readable, I suggest doing something like this near the top of your script:

AFilter='/home/gomeza/shared/sharm646-2021-02-24-09_22/Softwares/NGSQCToolkit_v2.3.3/Trimming/AmbiguityFiltering.pl'

and, later:

perl "$AFilter" -i "$forward_read_file" -irev "$reverse_read_file" -c 1 -t5 -t3

Alternatively, if the perl script(s) are executable (i.e. with a #!/usr/bin/perl or similar shebang line, and with the executable flag set with chmod +x), you can just add /home/gomeza/shared/sharm646-2021-02-24-09_22/Softwares/NGSQCToolkit_v2.3.3/Trimming/ to your $PATH:

PATH="$PATH:/home/gomeza/shared/sharm646-2021-02-24-09_22/Softwares/NGSQCToolkit_v2.3.3/Trimming"

and run the script as:

AmbiguityFiltering.pl -i "$forward_read_file" -irev "$reverse_read_file" -c 1 -t5 -t3

No problem. BTW, something I should have pointed out in my answer was that you need to double quote your variables when you use them (i.e. use "$var" or "${var}" instead of just $var or ${var}) and that the only time you need to use curly braces around a variable name is to disambiguate it from other text. e.g. "$var_1_xyz.txt" will be treated as one long variable named var_1_xyz without curly braces because underscore is a valid character for variable names, so needs to be typed as "${var}_1_xyz.txt". It's not wrong to use {} at other times, it's just not necessary. — cas
– cas, Commented May 28, 2021 at 23:10

tansy · Accepted Answer · 2021-05-27 22:23:57Z

Is renaming from title what you mean?

Like this:

cat a2 | sed -e 's|\(.*\)\(good_\)\(.*\)\(.fastq\)|mv \1\2\3\4 \1\2\4|' mv SRR5898908_1_prinseq_good_ZsSX.fastq SRR5898908_1_prinseq_good_.fastq mv SRR5898928_2_prinseq_good_VygO.fastq SRR5898928_2_prinseq_good_.fastq

Stack Exchange Network

How to remove four random characters before the .extension from various files using a for loop?

2 Answers 2

You must log in to answer this question.

Hot Network Questions

How to remove four random characters before the .extension from various files using a for loop?

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions