I am working on an undergraduate research project that is heavy in bioinformatics, and I am going down a pipeline of file processing. Some background: I am working with shotgun metagenomic data which is very large swatches of A,T,G,C (nucleotides in a DNA sample), and from what I gather, some qualifiers. I have gone through a few steps of the pipeline already which trimmed and cleaned up the files some, along with adding some qualifiers. The important thing is that these reads are mostly paired end reads, meaning two files reading the nucleotides right to left and left to right.
Prior to this, I had crammed my head with basically only biology and ecology so I don't really have any context for coding or how/why to do things or common practices/functions, etc. You get the point.
That said, I taught myself very basic for loops and string manipulation in UNIX, making some bash files that ran through different folders using different modules and functions. Here is example code:
cd ~/ncbi/public/sra/indian for forward_read_file in *_1.fastq do rev=_2 reverse_read_file=${forward_read_file/_1/$rev} perl /home/gomeza/shared/sharm646-2021-02-24-09_22/Softwares/NGSQCToolkit_v2.3.3/Trimming/AmbiguityFiltering.pl -i ${forward_read_file} -irev ${reverse_read_file} -c 1 -t5 -t3 rm ${forward_read_file} ${reverse_read_file} done #CAMEROON cd ~/ncbi/public/sra/cameroon for forward_read_file in *_1.fastq do rev=_2 reverse_read_file=${forward_read_file/_1/$rev} perl /home/gomeza/shared/sharm646-2021-02-24-09_22/Softwares/NGSQCToolkit_v2.3.3/Trimming/AmbiguityFiltering.pl -i ${forward_read_file} -irev ${reverse_read_file} -c 1 -t5 -t3 rm ${forward_read_file} ${reverse_read_file} done and so on for many folders. I used string manipulation to get each iteration of the for loop to call the paired end files, and then some arguments and parameters for the module I'm using.
The big issue I am running into now is that I can't think of a way to pair the paired end files for my next step in the pipeline as they have four random characters right before the extension, and I can't predict them. They don't contain meaningful data, so my plan is to delete them from the filenames and proceed as I have been.
Here are examples of the problem files; the issue is the four characters at the end of the string. If I get rid of those I can do the string manipulation as usual.
SRR5898908_1_prinseq_good_ZsSX.fastq SRR5898928_2_prinseq_good_VygO.fastq SRR5898979_1_prinseq_good_CRzI.fastq SRR6166642_2_prinseq_good_nqVP.fastq SRR6166693_2_prinseq_good_y_OD.fastq SRR5898908_2_prinseq_good_HPTU.fastq SRR5898929_1_prinseq_good_p2mS.fastq SRR5898979_2_prinseq_good_vYcE.fastq SRR6166643_1_prinseq_good_fc8y.fastq SRR6166694_1_prinseq_good_Ka1C.fastq SRR5898909_1_prinseq_good_X41r.fastq SRR5898929_2_prinseq_good_uO8g.fastq SRR5898980_1_prinseq_good_WuPS.fastq SRR6166643_2_prinseq_good_QUUK.fastq SRR6166694_2_prinseq_good_ZlNk.fastq SRR5898909_2_prinseq_good_GbmA.fastq SRR5898930_1_prinseq_good_3qyA.fastq Where the beginning SRRxxxxx is the sample, and the 1 or 2 are forward and reverse reads respectively, hence my string manipulation. The issue is the four characters at the end of the string. If I get rid of those I can do the string manipulation as usual. My mentor recommended I use the FIND or CUT functions somehow, and talked about using the return of the find as a variable to manipulate, but I feel like that would still run into the same issue.
How can I remove these characters safely using a for loop? Or whatever you think would work best.
Thank you!