Skip to content

Conversation

@Tixii
Copy link

@Tixii Tixii commented Feb 22, 2021

This adds the ability to read a fastq/fasta file and split the file based on the prefix of each read to enable faster sorting of read sets.

Usage: seqtk prefixsplit [options] <output_filename> <in.fa>
Options:
-p INT length of prefix
-A force FASTA output (discard quality)
-C drop comments at the header lines

It will create files for each prefix of the specified length, e.g.
output_filename.AA.fa
output_filename.AC.fa
....
plus a single file that contains those reads with an N at any position in the prefix:
output_filename.N.fa

Currently only prefix lengths of 1, 2, or 3 are possible, as I felt that creating more than 64 files wouldnt be useful.

There are options to remove the quality scores and drop comments using the same methods as the seqtk seq function.

I have tried to stick to the coding format of the rest of the file, however, this is my first time coding in C and therefore I am sure there are improvements that could be made.

Unknown added 4 commits February 22, 2021 13:04
This adds the ability to read a fastq/fasta file and split the file based on the prefix of each read to enable fasting sorting of read sets. Usage: seqtk prefixsplit [options] <output_filename> <in.fa> Options: -p INT length of prefix -A force FASTA output (discard quality) -C drop comments at the header lines It will create files for each prefix of the specified length, e.g. output_filename.AA.fa output_filename.AC.fa .... plus a single file that contains those with an N in the prefix: output_filename.N.fa There are options to remove the quality scores and drop comments using the same methods as the seqtk seq function. I have tried to stick to the coding format of the rest of the file, however, this is my first time coding in C and therefore I am sure there are improvements that could be made.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant