Here's a revamped version of your code with comments:
#include <stdio.h> #include <string.h> int main( void ) { char str[ 32 + 1 ]; // Up to 32 bases (plus terminator) char xtr[ 64 + 1 ] = ""; // Expands to 64 int obe; scanf( "%32s%n", str, &obe ); // Limit user entry for( int i = obe % 4; i > 0; i-- ) strcat( str, "A" ); // Pad (with 'A') to multiple of 4 // Convert bases to binary values in a string for( int j = 0; str[ j ]; j++ ) if ( str[j] == 'A' ) strcat( xtr, "00" ); else if ( str[j] == 'T' ) strcat( xtr, "01" ); else if ( str[j] == 'C' ) strcat( xtr, "10" ); else if ( str[j] == 'G' ) strcat( xtr, "11" ); // Output in blocks of 8 digits. for( int k = 0, len = strlen( xtr ); k < len; k += 8 ) printf( "%d - %.8s\n", k, xtr + k ); return 0; }
ATTCGG 0 - 00010110 8 - 11110000
Converting a DNA sequence to an intermediary string is unnecessary.
Fortuitously, the ASCII code for the letters 'A', 'C', 'G' and 'T' encode well enough in bit 1 and bit 2. Note: This encoding differs from yours, assigning different bit patterns to represent each base.
'A' = 0bxxxxx00x ==> 0 // 'x' == "don't care" 'C' = 0bxxxxx01x ==> 2 'G' = 0bxxxxx11x ==> 6 'T' = 0bxxxxx10x ==> 4
The downside is that conventional "ACGT" swaps the order of the last two bases.
This 'swap' can be 'unswapped' with a translation using a crafted 8 bit hexadecimal value.
Explore the following code and study the demonstration strings below:
#include <stdio.h> void demo( char *p ) { // chunks of bases into registers puts( p ); while( *p ) { // unsigned char asBits = 0; // 4 bases/chunk // unsigned short asBits = 0; // 8 bases/chunk unsigned int asBits = 0; // 16 bases/chunk // unsigned long asBits = 0; // 32 bases/chunk const int pack = sizeof(asBits) * 4; // The ASCII for each of ACGT is pretty fortunate; can be hashed to two bits 0-3. // 0xB4: (0b10110100) 4 pairs of bits crafted to correspond to "GTCA" (reversed for shifting.) // Note that T&G are swapped by that 'magic byte' to conform to conventional "ACGT" // "AND"ing with 6 masks for the two fortunate bits, // "0xB4" is right shifted 0, 2, 6 or 4 bits, // that is then masked (3&) for its lowest two bits. // 'A'->0b00, 'C'->0b01, 'G'->0b10' and 'T'->0b11 // The accumulator is shifted and this pair OR'd where they belong. int i; for( i = pack; *p && i; p++, i-- ) asBits = asBits<<2 | (3 & (0xB4>>(*p&6))); // using one of several mapping functions // Sequence may not be modulo 16, so tack on extra 0b00 to pad as needed asBits <<= i+i; // padding for stragglers // Playback for verification printf( "%0*X - ", pack/2, asBits ); for( int j = pack+pack-2; j >= 0; j -= 2 ) putchar( "ACGT"[(asBits>>j)&3] ); putchar( '\n' ); } } int main( void ) { /* Some bonus alternative translation functions char *cp; # define M1 "\0\1\3\2"[*cp>>1&3] # define M2 "\0\0\0\1\3\0\0\2"[*cp&7] # define M3 3&0x8340>>(*cp<<1&0xF) # define M4 3&0xB4>>(*cp&6) char *n = "0123"; for( cp = "ACGT"; *cp; cp++ ) printf( "%c %c%c%c%c\n", *cp, n[M1], n[M2], n[M3], n[M4] ); */ demo( "TGCTTGCCTGCATGCA" ); // 16 bases demo( "TTGCTTGCCTGCATGCT" ); // 17 bases demo( "T" ); // 1-4 bases demo( "AT" ); demo( "AAT" ); demo( "AAAT" ); // lots of bases demo( "CATCATCATCATCATCATCATCATCATCATCATCATCATCATCAT" ); return 0; }
Output demonstration:
TGCTTGCCTGCATGCA E7E5E4E4 - TGCTTGCCTGCATGCA TTGCTTGCCTGCATGCT F9F97939 - TTGCTTGCCTGCATGC C0000000 - TAAAAAAAAAAAAAAA T C0000000 - TAAAAAAAAAAAAAAA AT 30000000 - ATAAAAAAAAAAAAAA AAT 0C000000 - AATAAAAAAAAAAAAA AAAT 03000000 - AAATAAAAAAAAAAAA CATCATCATCATCATCATCATCATCATCATCATCATCATCATCAT 4D34D34D - CATCATCATCATCATC 34D34D34 - ATCATCATCATCATCA D34D34C0 - TCATCATCATCATAAA
Play around with this for a while.
EDIT:
Here's another version of the core processing that simultaneously converts batches of 4 bases to both a string of 1's and 0's, and shows a decimal equivalent.
unsigned char four = 0; // Convert bases to binary values in a string int j = 0; while( str[ j ] ) { if ( str[j] == 'A' ) strcat( xtr, "00" ), four = (four << 2) | 0; else if ( str[j] == 'T' ) strcat( xtr, "01" ), four = (four << 2) | 1; else if ( str[j] == 'C' ) strcat( xtr, "10" ), four = (four << 2) | 2; else if ( str[j] == 'G' ) strcat( xtr, "11" ), four = (four << 2) | 3; if( ++j % 4 == 0 ) { printf( "%s - %3d\n", xtr, four ); xtr[0] = '\0'; four = 0; } }
ATTCGG 00010110 - 22 11110000 - 240
ret[i]writes to an uninitialized pointer which is obviously wrong. Apart from that, what "pointer casts"? There are no casts in the code posted. Please post complete code and exact compiler messages.atoi(): a) you must terminate the string you pass, b) it doesn't convert binary, for that you needstrtol().