Revisions to Sandbox for Proposed Challenges

edited body

edited Jun 8, 2020 at 7:03

8k
9
7

Because you can use any 4 distinct characters you want, it's acceptable to use, for example, lower-case instead of upper-case, or to use T instead of U, or to use 0123 instead of ACGTACGU, or even to output the complementary strand (with A and U switched, and C and G switched).

added 3 characters in body

Source Link

edited Jun 8, 2020 at 4:51

Mitchell Spector

8k
9
7

This challenge is to output that sequence using as few bytes in your program as possible (code golf).The challenge is to output that sequence using as few bytes in your program as possible (code golf). You can write either a full program or a function.

added 16 characters in body

Source Link

edited Jun 8, 2020 at 4:42

Mitchell Spector

8k
9
7

As##Background As you probably learned in biology class, DNA and RNA are composed of strands of nucleotides; each nucleotide consists of a chemical called a base together with a sugar and a phosphate group. The information stored in the DNA or RNA is coded as a sequence of bases. DNA uses the bases A, C, G, and T (standing for adenine, cytosine, guanine, and thymine), while RNA uses A, C, G, and U (with uracil replacing thymine).

The##Challenge The genome of SARS-Cov-2, the virus that causes COVID-19, has been fully sequenced. This genome is a sequence of 29,903 bases, each base being one of A, C, G, or U, since it's an RNA virus.

Because the names A, C, G, and U are arbitrary, you can use any four4 characters you want instead:

You must use exactly 4 characters (they must be pairwise distinct -- twotwo or more of them can't be equal).
Each one of the 4 characters must be a printable ASCII character in the range from '!' to '~', inclusive (ASCII 33 to 126). In particular, this does not include the space character or the newline character.
Each of the 4 characters you use must always represent the same one of A, C, G, and U -- no changing in the middle!

Click to see the required output. (Unfortunately includingIncluding all 29,903 characters here would cause this to exceed a StackExchange maximum size.)

Because you can use any four4 distinct characters you want, it's acceptable to use, for example, lower-case instead of upper-case, or to use T instead of U, or to use 0123 instead of ACGT, or even to output the complementary strand (with A and U switched, and C and G switched).

I've set up a way to check that your program's output is correct. Just copy and paste your program's output into the argument in this verification program on TIO and run it.

Standard##Restrictions Standard loopholes are prohibited as usual. In particular, it's not allowed to retrieve information online or from any source other than your program. You also can't use any built-in which yields genomic data or protein data (these would generally retrieve data from the Internet so they wouldn't be allowed anyway, but some languagelanguages may have this facility built in internally --internally; use of such a functionality is prohibited whether it's implemented internally or externally).

Some##Verifying Your Program I've set up a way to check that your program's output is correct. Just copy and paste your program's output into the argument in this verification program on TIO and run it.

##Other Info Some facts that may or may not be of help:

There are 29,903 bases in the sequence.
The counts for the individual bases are:
There are 29,903 bases in the sequence. The counts for the individual bases are:

If you simply code each of the 4 bases in 2 bits, that would get you down to 7476 bytes (plus program overhead), so any competitive answer is likely to be shorter than that.

If you're interested in my source for the data, you can find it at this web page at NIH; scroll down to ORIGIN. The data is written there in lower-case letters, and 't' is used instead of 'u', apparently because DNA sequencing techniques were used.

There are variant strains of SARS-Cov-2 known (the base sequences are slightly different, and the length varies a bit); I believe the one here is the first one sequenced, from Wuhan.

Groups of 3 consecutive bases code for particular amino acids, so it might be useful to analyze the data in groups of 3. But there are non-coding areas where the number of bytes isn't necessarily a multiple of 3, so you may not want to just divide the data into groups of 3 starting at the beginning. If it might be useful, you can find more info on the structure of the virus RNA here (but this probably isn't needed).

If you simply code each of the 4 bases in 2 bits, that would get you down to 7476 bytes (plus program overhead), so any competitive answer is likely to be shorter than that.

The source for the data can be found at this web page at NIH; scroll down to ORIGIN. The data is written there in lower-case letters, and 't' is used instead of 'u', apparently because DNA sequencing techniques were used.

There are variant strains of SARS-Cov-2 known (the base sequences are slightly different, and the length varies a bit); I believe the one here is the first one sequenced, from Wuhan.

Groups of 3 consecutive bases code for particular amino acids, so it might be useful to analyze the data in groups of 3. But there are non-coding areas where the number of bytes isn't necessarily a multiple of 3, so you may not want to just divide the data into groups of 3 starting at the beginning. If it might be useful, you can find more info on the structure of the virus RNA here (but this probably isn't needed).

Disclaimer:Disclaimer: I'm not a biologist. If anyone has any corrections or improvements on the underlying biology (or anything else, of course), please let me know!

As you probably learned in biology class, DNA and RNA are composed of strands of nucleotides; each nucleotide consists of a chemical called a base together with a sugar and a phosphate group. The information stored in the DNA or RNA is coded as a sequence of bases. DNA uses the bases A, C, G, and T (standing for adenine, cytosine, guanine, and thymine), while RNA uses A, C, G, and U (with uracil replacing thymine).

The genome of SARS-Cov-2, the virus that causes COVID-19, has been fully sequenced. This genome is a sequence of 29,903 bases, each base being one of A, C, G, or U, since it's an RNA virus.

Because the names A, C, G, and U are arbitrary, you can use any four characters you want instead:

You must use exactly 4 characters (they must be pairwise distinct -- two or more of them can't be equal).
Each one of the 4 characters must be a printable ASCII character in the range from '!' to '~', inclusive (ASCII 33 to 126). In particular, this does not include the space character or the newline character.
Each of the 4 characters you use must always represent the same one of A, C, G, and U -- no changing in the middle!

Click to see the required output. (Unfortunately including all 29,903 characters here would cause this to exceed a StackExchange maximum size.)

Because you can use any four distinct characters you want, it's acceptable to use, for example, lower-case instead of upper-case, or to use T instead of U, or to use 0123 instead of ACGT, or even to output the complementary strand (with A and U switched, and C and G switched).

I've set up a way to check that your program's output is correct. Just copy and paste your program's output into the argument in this verification program on TIO and run it.

Standard loopholes are prohibited as usual. In particular, it's not allowed to retrieve information online or from any source other than your program. You also can't use any built-in which yields genomic data or protein data (these would generally retrieve data from the Internet so they wouldn't be allowed anyway, but some language may have this facility built in internally -- use of such a functionality is prohibited whether it's implemented internally or externally).

Some facts that may or may not be of help:

There are 29,903 bases in the sequence.
The counts for the individual bases are:

If you simply code each of the 4 bases in 2 bits, that would get you down to 7476 bytes (plus program overhead), so any competitive answer is likely to be shorter than that.

If you're interested in my source for the data, you can find it at this web page at NIH; scroll down to ORIGIN. The data is written there in lower-case letters, and 't' is used instead of 'u', apparently because DNA sequencing techniques were used.

There are variant strains of SARS-Cov-2 known (the base sequences are slightly different, and the length varies a bit); I believe the one here is the first one sequenced, from Wuhan.

Groups of 3 consecutive bases code for particular amino acids, so it might be useful to analyze the data in groups of 3. But there are non-coding areas where the number of bytes isn't necessarily a multiple of 3, so you may not want to just divide the data into groups of 3 starting at the beginning. If it might be useful, you can find more info on the structure of the virus RNA here (but this probably isn't needed).

Disclaimer: I'm not a biologist. If anyone has any corrections or improvements on the underlying biology (or anything else, of course), please let me know!

##Background As you probably learned in biology class, DNA and RNA are composed of strands of nucleotides; each nucleotide consists of a chemical called a base together with a sugar and a phosphate group. The information stored in the DNA or RNA is coded as a sequence of bases. DNA uses the bases A, C, G, and T (standing for adenine, cytosine, guanine, and thymine), while RNA uses A, C, G, and U (with uracil replacing thymine).

##Challenge The genome of SARS-Cov-2, the virus that causes COVID-19, has been fully sequenced. This genome is a sequence of 29,903 bases, each base being one of A, C, G, or U, since it's an RNA virus.

Because the names A, C, G, and U are arbitrary, you can use any 4 characters you want instead:

You must use exactly 4 characters (they must be pairwise distinct--two or more can't be equal).
Each one of the 4 characters must be a printable ASCII character in the range from '!' to '~', inclusive (ASCII 33 to 126). In particular, this does not include the space character or the newline character.
Each of the 4 characters you use must always represent the same one of A, C, G, and U -- no changing in the middle!

Click to see the required output. (Including all 29,903 characters here would cause this to exceed a StackExchange maximum size.)

Because you can use any 4 distinct characters you want, it's acceptable to use, for example, lower-case instead of upper-case, or to use T instead of U, or to use 0123 instead of ACGT, or even to output the complementary strand (with A and U switched, and C and G switched).

##Restrictions Standard loopholes are prohibited as usual. In particular, it's not allowed to retrieve information online or from any source other than your program. You also can't use any built-in which yields genomic data or protein data (these would generally retrieve data from the Internet so they wouldn't be allowed anyway, but some languages may have this facility built in internally; use of such functionality is prohibited whether implemented internally or externally).

##Verifying Your Program I've set up a way to check that your program's output is correct. Just copy and paste your program's output into the argument in this verification program on TIO and run it.

##Other Info Some facts that may or may not be of help:

There are 29,903 bases in the sequence. The counts for the individual bases are:

If you simply code each of the 4 bases in 2 bits, that would get you down to 7476 bytes (plus program overhead), so any competitive answer is likely to be shorter than that.

The source for the data can be found at this web page at NIH; scroll down to ORIGIN. The data is written there in lower-case letters, and 't' is used instead of 'u', apparently because DNA sequencing techniques were used.

There are variant strains of SARS-Cov-2 known (the base sequences are slightly different, and the length varies a bit); I believe the one here is the first one sequenced, from Wuhan.

Groups of 3 consecutive bases code for particular amino acids, so it might be useful to analyze the data in groups of 3. But there are non-coding areas where the number of bytes isn't necessarily a multiple of 3, so you may not want to just divide the data into groups of 3 starting at the beginning. If it might be useful, you can find more info on the structure of the virus RNA here (but this probably isn't needed).

Disclaimer: I'm not a biologist. If anyone has any corrections or improvements on the underlying biology (or anything else, of course), please let me know!