Pages
DNA to binary and back again
Rituum DNA to Binary converter widget
Programming a Skutterer

Adventures in Data Science - Chapter I

DNA to binary and back again

Possibly one of the worst ways to start teaching data science is to hit people right off the bat with a big segment on binary code. Unfortunately, you found my page and that's exactly what's gonna happen now. I am, however, going to include some DNA swagger for some added kick just to make sure you're on the right tracks, because I have no intention of teaching you how to code. Nope, for that you're on your own. My intention is to show you a certain type of data science, and what you can do without needing to code at all. Infact, that I have to code all of this up and for you to just sit there and do no coding is going to ruffle my feathers pretty darn quickly, so if I take it out on you you'll know who is really to blame. Let me show you the shoddy functions I used to convert from a DNA sequence to binary once I got my bronze swimming certificate equivalent in programming. It looks something like this...
    
    function dna2binary( sequence )
    {
    	var str = '';
    	for ( let char of sequence )
    	{
    		switch( char )
    		{
    			case 'A':
    			case 'a':
    				str += '00';
    				break;
    			case 'C':
    			case 'c':
    				str += '01';
    				break;
    			case 'G':
    			case 'g':
    				str += '10';
    				break;
    			case 'T':
    			case 't':
    				str += '11';
    				break;
    		}
    	}
    	return str;
    }
  
In this code we are replacing A, C, G, T with what can be known as words, in this case 00, 01, 11, and 10. In both data science and molecular biology, knowing that the order of these is also important is also a consideration. For example, if I were looking at the sequence ATGCCTTCTC ( it's the start of a delphinium flower gene! ), but converted to binary and back with somebody elses set of binary functions, I have a good chance of getting something else. Something wrong. And it's gonna pee all in my cornflakes. ATGCCTTCTC might become CATGGAAGAG and then we're all done for if I really am a scientist ( I'm not ). But with this in mind the sequence is fine and we can work with a DNA sequence and the same sequence in it's primordial form of binary magic. Yes, binary. Boooooriiiiiinggg!!! I know. Of course, my objective in data science is to bring these sequences alive in a way which we tend not to do yet, and we have to hit this binary problem head on because I don't want to deal with it any more than you do. Infact, if you want to be a scientist and you don't want to be a "data" scientist, you can instead be a life sciences scientist and spend all day trying to avoid ACGT sequences which are, well, not that far away from binary. The truth of it is, data science is more fun than that, and I'll show you, but not yet. The above sequence beginning ATGCCTTCTC ( hint: if you click the FASTA link on the ncbi website you get a compacted sequence you can use in your data science ) is actually much longer. Genetic sequences can be many millions of base pairs long, so you're going to be looking at a lot of lines like this:
  
  ATGCCTTCTCTATACTTTCTACTCACCACCCTATTCATAGCCACCCTTGTCTTCCTCCTCCTTAACCTGC
  GCGGCTTCTTCTCTAAGCGTCACGGCCCCCTCCCCCTCCCTCCCGGTCCCAAGCCGTGGTCCGTCGTCGG
  AAACCTCCCCCACCTGGGACCCGTCCCCCACCATGCACTCGCGTCCCTCTCCCACATCTACGGACCGTTG
  ATGTACCTGAGGGTTGGATACGTGGACGTGGTGGTGGCAGCGTCGGCGGGCGTCGCGGCCAAGTTCTTGA
  AGGTCCATGACCTCAACTTCGCGAGCCGTCCCCCGAACTCCGGGGCTAAGTATATTGCTTATAATTATCA
  TCTTTTCTGAGAGGGATTAGATTCATATTCCCAGACTCTATATGGATATAGAAAAAGGATAGTGTCTTTT
  
Here! I just copy pasted a little chunk from the start of the gene. This is mostly what goes on in molecular science and it's an important skill. You might not want to see it, but bear with me, here it is in binary too...
  
  0011100101111101110111001100011111110111000111010001010001010111001111010011
  0010010100010101111110110111110101110101110101111100000101111001100110100111
  1101111101110111000010011011010001101001010101011101010101011101010111010101
  1010110101010000100101101110101101011011011011011010000000010111010101010100
  0101111010100001010110110101010101000101001110010001110110011011010101110111
  0101010001001101110001101000010110111110001110110001011110001010101111101000
  1100011011101000011011101011101011101001001001101101101001101010011011011001
  1010010100001011110111111000001010110101001110000101110100000111110110011000
  1001011011010101010110000001110101101010100111000010110011001111100111110011
  0000111100110100101000110111101011101111111001110101011100111010001000010000
  1011101000101000111011111000101000001000111111101110011101000001001101110111
  1101110111110101000010100111001110100011100011111101010010010011101111011011
  000111011100110011101000110011001000000000001010001100101110110111111111
  
Yiiiikes!!! Believe me, not every programmer wants to be staring at binary code like this all day long, and I'm sure one of them. Infact, turning binary code into something else is one of the most rewarding experiences of coding, and you can put your feet up and forget all the nitty gritty when you're done. But in the meantime it's gonna screw you up. In chapter 2 I'm going to show you that we can actually make sequences like binary come alive in a way which starts us looking at data in a different way, and this is the realm of data science which I will explore.

Anyway before I get to the good chapters I'm going to drop the link to my functions on github so you can nick all my stuff and continue to lurk endlessly.

backwardmachine at Github Here have some delphiniums:

LC441150.1 Delphinium grandiflorum F3'H gene for flavonoid 3'-hydroxylase, complete cds https://www.ncbi.nlm.nih.gov/nuccore/LC441150.1?report=fasta

See that name? They're all boring like that. You can make them up yourself too. If you discover them. ( You might, really! ).