DNA to binary and back again
Rituum DNA to Binary converter widget
Programming a Skutterer

Adventures in Data Science - Chapter I

DNA to binary and back again

Possibly one of the worst ways to start teaching data science is to hit people right off the bat with a big segment on binary code. Unfortunately, you found my page and that's exactly what's gonna happen now. I am, however, going to include some DNA swagger for some added kick just to make sure you're on the right tracks, because I have no intention of teaching you how to code. Nope, for that you're on your own. My intention is to show you a certain type of data science, and what you can do without needing to code at all. Infact, that I have to code all of this up and for you to just sit there and do no coding is going to ruffle my feathers pretty darn quickly, so if I take it out on you you'll know who is really to blame. Let me show you the shoddy functions I used to convert from a DNA sequence to binary once I got my bronze swimming certificate equivalent in programming. It looks something like this...
    function dna2binary( sequence )
    	var str = '';
    	for ( let char of sequence )
    		switch( char )
    			case 'A':
    			case 'a':
    				str += '00';
    			case 'C':
    			case 'c':
    				str += '01';
    			case 'G':
    			case 'g':
    				str += '10';
    			case 'T':
    			case 't':
    				str += '11';
    	return str;
In this code we are replacing A, C, G, T with what can be known as words, in this case 00, 01, 11, and 10. In both data science and molecular biology, knowing that the order of these is also important is also a consideration. For example, if I were looking at the sequence ATGCCTTCTC ( it's the start of a delphinium flower gene! ), but converted to binary and back with somebody elses set of binary functions, I have a good chance of getting something else. Something wrong. And it's gonna pee all in my cornflakes. ATGCCTTCTC might become CATGGAAGAG and then we're all done for if I really am a scientist ( I'm not ). But with this in mind the sequence is fine and we can work with a DNA sequence and the same sequence in it's primordial form of binary magic. Yes, binary. Boooooriiiiiinggg!!! I know. Of course, my objective in data science is to bring these sequences alive in a way which we tend not to do yet, and we have to hit this binary problem head on because I don't want to deal with it any more than you do. Infact, if you want to be a scientist and you don't want to be a "data" scientist, you can instead be a life sciences scientist and spend all day trying to avoid ACGT sequences which are, well, not that far away from binary. The truth of it is, data science is more fun than that, and I'll show you, but not yet. The above sequence beginning ATGCCTTCTC ( hint: if you click the FASTA link on the ncbi website you get a compacted sequence you can use in your data science ) is actually much longer. Genetic sequences can be many millions of base pairs long, so you're going to be looking at a lot of lines like this:
Here! I just copy pasted a little chunk from the start of the gene. This is mostly what goes on in molecular science and it's an important skill. You might not want to see it, but bear with me, here it is in binary too...
Yiiiikes!!! Believe me, not every programmer wants to be staring at binary code like this all day long, and I'm sure one of them. Infact, turning binary code into something else is one of the most rewarding experiences of coding, and you can put your feet up and forget all the nitty gritty when you're done. But in the meantime it's gonna screw you up. In chapter 2 I'm going to show you that we can actually make sequences like binary come alive in a way which starts us looking at data in a different way, and this is the realm of data science which I will explore.

Anyway before I get to the good chapters I'm going to drop the link to my functions on github so you can nick all my stuff and continue to lurk endlessly.

backwardmachine at Github Here have some delphiniums:

LC441150.1 Delphinium grandiflorum F3'H gene for flavonoid 3'-hydroxylase, complete cds

See that name? They're all boring like that. You can make them up yourself too. If you discover them. ( You might, really! ).