DNA to binary and back again
Rituum DNA to Binary converter widget
Programming a Skutterer
Adventures in Data Science - Chapter I
DNA to binary and back again
Possibly one of the worst ways to start teaching data science is to hit people right off the bat
with a big segment on binary code. Unfortunately, you found my page and
that's exactly what's gonna happen now.
I am, however, going to include some DNA swagger for some added kick just to make sure you're on the
right tracks, because I have no intention of
teaching you how to code. Nope, for that you're on your own. My intention is to show you a
certain type of data science, and what you can do without needing to code at all.
Infact, that I have to code all of this up and for you to just sit there and do no coding is
going to ruffle my feathers pretty darn quickly, so if I take it out on you you'll know who is
really to blame. Let me show you the shoddy functions I used to convert from a DNA sequence
to binary once I got my bronze swimming certificate equivalent in programming. It looks something
function dna2binary( sequence )
var str = '';
for ( let char of sequence )
switch( char )
str += '00';
str += '01';
str += '10';
str += '11';
In this code we are replacing A, C, G, T with what can be known as
, in this case 00
, and 10
In both data science and molecular biology, knowing that the order of these is also
important is also a consideration. For example, if
I were looking at the sequence ATGCCTTCTC ( it's the start of a
delphinium flower gene
but converted to binary and back with somebody elses set of binary functions, I have a good chance
of getting something else. Something wrong
. And it's gonna pee all in my cornflakes.
ATGCCTTCTC might become CATGGAAGAG and then we're all done for if I really am a scientist ( I'm not ).
But with this in mind the sequence is fine and we can work with a DNA sequence and the same sequence
in it's primordial form of binary magic. Yes, binary.
Boooooriiiiiinggg!!! I know. Of course, my objective in data science is to bring these sequences alive
in a way which we tend not to do yet, and we have to hit this binary problem head on because I
don't want to deal with it any more than you do. Infact, if you want to be a scientist and you don't
want to be a "data" scientist, you can instead be a life sciences scientist and spend all
day trying to avoid ACGT sequences which are, well, not that far away from binary.
The truth of it is, data science is more fun than that, and I'll show you, but not yet.
The above sequence beginning ATGCCTTCTC ( hint: if you click the FASTA link on the ncbi website you
get a compacted sequence you can use in your data science ) is actually much longer. Genetic sequences
can be many millions of base pairs long, so you're going to be looking at a lot of lines like
Here! I just copy pasted a little chunk from the start of the gene. This is mostly
what goes on in molecular science and it's an important skill. You might not
want to see it, but bear with me, here it is in binary too...
Yiiiikes!!! Believe me, not every programmer wants to be staring at binary code like this all
day long, and I'm sure one of them. Infact, turning binary code into something else
is one of the most rewarding experiences of coding, and you can put your feet up
and forget all the nitty gritty when you're done. But in the meantime it's gonna screw you up.
In chapter 2 I'm going to show
you that we can actually make sequences like binary come alive in a way which
starts us looking at data in a different way, and this is the realm of data science
which I will explore.
Anyway before I get to the good chapters I'm going to drop the link to my
functions on github so you can nick all my stuff and continue to lurk endlessly.
backwardmachine at Github
Here have some delphiniums:
LC441150.1 Delphinium grandiflorum F3'H gene for flavonoid 3'-hydroxylase, complete cds
See that name? They're all boring like that. You can make them up yourself too. If you discover them.
( You might, really! ).