Introduction to Genes and DNA
DNA code is a sequence of chemicals that form information that control how humans are made and how they work. It is a digital code but it is not binary, but quaternary with 4 distinct items. The encoding information in an ordered sequence of 4 different symbols called "bases", typically denoted A, C, G, and T.
- A: adenosine
- C: cytosine
- G: guanine
- T: thymine
These 4 substances are the fundamental "bits" of information in the genetic code, and are called "base pairs" because there is actually 2 substances per "bit", as discussed later. Everything else is built on top of this basis of 4 DNA digits.
The entirety of human DNA code, called the "human genome", is about 3 million bases in total. Every human being has 2 copies of this code, one copy from each parent, so a human's cell DNA contains a total of around 6 billion bases. In computer terms, this is around 6 Gigabytes of symbols, or more like 1 Gigabyte if compacted, since it's about 2 binary bits of information per A/C/G/T base pair. DNA molecules are linear in a twisted double-helix, with a start and an end, and do not contain any cycles.
Chromasomes: These 6 billion odd base pairs are split amongst 46 chromasomes. Each person gets 2 pairs of chromasomes, 23 from each parent, to total 46 chromasomes per human cell. A chromasome is the largest form of a DNA molecule, with a large sequence of DNA codes, of differing lengths, usually hundreds of millions of base pairs in each chromasome. Chromasomes are independent molecules of DNA, with the typical double-helix, a start and end, but no cycles. Chromasomes are physically large enough to be seen on high power microscopes.
Genes: Each chromasome has subsequences of DNA bases that encode particular features, and these are called "genes". Thus genes are not independent molecules, but are abstract sequences within chromasomes. All genes have different lengths. Genes are too small to be physically seen on a microscope, but are analyzed using indirect chemical, molecular, and computational methods. The total number of distinct genes in the human genome is believed to be around 30,000 genes according to the Human Genome Project.
So the hierarchy of terminology for genetic components is something like:
- Base pair: the smallest element, a single DNA base-4 compound A, C, G, or T.
- Gene: a medium-size sequence of around 100,000 DNA base pairs, like a sub-module
- Chromasome: a large sequence of hundreds of millions of DNA base pairs, like a computer program file
- Human genome: the entirety of human DNA program code: 2 pairs of 23 distinct chromasomes, adding to around 6 billion DNA base pairs
Every individual has a unique genetic program, though all human DNA shares much common code too. A lot of genes and other DNA subsequences are modified or move around within the DNA of a species, such as when they are inherited from parents at conception. DNA does not usually change within a particular individual's body, though this can occur rarely from cell mutations (e.g. some cancer cells) and also genetic damage such as from radiation or toxic chemical exposure.
Each person has 46 chromasomes, in pairs of 2, with 23 from each parent. So there are really 23 distinct chromasomes, and each body cell effectively has 2 different copies of the DNA code, half from each parent.
Each chromasome is distinct and whole. They are ordered, and have clear start and end sequences. In a sense, they are like a file of computer code.
The 23 distinct chromasomes are known and named and have a common structure for each human. The first 22 chromasomes are just named in numbers, simply chromasome 1 through chromasome 22. The name for one of these 22 chromasomes is an "autosome".
The 23rd chromasome is the sex chromasome, which is called either "X" or "Y". Every person has a pair of sex chromasomes, one from each parent. However, unlike the other 22 pairs of chromasomes, a human does not necessarily have 2 similar chromasomes. A male person has a pair of different chromasomes, an X and a Y chromasome, and is usually written as XY. A female has two X chromasomes and is called XX.
The key issue about chromasomes is to understand their role in reproduction. Firstly, let's make some observations about reproduction:
- Children are similar to both parents, with similar traits, but are not identical to either parent.
- Siblings look different, despite sharing the same parents.
- Male and female children occur in about a 50-50 split.
To understand these features, we have to understand how chromasomes are distributed during reproduction. Every person has 46 chromasomes, 23 from the father, 23 from the mother. But the father has 46 chromasomes and so does the mother. Each sperm cell in the father gets 23 chromasomes, and similarly an egg cell gets 23 chromasomes from the mother's set of 46. For each autosome 1..22, the gamete (sperm or egg) gets one of the chromasomes, randomly, without regard to which grandparent the chromasome originally came from. For the sex chromasomes, the egg cell gets one of the mother's X chromasomes, and the sperm gets either the X or Y from the father's chromasomes. Hence, the number of permutations of chromasomes in a father's sperm cell is 2^23, and similarly the number of egg chromasome permuations is also 2^23. So even with the same parents, and even with only entire chromasomes inherited, the number of siblings that can be created is about 2^46.
However, chromosomes are changed during reproduction. They are a natural part of the process. Small or large chunks of chromosome material are swapped during reproductive cell creation. This is called crossover. Thus, the total number of possibilities is even huger than the number purely from simple swapping over.
Non-Gene DNA Sequences
Genes are the best understood subsequence of DNA code. Most genes clearly encode the data sequence representing a particular protein. However, all of the genes together are only a small part of DNA code. The 30,000 odd genes in human DNA might only make up 4% of human DNA.
So what is the other DNA code for? These DNA sequences are the least understood of all genetic issues. The main theory is that these DNA sequences are the control mechanisms, that control when particular genes are activated. If the genes are the data sequences for proteins, the remainder must be the real code. This code presumably controls when the genes are activated, so that human growth follows its normal timetables. It probably also controls how much a gene is activated, controlling how much of each protein is produced by a gene.
DNA and RNA
There are actually 2 main types of nucleic substances within cell nuclei that process information. DNA is the basic form within chromasomes, that is hard-coded into every cell. RNA is a more temporary form that is used to process subsequences of DNA messages. RNA is an intermediate form used to execute the portions of DNA that a cell is using. For example, in the synthesis of proteins, DNA is copied to RNA, which is then used to create proteins: DNA->RNA->Proteins.
The structure of DNA and RNA are very similar. They are both ordered sequences of 4 types of substances: ACGT for DNA, and ACGU for RNA. Thus RNA uses the same three ACG substances, but uses U (uracil) instead of T (thymine). The molecules uracil and thymine are only slightly different chemically. In DNA, there is pairing between AT and CG, and in RNA, the pairings are AU and CG, but since RNA is not double-stranded, this pairing is much rarer. Hence, RNA has the 4 substances:
- A: adenosine
- C: cytosine
- G: guanine
- U: uracil
Typically, DNA is created from RNA, and this is done by faithfully copying the sequence of base pairs, with the only change converting T to U. Hence, an RNA copy of a DNA sequence encodes the identical information, though it uses a slightly different set of 4 substances.
The differences between DNA and RNA are also many. The underlying sugar molecule that traps the 4 bases is different: deoxyribose in DNA, ribose in RNA. DNA is two strands wrapped in a double-helix, but RNA is a single strand.
Genes: Protein Data Sequences in the DNA Code
Some parts of DNA sequences are known to be purely data. These are the "genes". The best understood aspect of DNA coding is the encoding of amino acid information in genes that is used by the body to synthesize proteins. These are data blocks that represent protein structures.
All proteins are substances made up of only 20 basic building blocks called amino acids. Proteins are ordered sequences of these 20 amino acids. Another terminology is that an amino acid is a "peptide" and a protein is a sequence of many peptides called a "polypeptide".
So how does DNA encode the structure of a protein? It uses triplets of base pairs. There are 4x4x4=64 possible combinations in a base pair triplet, and only 20 amino acids. Some extra codes are used as start and stop signal markers at each end of the data sequence. Other triplets are mapped so that more than one triplet can represent a particular amino acid. However, the representation is unique across all DNA mapping base pair triplets to the 20 amino acids:
- 1. Phenylalanine (Phe): UUU, UUC
- 2. Leucine (Leu): UUA, UUG
- 3. Isoleucine (Ile): AUU, AUC, AUA
- 4. Methionine (Met): AUG
- 5. Valine (Val): GUU, GUC, GUA, GUG
- 6. Serine (Ser): UCU, UCC, UCA, UCG, AGU, ACG
- 7. Proline (Pro): CCU, CCC, CCA, CCG
- 8. Threonine (Thr): ACU, ACC, ACA, ACG
- 9. Alanine (Ala): GCU, GCC, GCA, GCG
- 10. Tyrosine (Tyr): UAU, UAC
- 11. Histidine (His): CAU, CAC
- 12. Glutamine (Gln): CAA, CAG
- 13. Asparagine (Asn): AAU, AAC
- 14. Lysine (Lys): AAA, AAG
- 15. Aspartic acid (Asp): GAU, GAC
- 16. Glutamic acid (Glu): GAA, GAG
- 17. Cysteine (Cys): UGU, UGC
- 18. Tryptophan (Trp): UGG
- 19. Arginine (Arg): CGU, CGC, CGA, CGG, AGA, AGG
- 20. Glycine (Gly): GGU, GGC, GGA, GGG
In addition, the following triplet codes are special:
- STOP: UAA, UAG, UGA
- START: AUG (same code as the Methionine amino acid)
Clearly, there are not unique 1-1 mappings of triplets to amino acids. However, although there is redundancy, it is not ambiguous. Any triplet can represent only 1 amino acid.
Why this redundancy? Perhaps there is some meaning to it? Perhaps simply a primitive form of error prevention? Perhaps it is simply an accident of nature that occurs because 3 digits were needed, since 2 DNA digits could only encode 4x4=16 codes, which is not enough to represent the 20 amino acids and start/stop codes.
This DNA encoding appears to be almost the same for all genetics on the planet. A few species of single-celled protists have slightly different codes.
The DNA data sequences are of varying length depending on the size of the protein. Proteins can range from tiny proteins with about 50 amino acids to huge proteins with 5,000 amino acids.
The DNA start and stop sequences are not the same as the RNA start and stop triplets. DNA has a promoter sequence to show where RNA should start to be copied, and a terminator sequence to tell RNA where to stop. The RNA then uses only a single triplet as the start and stop markers. The DNA promoter and terminator sequences are more complex.
Introns: Surprisingly, not all of the DNA code is useful. Certain sequences called "introns" are simply occurred. These are like comments in protein coding sequences. They are transcribed to mRNA properly, but then they are excised from the mRNA to produce the final mRNA. The resulting mRNA is the same order and codes as the original mRNA, but with the introns sequences removed.
RNA Data Sequences in DNA
Proteins are not the only substances that are synthesized directly from data within the DNA. Some forms of RNA are specialized, and also have their formula encoded directly in digital DNA formulae.
Not all types of RNA are temporary intermediate forms with their form depending on whatever DNA they are copying. There are certain forms of RNA that have a particular form that is the same across all individuals. Some of these special-purpose RNA forms are:
- tRNA: transfer RNA
- rRNA: ribosome RNA
There are exactly 20 forms of tRNA, one each to transfer a particular amino acid. tRNA molecules contain about 75-80 bases. tRNA recognizes one of the 64 triplets, and matches it to one of the 20 amino acids. Since there are 20 tRNA types, and not 64, each tRNA molecule has to recognize more than one triplet ordering as a match.
The DNA code contains multiple repetitions of the codes for tRNA and rRNA. About 280 copies are spread over 5 chromosomes. Presumably, this allows each cell to make multiple copies of tRNA and rRNA molecules at once from its single copy of the DNA.
Executing the DNA Program: Parallel Execution
Every cell has a full copy of the entire DNA, complete with around 6 billion DNA base pairs jammed into the cell's nucleus. Whenever cells divide to replicate, they duplicate the entire DNA code so that each cell retains a full DNA copy.
The only cells that do not have the entire DNA code are reproductive sperm or egg cells that have only 23 chromasomes each, and thus only about a half copy of DNA.
- DNA is digital, but is quaternary, not binary.
- DNA is a base-4 code using the digits A, C, G and T.
- Proteins are a base-20 code using the 20 amino acids.
- DNA represents a protein has an ordered sequence of base-4 triplets, using 64 possible values to 20 amino acids.
- Comments: Some DNA sequences are ignored: introns
Medical Tools & Articles:
- Risk Factor Center
- Medical Statistics Center
- Medical Treatment Center
- Prevention Center
- Medical Tests Center