Arabic Bulgarian Chinese Croatian Czech Danish Dutch English Estonian Finnish French German Greek Hebrew Hindi Hungarian Icelandic Indonesian Italian Japanese Korean Latvian Lithuanian Malagasy Norwegian Persian Polish Portuguese Romanian Russian Serbian Slovak Slovenian Spanish Swedish Thai Turkish Vietnamese
Arabic Bulgarian Chinese Croatian Czech Danish Dutch English Estonian Finnish French German Greek Hebrew Hindi Hungarian Icelandic Indonesian Italian Japanese Korean Latvian Lithuanian Malagasy Norwegian Persian Polish Portuguese Romanian Russian Serbian Slovak Slovenian Spanish Swedish Thai Turkish Vietnamesedefinitions - DNA SEQUENCE
report a problem
1. (MeSH ) The sequence of PURINES and PYRIMIDINES in nucleic acids and polynucleotides. It is also called nucleotide sequence.synonyms - DNA SEQUENCE DNA sequence
This article does not cite any references or sources .
Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed. (November 2007)
Electropherogram printout from automated sequencer for determining part of a DNA sequence
A DNA sequence or genetic sequence is a succession of letters representing the primary structure of a real or hypothetical DNA molecule or strand, with the capacity to carry information as described by the central dogma of molecular biology .
The possible letters are A. C. G. and T. representing the four nucleotide bases of a DNA strand — adenine. cytosine. guanine. thymine — covalently linked to a phosphodiester backbone. In the typical case, the sequences are printed abutting one another without gaps, as in the sequence AAAGTCTGAC, read left to right in the 5' to 3' direction. Short sequences of nucleotides are referred to as oligonucleotides and are used in a range of laboratory applications in molecular biology. With regard to biological function, a DNA sequence may be considered sense or antisense . and either coding or noncoding. DNA sequences can also contain "junk DNA ."
Sequences can be derived from the biological raw material through a process called DNA sequencing .
In some special cases, letters besides A, T, C, and G are present in a sequence. These letters represent ambiguity. Of all the molecules sampled, there is more than one kind of nucleotide at that position. The rules of the International Union of Pure and Applied Chemistry (IUPAC ) are as follows: [ 1 ]References External links
Published: 23rd March, 2015 Last Edited: 23rd March, 2015
This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
The entire human genome was sequenced by the Human genome Project, which was an international effort, begun in 1990 and completed in 2003. Although the project successfully sequenced the whole human genome and identified 20,000-25,000 genes, yet there is a big fraction on the human genome in which the function is not identified yet. These are referred to non-coding DNA sequences in the human genome. There have not been many studies conducted on non-coding DNA sequences of the human genome because formerly they were considered as "junk DNA". This is the motivation to carry out the research on non-coding DNA sequences of the human genome. Genomic comparisons have identified many shared non-coding sequences across species. It can thus be conjectured that there should be some reason behind non-coding regions to be maintained over long evolutionary and there are suggestions as the conserved non-coding sequences may be functional. This is the main motivation for selecting conserved non-coding sequences of the human genome to conduct this research.
A large portion of the eukaryotic genomes consists of non coding DNA sequences in contrast to prokaryotic genomes. Due to this reason of recent the non-coding DNA region has attracted a lot of attention. A significant factor is that non-coding sequences can be found only on eukaryotes with its prevalence increasing significantly in vertebrates. There is no general agreement about why most eukaryotic cells contain more DNA than required for their proteins. These non coding DNA sequences do not contain instructions to make proteins but seem to be involved in and their functionality which is yet not known.
Besides this, a very interesting finding about the human genome is that while the coding sequences contain only 2% while the remaining 98% account for non-coding DNA. Two different explanations for the variability in the amount of non coding DNA have been proposed.
One explanation is that non-coding DNA has no function; they exist due to mutation pressure.
A second explanation is that according to the functional theories non-coding DNA has a sequence independent function .
The percentage of non-coding DNA varies greatly in size on different genomes, even between closely related species. In prokaryotes the total length of non-coding DNA increases linearly with the total length of protein-coding DNA. Why does so much non-coding DNA exist and why its amount per genome does varies so much are central biological problems. This is the motivation why further studies on this area need to be carried out.
This dissertation is an attempt to look at the conserved non coding DNA sequences in the human genome which have not been analyzed much, so far. The objective of the research is to analyze the distribution of conserved non-coding DNA sequences in the human genome. There must be some reason behind the conserved non coding DNA sequences to be conserved for thousands of years time. DNA sequence comparison between different organisms provides a higher probability to identify common signatures that may have functional implications. Mammals and fish, being most evolutionary distant present vertebrates for which whole genome information is available, provide high valuable information in gathering conserved non-coding sequences. It is important to use orthologous sequences when performing cross-species DNA comparisons to recognize functional elements that are evolutionary conserved . Conserved non coding sequences which have been found through a whole genome comparison between human and the zebrafish (Danio rerio) are used in conducting the research. Zebrafish provides an important model organism for analysis of vertebrate development because its gene order and content is alike to the common vertebrate form of mitochondrial DNA . There is extensive similarity between the zebrafish and the human genomes so that many human development and disease genes have counterparts in the zebrafish. The finding of the research is expected to be significant since it will help us to understand more about the human genome. It will also be valuable for biological experiments and further researches, which are conducted on the human genome.
The study could contribute to medical treatment because the drug designers need to explore the human genome to know how our body reacts against diseases. It is the key to design a successful drug. When a disease attacks our body the coding sequences may not be the only portion which works against that. There may be some interaction between the cell immune system and non-coding sequences. Conserved blocks of DNA likely result from the sequence-specific constraints of DNA-protein interactions, although the relation between conserved sequences and functionality characterized binding sites and is not exact . It is remarkable since there are some proteins which are produced by our coding sequences which cause certain diseases such as cancer. We need to identify and understand our whole body means our whole genome to learn how the body system works. Although we know the functionality of the coding DNA sequences we do not know any functionality on non-coding DNA sequences. Since they are the majority of our genome the knowledge we have on the human genome is poor. That is why studying on non-coding DNA sequences in the human genome is so important as much as analyzing coding DNA sequences in the human genome.
Aims and Objectives
The main objectives of the research are to,
Analyze the distribution of CNSs over the human genome to identify the distribution patterns of the blocks if there exists any.
Identify whether there is any correlation between the closely existing gene and the conserved non-coding blocks.
An execution environment that is persistent is needed. So the algorithms can be tested without blocking out access.
The Structure of the Report
The dissertation has organized as follows. In chapter 2 of the dissertation the literature on the research area is discussed. This part discusses all the literature that needs to understand the related problem and details of the researches that had done so far on the area.
Human Genome, Chromosome and Genes
The qualities that differentiate human from other primates originated in specific DNA sequence changes in the human genome. Therefore the entire human genome has to be sequenced and analyzed to understand the characteristics of specific DNA sequences. The Human Genome Project, which for the first time sequenced the entire human genome completed on 2003 with a 13 years effort. According to the research findings more than 50% of the human genome shows sequence similarity to genes in other organisms. The human genome contains around 20,000-25,000 genes, which are located on DNA strands, are distributed among the 23 pairs of chromosomes in a cell.
The human belongs to eukaryotes; is an organism whose cells contain complex structures enclosed within membranes. Eukaryotes have more complexity in gene and genome structure than prokaryotes. The chromosomes of each eukaryotic cell are made up of
DNA (Deoxyribonucleic acid)
Small amount of RNA (ribonucleic acid) .
Chromosomes are made up with two DNA strands and the DNA sequences are made up with four nucleotides as Adenine (A), Cytosine(C), Guanine (G) and Thymine (T). DNA strands make hydrogen bonds between Adenine with Thymine and Guanine with Cytosine.
A gene can be defined as a part of DNA of which some is transcribed and includes a promoter, coding sequence and a signal for the RNA polymerase to stop. There need not to be a unique promoter for each gene because one promoter may act on several genes. Our body has the ability to produce different tissues because each cell only expresses a subset of its genes. These genes have different functionalities according to the functions of the organisms. So the same gene may appear on different chromosomes but the function they provide may vary. Most of the genes are capable of making more than one protein, so there can be number of proteins than the number of genes of a genome.
Figure 1: Organization of the Human Genome 
Coding DNA Sequences
There had been a lot studies conducted on the human genome. But the main focus on most of the genomic studies is protein coding genes or RNAs. The DNA sequences which codes for a protein or RNA is considered as coding DNA sequences. Even though the coding DNA sequences are considered as the main part of the genome they account only a small proportion of the human genome as 2%. ATG and AUG represent the sequences of DNA and RNA respectively that are the start codon or initiation codon. TAG or TAA denote stop codon of a coding DNA sequences and on RNA sequences it is UAG or UAA.
Gene expression means information of a gene is used in the synthesis of a functional gene product. These products can be proteins, but in non-protein coding genes such as rRNA genes or tRNA genes the product is a functional RNA. So in genetics, gene expression is the most fundamental level at which genotype gives rise to the phenotype.
There are two types of genes that code for polypeptides.
Structural genes - Code for functional proteins such as enzymes, hormones, antibodies, storage proteins and fibres.
Regulatory genes - Control the activities of other genes.
Non-coding DNA Sequences
Previously it was considered that the protein-coding sequences are the only important DNA sequence fraction of the genome. But as the studies on genomic sequences have been carried out, it has been discovered that coding sequences contains only a small fraction of the whole genome. A large portion of the genome contains non-coding sequences which was previously considered as "junk DNA". But since the non-coding sequences contain a large fraction over the genome, researchers have been interested on non-coding sequences as well as protein-coding sequences.
When consider about the human genome, coding sequences are spread only about 2% of the entire genome and the other 98% contains non-coding DNA sequences. As the figure 3 describes it is the largest fraction of non-coding DNA among eukaryotes.
Figure 3: Proportions of Non-protein coding DNA of different genomes
Non-coding DNA parts can be separated in to two parts as
Intronic regions (Introns)
In 1997 biologists discovered that the DNA of a eukaryotic gene is longer than its consequent mRNA. Because mRNA is a direct copy of the DNA sequence they thought it should be the same length. Then it was revealed that there are introns, which separate a gene area in to several pieces called exons. These exons are coding parts and as Figure 4 describes introns are appearing between exons. The size and arrangement of introns is uneven and can be considered as characteristics of genes. Same mRNA may have different introns removed in different cells. Therefore the gene has alternative introns and can code for different proteins. This increases the potential use.
Example: Calcitonin gene
Two different forms of mRNA are produced by this gene, depending on the introns which are removed at each occurrence. One produces protein calcitonin and other produces CGRP (Calcitonin gene-ralated peptides) which is similar to calcitonin.
Intergenic areas are the regions that locate between the genes. The length of intergenic regions cannot be guaranteed and varies on different chromosome locations.
There are two concepts that have been proposed about introns and intergenic regions .
The introns-early concept held that (nearly) all introns have been inherited to eukaryotic genes from ancestor of prokaryotes and eukaryotes. The difference in gene arrangement among homologous eukaryotes take place due to different intron losses as described by the Figure 5.
Figure 2: Introns-early concept: Introns have been inherited by ancestor - Ancestor of prokaryotes and eukaryotes has had a lot of introns and the eukaryotes today have different structure of introns due to the loss of those introns
As the Figure 6 describes the intron late concept argue against that introns were a eukaryotic novelty and new introns have been emerging continuously through eukaryotic evolution.
Figure 3: Introns were a eukaryotic novelty - Ancestor of prokaryotes and eukaryotes had no introns and through the eukaryotic evolution introns have been emerging continuously
Intron insertion might take major evolutionary conversions but intron loss takes over at short evolutionary distances . It has been proposed that the fraction of shared intron positions between species has been decreased while evolutionary distance has been increased. Accordingly intron conservation could be useful as phylogenetic marker. So intron positions have been successfully used as phylogenetic markers for shorter evolutionary distances .
Recent finding says that old introns are significantly over-represented in the 5'-portions of the genes but new introns are distributed much more uniformly. Moreover, in most of the genomes which are very rich in introns are over-represented in the 3'-portions of the genes.
Since introns are appearing between exons, subsequent analysis has shown that introns located between two codons were on average. Moreover that kind of introns are located in more highly conserved portions of genes than the introns located after the first position in a codon and after the second position in a codon.
Above findings are for eukaryotes and it has been found that not like eukaryotes, prokaryotes make up only a small fraction of non-coding DNA sequences.
Formerly people thought the genome complexity can be described considering the size of the genome. But there is no association between the genomic sizes and their complexity. As figure 7 describes the genome size of many organisms has turned out to be not much different than the genome size of the human .This information show that genome size does not correlated with the complexity of an organism.
Figure 4. Genome size comparison
But when considered on eukaryotes it is significant that the amount of non-protein coding proportion increases with the genome complexity (Figure 8). So most functional theories emphasize that cell size is adaptively important and that the genome-size-cell-volume relationship is the key to explaining the continued presence of non-coding DNA .
Conserved DNA Sequences
Conserved DNA sequences indicates that the sequences are functionally constraint. Non-functional genome parts are drift apart as species are divergent away from each other, because these non-functional parts undergo due to mutations. But if the parts are functional they remain recognizably similar in species over long periods of time. These parts may be responsible for code for proteins, code for RNA, and act as regulatory regions for enhancers, promoters, repressors and so on. So the sections which are functional DNA sequences likely to remain the same or very similar over millions of years of mutation pressure. This is the idea behind comparative genomic analyses between species. So it describes that comparing DNA sequences from several species provides a way to identify common signatures that may have functional significance.
Conserved Non-coding DNA Sequences
The availability of full genome sequences has led to many comparative genomic studies. These studies have examined the non-coding part of the genomes by genomic comparisons. The studies have discovered that vertebrate genomes contain a large amount of non-coding DNA sequences with strong conservation across the phylogeny. Interpret the functions of the non-coding sequences is a challenging, but highly important problem which the genomic studies have faced today.
Analysis on orthologous genes from animals and plants has discovered many shared non-coding DNA elements. But conservation of many intron positions in distant eukaryotes in spite of intron densities differ widely, and the location of introns in orthologous genes are not always same in closely related specie . About 50% of the DNA sequences that are conserved between human and mice are outside the coding sequences. Further from genomic comparisons between human, mice and dog Frazer  has found one half of the human/mouse conserved sequence was also conserved in the dog. Average percent pair wise sequence identity between CNSs of human with comparisons of Chimp, Mouse, Rat, Chicken, Frog,Fugu, Zebrafish and Tetraodon genomes is illustrated by figure 09. It is very interesting because it proves the evolutionary constraints of non-coding DNA which are likely to be functional. Conserved non-coding regions are less likely to undergo mutations compared to random DNA sequences . It proposes that the Conserved non-coding regions are functional significance sequences. Conserved blocks of non-coding DNA sequences likely have an impact on the sequence specific constraints of DNA protein interactions even though the association between functionally characterized binding sites and conserved non-coding DNA sequences have not discovered yet. Researchers suggest that about 20% - 30% nucleotide sites of eukaryotic genomes are expected to be conserved in functionally constrained non-coding regions .
Figure 5. Average percent pair-wise sequence identity of CNSs of human with comparisons of Chimp, Mouse, Rat, Chicken, Frog,Fugu, Zebrafish and Tetraodon genomes 
When considered on the findings of the distribution of CNSs it is significant that the length of the CNS is differently distributed from distribution of the length of exons. It is unevenly distributed throughout the genome and various chromosomes . That is a main characteristic of CNSs and observations say the distribution of the CNS is not uniform but highly clustered. The genomic regions which are very rich in genes may be poor in CNSs and on the other hand gene poor areas may be rich in CNSs. Chromosome 21 is a good example. Even though it is poor in genes the chromosome region is very rich in CNSs . A recent finding says that the amount of CNSs and coding sequences roughly equal in the chromosome 21 region . When considered with the chromosome length, the chromosome which has the highest density of CNSs is chromosome 10 and chromosome X has the lowest . Loots  has discovered that about one-half of the conserved elements were to be found in Intergenic regions more than 10kb distant from the closest known gene.
CNEs are unlikely to be translated because as aligned coding sequences on CNEs there is no periodicity on every three bases . So it is unlikely to consider these CNEs as new protein coding genes but there is a possibility that these sequences could be novel RNA genes that are transcribed and not translated .
Since recently the interest on the non-coding DNA sequences was raised, a lot of researches were carried out to find the functionality and characteristics of these regions. As a result it has discovered that conserved non-coding regions are strong associated with genes that have critical roles during development [9, 10, 11, 12, 16, 18]. So now it is generally well proven that non-coding conserved regions correspond to functional regulatory elements . The researchers believe CNSs are likely to function as negative or positive regulatory elements either spatially or temporally . Indeed, there is possibility for a CNS to act as either an enhancer or repressor  of transcription depending on the factors that bound to it . However, the tissue- or timing- specificity of CNE enhancers is not known a priori.
The researches have shown up that elements conserved in limited number of species are likely to be functional as the elements which are conserved in many species. There exists a limited class of CNSs on the human genome that can be found only in a subset of mammalians, which is shorter in length than CNSs that are common to other species . This limited class is a collection of rapidly evolving functional non-coding sequences. So it suggests these sequences may be accountable for gene expression variation between species .
By their research Prabhakar suggest that changes in non-coding sequences may have contributed to the modifications in brain development and functions that are responsible for unique human cognitive qualities. Not only syntenic association of the CNSs, but also the distance between these elements might be important to control certain gene expressions . By a comparison between human and fugu (Takifugu rubripes) approximately 1400 CNSs have been recognized, which appear to be development regulators in vertebrates .
Loots  says that large analysis on expression patterns of mammalian genes will classify sets of co-regulated genes that may share CNSs. Human genome may contain more than 1000 transcripts. So analyzing CNSs is very important in identifying the regulatory circuits involved in human development and differentiation. Although a small fraction of the CNSs can be associated with transcriptional regulation, there remain a large number of CNSs with unexplained function .
An important demonstration of the function of conserved non-coding regions is multispecies-species sequence comparison on Interiukin gene cluster . The selected regions were 1Mb and deletion of a conserved non-coding element of 401bp was shown to change Interiukin expression in T cells of transgenic mice. VIB (The Flanders Institute for Biotechnology) researches linked to K. U .Leuven and Harvard University say that non-coding DNA help tuning gene activity and enable organisms to quickly adapt to environmental changes . Dr. Craig Pikaard at Washington University and his research group have discovered that non-coding DNA acts as a part of the cellular immune system by enhancing the ability of the cell to fight viruses and transposons.
Minimum amount of non-coding DNA eukaryotes require increases quadratically with the amount of DNA located on exons . For the human genome the minimum non-coding amount is 5.4% of the whole genome length . Very short, specific CNS have been discovered which give rise to ncRNAs (non-coding RNAs) such as micro RNA and siRNA (small interfering RNA). These RNAs are involved in regulatory functions but also have been linked to genetic diseases in humans as has already been shown for the Sonic Hedgehog(SSH) gene [14, 11]. SSH gene controls a range of differentiation process during vertebrate development as well as patterning the limb .
In the human genome the highest number of duplicated CNSs is located on chromosomes 18/19, 8/10 and 5/16 . The frequencies of the appearing of these CNSs through the genome are not associated with the paralogons' genomic distribution. "Paralogs" are genes that are homologous to other genes in the same species. Therefore it is considerable to say that they are, likely to have originated from a common ancestral gene. Even though the highest numbers of human paralogs are on the chromosome pairs 1q/9q, 7q/17q, 2q/12q, 15q/18q, 1q/6q and 5q/15q, the CNS duplicates are normally clustered around 2p/14q, 8p/10q and 18p/19p/20p . McEwen  have reported 124 families of duplicated CNSs in the human genome and where about 98% of them were assigned to a single or multiple copies of the paralogous genes. Their main role is related to transcription or development.
Since most of the CNSs are located in and around genes that act as development regulators a number of studies has been carried out on the area. So it has been discovered that functional data is presented for highly CNSs associated with four unrelated development regulators which are SOX21, PAX6, HLXB9, and SSH . Trans-dev genes are mostly located in regions of low gene density because of the genome architecture which give an additional level of transcriptional control. Therefore it has suggested that CNSs might play a function in structuring the genomic architecture mostly around trans-dev genes . There are findings where a CNS cluster is to be found close to more than one trans-dev gene. Such occurrences illustrate the value of associating endogenous expression pattern of genes with CNE enhancer activity [11, 15, 16, 18].
Human and Zebrafish
Human and fish both belong to eukaryotes. According to the tree of life which describes evolutionary distance between these two species (Figure 10) the last common ancestor of humans and fish lived roughly 440 million years ago. So they are not very close species and, genomic sequences which are shared by these two species show evolutionary constrained functional regions.
Figure 6. tree of life describes the evolutionary distance between the zebrafish and human
Zebrafish (Danio reiro) a fish type, is useful model organism for studies of vertebrate development and gene function. Pioneering work of George Streisinger at University of Oregon established the zebrafish as a model organism. Because the gene order and content of zebrafish is equal to the common vertebrate form of mitochondrial DNA [ZFIN - db about zebrafish], it is an ideal model vertebrate for development and disease search. When performing genome comparisons to find conserved regions over evolutionary it is important to choose genomes which are orthologous. The evolutionary tree describes the relation between the zebrafish and the human as Figure 10. As the figure describes after the separation of the two species, their evolutionary time from their last common ancestor is likely same. Human Zebrafish genome comparison has found 4799 CNEs which are identical more than 70% and the length is more than80bp . Another genomic comparison has identified 73187 strand specific CNEs each element is lengthy more than 50bp and with more than 50% identity between the zebrafish and human [cneViewer]. cneViewer is the database which has built on that data that can be accessed as a web tool.
Figure 7: Part of evolutionary tree. Positions of the Human and Zebrafish (Danio reiro) are highlighted on the figure.
Design of the System
This chapter illustrates the design approach on analyzing conserved non-coding DNA sequences. Figure 8 depicts the data flow and the linkage of the system.
sequence similarity > 60%
sequence length > 100bpcneViewer database