Sunday, April 12, 2009

Genome Maps

A genome map is a linear representation of
genomic landmarks (genes and markers). It refers
either to a chromosome (cytogenetic map)
or to a stretch of DNA. A map provides knowledge
of the position of a particular genomic
landmark and its relation to others. Unraveling
the human genome in some respect resembles
the mapping of new continents five hundred
years ago.
A genetic map expresses the positions of genes
relative to each other without a physical anchor
on the chromosome. This is the type of map that
was first used by A. H. Sturtevant in 1913 when
working with T. M. Morgan (who started studying
Drosophila in 1910). Here, the distance between
markers is determined by the frequency
of recombination during meiosis, which is in
turn determined by the relative distance between
the loci (see p. 116). A physical map provides
knowledge of the exact position of a gene
or marker. Its distance to another locus on the
same chromosome is expressed by number of
base pairs (bp), a physical equivalent. A variety
of methods are employed to arrive at a physical
map.

Physical and genetic gene maps

The physical map gives the position of a gene
locus and its distance from other genes on the
same chromosome in absolute values, expressed
in base pairs and related to given positions
along the chromosome. The genetic map
gives the relative position of gene loci according
to the frequency of recombination, expressed as
recombination units or centimorgans (cM). One
centimorgan corresponds to a recombination
frequency of 1%. Since recombination occurs almost
twice as often in oocytes as in spermatocytes,
the genetic map in females is about 40%
longer than in males. Each gene locus has an official
designation with a defined abbreviation
using the letter D (for DNA), the number of the
chromosome, and the number of the marker,
preceded by an S for single-copy DNA, e.g.,
D1S77.

STS mapping from a clone library

STSmapping plays amajor role in genomemapping.
An STS (sequence-tagged site) is a short
stretch (60–1000 bp) of a unique DNA nucleotide
sequence, An STS has a specific location
and can be analyzed by PCR (see p. 66). The relevant
information, i.e., the sequence of the
oligonucleotide primers used for the PCR reaction
and other data can be stored electronically
and does not depend on biological specimens.
One can start with a clone library containing
DNA fragments in unknown order (1). Each end
of the chromosomal fragment is characterized
by a pattern of restriction sites (see p. 64). The
DNA fragments are ordered by determining
which ends overlap, then assembling them as a
contiguous array of overlapping fragments into
a clone contig (2). These are linearly arranged.
This establishes a map that shows the location
and the physical distance of the landmarks,
here A, B, C, etc. (3). Sequence-tagged sites
(STSs) are generated from the two ends of the
overlapping clones. This involves sequencing
100–300 bp of DNA (4).

EST mapping

ESTs (expressed sequence tags) are short DNA
sequences obtained from cDNA clones (complementary
DNA, see p. 58). Each EST represents
part of a gene. Their location is determined
by hybridizing an assembly of different
cDNAs (1) to genomic DNA (2). Thus, the locations
of defined sequences of expressed genes
can be determined (3). These can be mapped to
a location on a chromosome to establish an EST
map.

Approach to Genome Analysis

The approach to genome analysis encompasses
several goals. Of primary interest is the number,
type, and distribution of genes. Knowing all
genes and their positions and structures in a
eukaryotic genome will provide the basis for
understanding their function. The size of a
genome needs to be taken into account for a
systematic study.
Two basic approaches to sequencing a genome
can be distinguished: clone-by-clone sequencing
and the so-called shotgun approach. In the
former, individual DNA clones of known relation
to each other are isolated, arranged in their
proper alignment, and sequenced. The shotgun
approach breaks the genome into millions of
fragments of unknown relation. The individual
DNA clones, for which prior knowledge of their
precise origin is lacking, are sequenced. Subsequently,
they are aligned by high-capacity
computers. The two approaches complement
each other.

Sizes of genomes and cloning vectors

The sizes of genomes of different organisms
vary considerably. In general, genome size reflects
the complexity of the organism. A mammalian
genome (human and mouse are known
best) contains 3!109 base pairs (bp) or 3000
Mb. If each nucleotide pairwere represented by
a 1-mm-wide letter, the text would be more
than 3000km long or take up more than ten
sets of the Encyclopaedia Britannica or 750
megabytes of computer capacity. Thus, finding
all genes, mapping their position, and determining
their structure and function is an
enormous task (see Human Genome Project).
By comparison, the genome of important model
organisms such as Drosophila, the nematode C.
elegans, yeast, and bacteria are much smaller.
The genomes of some important plants such as
maize, rice, and wheat are even larger (5000–
17000 Mb) than mammalian genomes.
Since the size of DNA fragments that can be isolated
and multiplied in cloning vectors for
analysis is relatively small, a huge cloning
capacity is necessary for analysis of a large
genome. Yeast artificial chromosomes (YAC)
can accommodate about 1.4Mb, bacterial artifical
chromosomes (BAC) about 0.5 Mb, whereas
bacteriophages and cosmids

Range of resolution within the genome

The resolution ranges from a whole chromosome
or part of a chromosome isolated from a
somatic hybrid cell line (1) to the sequence of
the nucleotide pairs (5) and cloned DNA fragments
(2). Each fragment is characterized by
distinct landmarks (restriction sites or
sequence-tagged sites, STS, see Genome Maps).
They are aligned according to their contiguous
linear orientation in a contig (3), which can be
mapped (4). The individual clones can be
sequenced (sequence map, 5). This approach is
called a clone-by-clone approach in contrast to
“shotgun sequencing”

Alignment of overlapping DNA clones

From a stretch of genomic DNA to be characterized
by schematic segments A–E (1), a series of
overlapping clones is derived (2). In the first
step a radiolabeled clone is hybridized to
genomic DNA (clone 1) from a DNA library. A
probe (probe A) is used to find the adjacent
clone by hybridizing to a new clone (clone 2).
This establishes that clones 1 and 2 overlap.
Similar steps follow. DNA farther away can be
identified until one reaches the clone that contains
the gene of interest. This procedure, called
chromosome (DNA) walking, can start from
several points and proceed in both directions.
Modifications are used to speed up this process
and to cover large stretches of contiguous DNA.

Organization of Eukaryotic Genomes

The genomes of higher organisms contain considerably
more DNA than required for genes
since most of their DNA does not have coding
functions.

The components of the nuclear human genome

The human genome is huge, consisting of 3 billion
base pairs (3!109 bp or 3000 Mb) per haploid
set of chromosomes. Only 30% of mammalian
DNA is related to genes (900 Mb),
whereas 70% of the DNA is not (2100 Mb).
Coding DNA in genes accounts for only 3% (90
Mb) of the total amount of DNA. The bulk of
DNA (70%) consists of sequences that are repeated
many times (repetitive DNA). Characteristic
types of repetitive DNA are tandem repeats.
Depending on their size and pattern,
different types are distinguished: classic satellite
DNA, minisatellites, and microsatellites. Together
they constitute 14% of the total DNA (420
Mb). More than half of human DNA (56%) consists
of repeats interspersed throughout the
genome. The most important types are long terminal
repeats (LTRs), LINEs (long interspersed
nuclear elements), SINEs (short interspersed
nuclear elements), and transposons.

Satellite DNA

When fractionated human DNA is centrifuged
in a cesium chloride density gradient, the main
portion of DNA (1) forms a band at buoyant
density 1.701 g · cm"3. Three additional bands
(satellites) appear at 1.687, 1.693, and 1.697 g ·
cm"3, respectively. These are less dense because
their CG content differs from that of the main
DNA. One distinguishes classic satellite DNA (2)
made up of repeats of 100–6500 bp, minisatellites
(3) of 10–20 bp repeats, and microsatellites
(4) of 2–5 bp repeats. AT-rich and CG-rich segments
can be recognized. Microsatellites are the
most frequent form of repetitive DNA. Their
general structure is (CA)n where n equals about
2–10. The human genome contains 50000–
100000 polymorphic (CA)n blocks.

Long interspersed nuclear elements (LINEs)

Long interspersed repeat sequences (LINEs) are
mammalian retrotransposons that in contrast
to retroviruses lack long terminal repeats
(LTRs). They account for up to 70% of the human
genome by weight. They consist of repetitive
sequences up to 6500 bp long that are adeninerich
at their 3! ends. They may contain one or
two open reading frames (ORF), although they
are usually shorter and contain no ORF. At the 5!
end and at the 3! end they have an untranslated
region (5! UTR and 3! UTR). LINE elements are
thought to have arisen by transposition (see
p. 76). Mammalian genomes contain 20000–
60000 copies of LINE sequences. The major
human LINE element is the L1 sequence, a segment
that spans up to 6.4 kb. Approximately
100 000 L1 elements are dispersed throughout
the human genome. This can result in genetic
disease if one is inserted into a gene

Short interspersed nuclear elements (SINEs)

Short interspersed repeat sequences (SINEs)
consist of midsize repetitive segments of similar
nucleotide sequences with an average of 300
bp. Their basic structure is a tandem duplication
of CG-rich segments separated by adeninerich
segments. The most frequent SINE
sequences in humans are the Alu family (Alu
sequences). With about 500 000 copies, they
make up about 3–6% of the total genome of
humans. An Alu sequence consists of two 130-
bp tandem duplications with A-rich sections
between them. The 3! side (“right side”) contains
an insertion of 32 bp.

Gene Identification

A frequent and important goal in genome research
is to identify a particular gene with its
structure and expression and relate these to
normal and abnormal function. This process has
been aided by the ever-increasing number of
genes with a known map location, by access to
extensive data banks that store sequence information,
by the availability of dense maps of
marker loci, by comparative data fromdifferent
organisms, and by information from other
sources. The principles applied in identifying a
gene are outlined here.

Different approaches to identifying a disease-related gene

Disease genes have been identified and isolated
by different approaches, the choice of which has
depended on prior information and technical
considerations. The three approaches that have
been useful to date are (i) positional cloning, (ii)
functional cloning, and (iii) cloning of a candidate
gene.

Positional cloning

Positional cloning has been used to identify
more than 20 human disease genes. The first
and crucial step is clinical identification of the
phenotype. Next, the putative gene is mapped
to a chromosomal position with defined limits
(see B). From here the gene can be isolated
(cloned). In the next step, mutations in the gene
are demonstrated in patients and are shown not
to be present in unaffected family members and
normal controls.
Functional cloning requires prior knowledge of
the function of the gene. As this information is
rarely available at the outset, this approach is
very limited.

candidate gene

The candidate gene approach utilizes independent
paths of information. If a gene with a function
relevant to the disorder is known and has
been mapped, mutations of this gene can be
sought in patients. If mutations are present in
the candidate gene of patients, this gene is
likely to be causally related to the disease.

Principal steps in gene identification

To identify a suspected human disease gene,
clinical and family data together with blood
samples for DNA must be collected from
patients with the same particular disorder. The
disorder may follow one of the three modes of
monogenic inheritance (autosomal recessive,
autosomal dominant, X-chromosomal; 1) or
multigenic complex inheritance (not shown).
Genetic heterogeneity must be ruled out. A
chromosome region likely to harbor the gene is
then identified by one of the genetic mapping
techniques (linkage analysis or physical mapping
using a chromosomal structural aberration
such as a deletion or a translocation) (2). The
map position is refined and the gene location is
narrowed to a small area, e.g., a visible chromosome
band within 2–3 Mb (3). A contig of overlapping
DNA clones froma YAC or BAC (bacterial
artificial chromosome) or a cosmid library is established
from the region (4). This results in a
refined molecular map (5) characterized by a
set of localized polymorphic DNA marker loci

ORF

The presence of open reading frames (ORF),
transcripts, exons, and polyadenylated sites in
this region indicates that one is dealing with a
gene. Such stretches of DNA most likely representing
genes are isolated and subjected to mutational
analysis (6). Those genes not containing
mutations in patients can be excluded (7).
When mutations are found and a polymorphism
is excluded, the correct gene has been
identified (8). For confirmation, the expression
pattern is analyzed, its structure (exons, introns)
and size are determined, the transcript is
analyzed, and these properties are compared
with those of similar genes in other organisms
(“zoo blot”, see p. 250). Finally, the DNA
sequence of the entire gene can be determined
and conclusions about the function drawn.
From here on, a gene-directed diagnostic procedure
can be developed.