Download PopGen1: Introduction to population genetics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Epistasis wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Heritability of IQ wikipedia , lookup

Medical genetics wikipedia , lookup

Inbreeding wikipedia , lookup

Human genetic variation wikipedia , lookup

Polymorphism (biology) wikipedia , lookup

Koinophilia wikipedia , lookup

Dominance (genetics) wikipedia , lookup

Genetic drift wikipedia , lookup

Population genetics wikipedia , lookup

Microevolution wikipedia , lookup

Hardy–Weinberg principle wikipedia , lookup

Transcript
PopGen1: Introduction to population genetics
Introduction
MICROEVOLUTION is the term used to describe the dynamics of evolutionary change in populations and species
over time. The discipline devoted to the study of microevolution at the molecular level is called POPULATION
GENETICS. Note that the complementary term, Macroevolution, is most often used to describe patterns and
processes of diversification above the species level such as rates of speciation, and the origin of evolutionary
novelties. Disciplines such as PHYLOGENETICS, COMPARATIVE BIOLOGY, and PALAEONTOLOGY concern
themselves with macroevolution. It is important to understand that the process of microevolution can generate
(most, if not all) macroevolution, by operation over the relevant time spans.
Since we are interested in the outcome of evolution at the molecular level, i.e., the diversity of genes, proteins,
chromosomes, genomes, and organisms that we see in the world around us today, we must understand
population genetics. The principles of population genetics are used to (i) measure the extent of genetic variation
in natural populations, and (ii) explain natural genetic variation in terms of its origin, maintenance, and
evolutionary significance.
The practice of population genetics is based on using the information contained in a sample of a population to
make inferences about the evolutionary processes affecting that population. The discipline often uses explicit
models as the basis for testing hypotheses about the evolutionary processes that generated the sample of data,
and to estimate parameters; this approach is called PARAMETRIC INFERENCE. Sometimes, no model is assumed,
and that approach is called NONPARAMETRIC INFERENCE. The discipline of population genetics has practical value
in a wide variety of other disciplines. Medical biology relies heavily of population genetics to detect links
between genetic variation and a disease. Agriculture uses the principles in development of breeding programs.
Conservation, fisheries, wildlife, and forestry biologists all use population genetics. In fact, the combination of
molecular data and population genetic principles has revolutionized the fields of conservation biology and
ecology. Anthropologists are very interested in population genetics because it can be used to make inferences
about patterns of migration and interbreeding during the history of human evolution.
Until this point the term POPULATION has been used rather informally, but now is the time for a formal definition.
A population is a subgroup of individuals of the same species living within some set of restrictions, usually a
restricted geographic area. The term population is not used to refer to an entire species. Individuals of a
species are rarely distributed uniformly, having some form of aggregation according to the most favourable
aspects of the environment. For terrestrial vertebrates such aggregations are often obvious: herds, colonies,
habitat preferences such as marshes or deserts, etc. Less clear are the boundaries associated with genetic
similarity. In population genetics, the population is defined by the restriction of local interbreeding. Sometimes
such populations are called LOCAL POPULATIONS or DEMES, but they are very often simply called a POPULATION.
The practical definition will vary from species to species, and also from researcher to researcher. Although a
more precise definition is difficult, for our purposes we need only assume that we have defined the population
appropriately. We want to define the local population correctly because it is the unit within which the evolution of
adaptive characteristics occur; i.e., the evolving unit of the species.
So, for the purposes of this course we come to a definition of EVOLUTION: the change in the genetic constitution
of a population over time, or simply the change in allele frequencies in a population over time.
Allele frequencies in populations
In population genetics, variation is often quantified using the notion of ALLELE FREQUENCIES. Alleles are the
alternative forms of genes, including the so-called “normal” form or “wild-type” form. The allele frequencies are
the simple proportions of different alleles within a local population. Note that sometimes the term GENE
FREQUENCY will be used synonymously with allele frequency. Long-term changes in allele frequencies, i.e.,
evolution, occur within the unit of the local population.
Before we can measure evolution by the change in allele frequencies in a population, we must be able to
measure the frequencies of those alleles. Consider an autosomal locus in a collection of diploid individuals.
Let’s take the human M-N blood group as one example. In this blood group there are two alleles, M and N. In
this case there would be three possible types of GENOTYPES, MM, MN, or NN. A sample of the population of
Icelandic Eskimos was typed according the MN blood group and the data are presented in the table below:
count
Genotypes
MN
385
MM
233
Total (n)
NN
129
747
From the data in this table we can compute the allele frequencies of M and N in this sample. By convention we
use p and q to represent the frequencies of M and N, with p + q = 1.
p=
2(MM ) + 1(MN )
2n
and
q = 1− p
Using the data in Table 1
p=
2(233) + (385)
= 0.57
2 × 747
and
q = 1 − 0.57 = 0.43
The exercise in calculating the frequencies of M and N leads us to an important point: the quantities p and q are
PARAMETERS of a population with unknown true values. We had to estimate the values of those parameters from
a sample of the population. Such estimates will have an error, and we must judge the reliability of our estimate
by estimating its VARIANCE. We can estimate the sampling variance of p with:
Var ( p ) =
p(1 − p )
2n
Assuming that repeated estimates would be normally distributed, we can use the variance to make predictions
about how close the true frequency of the allele in the population is to our estimate of the frequency. First we
take the square root of the variance; this is known as the STANDARD ERROR of the estimate. Under the normal
distribution approximately 68% of the estimates of a parameter would fall within the interval defined by the true
value ± 1 standard error. This is called a CONFIDENCE INTERVAL. The 95% confidence interval is defined by ± 2
standard errors, and the 99.7% confidence interval is defined by ± 3 standard errors.
Let’s use the variance to evaluate the effect of sample size on the precision of an estimate of allele frequencies.
Suppose we obtain the following sample from a population: AA = 8, Aa = 6, and aa = 2. Let’s use the above
formulas to estimate p and q and the 95% confidence interval.
p = 0.6875, var(p) = 0.0067, 95%CI = (0.524,0.851)
Now let’s recalculate the results after multiplying the sample by a factor of 10.
p = 0.6875, var(p) = 0.00067, 95%CI = (0.636,0.739)
Increasing the sample size decreased our confidence interval, indicating that the precision of our estimate
should have increased. Of course had we actually taken another sample 10 times larger, we would not have
estimated p exactly as we had in our small sample. The 95%CI is a tool for quantifying this sort of sampling
error.
Changes in gene frequencies between populations
Let’s return to our sample of the frequencies of the MN blood groups from Iceland, and compare it to a sample
from Greenland. The data, presented in the table below, clearly indicate a difference in genotype frequencies
between these populations. We can make at least one conclusion based on these data: evolution has occurred!
Iceland
Greenland
MM
0.312
0.835
Genotypes
MN
0.515
0.156
Total (n)
NN
0.173
0.009
747
569
We are now faced with the task of identifying the origin and evolutionary significance of such change. The
causes of change are the subject matter for later lectures. For now we will simply list the possible causes of
change:
1. Finite population size: The alleles passed from one generation to another are a sample of the genes of
the parental generation. Because the proportion of alleles in the sample is subject to sampling variation
between generations, allele frequencies will vary from generation to generation (evolution by our
definition). The size of the sample, in this case the number of members of the population that are
reproducing, influences the strength of this effect. Think back to the difference in the variance of our
samples of allele frequencies above.
2. Mutation: Of course if a mutation occurs then a change in allele frequencies has occurred. You have
probably already guessed that a single mutational event will have a very small effect in large populations.
It is other forces such as sampling variance of selection that determine the evolutionary significance of a
new mutant over evolutionary time.
3. Non-random mating: Suppose that potential mates in the parental generation have a preference for a
certain trait with a genetic basis. If they preferentially choose mates with such a trait then the frequency of
the gene encoding that trait will increase in the next generation. This is just one example of how nonrandom mating influences allele frequencies.
4. Natural selection: Individuals in a population differ in their ability to survive and reproduce (fitness). If
an allele confers an increased fitness to those individuals who carry it, then they will make a greater
contribution to subsequent generations, thus increasing the frequency of that allele over time. Of course it
can work the other way. If a new allele arose by mutation and reduced that individual’s fitness, then that
individual might not reproduce and the allele would not appear in the next generation.
5. Migration or Gene Flow: Given two or more populations, immigration of individuals from outside the
population will change the frequencies of alleles in that population.
In order to identify which of these is relevant, we must first understand the simple case where all these factors
are excluded from the system. This very simple model results in a type of equilibrium that will be the focus of the
remainder of this topic.
Hardy-Weinberg equilibrium
“A MODEL is an intentional simplification of a complex situation designed to eliminate extraneous detail in order to
focus attention on the essentials of the situation” (Daniel L. Hartl).
We need a model of an idealized, unchanging, population as our starting point. Under such a model gene
frequencies will not change between generations: i.e., no evolution occurs. Starting with such a model we can
then add different causes of change, such as natural selection or non-random mating, and see how such
processes affect the relationship between genotype frequencies and allele frequencies across generations.
Modern population and quantitative genetics depends on understanding this relationship.
The HARDY-WEINBERG (HW) MODEL provides us with the model we need. It represents a population at
equilibrium with respect to allele and genotype frequencies. Independently specified in 1908 by G.H. Hardy and
W. Weinberg, the model is based on the following assumptions:
Assumption of the HW model
1. The organism is diploid
2. Reproduction is sexual
3. Mating is random
4. Generations are discrete
5. Population size is infinite (or very large)
6. No migration
7. No mutation
8. No natural selection
Under these assumptions we may specify the probabilities of all possible combinations of gametes in the next
generation as:
(p + q) x (p + q)
Do you remember the PRODUCT RULE and ADDITIVE RULE from your genetics class? Also, the proportions of all
possible genotypes must sum to 1:
(p + q) x (p +q) = 1
Multiplication gives:
p2 + 2pq + q2 = 1
Lets say that p is the frequency of the A allele and q is the frequency of the a allele. Then the genotype
frequencies in the next generation are as follows:
fAA = p2
fAa = 2pq
faa = q2
Female
gametes
We can see this in the form of a table:
Male gametes
A(p) a(q)
AA
Aa
A
2
(pq)
(p) (p )
Aa
aa
a
2
(q) (pq) (q )
Keynotes of the HW model:
1. HW model specifies the relationship between allele frequencies and gene frequencies in a population. Such
a relationship can be tested with real data. When a population is consistent with HW expectations we say
that population is in HARDY-WEINBERG EQUILIBRIUM.
2. The standard mechanisms of Mendelian inheritance ensure that allele frequencies do not change from
generation to generation unless some external pressure is acting.
3. No matter what the initial frequencies, just one generation of random mating will result in HW frequencies.
These features provide the foundation for many methods of investigation of the genetics of natural populations.
In particular it is the departure from HW equilibrium, and the cause of such a departure (i.e., which assumptions
of the model have been violated), that interests populations geneticists. Unfortunately, HW is not very sensitive
to certain kinds of violations. This means that just because we cannot reject the notion of HW equilibrium for a
natural population does not mean that we can conclude all the assumptions are valid.
Rare recessive alleles can “hide” in the heterozygotes
The relationship between allele and genotype frequency under HW is informative: when a recessive allele is
rare, most of the individuals who carry the allele are heterozygous.
The ratio of heterozygote: rare homozygote = 2pq : q2
q
0.1
0.01
0.001
2pq : q2
18:1
198:1
1998:1
Let’s look at a real example:
Cystic fibrosis is a serious human illness caused by an autosomal recessive gene. An individual homozygous
recessive for the CF “gene” will have abnormal glandular secretions resulting in the generation of thickened
secretions, which causes the blockage of normal functions of some organs like the lungs and digestive organs
causing difficulties in breathing and digestion.
The frequency CF is about 1 in 1700 newborn Caucasians.
q2 = 1/1700
q = (1/1700)1/2 = 0.024
ASSUMING HW:
Frequency of heterozygotes in population = 2pq = 2 x 0.024 x (1-0.024) = 0.047
So, if we allowed a generation of random mating and no selection, 1 in 21 people would be heterozygous
“carriers” of the CF allele, even though the frequency of CF in this population is quite rare, at 1 in 1700!
Note that in this case we just assumed HW without testing if the assumption was reasonable. CF is clearly
under selection, and population sizes are finite. However, heterozygous individuals are not subject to selection,
and the example illustrates that deleterious alleles can “lurk” in a large fraction of the population without any
overt phenotypic effect.
More realistic models of CF require consideration of selection, mutation, and drift; the frequency of the CF allele
in human populations reflects the interaction of all these evolutionary forces. We will return to this example later
in the course.
A proof of the HW principle
Let: P1 = p2, P2 = 2pq, P3 = q2 in the parental generation
Genotypes
Genotype frequencies
AA
P1
Aa
P2
aa
P3
Allele frequencies
A: p = (P1 × 1) + (P2 × 1/2)
a: q = (P3 × 1) + (P2 × 1/2)
The questions is what happens to the three genotype frequencies (P1’, P2’, and P3’) in the next generation?
Under HW, P1 = P1’ and P2 = P2’ and P3 = P3’. We can test for this outcome by listing all possible matings and
the frequencies of their offspring genotypes. Then we simply sum the products of each mating that give rise to
each of the three possible genotypes. The results are shown in the table below.
Mating
Frequency
Genotype frequencies of offspring
AA
Aa
aa
AA × AA
AA × Aa
AA × aa
Aa × Aa
Aa × aa
aa × aa
P12
2 × P1 × P2
2 × P1 × P3
P22
2 × P2 × P3
P32
1
1/2
0
1/4
0
0
0
1/2
1
1/2
1/2
0
0
0
0
1/4
1/2
1
Total in next generation:
P1’
P2’
P3’
2
P ⎞
⎛
P1' = P12 + P1P2 + (1 / 4)P22 = ⎜ P1 + 2 ⎟ = p 2
2⎠
⎝
P ⎞⎛
P ⎞
⎛
P2' = P1 P2 + 2P1 P3 + (1/2 )P22 + P2 P3 = 2⎜ P1 + 2 ⎟⎜ P3 + 2 ⎟ = 2 pq
2 ⎠⎝
2⎠
⎝
2
P ⎞
⎛
P = (1 / 2 )P2 + P2 P3 + P = ⎜ P3 + 2 ⎟ = q 2
2⎠
⎝
'
3
2
3
So, we have our proof; after all this we see that allele frequencies do NOT change from generation to generation
under HW!
Testing for HW proportions
We test a population for HW frequencies in three steps: (i) compute observed genotype frequencies; (ii) compute
expected genotype frequencies under HW; and (iii) conduct a goodness of fit test.
We will use the following data as an example:
MN blood types in Pueblo Indians
Genotypes
MM
MN
Observed counts
83
46
Total = n = 140
NN
11
PART 1:
Observed Genotype frequencies:
MM = 83/140 = 0.59
MN = 46/140 = 0.33
NN = 11/140 = 0.08
Observed allele frequencies:
M = p = 0.59 + 0.33(1/2) = 0.755
N = q = (1-0.755) = 0.245
Do NOT compute the allele frequencies at this step by assuming HW (i.e., p2 ≠ 0.59). Anyone who does this will
automatically get an F in the class!
PART 2:
Expected genotype frequencies:
Expected genotype counts:
p2 = (0.755)2 = 0.57
2pq = 2 x 0.755 x 0.245 = 0.37
q2 = (0.245)2 = 0.06
p2 x n = 0.57 x 140 = 79.8
2pq x n = 0.37 x 140 = 51.8
q2 x n = 0.07 x 140 = 8.4
Compare these counts to the observed counts in the table above.
PART 3:
χ2 = ∑
(observed - expected )2 = (83 − 79.8)2 + (46 − 51.8)2 + (11 − 8.4)2
expected
79.8
51.8
8.4
χ 2 = 0.128 + 0.649 + 0.805
χ 2 = 1.58
d.f. = [(number of tests) – (non-independent categories) – (calculate p from data)] = [3 – 1 – 1] = 1
P = 0.20; i.e., there is a 20% chance that we would have observed a test statistic this large under HW.
Testing for HW is not possible under dominance
DOMINANCE occurs when the phenotypic effect of one allele conceals the effect of an alternate allele. An
example is the D allele of the human Rh+ blood group:
Genotype
DD
Dd
dd
Phenotype
Rh+
Rh+
Rh-
In North America the frequency of the Rh+ phenotype (DD or Dd) is 0.858 and the Rh- phenotype (dd) is 0.142.
If are willing to assume HW then we can calculate the expected frequencies of D and d:
q = (0.142)1/2 = 0.3768
p = (1 – 0.3768) = 0.6232
Why not use these to test HW? These are EXPECTED frequencies under HW! The calculated frequencies fit
the expected frequencies exactly.
d.f. = 2 (for two classes of data) – 1 – 1 (for estimating q) = 0
HW equilibrium with three alleles
There is no reason why the allelic variation at a locus must be limited to two variants. Micro-satellite loci
commonly have very large numbers of alleles per locus. Let’s look at HW for the case of three autosomal
alleles:
Alleles: A1 A2 and A3
Frequencies: p1 p2 and p3
p 1 + p2 + p3 = 1
We can do a cross multiplication square that shows the HW frequencies at three alleles at a single autosomal
locus. Remember we are still making all the HW assumptions listed above.
Female
gametes
A1 (p1)
A1A1
p12
A1A2
p1 p2
A1A3
p1 p3
A1
(p1)
A2
(p2)
A3
(p3)
Male gametes
A2 (p2)
A3 (p3)
A1A2
A1A3
p1 p2
p1 p3
A2A2
A2A3
p22
p2 p3
A2A3
A3A3
p2 p3
p32
Frequencies in the next generation:
A1A1:
A1A2:
A1A3:
A2A2:
A2A3:
A3A3:
p12
2p1 p2
2p1 p3
p22
2p2 p3
p32
Hardy-Weinberg as a null model
We have seen that under the assumptions of the Hardy-Weinberg model, i.e., a large random mating population
with no selection, that the allele frequencies are stable over time. In other words our model is the equivalent of
“no evolution”. Under this model nothing interesting ever happens: what good is such a model? It might not be
interesting, but it is valuable as our NULL MODEL. By systematically changing the assumptions of HW, through
models that explicitly relax the involved assumption, we can investigate its effect on evolution. This approach is
the basis of the next five topics in population genetics. The table below summarized some of the involved
assumptions and their effects.
Agency
Linkage
Change in frequencies
Genotype
Allele
no
no
Notes
Creates disequilibrium among loci
Inbreeding
yes
no
Acts on all loci in genome; results in loss of
heterozygosity
Assortative Mating
yes
no
Only acts on the locus subject to assortment, and
those loci linked to it
Migration a
yes
yes
Depends of migration rate and frequency
differences between populations
Mutation
yes
yes
Very very very slow
Natural Selection
yes
yes
Acts on the locus subject to selection, and those
loci linked to it
Genetic Drift
yes
yes
Acts on all loci in the genome; results in loss of
heterozygosity and loss of alleles
a
we use the term migration in population genetics to mean movement of genes or individuals among populations in terms of
“gene flow”. Migration has a different meaning in the discipline of ecology. We do not examine the impact of migration on
evolution in this course; a full course in population genetics will cover migration in detail.
Don’t worry if the table does not make much sense the first time you look at it. At some point you will look back on it and it will.